Cloud Leaders Diverge on AI Approach -- AWSInsider

Cloud Leaders Diverge on AI Approach

By Jeffrey Schwartz
10/10/2016

Many signs are pointing to artificial intelligence (AI) as the next battleground for the top three cloud providers.

Google last week touted its recent AI efforts, which include the launch of Google Home, a device similar to the Amazon Echo that taps the Google search engine and has its own personal assistant. The company also jumped back into the smartphone market with its new Android-based Pixel, Google's first phone to have the personal assistant built-in.

Google's AI announcements came on the heels of Microsoft's. At its Ignite conference last month, Microsoft talked up the AI advancements in its Azure cloud, including the deployment of field-programmable gate arrays (FPGAs) in every Azure node to enable it to process the tasks of a supercomputer, accelerating its machine learning, AI and Intelligent Bots framework.

During an Ignite keynote, Microsoft Research networking expert Doug Burger said the deployment of FPGAs and GPUs in every Azure node provides what he described as "ExaScale" throughput, meaning the ability to run 1 billion operations per second. That means Azure has "10 times the AI capability of the world's largest existing supercomputer," Burger claimed, noting that "a single [FPGA] board turbo-charges the server, allowing it to recognize the images significantly faster."

From a network throughput standpoint, Azure can now support network speeds of up to 25Gbps, faster than any other cloud provider has claimed to date, he said.

Just as Microsoft is building out Azure to power the AI-based capabilities it aims to deliver, Google plans to do the same for its cloud. Google has the largest bench of AI scientists, while Microsoft and China's Baidu search engine and cloud are a close second, said Karl Freund, a senior analyst at Moor Insights and Technology. Microsoft recently announced the formation of an AI group staffed with 5,000 engineers.

Freund explained in a blog post published by Forbes that Microsoft's stealth deployment of FPGAs in Azure over the past few years is a big deal and will likely be an approach that other large cloud providers, looking to boost the machine learning capabilities of their platforms, will consider.

Freund said in an interview that he was aware of Microsoft's intense interest in FPGAs five years ago, when Microsoft Research quietly unveiled "Project Catapult," outlining a five-year proposal of deploying the accelerators throughout Azure. Microsoft first disclosed its work with FPGAs two years ago when Microsoft Research published a paper describing its deployment of the FPGA fabric on 1,632 servers to accelerate the Bing search engine.

Still, it was surprising that Microsoft actually moved forward with the deployment, Freund said. He also emphasized how Microsoft's choice to deploy FPGAs contrasts with how Google is building AI into its cloud using non-programmable ASICs. Google's fixed function chip is called the TPU, the tensor processing unit, based on the TensorFlow machine learning libraries and graph for processing complex mathematical calculations. Google revealed back in May that it had started running the TPUs in its cloud more than a year ago.

The key difference between Google's and Microsoft's approaches to powering their respective clouds with AI-based computational and network power is that FPGAs are programmable and Google's TPUs, because they're ASIC-based, are not.

"Microsoft will be able to react more readily. They can reprogram their FPGAs once a month because they're field-programmable, meaning they can change the gates without replacing the chip, whereas you can't reprogram a TPU -- you can't change the silicon," Freund said. Consequently, Google will have to swap out the processors in every node of its cloud, he explained.

The advantage Google has over Microsoft is that its TPUs are substantially faster -- potentially 10 times faster -- than today's FPGAs, Freund said. "They're going to get more throughput," he said. "If you're the scale of Google, which is a scale few of us can even comprehend, you need a lot of throughput. You need literally hundreds or thousands or even millions of simultaneous transactions accessing these trained neural networks. So they have a total throughput performance advantage versus anyone using an FPGA. The problem is if a new algorithm comes along that allows them to double the performance, they have to change the silicon and they're going to be, you could argue, late to adopt those advances."

So who has an advantage: Microsoft with its ability to easily reprogram its FPGAs, or Google using its faster TPUs? "Time will tell who has the right strategy but my intuition says they are both right and there is going to be plenty of room for both approaches, even within a given datacenter," Freund said.

As for public cloud leader Amazon Web Services (AWS), it recently launched new GPU-based EC2 instances to support "heavier GPU compute workloads," including AI, Big Data processing and machine learning.

However, the technology AWS uses for its Elastic Compute Cloud (EC2) is a well-guarded secret. Back in June, AWS launched its Elastic Network Adapter (ENA), boosting its network speed from 10Gpbs to 20Gbps.

While AWS isn't revealing the underlying hardware model behind its ENAs, Freund said it's reasonable to presume the company's 2015 acquisition of Israeli chip maker Annapurna Labs is playing a role in boosting AWS' EC2. Annapurna was said to be developing a system-on-a-chip-based programmable ARM network adapters, Freund said.

About the Author

Jeffrey Schwartz is editor of Redmond magazine and also covers cloud computing for Virtualization Review's Cloud Report. In addition, he writes the Channeling the Cloud column for Redmond Channel Partner. Follow him on Twitter @JeffreySchwartz.