AWS Announces Tools To Manage Machine Learning Workloads -- AWSInsider

AWS Announces Tools To Manage Machine Learning Workloads

By Gladys Rama
11/30/2018

In a bid to reduce the time, compute resources and costs required to train machine learning models, Amazon Web Services (AWS) announced new tools at re:Invent this week to lighten the load.

For starters, the company is preparing new GPU instances optimized for large-scale machine learning training. The upcoming P3dn.24xlarge instances, expected to launch in the first week of December, will be powered by eight NVIDIA Tesla V100 GPUs, each with 32GB capacity, and support networking throughput up to 100Gbps. They will also have 96 Intel Xeon Skylake vCPUs.

"The faster networking, new processors, doubling of GPU memory, and additional vCPUs enable developers to significantly lower the time to train their ML models or run more HPC simulations by scaling out their jobs across several instances (e.g., 16, 32 or 64 instances)," according to AWS. "[I]n addition to increasing the throughput of passing data between instances, the additional network throughput of P3dn.24xlarge instances can also be used to speed up access to large amounts of training data by connecting to Amazon S3 or shared file systems solutions such as Amazon EFS."

Additionally, for developers working with TensorFlow, AWS said it has made changes to the framework that improve its ability to scale across multiple GPUs, making machine learning training tasks more resource-efficient. The changes are now generally available.

"By improving the way in which TensorFlow distributes training tasks across those GPUs, the new AWS-Optimized TensorFlow achieves close to linear scalability when training multiple types of neural networks (90 percent efficiency across 256 GPUs, compared to the prior norm of 65 percent)," AWS said in an announcement.

Another new service that's now generally available is Amazon Elastic Inference. Inference refers to the process in which a machine learning model makes predictions around brand-new data using what it has learned from the earlier training stage. The resource requirements and costs of inference can be significantly higher than those for training, according to AWS. Inference gives developers more options to manage the amount of compute power they buy, potentially cutting their spending by as much as 75 percent, according to the AWS announcement.

"Instead of running on a whole Amazon EC2 P2 or P3 instance with relatively low utilization, developers can run on a smaller, general-purpose Amazon EC2 instance and provision just the right amount of GPU performance from Amazon Elastic Inference," the company said. Starting at just 1 TFLOP, developers can elastically increase or decrease the amount of inference performance, and only pay for what they use."

For heavier workloads, AWS is developing a new inference processor called AWS Inferentia. Due in 2019, Inferentia is a dedicated inference chip that promises to be cost-effective, with low latency and high throughput capacity.

"Each chip provides hundreds of TOPS (tera operations per second) of inference throughput to allow complex models to make fast predictions," according to the AWS product page. "For even more performance, multiple AWS Inferentia chips can be used together to drive thousands of TOPS of throughput."

About the Author

Gladys Rama (@GladysRama3) is the editorial director of Converge360.

Featured

Subscribe on YouTube

AWS Cloud Report

Email Address*Country*

Please type the letters/numbers you see above.

Upcoming Training Events

0 AM

VSLive! 3-Day Hands-On Training Seminar: Master Modern JavaScript: Unlock the Full Potential of Your Code
June 2-4, 2025

VSLive! 2-Day Hands-On Training Seminar: Asynchronous and Parallel Programming in C#
June 24-25, 2025

4-Hour Hands-on Workshop: MCP Demystified
June 30, 2025

VSLive! 4-Day Hands-On Training Seminar: Immersive .NET Full Stack Training: 4-Day Hands-On Experience
July 15-18, 2025

Securing IT in the AI Era
July 23, 2025

VSLive! 4-Hour In-Depth Workshop: Immersive .NET Full Stack Training: C# Interfaces: Effective Usage while Avoiding Pitfalls
July 29, 2025

Visual Studio Live! @ Microsoft HQ
August 4-8, 2025

TechMentor @ Microsoft HQ
August 11-15, 2025

4-Hour VSLive! Workshop: Testability in .NET
August 27, 2025

Microsoft 365 Security Masterclass
August 25-26, 2025

Visual Studio Live! San Diego
September 8-12, 2025

Live! 360 2-Day Hands-On Seminar: Swimming in the Lakes of Microsoft Fabric and AI – A Hands-on Experience
September 18-19, 2025

VSLive! 2-Day Hands-On Training Seminar: Hands-On with .NET Web Development in 2025
October 7-8, 2025

Live! 360 Orlando
November 16-21, 2025

Artificial Intelligence Live! Orlando
November 16-21, 2025

Cloud & Containers Live! Orlando
November 16-21, 2025

Cybersecurity & Ransomware Live! Orlando
November 16-21, 2025

Data Platform Live! Orlando
November 16-21, 2025

Visual Studio Live! Orlando
November 16-21, 2025

TechMentor Orlando
November 16-21, 2025

VSLive! 4-Day Hands-On Training Seminar: Immersive .NET Full Stack Training: 4-Day Hands-On Experience
December 16-19, 2025

Visual Studio Live! Las Vegas
March 16-20, 2026

Free Whitepapers

> More TechLibrary

Free Webcasts

> More Webcasts