Machine Learning-Based Amazon DevOps Guru Hits Preview

Amazon Web Services (AWS) has unveiled its new DevOps Guru managed operations service, which uses machine learning to detect operational issues and automatically recommends specific actions for remediation.

Amazon DevOps Guru, now in preview, is designed to collect and analyze application metrics, logs, events and traces to recognize behaviors that deviate from normal operating patterns -- things like under-provisioned compute capacity, database I/O over-utilization and memory leaks, among others. AWS announced DevOps Guru as part of its ongoing re:Invent virtual conference.

According to AWS, when DevOps Guru identifies anomalous app behavior that could cause potential outages or service disruptions -- such as approaching resource limits and code and config changes that might cause outages -- it alerts developers with issue details like the resources involved, a timeline of events and specific recommendations for remediation. It does this via the Amazon Simple Notification Service (SNS) and partner integrations with companies like Atlassian's Opsgenie and PagerDuty.

Amazon DevOps Guru also "spotlights" things like under-provisioned compute capacity, database I/O overutilization and memory leaks.

"Customers have asked us to continue adding services around areas where we can apply our own expertise on how to improve application availability and learn from the years of operational experience that we have acquired running," said Swami Sivasubramanian, head of the AWS Machine Learning group, in a statement. "With Amazon DevOps Guru, we have taken our experience and built specialized machine learning models that help customers detect, troubleshoot, and prevent operational issues while providing intelligent recommendations when issues do arise. This enables teams to immediately benefit from operational best practices Amazon has learned from running, saving customers the time and effort that would otherwise be spent configuring and managing multiple monitoring systems."

The Amazon DevOps Guru service not only analyzes system and app data to detect anomalies, but it also groups this data into "operational insights" that include anomalous metrics, visualizations of application behavior over time and recommendations on actions for remediation, according to AWS. The service also correlates and groups related application and infrastructure metrics, such as Web app latency spikes, running out of disk space, bad code deployments and memory leaks.

The result is reduced redundant alarms and help for users focusing on so-called high-severity issues. Users can see configuration change histories and deployment events, along with system and user activity, to generate a prioritized list of likely causes for an operational issue in the Amazon DevOps Guru console.

The service was also designed to provide intelligent recommendations with remediation steps and integration with AWS Systems Manager for runbook and collaboration tooling, which gives users the ability to more effectively maintain applications and manage infrastructure for their deployments.

Paired with Amazon CodeGuru, another machine learning-powered developer tool that provides intelligent recommendations and identifies an application's most expensive lines of code, Amazon DevOps Guru provides users with the automated benefits of machine learning for their operational data, so that developers can more easily improve application availability and reliability, the company said.

About the Author

John K. Waters is the editor in chief of a number of sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS.  He can be reached at


Subscribe on YouTube