AWS Launches Managed Chaos Engineering Service

The new Fault Injection Simulator from Amazon Web Services (AWS) is now generally available, letting users actively run tests against their applications to detect potential weaknesses.

First described by AWS CTO Werner Vogels during last December's virtual re:Invent conference, the Fault Injection Simulator (FIS) is a managed service that tests how an application will react to potential problems like bad code, performance bottlenecks or malicious attacks. It does this by running experiments that "inject faults" into an application.

Fault-injection experiments are a cornerstone of chaos engineering, the process by which software developers stress-test their applications to identify -- and preemptively correct -- the areas where breakdowns would likely occur in a real-world scenario. It's used extensively within AWS, Vogels said. Streaming giant Netflix was also an early proponent of chaos engineering, building a program called Chaos Monkey to test the resilience of its platform.

While introducing FIS at re:Invent, Vogels said, "We believe chaos engineering is for everyone, not just shops running on Amazon or Netflix scale." FIS, he said at the time, is designed to "simplify the process of running chaos experiments in the cloud."

Experiments run through FIS are built to "follow the typical chaos experimental workflow," Vogels said, "where you understand your steady state, set a hypothesis, inject faults and momentary application. When the experiment is over, FIS will tell you if your hypothesis was confirmed, and you can use the data collected by [monitoring tool Amazon] CloudWatch to decide where you need to make improvements."

In a blog post Monday, AWS evangelist Jeff Barr shared more details about FIS. Currently, FIS supports running experiments on just these AWS services: EC2, Elastic Container Service, Elastic Kubernetes Service and Relational Database Service. Users can choose what "fault" they want to inject into their service from set of actions, such as terminating specific EC2 instances or returning error messages in response to specific requests.

"You can select the target resources by type, tag, ARN, or by querying for specific attributes," Barr said. "You also have the ability to stop the experiment if one or more stop conditions (as defined by CloudWatch Alarms) are met. This allows you to quickly terminate the experiment if it has an unexpected impact on a crucial business or operational metric."

AWS plans to expand the list of supported products and actions this year, Barr said. More information is available on the FIS page here.

About the Author

Gladys Rama (@GladysRama3) is the editorial director of Converge360.


Subscribe on YouTube