AWS Details DynamoDB Outage, Preventive Measures

Amazon Web Services Inc. (AWS) provided a detailed account of the recent service outage that reportedly brought down major Web sites and inconvenienced many customers in the US-East Region.

The primary problem affecting customers was a disruption of the database service, DynamoDB, which affected Web sites such as Netflix, Tinder, Airbnb and IMDb, among many others, during a six- to eight-hour period on Sept. 20.

Early in the morning, a "brief network disruption" occurred, which normally would have been handled smoothly, AWS said. However, a metadata service that communicates information to storage servers about their "membership" -- or their assigned partitions that contain the actual data they're responsible for -- became overloaded, AWS said in the technically detailed explanation. One problem led to another, and the result was a loss of service.

The DynamoDB outage also affected other services, such as Simple Queue Service, EC2 Auto Scaling, the CloudWatch monitoring service and more.

To prevent further such outages, the company said:

There are several actions we'll take immediately to avoid a recurrence of Sunday's DynamoDB event. First, we have already significantly increased the capacity of the metadata service. Second, we are instrumenting stricter monitoring on performance dimensions, such as the membership size, to allow us to thoroughly understand their state and proactively plan for the right capacity. Third, we are reducing the rate at which storage nodes request membership data and lengthening the time allowed to process queries. Finally and longer term, we are segmenting the DynamoDB service so that it will have many instances of the metadata service each serving only portions of the storage server fleet. This will further contain the impact of software, performance/capacity, or infrastructure failures.

The company also apologized to customers, noting that, even though DynamoDB has effectively enjoyed 100 percent uptime in the past three years, "we know how critical this service is to customers, both because many use it for mission-critical operations and because AWS services also rely on it. For us, availability is the most important feature of DynamoDB, and we will do everything we can to learn from the event and to avoid a recurrence in the future."

About the Author

David Ramel is an editor and writer for Converge360.


Subscribe on YouTube