AWS Blames Unplanned System Restarts for S3 Outage
This week's hours-long Amazon Web Services (AWS) outage was caused by an incorrectly typed command that forced inadvertent restarts of key Amazon S3 subsystems, the company said.
In its postmortem report released Thursday, AWS explained the chain of events that led to the outage, which bogged down much of the Internet -- including many of AWS' own applications -- for roughly four hours on Tuesday.
The outage affected the Amazon S3 storage service in AWS' Northern Virginia region, which houses a significant portion of the vendor's total cloud infrastructure. It lasted between 9:30 a.m. PST to just before 2 p.m. PST.
According to AWS' account, on the morning of the outage, its technicians had been in the middle of investigating some sluggishness in the S3 billing system when one of them incorrectly entered a command that had the domino effect of prompting an unplanned restart:
The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests.
AWS said that while Amazon S3 subsystems are designed to withstand "the removal or failure of significant capacity," the subsystems in some of its larger regions have not been restarted for many years.
The Northern Virginia region is considered to be AWS' densest in terms of how much of the company's overall cloud infrastructure is located there; AWS doesn't provide official figures, but a 2012 study by Accenture Technology Labs estimated that Northern Virginia accounted for 70 percent of AWS' total server racks at the time. It's also AWS' oldest region, having opened in 2006.
"S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected," the company said.
AWS outlined the steps it's taking to prevent similar outages in the future -- for instance, modifying its processes to make sure that technicians can't remove too much server capacity too fast and cause a restart.
It is also making refactoring the S3 service into smaller, easier-to-manage "cells" a priority. Dividing services into cells lets technicians easily test for problems and minimize downtime, AWS explained. While S3 had already undergone a certain degree of refactoring, AWS promised to do "further partitioning" in the wake of the outage.
AWS also addressed one glaring casualty of the outage: its own Service Health Dashboard. Used by customers to check the status of individual AWS applications, the Dashboard was rendered basically useless for a good part of the outage, incorrectly showing impacted services as "operating normally."
"From the beginning of this event until 11:37AM PST, we were unable to update the individual services' status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3," AWS explained. "Instead, we used the AWS Twitter feed (@AWSCloud) and SHD banner text to communicate status until we were able to update the individual services' status on the SHD. We understand that the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions."
Major Web sites -- including Expedia, Medium, Slack, Imgur, Trello, GitHub, Docker and the U.S. Securities and Exchange Commission -- experienced major slowdowns or were unable to load altogether during Tuesday's outage. According to Web site analysis firm Apica, 54 of the Internet's top 100 retailers experienced performance declines of 20 percent or more, with load times being severely impacted for some. Target.com, for instance, loaded 991 percent slower due to the outage, according to Apica, while DisneyStore.com loaded 1,165 percent slower.
Gladys Rama is the senior site producer for Redmondmag.com, RCPmag.com and MCPmag.com.