Sunday, December 12, 2021

AWS attributes outage to surge from automated scaling of internal network

At 7:30 AM PST on December 7th, 2021, an automated activity at AWS Northern Virginia (US-EAST-1) Region that is used to scale capacity of services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network. The unexpected behavior resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks. These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries. This led to persistent congestion and performance issues on the devices connecting the two networks." The traffic surge impacted the control planes that are used for creating and managing AWS resources.  In particular, API Gateway servers were impacted by their inability to communicate with the internal network during the early part of this event. As a result of these errors, many API Gateway servers eventually got into a state where they needed to be replaced in order to serve requests successfully. 

In a blog posting, Amazon Web Services (AWS) apologized for the incident and said it has already taken several actions to prevent a recurrence of this event, including disabling of the automated scaling process until a remediation method is deployed. 


https://aws.amazon.com/message/12721/