Here's Why a Vital Amazon Web Services Region Went Down on Dec. 7

XDJ

Amazon shares the results of its investigation into the ‘service disruption’ that AWS (and its customers) experienced on Dec. 7.

Amazon has explained why a vital Amazon Web Services (AWS) region, US-East-1, experienced what the company describes as a “service disruption” for about seven hours on Dec. 7.

The problems with US-East-1 affected many people’s ability to connect to streaming platforms like Netflix, Disney+, and Amazon Prime Video; games like Valorant, League of Legends, and PUBG; apps like Tinder, Venmo, and Coinbase; and many other services that rely on AWS.

The sheer popularity of those services makes it relatively easy to tell when AWS is having problems—just try to stream a video, play a game, or use a mobile app connected to the nigh-ubiquitous platform. But it can be much more difficult to figure out why AWS is down.

Here’s what Amazon says caused US-East-1’s woes:

“At 7:30 AM PST, an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network. This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks. These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries. This led to persistent congestion and performance issues on the devices connecting the two networks.”

The company also says that congestion “immediately impacted the availability of real-time monitoring data for our internal operations teams, which impaired their ability to find the source of congestion and resolve it,” as well as their ability to explain the issue to AWS customers.

AWS is a sprawling platform that offers a broad range of products used by many companies to serve a variety of purposes. It’s a wonder that it doesn’t experience major outages more often—and that it was able to recover from this particular disruption as quickly as it did.

However, the incident still highlights the inherent risk associated with so many companies relying on AWS, especially since the nature of the network means that problems with the platform can hinder efforts to solve problems with the platform. (And that’s when a single region’s involved!)

Amazon even acknowledges that relying too much on just one AWS region can be a problem:

“Our Support Contact Center also relies on the internal AWS network, so the ability to create support cases was impacted from 7:33 AM until 2:25 PM PST. We have been working on several enhancements to our Support Services to ensure we can more reliably and quickly communicate with customers during operational issues. We expect to release a new version of our Service Health Dashboard early next year that will make it easier to understand service impact and a new support system architecture that actively runs across multiple AWS regions to ensure we do not have delays in communicating with customers.”

More information about what caused the disruption to US-East-1, how Amazon’s responding to the issue, and which services were affected can be found in the company’s summary.

Source