Amazon Cloud Service Outage (in US East) on October 22, 2012

On October 22, 2012 at 10:00 AM (PDT) a small number of amazon's EBS servers (for the East Region) began seeing some performance degradation and gradually many more servers also got affected which lead to this biggest Amazon Cloud outage in the recent times. Operations were restored back to normal at about 4:15 PM (PDT)

1) Main impact of this outage was on the Amazon's EBS (Elastic Block Store) servers. About 50% of the requests to this servers were lost during this outage.
2) As a side effect, users of Amazon's API servers, ELB (Elastic Load Balancing) and RDS (Rational Database Service) were also affected.
3) Many popular services like Airbnb, Reddit, Dropbox, etc that were depending on the Amazon Cloud services were also affected during this outage.

Cause for this Outage
1) A week ago, one of the servers in the affected zone was replaced for some hardware problem. Unexpectedly the the DNS update for the new server was not successfully propagate to all the Storage servers in the system.
2) The Data collection agent in such storage servers continued to attempt connecting to the old server - a software bug in these agents caused a latent memory leak during such unsuccessful attempts.
3) Amazon's monitoring system didn't have any alarm for such memory leaks and hence the problem was unnoticed until the threshold limit was crossed the the performance began to degrade.
4) During the event, to reduce the number of incoming requests the Amazon team made some change in their throttling system - which on the other side affected many API users.

How the problem was resolved
1) The team made adjustments to reduce the incoming load and this reduced load allowed the servers to automatically recover the affected volumes.

Post-Outage Refunds
Amazon team assured to automatically issue back 10 days of charge credits to their RDS users who were affected for more than 20 mins. 

Future plans to avoid similar outages
1) System would be changed to ensure the DNS updates are reliably propagated to all the storage servers.
2) Monitoring system will be monitoring the latent memory leaks on each process's memory consumption.
3) The fix for the memory bug issue would be deployed in the coming week.

For further reading

No comments:

Post a Comment