Google App Engine Outage on October 26, 2012

Here is a brief summary about the Google App Engine outage that happened on October 26, 2012

Time :
Problem started at about 4:00 am (PST) in one of the App Engine DataCenters and operation was restored back to normal at about 11:45 am.

Impact during the outage
(As per Google AE team) about 50% of Requests to App Engine application failed during this outage.
Few users had noticed some remarkable increase of server instances for their application.

How the outage had happened
1) About 4:00 am PST in one of the Google App Engine data centers, load on traffic routers began increasing which later crossed the paging threshold
2) The team had to perform a restart of the traffic routers in the affected data center to address the above mentioned load increase
3) Unexpectedly this restart reduced the count of available traffic routers below the minimum requirement which caused overload in other data centers thus causing a cascading failure

How it was resolved
With no other options, the team had to perform a complete restart of the system with gradual traffic to return back to normal.

Post-Outage Refunds 
Google App Engine assured to credit back 10% of the monthly charges to all their paying customers.

What has been done to avoid in the future
Google AppEngine team increased traffic routing capacity in their data centers and also made few adjustments in their configurations to reduce possiblilties of another cascading failure.

for further reading:

No comments:

Post a Comment