Amazon Cloud Service Outage (in US East) on October 22, 2012

On October 22, 2012 at 10:00 AM (PDT) a small number of amazon's EBS servers (for the East Region) began seeing some performance degradation and gradually many more servers also got affected which lead to this biggest Amazon Cloud outage in the recent times. Operations were restored back to normal at about 4:15 PM (PDT)

1) Main impact of this outage was on the Amazon's EBS (Elastic Block Store) servers. About 50% of the requests to this servers were lost during this outage.
2) As a side effect, users of Amazon's API servers, ELB (Elastic Load Balancing) and RDS (Rational Database Service) were also affected.
3) Many popular services like Airbnb, Reddit, Dropbox, etc that were depending on the Amazon Cloud services were also affected during this outage.

Cause for this Outage
1) A week ago, one of the servers in the affected zone was replaced for some hardware problem. Unexpectedly the the DNS update for the new server was not successfully propagate to all the Storage servers in the system.
2) The Data collection agent in such storage servers continued to attempt connecting to the old server - a software bug in these agents caused a latent memory leak during such unsuccessful attempts.
3) Amazon's monitoring system didn't have any alarm for such memory leaks and hence the problem was unnoticed until the threshold limit was crossed the the performance began to degrade.
4) During the event, to reduce the number of incoming requests the Amazon team made some change in their throttling system - which on the other side affected many API users.

How the problem was resolved
1) The team made adjustments to reduce the incoming load and this reduced load allowed the servers to automatically recover the affected volumes.

Post-Outage Refunds
Amazon team assured to automatically issue back 10 days of charge credits to their RDS users who were affected for more than 20 mins. 

Future plans to avoid similar outages
1) System would be changed to ensure the DNS updates are reliably propagated to all the storage servers.
2) Monitoring system will be monitoring the latent memory leaks on each process's memory consumption.
3) The fix for the memory bug issue would be deployed in the coming week.

For further reading

Google App Engine Outage on October 26, 2012

Here is a brief summary about the Google App Engine outage that happened on October 26, 2012

Time :
Problem started at about 4:00 am (PST) in one of the App Engine DataCenters and operation was restored back to normal at about 11:45 am.

Impact during the outage
(As per Google AE team) about 50% of Requests to App Engine application failed during this outage.
Few users had noticed some remarkable increase of server instances for their application.

How the outage had happened
1) About 4:00 am PST in one of the Google App Engine data centers, load on traffic routers began increasing which later crossed the paging threshold
2) The team had to perform a restart of the traffic routers in the affected data center to address the above mentioned load increase
3) Unexpectedly this restart reduced the count of available traffic routers below the minimum requirement which caused overload in other data centers thus causing a cascading failure

How it was resolved
With no other options, the team had to perform a complete restart of the system with gradual traffic to return back to normal.

Post-Outage Refunds 
Google App Engine assured to credit back 10% of the monthly charges to all their paying customers.

What has been done to avoid in the future
Google AppEngine team increased traffic routing capacity in their data centers and also made few adjustments in their configurations to reduce possiblilties of another cascading failure.

for further reading:

Airtel Broadband Problems - My Customer Experience

Several times I have reported about these problems to Airtel's Support - but every time instead of understanding the problems and resolving them, their executives always give some excuses to close the tickets for then. Here are few of the problems I am facing with the Airtel Broadband connection in Chennai for the past few months -

VettyOfficer.com - a fresh beginning

All these years I have been blogging in my personal blog @ vettyofficer.blogspot.com and I also registered this new domain www.vettyofficer.com to point to that old blog. However, I don't know why, all of a sudden something flashed in my mind to start over from fresh again. Hence this new home in the internet. 

I know I am not an active blogger who could post frequently - but this time I have decided to post at regular intervals and I hope I will :-)