[SATLUG] Amazon's April 21 Cloud Outage

Don Wright satlug at sbcglobal.net
Sun May 1 15:57:48 CDT 2011


Amazon has posted their analysis of the April 21st meltdown of several
of their cloud services.
http://aws.amazon.com/message/65648/

From this layman's view it reveals a design that incorporated multiple
single points of failure that led to a cascade collapse as successive
levels of management and automatic recovery were unable to cope with the
load. In a way it's reminiscent of the July 20, 2008 global collapse of
their S3 (storage) services - system management messages were given
priority over actually doing work, and a flood of (mis-)management data
prevented otherwise innocent systems from performing their jobs.

Would any high-availability professionals care to comment or point me
toward better analysis than I've seen with a casual Google?  --Don

-- 
Price slightly higher west of the Rockies.


More information about the SATLUG mailing list