[SATLUG] Amazon's April 21 Cloud Outage
satlug at sbcglobal.net
Sun May 1 15:57:48 CDT 2011
Amazon has posted their analysis of the April 21st meltdown of several
of their cloud services.
From this layman's view it reveals a design that incorporated multiple
single points of failure that led to a cascade collapse as successive
levels of management and automatic recovery were unable to cope with the
load. In a way it's reminiscent of the July 20, 2008 global collapse of
their S3 (storage) services - system management messages were given
priority over actually doing work, and a flood of (mis-)management data
prevented otherwise innocent systems from performing their jobs.
Would any high-availability professionals care to comment or point me
toward better analysis than I've seen with a casual Google? --Don
Price slightly higher west of the Rockies.
More information about the SATLUG