[SATLUG] Amazon's April 21 Cloud Outage
esanchezvela.satlug at gmail.com
Mon May 2 10:49:58 CDT 2011
On Sun, May 1, 2011 at 4:57 PM, Don Wright <satlug at sbcglobal.net> wrote:
> Amazon has posted their analysis of the April 21st meltdown of several
> of their cloud services.
> From this layman's view it reveals a design that incorporated multiple
> single points of failure that led to a cascade collapse as successive
> levels of management and automatic recovery were unable to cope with the
> load. In a way it's reminiscent of the July 20, 2008 global collapse of
> their S3 (storage) services - system management messages were given
> priority over actually doing work, and a flood of (mis-)management data
> prevented otherwise innocent systems from performing their jobs.
> Would any high-availability professionals care to comment or point me
> toward better analysis than I've seen with a casual Google? --Don
> Price slightly higher west of the Rockies.
i did not read the post to a great detail but it seems like in order
to guarantee service, NodeA replicates its storage to let's say NodeB,
if for any reason NodeB goes down, NodeA searches for another partner
for its data be replicated, apparently they had a routing problem
which caused a whole lot'a of servers loosing contact with their
replication partners, triggering a massive search for replication
partners and from there everything was downhill, it also looks to me
that they took the long path to recovery once they recovered from the
routing problem and other issues, instead of re-using the "failed"
replication partners, it looks like they deployed additional capacity
and restarted the replication to that area.
More information about the SATLUG