[SATLUG] Amazon's April 21 Cloud Outage

Jonathan Kelley jonkelley at gmail.com
Mon May 2 11:31:50 CDT 2011


Makes you wonder how closely they watch their network to detect race
conditions like these. That's why manual intervention should be involved in
system that attempts to fix itself, less it create a loop which brings
things down.

Rackspace is a viable alternative to those burnt by AWS. I may be biased as
an employee.
--
Jon


On Mon, May 2, 2011 at 10:49 AM, Enrique Sanchez <
esanchezvela.satlug at gmail.com> wrote:

> On Sun, May 1, 2011 at 4:57 PM, Don Wright <satlug at sbcglobal.net> wrote:
> > Amazon has posted their analysis of the April 21st meltdown of several
> > of their cloud services.
> > http://aws.amazon.com/message/65648/
> >
> > From this layman's view it reveals a design that incorporated multiple
> > single points of failure that led to a cascade collapse as successive
> > levels of management and automatic recovery were unable to cope with the
> > load. In a way it's reminiscent of the July 20, 2008 global collapse of
> > their S3 (storage) services - system management messages were given
> > priority over actually doing work, and a flood of (mis-)management data
> > prevented otherwise innocent systems from performing their jobs.
> >
> > Would any high-availability professionals care to comment or point me
> > toward better analysis than I've seen with a casual Google?  --Don
> >
> > --
> > Price slightly higher west of the Rockies.
> > --
>
>
> i did not read the post to a great detail but it seems like in order
> to guarantee service, NodeA replicates its storage to let's say NodeB,
> if for any reason NodeB goes down, NodeA searches for another partner
> for its data be replicated, apparently they had a routing problem
> which caused a whole lot'a of servers loosing contact with their
> replication partners, triggering a massive search for replication
> partners and from there everything was downhill, it also looks to me
> that they took the long path to recovery once they recovered from the
> routing problem and other issues, instead of re-using the "failed"
> replication partners, it looks like they deployed additional capacity
> and restarted the replication to that area.
> --
> _______________________________________________
> SATLUG mailing list
> SATLUG at satlug.org
> http://alamo.satlug.org/mailman/listinfo/satlug to manage/unsubscribe
> Powered by Rackspace (www.rackspace.com)
>


More information about the SATLUG mailing list