[SATLUG] Average lifespan of an Adaptec SCSI card?

Alan Lesmerises alesmerises at satx.rr.com
Wed Jul 22 20:56:13 CDT 2009


Jeremy Mann wrote:
> After 6 years of faithful service, our Adaptec SCSI adaptor (Adaptec
> 3410S Ultra160 4 channel) may have bit the dust. Over the weekend, 2
> of the 6 drives failed and were marked as Dead. I have replaced those,
> however, the server keeps kernel panicing at random intervals with
> regards to the dpt_i2o driver (kernel SCSI driver).
>
> This server is our secondary backup server and it has been online and
> operational for quite some time. In fact, its the one server in our
> lab with the highest uptime, until this weekend.
>
> We didn't lose any data other than the operating system so I can bring
> the system down for testing, pulling cards, etc...
>
> My question is, under this typical data center environment, what is
> the average lifespan of a SCSI card (or a SATA card)?
Actually, reliability & failure analysis is my particular area of 
specialty and I deal with it every day, so I may be able to shed a 
little light on the subject.

Using a measure of "average lifespan" is an overly simplistic measure 
that really doesn't tell you everything you need to know to determine 
whether a particular failure is abnormally early (or late) compared to 
comparable units elsewhere.  Failures can occur over a range of 
operating times (or cycles, miles, or other measures of life 
utilization), and there is usually some sort of characteristic curve 
(a.k.a. a  statistical "distribution") to describe the variation.  
Recall the "bell curve" that teachers usually referred to when grading 
tests?  That's called a "normal distribution" and is only one of many 
types of distributions that can describe the spread of a set of 
mathematical data.

Let's say you have just an "average lifespan" of 8 years.  If you assume 
that lifespans are normally distributed, then a standard deviation (a 
measure of the width of that "bell curve") of +/- 1 year would give you 
radically different results than a standard deviation of +/- 4 years.  
The first scenario would mean your card would have failed before 
something like 90% of it's peers, whereas the second would mean that 
only ~55% of it's peers have not yet failed.

In the case of solid-state electronics, it has long been known that the 
bulk of the failures occur in a pattern known as "infant mortality" (and 
the name does come from human death studies).  That means that failures 
occur with decreasing frequency with time, and the longer it lasts, the 
longer it's going to last.  That being said, it they are subjected to 
some external factor (excessive overheating, vibration, etc.), failures 
can also be externally induced.

So, without having specific data to refer to, I'd be inclined to say 
that 6 years isn't bad amount of service life to get out of that card, 
but the question of whether the other ones are about to fail is much 
more difficult to answer.  The biggest factor that you could answer is 
if the other components were in fact exposed to excessive heat, 
vibration, or other deleterious effects that could lead to a failure.

Al Lesmerises



More information about the SATLUG mailing list