[SATLUG] Average lifespan of an Adaptec SCSI card?
alesmerises at satx.rr.com
Wed Jul 22 20:56:13 CDT 2009
Jeremy Mann wrote:
> After 6 years of faithful service, our Adaptec SCSI adaptor (Adaptec
> 3410S Ultra160 4 channel) may have bit the dust. Over the weekend, 2
> of the 6 drives failed and were marked as Dead. I have replaced those,
> however, the server keeps kernel panicing at random intervals with
> regards to the dpt_i2o driver (kernel SCSI driver).
> This server is our secondary backup server and it has been online and
> operational for quite some time. In fact, its the one server in our
> lab with the highest uptime, until this weekend.
> We didn't lose any data other than the operating system so I can bring
> the system down for testing, pulling cards, etc...
> My question is, under this typical data center environment, what is
> the average lifespan of a SCSI card (or a SATA card)?
Actually, reliability & failure analysis is my particular area of
specialty and I deal with it every day, so I may be able to shed a
little light on the subject.
Using a measure of "average lifespan" is an overly simplistic measure
that really doesn't tell you everything you need to know to determine
whether a particular failure is abnormally early (or late) compared to
comparable units elsewhere. Failures can occur over a range of
operating times (or cycles, miles, or other measures of life
utilization), and there is usually some sort of characteristic curve
(a.k.a. a statistical "distribution") to describe the variation.
Recall the "bell curve" that teachers usually referred to when grading
tests? That's called a "normal distribution" and is only one of many
types of distributions that can describe the spread of a set of
Let's say you have just an "average lifespan" of 8 years. If you assume
that lifespans are normally distributed, then a standard deviation (a
measure of the width of that "bell curve") of +/- 1 year would give you
radically different results than a standard deviation of +/- 4 years.
The first scenario would mean your card would have failed before
something like 90% of it's peers, whereas the second would mean that
only ~55% of it's peers have not yet failed.
In the case of solid-state electronics, it has long been known that the
bulk of the failures occur in a pattern known as "infant mortality" (and
the name does come from human death studies). That means that failures
occur with decreasing frequency with time, and the longer it lasts, the
longer it's going to last. That being said, it they are subjected to
some external factor (excessive overheating, vibration, etc.), failures
can also be externally induced.
So, without having specific data to refer to, I'd be inclined to say
that 6 years isn't bad amount of service life to get out of that card,
but the question of whether the other ones are about to fail is much
more difficult to answer. The biggest factor that you could answer is
if the other components were in fact exposed to excessive heat,
vibration, or other deleterious effects that could lead to a failure.
More information about the SATLUG