[SATLUG] Open Source deduplication capabilities?
brad at shub-internet.org
Tue Jun 3 21:55:27 CDT 2008
On 6/3/08, Jeremy Mann wrote:
> I just read a great article in this months Information Week about a
> new backup technology called 'data deduplication'. Anybody out there
> actually using it and if so, how much has it dropped your storage
> requirements for your backups?
It's not really that new. Network Appliance and EMC have been
selling software to do this on their systems for years now.
It's primarily used in backups, where you're doing a
disk-to-disk-to-tape kind of environment (a.k.a., D2D2T), and you're
backing up a lot of the same data from multiple clients (think full
backups, including OS, of hundreds of desktop PCs). In those
environments, it can save you a huge amount of space.
However, this can come at some performance cost. This increases the
"hot spots" in your storage environment, as well as increasing your
risk of having Single Points Of Failure (SPOFs) that could seriously
hurt you if there was a failure that affected the critical block(s)
in question. Of course, Netapp and EMC are aware of these issues and
they've built in features to try to mitigate these problems, but
there's only so much they can do and keep the feature useful.
Outside of those kinds of environments, you're spending a lot of CPU
time to identify data that could potentially be de-duplicated, so
that you can reduce your storage and I/O requirements, and yet you're
not getting that much disk space or I/O capacity returned on this
investment in CPU.
In essence, you're working backwards to create a single-instance
storage mechanism (see
<http://en.wikipedia.org/wiki/Single_instance_storage>), and it takes
a lot of work to try to do that after-the-fact. If you can take
advantage of cases where you get something that needs to be copied to
multiple locations and you go ahead and do a SIS method right then
and there, that's much easier.
But for mail systems, SIS doesn't really buy you all that much in the
general case. Quoting from the Wikipedia article:
However there is a common misconception that the primary benefit
of single instance storage in mail server solutions is a reduction
in disk space requirements. The truth is that its primary benefit
is to greatly enhance delivery efficiency of messages sent to
large distribution lists. In a mail server scenario disk space
savings from single instance storage are transient and drop off
very quickly over time.
And note that Exchange has had SIS for years, and yet SIS is also
what causes a great deal of instability problems that Exchange has --
when a database gets corrupted and your only copy of a given message
is hosed, every single person who had that message in their mailbox
is now affected. If you hadn't tried to eek out that last tiny
little bit of disk space, then only one such user would be affected
and they could potentially get a good copy sent to them by someone
else who wasn't affected by the corruption.
Brad Knowles <brad at shub-internet.org>
LinkedIn Profile: <http://tinyurl.com/y8kpxu>
More information about the SATLUG