[SATLUG] Open Source deduplication capabilities?

Brad Knowles brad at shub-internet.org
Tue Jun 3 21:55:27 CDT 2008

On 6/3/08, Jeremy Mann wrote:

>  I just read a great article in this months Information Week about a
>  new backup technology called 'data deduplication'. Anybody out there
>  actually using it and if so, how much has it dropped your storage
>  requirements for your backups?

It's not really that new.  Network Appliance and EMC have been 
selling software to do this on their systems for years now.

It's primarily used in backups, where you're doing a 
disk-to-disk-to-tape kind of environment (a.k.a., D2D2T), and you're 
backing up a lot of the same data from multiple clients (think full 
backups, including OS, of hundreds of desktop PCs).  In those 
environments, it can save you a huge amount of space.

However, this can come at some performance cost.  This increases the 
"hot spots" in your storage environment, as well as increasing your 
risk of having Single Points Of Failure (SPOFs) that could seriously 
hurt you if there was a failure that affected the critical block(s) 
in question.  Of course, Netapp and EMC are aware of these issues and 
they've built in features to try to mitigate these problems, but 
there's only so much they can do and keep the feature useful.

Outside of those kinds of environments, you're spending a lot of CPU 
time to identify data that could potentially be de-duplicated, so 
that you can reduce your storage and I/O requirements, and yet you're 
not getting that much disk space or I/O capacity returned on this 
investment in CPU.

In essence, you're working backwards to create a single-instance 
storage mechanism (see 
<http://en.wikipedia.org/wiki/Single_instance_storage>), and it takes 
a lot of work to try to do that after-the-fact.  If you can take 
advantage of cases where you get something that needs to be copied to 
multiple locations and you go ahead and do a SIS method right then 
and there, that's much easier.

But for mail systems, SIS doesn't really buy you all that much in the 
general case.  Quoting from the Wikipedia article:

	However there is a common misconception that the primary benefit
	of single instance storage in mail server solutions is a reduction
	in disk space requirements. The truth is that its primary benefit
	is to greatly enhance delivery efficiency of messages sent to
	large distribution lists. In a mail server scenario disk space
	savings from single instance storage are transient and drop off
	very quickly over time.

And note that Exchange has had SIS for years, and yet SIS is also 
what causes a great deal of instability problems that Exchange has -- 
when a database gets corrupted and your only copy of a given message 
is hosed, every single person who had that message in their mailbox 
is now affected.  If you hadn't tried to eek out that last tiny 
little bit of disk space, then only one such user would be affected 
and they could potentially get a good copy sent to them by someone 
else who wasn't affected by the corruption.

Brad Knowles <brad at shub-internet.org>
LinkedIn Profile: <http://tinyurl.com/y8kpxu>

More information about the SATLUG mailing list