[SATLUG] Open Source deduplication capabilities?

John Pappas j at jvpappas.net
Wed Jun 18 18:11:31 CDT 2008

On Sun, Jun 15, 2008 at 5:39 PM, Brad Knowles <brad at shub-internet.org>

> On 6/15/08, John Pappas wrote:
>                                        IMHO, dedupe is the only way to
>> bring
>>  offsite data replication within reach of any SMB where bandwidth is the
>>  bottleneck.
> Anyone seriously considering de-duplication technologies should look at the
> series of articles at <http://www.backupcentral.com/content/view/175/1/>
> written by Curtis Preston (author of the O'Reilly books "Unix Backup &
> Restore", the 2nd edition "Backup and Restore" which covers Unix and
> Unix-like OSes as well as others, and "Using SANs and NAS").

Sure, but dedupe does not AFAIK infer data integrity.  Ultimately, the
design goals of your initiative define what (if any) dedupe methodologies
should be used.  I like DD as it has data integrity measures built in
(hashes, ECC, etc).

Keep in mind that a lot depends on the kinds of other tools you're using.
>  For example, in a TSM "incrementals forever" type of environment, the fact
> that you're taking only incremental backups means that you're effectively
> doing a lot of de-duplication on the input side, and further de-duplication
> is not likely to buy you as much.  See <
> http://www.backupcentral.com/content/view/136/47/>.

Agreed.  I am not a TSM guy, so I am not sure how a restore works (how many
tapes are needed for a complete restore in an incremental forever?).  Most
environments can get away with replacing tape with a dedupe methodology (or
single instance storage) of any type and see a good benefit (think
fileshares and OS backups).  My comments were based on a first-hand
demonstration in a large datacenter with real data and replication
concerns.  I am not Curtis Preston, but I do know somethings, and for the
near-line or D2D2T needs, DD is a good fit.  SIR (unless it has DB knowledge
built in) falters on DB tablespace storage as the files are "different",
where block dedupe can account for the tablespace index changes and whatnot.


More information about the SATLUG mailing list