Anyone feel that deduplication is not finding its final resting place in backups and WAN accelerators?
It’s only a matter of time before the algorithms are run as a matter of choice on the array processors.
Of course, that means fewer disk sales, but also bigger/faster/more expensive processors.
Replication will also become more efficient – see EMC’s recent acquisition of Kashya (now RecoverPoint – one of its functions is dedup during replication from array to array, how long do you think it will take them to move this functionality to the array processors?)
Just some random thoughts…
D
Dimitris,
IMO, technologies that help with the data footprint reduction will show up everywhere considering the growth of data is lot faster than the capacity to store them. It still result in more disk sales not less.
I have been writing about this for past year, most recently at http://andirog.blogspot.com/2007/03/data-deluge-storage-software-need-to.html.
Anil
Yes, also see my entry here: http://recoverymonkey.net/wordpress/?p=10
I still think that, in the end, the real issues will become indexing and classification.
You are asking two questions:
1: Are backups/WAN accelerators the ultimate endpoint for de-dupe?
2: When will de-dupe be put into the array processors?
For #1, backup and WAN accelerators are vastly superior to other applications because of their data profile. De-duplication is useful when you are seeing the same data over and over again. When you do your weekly full backup, you’re going to get high duplication levels. Other applications — with structured data or not — do not have such high levels of repetition, so they will only get limited benefits out of de-dupe to the extent that the pain of a new system and the ROI just isn’t there.
For #2, there are already disk systems that include integrated de-dupe. Take a look at the NAS approach of Data Domain or NEC’s HydraStor — they both have de-dupe built in. And the Hifn Express DR 250/255 card will do de-dupe hashing in hardware to speed it up. File-system approaches make a lot more sense than doing block-based storage for the reasons in #1 — the data profile for block-based applications just don’t fit: With the typical backup currently starting with disk-to-disk, there’s no killer need for block de-dupe.