One subject I keep hearing about is deduplication. The idea being that you save a ton of space since a lot of your computers have identical data.
One way to do it is with an appliance-based solution such as Data Domain. Effectively, they put a little server and a cheap-but-not-cheerful, non-expandable 6TB RAID together, then charge a lot for it, claiming it can hold 90TB or whatever. Use many of them to scale.
The technology chops up incoming files into pieces. Then, the server calculates a unique numeric ID using a hash algorithm.
The ID is then associated with the block and both are stored.
If the ID of another block matches one already stored, the new block is NOT stored, but it’s ID is, as is the association with the rest of the blocks in the file (so that deleting a file won’t adversely affect common blocks with other fles).
This is what allows dedup technologies to store a lot of data.
Now, why it depends how much you can store:
If you’re backing up many different unique files (like images), there will be almost no similarity, so everything will be backed up.
If you’re backing up 1000 identical windows servers (including the windows directory) then there WILL be a lot of similarity, and great efficiencies.
Now the drawbacks (and why I never bought it):
The thing relies on a weak server and a small database. As you’re backing up more and more, there will be millions (maybe billions) of IDs in the database (remember, a single file may have multiple IDs).
Imagine you have 2 billion entries.
Imagine you’re trying to back up someone’s 1GB PST, or other large file, that stays mostly the same over time (ideal dedup scenario). The file gets chopped up in, say, 100 blocks.
Each block has it’s ID calculated (CPU-intensive).
Then, EACH ID has to be compared with the ENTIRE database to determine whether there’s a match or not.
This can take a while, depending on what search/sort/store algorithms they use.
I asked data domain about this and all they kept telling me was “try it, we can’t predict your performance”. I asked them whether they had even tested the box to see what the limits were, and they hadn’t. Hmmm.
I did find out that, at best, the thing works at 50MB/s (slower than an LTO3 tape drive), unless you use tons of them.
Now, imagine you’re trying to RECOVER your 1GB PST.
Say you try to recover from a “full” backup on the data domain, but that file has been living in it for a year, with the new blocks being added to it.
When requesting the file, the data domain box has to synthesize the file (remember, even the “full” doesn’t include the whole file). It will read the IDs needed to recreate it and put the blocks together so it can present the final file, as it should have looked.
This is CPU- and disk-intensive. Takes a while.
The whole point of doing backups to disk is to back up and restore faster and more reliably. If you’re slowing things down in order to compress your disk as much as possible, you’re doing yourself a disservice.
Don’t get me wrong, dedup tech has it’s place, but I just don’t like the appliance model for performance and scalability reasons.
EMC just purchased Avamar, a dedup company that does the exact same thing but lets you install the software on whatever you want.
There are also Asigra and Evault, both great backup/dedup products that can be installed on ANY server and work with ANY disk, not just the el cheapo quasi-JBOD data domain sells.
So, you can leverage your investment in disk and load the software of a beefy box that will actually work properly.
Another tack would be to use virtual tape – doesn’t do dedup (yet, but it will since EMC bought Avamar and Adic, now Quantum, also acquired another dedup company and will put the stuff in their VTL, you can get the best of both worlds) but it does compression just like real tape.
Plus, even the cheapest EMC virtual tape box works at over 300MB/s.
I sort of detest the “drop at the customer site” model data domain (and a bunch of the smaller storage vendors) use. They expect you to put the box in and if it works OK to make it easier to keep it than send it back.
Most people will keep the first thing they try (unless it fails horrifically), since they don’t want to go through the trouble of testing 5 different products (unless we’re talking about huge companies that have dedicated testing staff).
Let me know what you think…