More FUD busting: Deduplication – is variable-block better than fixed-block, and should you care?

Before all the variable-block aficionados go up in arms, I freely admit variable-block deduplication may overall squeeze more dedupe out of your data.

I won’t go into a laborious explanation of variable vs fixed, but, in a nutshell, fixed-block deduplication means that data is split into equal chunks, each chunk given a signature, compared to a DB and the common chunks are not stored.

Variable-block basically means the chunk size is variable, with more intelligent algorithms also having a sliding window, so that even if the content in a file is shifted, the commonality will still be discovered.

With that out of the way, let’s get to the FUD part of the post.

I recently had a TLA vendor tell my customer: “NetApp deduplication is fixed-block vs our variable-block, therefore far less efficient, therefore you must be sick in the head to consider buying that stuff for primary storage!”

This is a very good example of FUD that is based on accurate facts which, in addition, focuses the customer’s mind on the tech nitty-gritty and away from the big picture (that being “primary storage” in this case).

Using the argument for a pure backup solution is actually valid. But what if the customer is not just shopping for a backup solution? Or, what if, for the same money, they could have it all?

My question is: Why do we use deduplication?

At the most basic level, deduplication will reduce the amount of data stored on a medium, enabling you to buy less of said medium yet still store quite a bit of data.

So, backups were the most obvious place to deploy deduplication. Backup-to-Disk is all the rage, what if you can store more backups on target disk with less gear? That’s pretty compelling. In that space you have of course Data Domain and the Quantum DXi as the two of the more usual backup target suspects.

Another reason to deduplicate is to not only achieve more storage efficiency but also improve backup times by not even transferring over the network data that’s already been transferred. In that space there’s Avamar, PureDisk, Asigra, Evault and others.

NetApp simply came up with a few more reasons to deduplicate, not mutually exclusive with the other 2 use cases above:

  1. What if you could deduplicate your primary storage – typically the most expensive part of any storage investment – and as a result buy less?
  2. What if deduplication could actually dramatically improve your performance in some cases, while not hindering it in most cases? (the cache is deduplicated as well, more info later).
  3. What if deduplication was not limited to infrequently-accessed data but, instead, could be used for high-performance access?

For the uninitiated, NetApp is the only vendor, to date, that can offer block-level deduplication for all primary storage protocols for production data – block and file, FC, iSCSI, CIFS, NFS.

Which is a pretty big deal, as is anything useful AND exclusive.

What the FUD carefully fails to mention is that:

  1. Deduplication is free to all NetApp customers (whoever didn’t have it before can get it via a firmware upgrade for free)
  2. NetApp customers that use this free technology see primary storage savings that I’ve seen range anywhere from 10% to 95%, despite all the limitations the FUD-slingers keep mentioning
  3. It works amazingly well with virtualization and actually greatly speeds things up especially for VDI
  4. Things that would defeat NetApp dedupe will also defeat the other vendors’ dedupe (movies, compressed images, large DBs with a lot of block shuffling). There is no magic.

So, if a customer is considering a new primary storage system, like it or not, NetApp is the only game in town with deduplication across all storage protocols.

Which brings us back to whether fixed-block is less efficient than variable-block:

WHO CARES? If, even with whatever limitations it may have, NetApp dedupe can reduce your primary storage footprint by any decent percentage, you’re already ahead! Heck, even 20% savings can mean a lot of money in a large primary storage system!

Not bad for a technology given away with every NetApp system…

D

8 Replies to “More FUD busting: Deduplication – is variable-block better than fixed-block, and should you care?”

  1. Nice to see you so excited about NetApp’s offering!

    Just to keep it a bit balanced.

    * dedupe is FREE on every EMC Celerra as well
    * EMC is better on some use cases than NetApp
    * NetApp is better on some use cases than EMC
    * backups are a pretty big use case for dedupe
    * EMC thinks we have that one handled pretty well
    * “primary storage” is a very broad category indeed
    * I don’t think anyone wants to dedupe their SAP instance
    * your mileage will always vary based on your use case

    So, now that there are multiple ways to do something, shouldn’t the discussion move to use cases and comparisons?

    Finally — and please don’t take this the wrong way — is there a business relationship between you and NetApp of some sort?

    I’m not saying that there is, but — if there is — you should disclose it (or lack thereof).

    Thanks!

    — Chuck

  2. Hi Chuck – click on the “about” link since the disclosure is in there. My HTML skills utterly suck BTW, I just use a basic wordpress template that I very slightly modified.

    Agreed that backups are IMO a HUGE case for dedupe, not just big. That’s where it all started. The dedupe backup appliance market is doing some pretty brisk business.

    The point of the post was, though, again the use of FUD to dissuade customers from making a decision that has nothing to do with backup.

    Some people do turn dedupe on for DBs, and in some cases it removes a lot of the white space + other commonality.

    Dedupe also saves space for Exchange 2010 since Microsoft removed the single-instancing from it (we see about 30%). Could be significant in some shops.

    The best use cases, in my opinion, for primary storage dedupe (i.e. use cases where it’s almost a no-brainer):

    * VMware/Hyper-V/Xen of course
    * General OS drives in boot-from-SAN environments
    * File shares
    * Exchange 2010 (does very little for ’03)

    The above covers quite a bit of ground.

    Of the above 4 use cases, a Celerra would handle the third via the RecoverPoint compression algorithm and file single-instancing.

    Which, for older CIFS/NFS data that would fit the scanner’s heuristics (compressible, identical files down to the last bit, not accessed in the last x days etc) might be just fine.

    But what about running VMware over NFS and deduping it? (you brought the celerra up :))

    Anyway, let’s focus on the subject of the post pls…

    D

  3. You’re right that there’s no magic to this Dmitri. For live VM storage, Netapp is actually a great fit, because a few blocks are changing all the time in that big VMDK file, so deduping needs to be done at the physical layer to avoid a big I/O performance impact.

    But it’s not fair to say “Who cares?” when differences in reduction rates are directly proportional to customer savings. Although a 10% difference isn’t going to make a difference to anyone, a 2x or 3x difference certainly will. In terms of dedupe there’s static block (Netapp), variable-sliding block (DataDomain), SIS (Celerra), and content-aware object dedupe (Ocarina). Each has a different cost-benefit level.

    Which brings me to the *other* primary storage application; nearline or online archival. For data that is static (by nature or by age), it’s worth taking some extra CPU cycles to get the most out of it, potentially applying better dedupe algorithms (such as variable or object) as well as compression, to really amplify the benefits of tiering for example.

    Again, to keep it balanced, it’s worth pointing out that Celerra, BlueArc (Titan & Mercury), HDS HNAS, Isilon, and HP x9000 all have dedupe solutions for primary storage applications delivered via Ocarina (who I work for), and we routinely deliver 30/50/70% savings on files that Netapp does nothing for. On the other hand, our current solution has too much performance overhead for live VM environments where Netapp plays well (even though we can shrink the heck out of VMDKs!).

    So I’m hopeful that as data-reduction propagates as a standard offering across primary storage, the market will begin to recognize many are not equivalent in terms of optimal use case and price-performance, and we won’t make apples&oranges comparisons.

    Keep up the good posts!

  4. Thanks Mike.

    I had to edit this response for clarification: I think either both Chuck and Mike missed the point of the post, or I wasn’t clear in making my point. Or maybe it’s too hard to resist promoting one’s product 🙂

    The point simply was:

    The use of FUD, yet again, to obfuscate things and dissuade a customer from buying something that absolutely is a fit. In this case, the FUD being variable-block vs fixed-block dedupe, and the customer in question (should have mentioned that) wanting dedupe primarily for VMware, and whatever other savings in the NAS part.

    Mike: Is the compression lossy or lossless to deliver the stated savings? I heard someone say that the max savings are achieved by using lossy compression but I’d rather have someone from Orarina go on record than spread rumors. Lossy would of course provide even greater savings where images/video are concerned, and makes sense to have “content-aware” compression that way. Not everyone is OK with lossy of course since it alters the data.

    The goal for NetApp dedupe was, simply, to enable general space savings (and even accelerate performance), not maximum savings, with the minimum of fuss, regardless of data.

    There are obviously solutions that can maximize the savings, such as yours (BTW doesn’t it theoretically work with any NAS?)

    Kinda like running a very efficient compression algorithm like 7-zip at max compression, the CPU costs of delivering said compression get too high for anything non-nearline. Of course, once it’s compressed, the savings are often worth the CPU cycles!

    I can see Ocarina being a good fit for someone with tons of rather infrequently-accessed data in fileshares. My only beef is that it adds another layer of stuff in the solution, and some customers might be averse to that. The market trend is simplification (another reason NetApp is winning so many deals). Of course, if the tech is SO good that the complexity and added headaches are worth it, then you have something!

    Isn’t technology cool? All those options.

    D

  5. @ Mike:

    " Any sufficiently advanced magic is indistinguishable from technology "

    You are talking lossy compression, as Dimitris suggested, for the 30~70% savings where lossless dedupe can save nothing, correct?

  6. Hi Dimitris

    I was very surprised to learn on Storagebod’s blog that you are now a paid employee of NetApp.

    You wrote there “Recoverymonkey has always been paid for by me looong before I joined NetApp” as your rationale for not fully disclosing this fact.

    Trust me, the fact that you are now a paid employee of a vendor (any vendor!) is a salient fact that should be disclosed to your audience.

    Also, this would make you compliant with new FTC blogging rules.

    All of us the storage blogging world think it’s very important to know who’s paying the bills. As a matter of fact, many of us believe that full disclosure and transparency of your employment and/or compensation are an extremely important manner.

    The EMC blogging policy is pretty clear on that — if you blog or comment on EMC-related topics, you should disclose your affiliation.

    I’m really quite surprised that NetApp doesn’t have a similar policy in place. It’s a smart business practice.

    I hope you can see this point, and make your current affiliation a bit more public.

    Thanks!

    — Chuck

  7. Chuck – that comment is a bit rich – even coming from you. Lecturing Dimitris regarding FTC rules on his own blog because you were too sloppy to click on the “about” link on this page makes you look petty at best.

    Tell me, when you eat breakfast in the morning and can’t find the orange juice, are you shocked to discover the kids already drank some because they looked behind the milk in the fridge abd found it?

  8. Personally, I was surprised that another vendor would jump on a customer blog and speculate on a competitor’s roadmap. You want to comment on what’s on the truck or industry trends, have at it. You want to gin up rumors about a competitor when you know they can’t comment publicly? I’m pretty sure it doesn’t violate any FTC regulations, but it does show extremely poor form. Predictable, but disappointing nonetheless.

Leave a comment for posterity...