NetApp usable space – beyond the FUD

I come across all kinds of FUD, and some of the most ridiculous claims against NetApp regard usable space. I won’t post screenshots from competitive docs since who knows who’ll complain, but suffice it to say that one of the usual strategies against NetApp is to claim the system has something like well under 50% space efficiency using a variety of calculations, anecdotes and obsolete information. In one case, 34% usable space 🙂 Right…

The purpose of this post is to outline the state of the art regarding NetApp usable space as of Spring of 2010.

Since NetApp systems can use free space in various ways instead of just for LUNs, there is frequent confusion regarding what each space-related parameter means, and what the best practices are. NetApp’s recommendations have changed over the years as the technology matured – my goal is to bring everybody up to speed.

Executive summary

Depending on the number and type of drives and the design, aside from edge cases dealing with small systems with a very low number of disks, the real usable space in NetApp systems can easily exceed 75% of the real usable space in the drives. I’ve seen it as high as about 78% of the actual space on the drives. That’s amazingly efficient for something with double-parity protection as default and includes spares. This number is the same whether it represents NAS or SAN data and doesn’t include deduplication, compression or space-efficient clones, which could inflate it to over 1000%. Indeed, NetApp systems are used in the biggest storage installations on the planet partly because they’re so space-efficient. Now, on to the details.

What’s space good for anyway?

Legacy arrays use space in very simple terms – you create RAID groups, then you create LUNs on them and those LUNs pretend they’re normal disks, and that’s that. Figuring out where your space goes is easy – there’s a 1:1 relationship between LUN size and space used on the array. You buy an array that can provide 10TB after RAID and spares, and that’s all you ever get – nothing more, nothing less.

Legacy arrays can sometimes use features such as snapshots, but frequently there are so many caveats around their use (performance being a big one) that either they’re never implemented, or their number is very small indeed to make them really useful.

Since NetApp gear doesn’t suffer from those limitations, customers invariably end up using snapshots a lot, and for various reasons, not just backup. I have customers with over 10,000 snapshots in their arrays – they replicate all those snapshots to another array, can retrieve data that’s several months old, and have stopped relying on legacy backup software, saving money and achieving far faster and easier DR in the process, since with snapshots there’s no restore needed.

What’s your effective space with NetApp gear?

If you consider that each snapshot looks like a complete copy of your data, without factoring in any deduplication at all, the effective logical space could be many, many times more than the physical space. A large law firm I deal with manages to fit about 2.5PB of data into 8TB of snapshot delta space – which is pretty efficient by anyone’s standards. We’re not talking about backups done on deduplicated disk here that need to be restored to become useful – we’re talking about many thousands of straight-up, application-consistent, “full” copies of LUNs, CIFS and NFS shares that you can mount at full speed instantly, without needing to restore from another medium or backup application.

Once you add deduplication and thin cloning, the storage efficiency goes even higher.

It’s not the size of your disk that matters, it’s how you use it

If you use a NetApp system like a legacy disk array, without taking advantage of any of the advanced features (maybe you just care for the multi-protocol functionality, with great performance and reliability) then your usable space falls right within norms. Once you start using the advanced snapshot features, they start eating space of course – but giving you something in return. What you need to figure out is if the tradeoffs are worth it: for instance, if I can keep a month’s worth of Exchange backups with a nominal capacity increase, what is that worth for me? Maybe:

  • I can eliminate backup software licenses
  • I can shrink my storage footprint
  • Avoid purchasing external disk for backups
  • I don’t need to buy external CDP hardware/software and a bunch of extra disk
  • My restores take seconds
  • DR becomes trivial

Or, if I can create 150 clones of my SQL database that my developers can simultaneously use and only chew up a small fraction of the space I’d otherwise need, what is that worth? With other systems, I’d need 150x the space…

Or, create thousands of VM clones for VDI…

How much money are you saving?

What do simplicity and speed mean to your business from an OpEx savings standpoint?

Another way to look at it:

How much more efficient would your business be if you weren’t hampered by the limitations of legacy technology? It’s all about becoming aware of the expanded possibilities.

What you buy

FYI, and to clear any misconceptions in case you can’t be bothered to read the rest: if you ask me for a 10TB usable system, you’ll get a system that will truly provide 10TB usable, honest-to-goodness Base2 space protected against dual-drive failure (no RAID5 silliness), and after all overheads, spares etc. have been taken out. If you want snapshot space we’ll have to add some (like you’d need to with any other vendor). It’s as simple as that.

Right-sized, real space vs raw capacity

Others have explained some of this before but, for completion, I’ll take a stab:

  • The real usable size of, say, a 450GB drive is not really 450GB regardless of the manufacturer.
  • The real usable capacity quoted depends on whether it’s Base2 or Base10 math and a bunch of other factors
  • All vendors that source drives from multiple manufacturers that use RAID groups need to right-size their drives – meaning that, if manufacturer A offers a tad more space in the drive than manufacturer B, in order to use both kinds of drives in the same RAID group, you kinda need to make them seem like the exact same size, meaning you go for the lowest common denominator between drive vendors.
  • Using our 450GB example above, the real addressable right-sized Base10 space in that drive is 438.3GB, and even less in Base2 (402.2). Base2 math simply means 1024 bytes in 1K, not 1000, and the rest follows.
  • Beware of analysis, comparisons or quotes showing Base10 from one vendor and Base2 from another, or raw disk space from one vendor vs right-sized from another! Always ask what base is what you’re seeing and whether the numbers reflect right-sized drives! If you look at the right-sized drive Base2 space from various vendors, it’s usually pretty close. Base your % usable calculations on that number and not the marketing 450GB number that’s not real for any vendor anyway.
  • Everyone pretty much buys the same drives from the same drive manufacturers

Some space reservation axioms

Any system that allows snapshots, clones etc. typically needs some space for those advanced operations. For instance, if you completely fill up a system and then want to take a snapshot, it may let you but if you modify any data then it won’t have space to store the writes and the snapshot will be invalidated and deleted – kinda pointless.

As usual, there is no magic. If you expect to be able to store multiple snapshots, the system needs space to store the data changed between snapshots, regardless of array vendor!

And, out of curiosity – how many man-made devices do you own that you max out all the time? Not leaving breathing room is a recipe for trouble for any piece of equipment.

Explanation of the NetApp data organization

For the uninitiated, here’s a hierarchical list of NetApp structures:

  1. Disks
  2. RAID groups – made of multiple disks. Default RAID is RAID-DP. The system automatically makes them, you don’t need to define them or worry about back-end balancing etc. NetApp RAID groups are typically large, 16 disks or so. RAID-DP ensures better protection than RAID10 (the math shows 163x better than RAID10 and 4,000 better than RAID5).
  3. Parity drives – drives containing extra information that can be used to rebuild data. RAID-DP uses 2 parity drives per RAID group.
  4. Spares – drives that can replace failed or failing drives (no need to wait until the drive is truly dead)
  5. Aggregates – a collection of RAID groups and the basic unit from which space is allocated. That’s really what you define, then the system figures out automatically how to allocate disks and create RAID groups for you (can even expand RAID groups on the fly as you add more disks to the aggregate, even 1 disk at a time).
  6. Volumes – a container that takes space from an Aggregate. A volume can be NAS or SAN. A volume can only belong to one Aggregate, and there will typically be many volumes within an Aggregate. Most people will enable the automatic growing of Volumes.
  7. LUNs – they are placed inside the Volumes. One or more per volume, depending on what you’re trying to do. Usually one.
  8. Snapshots – logical, space-efficient copies of either entire Volumes or structures within volumes. There are 3 kinds depending on what you’re trying to do (Snapshot, Snapvault and Flexclone) but they all use similar underlying technology. I might get into the differences in a future post. Briefly: Snapshot -shorter term, Snapvault – longer term, Flexclone – writeable Snapshot.

Explanation of the NetApp space allocations

  1. Snapshot Reserve – an accounting feature that sets aside a logical percentage of space on a Volume. For instance, if you create a 10TB volume and set a 10% Snap Reserve, the client system will see 9TB usable. Most people will enable automatic deletion of Snapshots. The percentage to set aside is at your discretion and is variable on the fly. The actual amount of space consumed is related to your rate of change between snapshots. See here for some real averages across thousands of systems.
  2. Aggregate Snap Reserve – this is pretty unique. One can actually roll back an entire Aggregate on a NetApp system – can come in handy if you accidentally deleted whole Volumes or in general did some gigantic boo-boo. Rolling back the entire Aggregate will undo whatever was done to that aggregate to break it! This feature is enabled by default and has a 5% reservation. It it not mandatory unless you are running Syncmirror (mostly in Metrocluster setups). Depending on what you want to do, you could disable this altogether or set it to a small number like 1% (my recommendation).
  3. Fractional Reserve – The one that confuses everyone. In a nutshell: it’s a legacy safety net in case you want to modify all the data within a LUN yet still keep the snapshots. Think about it: Let’s say you took a snapshot and you then went ahead and modified every single block of your data. Your snap delta would balloon to the total size of the LUN – regardless of whether you use NetApp, EMC, XIV, Compellent, 3Par, HDS, HP etc etc. The data has to go someplace! There’s a great explanation in this document and I suggest you read it since it covers quite a bit more, too. This one is great, too. Long story short: With snapshot autodelete, and/or volume autogrow, you can set it to zero. If you use the SnapManager products, they take care of snapshot deletion themselves.
  4. System reserve – this is the only one that’s not optional. It’s set to 10% by default. You can actually change it but I’m not telling you how. That space is there for a reason, and changing it will potentially cause problems with high write rate environments. That 10% is used for various operations and has been found to be a good percentage to maintain good performance. All NetApp sizing takes this into account. BTW – ask other vendors if it’s perfectly safe to fill their systems at 100% all the time and whether that impacts performance or prevents them from being able to do certain things. And finally, that 10% lost is gained back in spades with the other NetApp efficiency methodologies (starting at the low level with RAID-DP – please do some simple math based on our 16+ drive RAID group vs typical RAID group sizes) so it doesn’t even matter.

Bottom line: Aside from the 10% system reserve, the rest is all usable space.

The NetApp defaults and some advice

So, here’s where it can get interesting (and confusing) and where the competition gets all their ammunition. Depending on the age of the documentation and firmware, different best practices and defaults apply.

So, if you look at competitive docs from other vendors, they claim that if you use NetApp for LUNs you waste double the space for fractional reserve. That recommendation was true many years ago and it was a safety precaution regarding fractional reserve. The documentation has been updated years ago with zero fractional reserve as the recommendation, but of course that doesn’t help competitors so they left the old messaging. So here’s a basic list of quick recommendations for LUNs:

  1. Snap reserve – 0
  2. Fractional reserve – 0
  3. Snap autodelete on (unless you have SnapManager products managing the snap deletion)
  4. Volume autogrow on
  5. Leave at least a little space available in your volumes, don’t let a LUN 100% fill a volume (the LUN space can be thick but the volume space can be thin-provisioned). This space is needed for deduplication and other processes temporarily
  6. Do consider embracing thin provisioning, even if you don’t want to oversubscribe your disk. It’s much more flexible long-term, and allows for storage elasticity.

So, look at the defaults and ask your engineer if it’s OK to change them if they don’t agree with the settings above. Especially on older systems, I notice that the fractional reserve is still 100%, even after getting updated with the latest software (the update doesn’t change your config). Nothing like giving someone a bunch of disk space back with a few clicks…

If you want to do thin provisioning, depending on the firmware, you may see that using thin provisioning on a volume forces the fractional reserve to 100% – but, ultimately, no real space is being consumed. Was OK in 7.2x, changed to the 100% behavior in 7.3.1, fixed in 7.3.3 since it was confusing everyone.

The bottom line

Ultimately, I want you thinking of how you can use your storage as a resource that enables you to do more than just storing your LUNs. And, finally, I wanted to dispel notions that NetApp storage has less storage efficiency than legacy systems. Comments are always appreciated!

D

7 Replies to “NetApp usable space – beyond the FUD”

  1. Good info thanks for the clarification. For the record, I don’t talk about the competition in opportunities unless asked directly and then I really try to stay away from it by stating that I’m here to sell my product, not educate the prospect on the competitors.

    So that being said, I do have some questions! 🙂

    – Snap autodelete serves what function exactly?
    – If I can (and should per recommendation) set snap and fractional reserves to 0, why isn’t that the default? Better yet, why doesn’t NetApp just automate the whole process and make it easier on the administrator?

    I’m with you on thin provisioning – even if you don’t oversubscribe I think it’s a good practice to use it if available in your system. In fact, I’d recommend that you DON’T oversubscribe your storage unless you are VERY certain of your data growth and have implemented some controls to make sure you don’t run into a wall.

    When I used to manage NetApp gear, I really liked it except for the confusion around reserves and dealing with that as an impact to my usable storage. If you guys have fixed that problem that’s great.

    My other beef with NetApp was the nickle and dime feature set licensing. I was “sold” a lot of great features but it turns out I didn’t actually buy them and ended up spending a bit more than I originally thought I would. You could say I’m partially to blame, I guess, but then again the customer is never wrong – are they?

  2. Hey John, thanks for the comment.

    Snap autodelete automatically deletes snapshots as a volume fills up (there’s also a snapshot scheduler with retentions just like a backup tool that’s time-based but this is different). I like to use it in conjunction with volume auto-grow.

    On new systems, the snap and fractional reserves for LUN volumes are, indeed, zero, but I had to mention this for completion since with older firmware that may not be the case. So yes, it is automated and ridiculously easy with current code.

    Regarding licensing:

    You can get a la carte licensing like in the past, or get the bundle licensing – much simpler and much more cost-effective. That’s what I sell 99.9% of the time.

    Also, NetApp licensing is NOT per client system and it’s NOT based on capacity or number of disks.

    So, regardless of how many SQL, Vmware, Exchange, Oracle etc. servers you have, the deep application integration is a flat fee per controller.

    Regardless of how many drives or TB you have, licensing stays the same. Regardless of how many TB you have to replicate, snap etc.

    Many arrays out there charge by capacity and/or number of drives, how does Compellent do it?

    Thx

    D

  3. Compellent offers enterprise licensing for application integration features (or you can buy per server). It’s not licensed per TB or drive. Some features, like Windows space reclamation, do not require a license.

  4. Thanks. What about the other features like data progression, how are those licensed?

    What about replication?

    What about znas? That’s by number of drives, right?

  5. Are you suggesting that snapshots eliminate the need for backup?

    1) What do you do when an auditor asks for proof that the systems been protected? What do you show them and how do you report on that?
    2) What happens when an array fails, snaps keep the data on the same box? Arrays fail, isn’t the purpose of backup getting the data off the array? Is replication really that answer?
    3) Replication means corruption at site A will yield corruption at site B, doesn’t it?
    4) If snapshots were really the answer, why was NetApp so big on buying Data Domain? Seemed like at some poinnt they were willing to spend 2 billion to backup data off of the production storage array?

    Backup seems to be a fundamental part of IT that can;t be eliminated with just snaps and replication.

  6. Hi Alex,

    The “snapshot as backup” conversation has been covered much better here. It’s Curtis Preston’s article on the subject.

    In a nutshell, snapshots that have been replicated to some other disk than the primary do count as backups.

    To your points:

    1. You can report on the status of local and remote protection
    2. Snapshot replication and vaulting takes care of that
    3. Traditional replication does suffer from the issue you describe. Replication of app-consistent and verified snaps ensures that you only replicate consistent data at given points in time, THAT YOU CAN ROLL BACK TO.
    4. Oh, that’s way easy – because not everyone wants to totally convert to NetApp storage and Data Domain is a great container for deduplicated target backups (plus they had a great customer list, existing or prospect). Even then, you need 2x DDUP boxes, replicating.

    Watching some NetApp demos would help you understand, I’d be more than happy to facilitate.

    Thx

    D

  7. Dimitris, I know you asked John the question almost a month ago, but since he didn’t response, here is your answer.

    Data Progression (Automated Tiered Storage), Data Instant Replay (Snapshots), Remote Data Instant Replay (Replication), and FastTrack (short stroking drives) are all licensed features on the array itself. Like you described for NetApp, some bundles include them or you can go À la carte. Regardless, the array features are all capped at 96 drives which is far less than can be attached and the licensing is perpetual so any hardware component can be replaced/upgraded without purchasing additional licensing. In the event of upgrading beyond what a single array can manage (it’s variable and would take much longer to explain than I care to get into right now), additional controllers are purchased with more enclosures and drives, but all under the existing systems license. Essentially, once you hit 96 licensed drives you can grow that system without limits and only pay for additional hardware.

    Again, sorry for the really late response, but hope this answers your question.

Leave a comment for posterity...