Category Archives: FUD

So now it is OK to sell systems using “Raw IOPS”???

As the self-proclaimed storage vigilante, I will keep bringing these idiocies up as I come across them.

So, the latest “thing” now is selling systems using “Raw IOPS” numbers.

Simply put, some vendors are selling based on the aggregate IOPS the system will do based on per-disk statistics and nothing else

They are not providing realistic performance estimates for the proposed workload, with the appropriate RAID type and I/O sizes and hot vs cold data and what the storage controller overhead will be to do everything. That’s probably too much work. 

For example, if one assumes 200x IOPS per disk, and 200 such disks are in the system, this vendor is showing 40,000 “Raw IOPS”.

This is about as useful as shoes on a snake. Probably less.

The reality is that this is the ultimate “it depends” scenario, since the achievable IOPS depend on far more than how many random 4K IOPS a single disk can sustain (just doing RAID6 could result in having to divide the raw IOPS by 6 where random writes are concerned – and that’s just one thing that affects performance, there are tons more!)

Please refer to prior articles on the subject such as the IOPS/latency primer here and undersizing here. And some RAID goodness here.

If you’re a customer reading this, you have the ultimate power to keep vendors honest. Use it!


Technorati Tags: ,

An explanation of IOPS and latency

<I understand this extremely long post is redundant for seasoned storage performance pros – however, these subjects come up so frequently, that I felt compelled to write something. Plus, even the seasoned pros don’t seem to get it sometimes… :) >

IOPS: Possibly the most common measure of storage system performance.

IOPS means Input/Output (operations) Per Second. Seems straightforward. A measure of work vs time (not the same as MB/s, which is actually easier to understand – simply, MegaBytes per Second).

How many of you have seen storage vendors extolling the virtues of their storage by using large IOPS numbers to illustrate a performance advantage?

How many of you decide on storage purchases and base your decisions on those numbers?

However: how many times has a vendor actually specified what they mean when they utter “IOPS”? :)

For the impatient, I’ll say this: IOPS numbers by themselves are meaningless and should be treated as such. Without additional metrics such as latency, read vs write % and I/O size (to name a few), an IOPS number is useless.

And now, let’s elaborate… (and, as a refresher regarding the perils of ignoring such things wnen it comes to sizing, you can always go back here).


One hundred billion IOPS…


I’ve competed with various vendors that promise customers high IOPS numbers. On a small system with under 100 standard 15K RPM spinning disks, a certain three-letter vendor was claiming half a million IOPS. Another, a million. Of course, my customer was impressed, since that was far, far higher than the number I was providing. But what’s reality?

Here, I’ll do one right now: The old NetApp FAS2020 (the older smallest box NetApp had to offer) can do a million IOPS. Maybe even two million.

Go ahead, prove otherwise.

It’s impossible, since there is no standard way to measure IOPS, and the official definition of IOPS (operations per second) does not specify certain extremely important parameters. By doing any sort of I/O test on the box, you are automatically imposing your benchmark’s definition of IOPS for that specific test.


What’s an operation? What kind of operations are there?

It can get complicated.

An I/O operation is simply some kind of work the disk subsystem has to do at the request of a host and/or some internal process. Typically a read or a write, with sub-categories (for instance read, re-read, write, re-write, random, sequential) and a size.

Depending on the operation, its size could range anywhere from bytes to kilobytes to several megabytes.

Now consider the following most assuredly non-comprehensive list of operation types:

  1. A random 4KB read
  2. A random 4KB read followed by more 4KB reads of blocks in logical adjacency to the first
  3. A 512-byte metadata lookup and subsequent update
  4. A 256KB read followed by more 256KB reads of blocks in logical sequence to the first
  5. A 64MB read
  6. A series of random 8KB writes followed by 256KB sequential reads of the same data that was just written
  7. Random 8KB overwrites
  8. Random 32KB reads and writes
  9. Combinations of the above in a single thread
  10. Combinations of the above in multiple threads
…this could go on.

As you can see, there’s a large variety of I/O types, and true multi-host I/O is almost never of a single type. Virtualization further mixes up the I/O patterns, too.

Now here comes the biggest point (if you can remember one thing from this post, this should be it):

No storage system can do the same maximum number of IOPS irrespective of I/O type, latency and size.

Let’s re-iterate:

It is impossible for a storage system to sustain the same peak IOPS number when presented with different I/O types and latency requirements.


Another way to see the limitation…

A gross oversimplification that might help prove the point that the type and size of operation you do matters when it comes to IOPS. Meaning that a system that can do a million 512-byte IOPS can’t necessarily do a million 256K IOPS.

Imagine a bucket, or a shotshell, or whatever container you wish.

Imagine in this container you have either:

  1. A few large balls or…
  2. Many tiny balls
The bucket ultimately contains about the same volume of stuff either way, and it is the major limiting factor. Clearly, you can’t completely fill that same container with the same number of large balls as you can with small balls.
IOPS containers













They kinda look like shotshells, don’t they?

Now imagine the little spheres being forcibly evacuated rapildy out of one end… which takes us to…


Latency matters

So, we’ve established that not all IOPS are the same – but what is of far more significance is latency as it relates to the IOPS.

If you want to read no further – never accept an IOPS number that doesn’t come with latency figures, in addition to the I/O sizes and read/write percentages.

Simply speaking, latency is a measure of how long it takes for a single I/O request to happen from the application’s viewpoint.

In general, when it comes to data storage, high latency is just about the least desirable trait, right up there with poor reliability.

Databases especially are very sensitive with respect to latency – DBs make several kinds of requests that need to be acknowledged quickly (ideally in under 10ms, and writes especially in well under 5ms). In particular, the redo log writes need to be acknowledged almost instantaneously for a heavy-write DB – under 1ms is preferable.

High sustained latency in a mission-critical app can have a nasty compounding effect – if a DB can’t write to its redo log fast enough for a single write, everything stalls until that write can complete, then moves on. However, if it constantly can’t write to its redo log fast enough, the user experience will be unacceptable as requests get piled up – the DB may be a back-end to a very busy web front-end for doing Internet sales, for example. A delay in the DB will make the web front-end also delay, and the company could well lose thousands of customers and millions of dollars while the delay is happening. Some companies could also face penalties if they cannot meet certain SLAs.

On the other hand, applications doing sequential, throughput-driven I/O (like backup or archival) are nowhere near as sensitive to latency (and typically don’t need high IOPS anyway, but rather need high MB/s).

Here’s an example from an Oracle DB – a system doing about 15,000 IOPS at 25ms latency. Doing more IOPS would be nice but the DB needs the latency to go a lot lower in order to see significantly improved performance – notice the increased IO waits and latency, and that the top event causing the system to wait is I/O:

AWR example









Now compare to this system (different format this data but you’ll get the point):

Notice that, in this case, the system is waiting primarily for CPU, not storage.

A significant amount of I/O wait is a good way to determine if storage is an issue (there can be other latencies outside the storage of course – CPU and network are a couple of usual suspects). Even with good latencies, if you see a lot of I/O waits it means that the application would like faster speeds from the storage system.

But this post is not meant to be a DB sizing class. Here’s the important bit that I think is confusing a lot of people and is allowing vendors to get away with unrealistic performance numbers:

It is possible (but not desirable) to have high IOPS and high latency simultaneously.

How? Here’s a, once again, oversimplified example:

Imagine 2 different cars, both with a top speed of 150mph.

  • Car #1 takes 50 seconds to reach 150mph
  • Car #2 takes 200 seconds to reach 150mph

The maximum speed of the two cars is identical.

Does anyone have any doubt as to which car is actually faster? Car #1 indeed feels about 4 times faster than Car #2, even though they both hit the exact same top speed in the end.

Let’s take it an important step further, keeping the car analogy since it’s very relatable to most people (but mostly because I like cars):

  • Car #1 has a maximum speed of 120mph and takes 30 seconds to hit 120mph
  • Car #2 has a maximum speed of 180mph, takes 50 seconds to hit 120mph, and takes 200 seconds to hit 180mph

In this example, Car #2 actually has a much higher top speed than Car #1. Many people, looking at just the top speed, might conclude it’s the faster car.

However, Car #1 reaches its top speed (120mph) far faster than Car # 2 reaches that same top speed of Car #1 (120mph).

Car #2 continues to accelerate (and, eventually, overtakes Car #1), but takes an inordinately long amount of time to hit its top speed of 180mph.

Again – which car do you think would feel faster to its driver?

You know – the feeling of pushing the gas pedal and the car immediately responding with extra speed that can be felt? Without a large delay in that happening?

Which car would get more real-world chances of reaching high speeds in a timely fashion? For instance, overtaking someone quickly and safely?

Which is why car-specific workload benchmarks like the quarter mile were devised: How many seconds does it take to traverse a quarter mile (the workload), and what is the speed once the quarter mile has been reached?

(I fully expect fellow geeks to break out the slide rules and try to prove the numbers wrong, probably factoring in gearing, wind and rolling resistance – it’s just an example to illustrate the difference between throughput and latency, I had no specific cars in mind… really).


And, finally, some more storage-related examples…

Some vendor claims… and the fine print explaining the more plausible scenario beneath each claim:

“Mr. Customer, our box can do a million IOPS!”

512-byte ones, sequentially out of cache.

“Mr. Customer, our box can do a quarter million random 4K IOPS – and not from cache!”

at 50ms latency.

“Mr. Customer, our box can do a quarter million 8K IOPS, not from cache, at 20ms latency!”

but only if you have 1000 threads going in parallel.

“Mr. Customer, our box can do a hundred thousand 4K IOPS, at under 20ms latency!”

but only if you have a single host hitting the storage so the array doesn’t get confused by different I/O from other hosts.

Notice how none of these claims are talking about writes or working set sizes… or the configuration required to support the claim.


What to look for when someone is making a grandiose IOPS claim

Audited validation and a specific workload to be measured against (that includes latency as a metric) both help. I’ll pick on HDS since they habitually show crazy numbers in marketing literature.

For example, from their website:



It’s pretty much the textbook case of unqualified IOPS claims. No information as to the I/O size, reads vs writes, sequential or random, what type of medium the IOPS are coming from, or, of course, the latency…

However, that very same box barely breaks 200,000 SPC-1 IOPS with good latency in the audited SPC-1 benchmark:



Last I checked, 200,000 was 20 times less than 4,000,000. Don’t get me wrong, 200,000 low-latency IOPS is a great SPC-1 result, but it’s not 4 million SPC-1 IOPS.

Check my previous article on SPC-1 and how to read the results here. And if a vendor is not posting results for a platform – ask why.


Where are the IOPS coming from?

So, when you hear those big numbers, where are they really coming from? Are they just ficticious? Not necessarily. So far, here are just a few of the ways I’ve seen vendors claim IOPS prowess:

  1. What the controller will theoretically do given unlimited back-end resources.
  2. What the controller will do purely from cache.
  3. What a controller that can compress data will do with all zero data.
  4. What the controller will do assuming the data is at the FC port buffers (“huh?” is the right reaction, only one three-letter vendor ever did this so at least it’s not a widespread practice).
  5. What the controller will do given the configuration actually being proposed driving a very specific application workload with a specified latency threshold and real data.
The figures provided by the approaches above are all real, in the context of how the test was done by each vendor and how they define “IOPS”. However, of the (non-exhaustive) options above, which one do you think is the more realistic when it comes to dealing with real application data?


What if someone proves to you a big IOPS number at a PoC or demo?

Proof-of-Concept engagements or demos are great ways to prove performance claims.

But, as with everything, garbage in – garbage out.

If someone shows you IOmeter doing crazy IOPS, use the information in this post to help you at least find out what the exact configuration of the benchmark is. What’s the block size, is it random, sequential, a mix, how many hosts are doing I/O, etc. Is the config being short-stroked? Is it coming all out of cache?

Typically, things like IOmeter can be a good demo but that doesn’t mean the combined I/O of all your applications’ performance follows the same parameters, nor does it mean the few servers hitting the storage at the demo are representative of your server farm with 100x the number of servers. Testing with as close to your application workload as possible is preferred. Don’t assume you can extrapolate – systems don’t always scale linearly.


Factors affecting storage system performance

In real life, you typically won’t have a single host pumping I/O into a storage array. More likely, you will have many hosts doing I/O in parallel. Here are just some of the factors that can affect storage system performance in a major way:


  1. Controller, CPU, memory, interlink counts, speeds and types.
  2. A lot of random writes. This is the big one, since, depending on RAID level, the back-end I/O overhead could be anywhere from 2 I/Os (RAID 10) to 6 I/Os (RAID6) per write, unless some advanced form of write management is employed.
  3. Uniform latency requirements – certain systems will exhibit latency spikes from time to time, even if they’re SSD-based (sometimes especially if they’re SSD-based).
  4. A lot of writes to the same logical disk area. This, even with autotiering systems or giant caches, still results in tremendous load on a rather limited set of disks (whether they be spinning or SSD).
  5. The storage type used and the amount – different types of media have very different performance characteristics, even within the same family (the performance between SSDs can vary wildly, for example).
  6. CDP tools for local protection – sometimes this can result in 3x the I/O to the back-end for the writes.
  7. Copy on First Write snapshot algorithms with heavy write workloads.
  8. Misalignment.
  9. Heavy use of space efficiency techniques such as compression and deduplication.
  10. Heavy reliance on autotiering (resulting in the use of too few disks and/or too many slow disks in an attempt to save costs).
  11. Insufficient cache with respect to the working set coupled with inefficient cache algorithms, too-large cache block size and poor utilization.
  12. Shallow port queue depths.
  13. Inability to properly deal with different kinds of I/O from more than a few hosts.
  14. Inability to recognize per-stream patterns (for example, multiple parallel table scans in a Database).
  15. Inability to intelligently prefetch data.


What you can do to get a solution that will work…

You should work with your storage vendor to figure out, at a minimum, the items in the following list, and, after you’ve done so, go through the sizing with them and see the sizing tools being used in front of you. (You can also refer to this guide).

  1. Applications being used and size of each (and, ideally, performance logs from each app)
  2. Number of servers
  3. Desired backup and replication methods
  4. Random read and write I/O size per app
  5. Sequential read and write I/O size per app
  6. The percentages of read vs write for each app and each I/O type
  7. The working set (amount of data “touched”) per app
  8. Whether features such as thin provisioning, pools, CDP, autotiering, compression, dedupe, snapshots and replication will be utilized, and what overhead they add to the performance
  9. The RAID type (R10 has an impact of 2 I/Os per random write, R5 4 I/Os, R6 6 I/Os – is that being factored?)
  10. The impact of all those things to the overall headroom and performance of the array.

If your vendor is unwilling or unable to do this type of work, or, especially, if they tell you it doesn’t matter and that their box will deliver umpteen billion IOPS – well, at least now you know better :)


Technorati Tags: , , , , , , , , , , , ,


NetApp vs EMC usability report: malice, stupidity or both?

Most are familiar with Hanlon’s Razor:

Never attribute to malice that which is adequately explained by stupidity.

A variation of that is:

Never attribute to malice that which is adequately explained by stupidity, but don’t rule out malice.

You see, EMC sponsored a study comparing their systems to ones from the company they look up to and try to emulate. The report deals with ease-of-use (and I’ll be the first to admit the current iteration of EMC boxes is far easier to use than in the past and the GUI has some cool stuff in it). I was intrigued, but after reading the official-looking report posted by Chuck Hollis, I wondered who in their right mind will lend it credence, and ignored it since I have a real day job solving actual customer problems and can’t possibly respond to every piece of FUD I see (and I see a lot).

Today I’m sitting in a rather boring meeting so I thought I’d spend a few minutes to show how misguided the document is.

In essence, the document tackles the age-old dilemma of which race car to get by comparing how easy it is to change the oil, and completely ignores the “winning the race with said car” part. My question would be: “which car allows you to win the race more easily and with the least headaches, least cost and least effort?”

And if you think winning a “race” is just about performance, think again.

It is also interesting how the important aspects of efficiency, reliability and performance are not tackled, but I guess this is a “usability” report…

Strange that a company named “Strategic Focus” reduces itself to comparing arrays by measuring the number of mouse clicks. Not sure how this is strategic for customers. They were commissioned by EMC, so maybe EMC considers this strategic.

I’ll show how wrong the document is by pointing at just some of the more glaring issues, but I’ll start by saying a large multinational company has many PB of NetApp boxes around the globe and 3 relaxed guys to manage it all. How’s that for a real example? :)

  1. Page 2, section 4, “Methodology”: EMC’s own engineers set up the VNX properly. No mention of who did the NetApp testing, what their qualifications are, and so on. So, first question: “Do these people even know what they’re doing? Have they really used a NetApp system before?”
  2. Page 10, Table A, showing the configurations. A NetApp FAS3070 was used, running the latest code at this moment (8.01). Thanks EMC for the unintended compliment – you see, that system is 2 generations old (the current one is 3270 and the previous one is 3170) yet it can still run the very latest 64-bit ONTAP code just fine. What about the EMC CX3? Can it run FLARE31? Or is that a forklift upgrade? Something to be said for investment protection :)
  3. Page 3 table 5-1, #1. Storage pools on all modern arrays would typically be created upfront, so the wording is very misleading. In order to create a new LUN one does NOT NEED to create a pool. Same goes for all vendors.
  4. Same table and section (also mentioned in section 7): Figuring out the space available is as simple as going to the aggregate page, where the space is clearly shown for the aggregates. So, unsure what is meant here.
  5. Regarding LUN creation… Let me ask you a question: After you create a LUN on any array, what do you need to do next? You see, the goal is to attach the LUN to a host, do alignment, partition creation, multipathing and create a filesystem and write stuff to it. You know, use it. NetApp largely automates end-to-end creation of host filesystems and, indeed, does not need an administrator to create a LUN on the array at all. Clearly the person doing the testing is either not aware of this or decided to omit this fact.
  6. Page 4, item 4 (thin provisioning). Asinine statement – plus, any NetApp LUN can be made thick or thin with a single click, whereas a VNX needs to do a migration. Indeed, NetApp does not complicate things whether thin or thick is required, does not differentiate between thin and thick when writing, and therefore does not incur a performance penalty, whereas EMC does (according to EMC documentation).
  7. Page 4, item 5 (Creation of virtual CIFS servers). The Multistore feature is free of charge on all new systems and allows one to create fully segregated, secure multitenancy virtual CIFS, NFS and iSCSI NetApp “partitions” – far beyond the capabilities of EMC. Again, misleading.
  8. Page 4, item 6 (growing storage elements). No measurable difference? Kindly show all the steps to grow a LUN until the new space is visible from the host side. End-to-end is important to real users since they want to use the storage. Or maybe not, for the authors of this document.
  9. Page 5, Item 1. We are really talking here about EMC snapshots? Seriously? Versus NetApp? To earn the right to do so assumes your snapshots are a usable and decent feature and that you can take a good number of them without the box crumbling to pieces. Ask any vendor about a production array with the most snaps and ask to talk to the customer using it. Then compare the number of snaps to a typical NetApp customer’s. Don’t be surprised if one number is a few hundred times less than the other.
  10. Page 5, item 3 (storage tiering): part of a longer conversation but this assumes all arrays need to do tiering. If my solution is optimized to the level that it doesn’t need to do this but yours is not optimized so it needs tiering, why on earth am I being penalized for doing storage more efficiently than you? (AKA the “not invented here” syndrome).
  11. Page 6 item 1 (VMware awareness): NetApp puts all the awareness inside vCenter and, indeed, datastore creation (including volume/LUN/NAS creation and resizing), VM cloning etc. all from within vCenter itself. Ask for a demo and prepare to be amazed.
  12. Page 6, item 2 – (dedupe/compress individual VMs): This one had my blood boiling. You see, EMC cannot even dedupe individual VMs, (impossible, given the fact that current DART code only does dedupe at the file and not block level and no two active VMs will ever be exactly the same), can’t dedupe at all for block storage (maybe in the future but not today), and in general doesn’t recommend compression for VMs! Ask to see the best practices guide that states all this is supported and recommended for active production VMs, and to talk to a customer doing it at scale (not 10 VMs). A feature you can theoretically turn on but that will never work is not quite useful, you see…
  13. Page 8, entire table: Too much to comment on, suffice it to say that NetApp systems come with tools not mentioned in this report that go so far beyond what Unisphere does that it’s not even funny (at no additional cost). Used by customers that have thousands of NetApp systems. That’s how much those tools scale. EMC would need vast portions of the Ionix suite to do anything remotely similar (at $$$). Of course, mentioning that would kinda derail this document… and the piece about support and upgrades is utterly wrong, but I like to keep the surprise for when I do the demos and not share cool IP ideas here :)
  14. Page 11, Table B1: In the end, the funniest one of all! If you add up the total number of mouse clicks, NetApp needed 92 vs EMC’s 111. Since the whole point of this usability report is to show overall ease of use by measuring the total number of clicks to do stuff, it’s interesting that they didn’t do a simple total to show who won in the end… :)

I could keep going but I need to pay attention to my meeting now since it suddenly became interesting.

Ultimately, when it comes to ease of use, it’s simple to just do a demo and have the customer decide for themselves which approach they like best. Documents such as this one mean less than nothing for actual end users.

I should have another similar list showing clicks and TIME needed to do certain other things. For instance, using RecoverPoint (or any other method), kindly show the number of clicks and time (and disk space) for creating 30 writable clones of a 10TB SQL DB and mounting them on 30 different DB servers simultaneously. Maintaining unique instance names etc. Kinda goes a bit beyond LUN creation, doesn’t it? :)

All this BTW doesn’t mean any vendor should rest on their laurels and stop working on improving usability. It’s a never-ending quest. Just stop it with the FUD, please…

Finishing with something funny: Check this video for a good demonstration of something needing few clicks yet not being that easy to do.

Comments welcome.


Technorati Tags: , , , ,

EMC conclusively proves that VNX bottlenecks NAS performance

A bit of a controversial title, no?

Allow me to elaborate.

EMC posted a new SPEC SFS result as part of a marketing stunt (which is working, look at what I’m doing – I’m talking about them, if only to clear the air).

In simple terms, EMC got almost 500,000 SPEC SFS NFS IOPS (not to be confused with, say, block-based SPC-1 IOPS) with the following configuration:

  1. Four (4) totally separate VNX arrays, each loaded with SSD storage, utterly unaware of each other (8 total controllers since each box has 2)
  2. Five (5) Celerra VG8 NAS heads/gateways (1 spare), one on top of each VNX box
  3. 2 Control Stations
  4. 8 exported filesystems (2 per VG8 head/VNX system)
  5. Multiple pools of storage (at least 1 per VG8) – not shared among the various boxes, no data mobility between boxes
  6. Only 60TB NAS space with RAID5 (or 15TB per box)

Now, this post is not about whether this configuration is unrealistic and expensive (almost nobody would pay $6m for merely 60TB of NAS, not today). I get it that EMC is trying to publish the best possible number by loading a bunch of separate arrays with SSD. It’s OK as long as everyone understands the details.

My beef has to do with how it’s marketed.

EMC is very vague about the configuration, unless you look at the actual SPEC website. In the marketing materials they just mention VNX, as in “The EMC VNX performed at 497,623 SPECsfs2008_nfs.v3 operations per second”. Kinda like saying it’s OK to take 3 5-year olds and a 6-year old to a bar because their age adds up to 21.

No – the far more accurate statement is “four separate VNXs working independently and utterly unaware of each other did 124,405 SPEC fs2008_nfs.v3 operations per second each“.

All EMC did was add up the result of 4 boxes.

Heck, that’s easy to do!

NetApp already has a result for the 6240 (just 2 controllers doing a respectable 190,675 SPEC NFS ops taking care of NAS and RAID all at once since they’re actually unified, no cornucopia of boxes there) without using Solid State Drives (common SAS drives plus a large cache were used instead – a standard, realistic config we sell every day, and not a “lab queen”).

If all we’re doing is adding up the result of different boxes, simply multiply this by 4 (plus we do have Cluster-Mode for NAS so it would count as a single clustered system with failover etc. among the nodes) and end up with the following result:

  1. 762,700 SPEC SFS NFS operations
  2. 8 exported filesystems
  3. 343TB usable with RAID-DP (thousands of times more resilient than RAID5)

So, which one do you think is the better deal? More speed, 343TB and better protection, or less speed, 60TB and far less protection? :)

Customers curious about other systems can do the same multiplication trick for other configs, the sky is the limit!

The other, more serious part, and what prompted me to title the post the way I did, is that EMC’s benchmarking made pretty clear the fact that the VNX is the bottleneck, only able to really support a single VG8 head at top speed, necessitating the need for 4 separate VNX systems to accomplish the final result. So, the fact that a VNX can have up to 8 Celerra heads on top of it means nothing since the back-end is your limiting factor. You might as well stick to a dual-head VG8 config (1 active 1 passive) since that’s all it can comfortably drive (otherwise why benchmark it that way?)

But with only 1 active NAS head you’d be limited to just 256TB max NAS capacity, since that’s how much total space a Celerra head can address as of the time of this writing. Which is probably enough for most people.

I wonder if the NAS heads that can be bought as a package with VNX are slower than VG8 heads, and by how much. You see, most people buying the VNX will be getting the NAS heads that can be packaged with it since it’s cheaper that way. How fast does that go? I’m sure customers would like to know, since that’s what they will typically buy.

I also wonder how fast it would be with RAID6.

Here’s a novel idea: benchmark what customers will actually buy!

So apples-to-apples comparisons can become easier instead of something like this:


For the curious: on the left you see an “Autumn Glory” Malus Floribunda (miniature apple). Photo courtesy of John Fullbright.


Technorati Tags: , , , , , , , ,

Questions to ask EMC regarding their new VNX systems…

It’s that time of the year again. The usual websites are busy with news of the upcoming EMC midrange refresh called VNX. And records being broken.

(NEWSFLASH: Watching the webcast now, the record they kept saying they would break ended up being some guy jumping over a bunch of EMC arrays with a motorcycle – and here I was hoping to see some kind of performance record…)

I’m not usually one to rain on anyone’s parade, but I keep seeing the “unified” word a lot, but based on what I’m seeing, it’s all more of the same, albeit with newer CPUs, a different faceplate, and (join the club) SAS. I’m sure the new systems will be faster courtesy of faster CPUs, more RAM and SAS. But are they offering something materially closer to a unified architecture?

Note that I’m not attacking anything in the EMC announcement, merely the continued “unified” claim. I’m sure the new Data Domain, Isilon and Vmax systems are great.

So here are some questions to ask EMC regarding VNX – I’ll keep this as a list instead of a more verbose entry to keep things easy for the ADD-afflicted and allow easier copy-paste into emails :)

  1. Let’s say I have a 100TB VNX system. Let’s say I allocate all 100TB to NAS. Then let’s say that all the 100TB is really chewed up in the beginning but after a year my real data requirements are more like 70TB. Can I take that 30TB I’m not using any more and instantly use it for FC? Since it’s “unified” and all? Without breaking best practices for LUN allocation to Celerra? Or is it forever tied to the NAS part and I have to buy all new storage if I don’t want to destroy what’s there and start from scratch?
  2. Is the VNX (or even the NS before it) 3rd-party verified as an over 5-nines system? (I believe the CX is but is the CX/NS combo?)
  3. How is the architecture of these boxes any different than before? It looks like you still have 2 CX SPs, then some NAS gateways. Seems like very much the same overall architecture and there’s (still) nothing unified about it. I call for some truth in advertising! Only the little VNXe seems materially different (not in the software but in the amount of blades it takes to run it all).
  4. Are the new systems licenced by capacity?
  5. Can the new systems use more than the 2TB of FAST Cache?
  6. On the subject of cache, what is the best practice regarding the minimum number of SSDs to use for cache? Is it 8? How many shelves/buses should they be distributed on?
  7. What is the best practice regarding cache oversubscription and how is this sized?
  8. Since the FAST Cache can also cache writes, what are the ramifications if the cache fails? How many customers have had this happen? After all, we are talking about SSDs, and even mirrored SSDs are much less reliable than mirrored RAM.
  9. What’s the granularity for using RecoverPoint to replicate the NAS piece? Seems like it needs to replicate everything NAS as one chunk as a large consistency group, with Celerra Replicator needed for more granular replication.
  10. What’s the granularity for recovering NAS with RecoverPoint? Seems like you can’t do things by file or by volume even. The entire data mover may need to be recovered in one go, regardless of the volumes within.
  11. When using RecoverPoint, does one need to not use storage pools for certain operations? And what does that mean regarding the complexity of implementation?
  12. Speaking of storage pools, when are they recommended, when not, and why? And what does that mean about the complexity of administration?
  13. What functionality does one lose if one does not use pools?
  14. Can one prioritize FAST Cache in pool LUNs or is cache simply on or off for the entire pool?
  15. Can I do a data-in-place upgrade from CX3 or CX4 or is this a forklift upgrade?
  16. Why is FASTv2 not recommended for Exchange 2010 and various other DBs?
  17. If Autotiering is not really applicable to many workloads, what is it really good for?
  18. What is the percentage of flash needed to properly do autotiering on VNX? (it’s only 3% on VMAX since it uses a 7MB page, but VNX uses a 1GB page, which is far more inefficient). Why is FAST still at the grossly inefficient 1GB chunk?
  19. Can FAST on the VNX exclude certain time periods that can confuse the algorithms, like when backups occur?
  20. Is file-level FAST still a separate system?
  21. Why does the low-end VNXe not offer FC?
  22. Can I upgrade from VNXe to VNX?
  23. Does the VNXe offer FAST?
  24. Can a 1GB chunk span RAID groups or is performance limited to 1 RAID group’s worth of drives?
  25. Why are functions like block, NAS and replication still in separate hardware and software?
  26. Why are there still 2 kinds of snapshotting systems?
  27. Are the block snaps finally without a huge write performance impact? How about the NAS snaps?
  28. Are the snaps finally able to be retained for years if needed?
  29. Why are there 4 kinds of replication? (Mirrorview, Celerra Replicator, Recoverpoint, SAN copy)
  30. Why are there still all these OSes to patch? (Win XP in the SPs, Linux on the Control Station and RecoverPoint, DART on the NAS blades, maybe more if they can run Rainfinity and Atmos on the blades as well)
  31. Why still no dedupe for FC and iSCSI?
  32. Why no dedupe for memory and cache?
  33. Why not sub-file dedupe?
  34. Why is Celerra still limited to 256TB per data mover?
  35. Is Celerra still limited to 16TB per volume? Or is yet another, completely separate system (Isilon) needed to do that?
  36. Is Celerra still limited to not being able to share a volume between data movers? Or is, again, Isilon needed to do that?
  37. Can Celerra non-disruptively move CIFS and NFS volumes between data movers?
  38. Why can there not be a single FCoE link to transfer all the protocols if the boxes are “unified”?
  39. Have the thin provisioning performance overheads been fixed?
  40. Have the pool performance bottlenecks been fixed? Or is it still recommended to use normal RAID LUNs for highest performance?
  41. Can one actually stripe/restripe within a FLARE pool now? When adding storage? With thin provisioning?
  42. What is the best practice for expanding, say, a 50 drive pool? How many drives do I have to expand by? Why?
  43. Does one still need to do a migration to use thin provisioning?
  44. Does one need to do yet another migration to “re-thin” a LUN once it gets temporarily chunky?
  45. Have the RAID5 and RAID6 write inefficiencies been fixed? And how?
  46. Will the benchmarks for the new systems use RAID6 or will they, again, show RAID10? After all, most customers don’t deploy RAID10 for everything, and RAID5 is thousands of times less reliable than RAID6. How about some SPC-1 benchmarks?
  47. Why is EMC still not fessing up to using a filesystem for their new pools? Maybe because they keep saying doing so is not a “real” SAN, even in recent communication?
  48. Since EMC is using a filesystem in order to get functionality in the CX SPs like pools, thin provisioning, compression and auto-tiering (and probably dedupe in the future), how are they keeping fragmentation under control? (how the tables have turned!)

What I notice is a lack of thought leadership when it comes to technology innovation – EMC is still playing catch-up with other vendors in many important architectural areas,  and keeps buying companies left and right to plug portfolio holes. All vendors play catch-up to some extent, the trick is finding the one playing catch-up in the fewest areas and leading in the most, with the fewest compromises.

Some areas of NetApp leadership to answer a question in the comments:

  • First Unified architecture (since 2002)
  • First with RAID that has the space efficiency of RAID5, the performance of RAID10 and the reliability of RAID6
  • First with block-level deduplication for all protocols
  • FIrst with zero-impact snapshots
  • First with Megacaches (up to 16TB cache per system possible)
  • First with VMware integration including VM clones
  • First with space- and time-efficient, integrated replication for all protocols
  • First with snapshot-based archive storage (being able to store different versions of your data for years on nearline storage)
  • First with Unified Connect and FCoE – single cable capability for all protocols (FC, iSCSI, NFS, CIFS)

However, EMC is strong when it comes to marketing, messaging and – wait for it – the management part. Since it’s amazingly difficult to integrate all the technologies EMC has acquired over the years (heck, it’s taking NetApp forever to properly integrate Spinnaker and that’s just one other architecture), EMC is focusing instead on the management of the various bits (the current approach being Unisphere, tying together a subset of EMC’s acquisitions).

So, Unified Storage in EMC-speak really means unified management. Which would be fine if they were upfront about it. Somehow, “our new arrays with unified management but not unified architecture” doesn’t quite roll off the tongue as easily as “unified storage”.

Mike Riley eloquently explains whether it’s easier to fix an architecture or fix management here. Ultimately, unified management can’t tackle all the underlying problems and limitations, but it does allow for some very nice demos.

A cool GUI with frankenstorage behind it is like putting lipstick on a pig, or putting a nice shell on top of a car cobbled together from disparate bits. The underlying build is masked superficially, until it’s not… usually, at the worst possible time.

Sure, ultimately, management is what the end user interfaces with. Many people won’t really care about what goes on inside, nor have the time or inclination to learn. I merely invite them to start thinking more about the inner bits, because when things get tricky is also when something like a portal GUI meshing 4-5 different products together also stops working as expected, and that’s also when you start bouncing between 3-4 completely different support teams all trying to figure out which of the underlying products is causing the problem.

Always think in terms of what happens if something goes wrong with a certain subsystem and always assume things will break – only then can you have proper procedures and be prepared for the worst.









And always remember that the more complex a machine, the more difficult it can be to troubleshoot and fix when it does break (and it will break – everything does). There’s no substitute for clean and simple engineering.

Of course, Rube Goldberg-esque machines can be entertaining… if entertainment is what you’re after :)



Technorati Tags: , , , , , , , , , , ,


FUD tales from the blogosphere: when vendors attack (and a wee bit on expanding and balancing RAID groups)

Haven’t blogged in a while, way too busy. Against my better judgment, I thought I’d respond to some comments I’ve seen on the blogosphere, adding one of my trademark extremely long titles. Part response, part tutorial. People with no time to read it all: Skip to the end and see if you know the answer to the question or if you have ideas on how to do such a thing.

It’s funny how some vendors won’t hesitate to wholeheartedly agree when some “independent” blogger criticizes their competition (before I get flamed, independent in quotes since, as I discussed before, there ain’t no such thing whether said blogger realizes it or not – being biased is a basic human condition).

The equivalent of someone posting in an Audi forum about excessive brake dust, and having guys from Mercedes and BMW chime in and claim how they “tested” Audis and indeed they had issues (but of course!) and how their cars are better now and indeed maybe Audi doesn’t have as much of a lead any more (if, indeed, they ever did). I think the term for that is “shill” but I can understand taking every opportunity to harm an opponent.

So the “Storage Architect” posted entries asking about certain features to be implemented on NetApp storage, one of them being able to reduce the size of an aggregate. Then everyone and their mum jumped on and complained how on earth such an important feature isn’t there :) BTW I’m not saying such a thing wouldn’t be useful to have from time to time. I’ll just try to explain why it’s tricky to implement and maybe ways to avoid problems.

For the uninitiated, a NetApp aggregate is a collection of RAID-DP RAID groups, that are pooled, striped and I/O then hits all the drives from all RAID groups equally for performance. You then carve out volumes out of that aggregate (containers for NFS, CIFS, iSCSI, FC).

A pretty simple structure, really, but effective. Similar constructs are used by many other storage vendors that allow pooling.

So, the question was, why not be able to make an aggregate smaller? (you can already make it bigger on-the-fly, as well as grow or shrink the existing volumes within).

An HP guy them proceeded to complain about how he put too few drives in an aggregate and ended up with an imbalanced configuration while trying to test a NetApp box.

So, some basics:  the following picture shows a well-balanced pool – notice the equal number of drives per RAID group:

The idea being that everything is load-balanced:

Makes sense, right?

You then end up with pieces of data across all disks, which is the intent. Growing it is easy – which is, after all, what 99.99% of customers ever want to do.

However, the HP dude didn’t have enough disks to create a balanced config with the default-sized RAID group (16). So he ended up with something like this, not performance-optimal:

So what the HP dude wanted to do, was to reduce the size of the RAID group and remove drives, even though he expanded the aggregate (and by extension the RAID group) originally.

Normally, before one starts creating pools of storage (with any storage system), one also knows (or should) what one has to play with in order to get the best overall config. It’s like “I want to build a 12-cylinder car engine, but I only have 9 cylinders”. Well – either buy more cylinders, or build an 8-cylinder engine! Don’t start building the 12-cylinder engine and go “oops” :) This is just Storage 101. Mistakes can and do happen, of course.

So, with the current state of tech, if I only had 20 drives to play with (and no option to get more), assuming no spares, I’d rather do one of the following:

  1. Aggregate with 10 + 10 RAID groups inside or
  2. Use all 20 drives in a single RAID group for max space
  3. Ask someone that knows the system better than I do for some advice

This is common sense and both doable and trivial with a NetApp system. The idea is you set the desired RAID group size for that aggregate BEFORE you put in disks. Not really difficult and pretty logical.

For instance, aggr options HPdudeAggr raidsize 10 before adding the drives would have achieved #1 above. Graphically, the Web GUI has that option in there as well, when you modify an aggregate. The option exists and it’s well-known and documented. Not knowing about it is a basic education issue. Arguing that no education should be needed to use a storage device (with an extreme number of features) properly even for deeply involved, low-level operations, is a romantic notion at best. Maybe some day. We are all working hard to make it a reality. Indeed, a lot of things that would take a really long time in the past (or still, with other boxes) have become trivialized – look at SnapDrive and the SnapManager products, for instance.

Back to our example: if, in the future, 10 more disks were purchased, and approach #1 above was taken, one would simply add the ten disks to the aggregate with aggr add HPdudeAggr 10. Resulting in a 10+10+10 config.

But what if I had done #2 above (make a 20-drive RAID group the default for that aggregate)?

Then, simply, you’d end up imbalanced again, with a 20+10. Some thought is needed before embarking on such journeys.

Maybe a better approach would be to add, say, a more reasonable number of drives to achieve good balance? Adding 12 more drives, for example, would allow for an aggregate with 16+16 drives. So, one could simply change the raidsize using aggr options HPdudeAggr raidsize 16, then, add the 12 disks to the aggregate with aggr add HPdudeAggr -g all 12.

This would expand both RAID groups contained within the aggregate dynamically to 16 drives per, resulting in a 16+16 configuration. Which, BTW, is not something you can easily do with most other storage systems!

Having said all that, I think that for people that are not storage savvy (or for the storage savvy that are suffering from temporary brain fog), a good enhancement would be for the interfaces to warn you about imbalanced final configs and show you what will be created in a nice graphical fashion, asking you if you agree (and possibly providing hints on how it could be done better).

I’m not aware of any other storage system that does that degree of handholding but hey, I don’t know everything.

Indeed, maybe the nature of the other posts was being bait so I’ll obligingly take the bait and ask the question so you can advertise your wares here: :)

Is anyone aware of a well-featured storage system from an established, viable vendor that currently (Aug 7, 2010, not roadmap or “Real Soon Now”) allows the creation of a wide-striped pool of drives with some RAID structures underneath; then allows one to evacuate and then destroy some of those underlying RAID groups selectively, non-disruptively, without losing data, even though they already contain parts of the stripes; then change the RAID layout to something else using those same existing drives and restripe without requiring some sort of data migration to another pool and without needing to buy more drives? Again, NOT for expansion, but for the shrinking of the pool?

To clarify even further: What the HP guy did was exactly this: He had 20 drives to play with, he created by mistake a pool with 2 RAID groups, 14+2 and a 2+2, how would your solution take those 2 RAID groups, with data, and change the config to something like 10 + 10 without needing more drives or the destruction of anything?

Can you dynamically reduce a RAID group? (NetApp can dynamically expand, but not reduce a RAID group).

I’m not implying such a thing doesn’t exist, I’m merely curious. I could see ways to make this work by virtualizing RAID further. Still, it’s just one (small) part of the storage puzzle.

The one without sin may cast the first stone! :)


Technorati Tags: ,,

Et tu, Brute? EMC offering capacity guarantees? The sky is falling! Will Chuck resign?

It came to my attention that EMC is offering a 20% efficiency guarantee vs the competition (they seem to be focusing on NetApp as usual but that’s besides the point in this post). See here.

Now, I won’t go ahead and attack their guarantee. Good luck with that, more power to you etc etc. They need all the competitive edge they can get.

No, what I’ll do is expose yet more EMC messaging inconsistency. If you’ve been following the posts in my site you’ll notice that I have absolutely nothing against EMC products – but I do have issues with how they’re sold and marketed and what they’ll say about the competition.

First and foremost: most major storage players, with the notable exception of EMC, have been offering some kind of efficiency guarantee. Sure, you needed to read the fine print to see if your specific use case would be covered (like with every binding document), but at least the guarantees were there. NetApp was first with our 50% efficiency guarantee, then came others (HDS and 3Par are just some that come to mind). We even offer a 35% guarantee if we virtualize EMC arrays :)

We all have different ways of getting the efficiency. NetApp has a combo of deduplication, thin provisioning, snapshots, highly efficient RAID and thin cloning, for instance. Others have a subset (3Par has their really good thin provisioning, for example). Regardless, we all tried to offer some measure of extra efficiency in these hard economic times.

And it’s not just marketing: I have multiple customers that, especially on virtualized environments, save at least 70% (that’s a real 70%, not 70% because we switched them from RAID10 to RAID-DP – literally, a 10TB data set is occupying 3TB). And for deployments like VDI, the savings are in the extreme range.

EMC’s stance was to, at a minimum, ridicule said guarantees. The inimitable Barry Burke (the storage anarchist) had this pretty funny post.

Chuck Hollis has been far more polemic about this – the worst was when he said he’d quit if EMC tried to do something similar (see here in the comments). BTW â we are all waiting for that resignation :) (on a more serious note, Chuck, if you don’t resign because of this, at least refrain from promising next time).

He also called other guarantees “shenanigans” here. I guess he’s really against the idea of guarantees.

But now it’s all good you see, EMC is offering a blanket 20% efficiency guarantee versus the competition! I.e. they will be able to provide 20% more actual usable storage or else they’ll give you free drives to cover the difference. You see, this guarantee is real, not like what all the other companies offer :)

Kidding aside, methinks they’re missing the point – this (to go back to my favorite car analogies) is like saying: :Both our car and your car have a 3-liter engine, but yours has twin turbos and a racing intercooler and 3 times the horsepower but we won’t take any of that into account, we will strictly examine whether you indeed have a 3-liter engine, and we’ll bore ours out to make it 3.6 liters for free”. Alrighty then. I’ll keep my turbos. But how will they deal with an existing NetApp customer that’s getting something like 3x efficiency already? Fulfilling the guarantee terms could get mighty expensive.

If a NetApp customer is getting 3x the usable storage due to deduplication and other means, will EMC come up with the difference or will they just make sure they offer 20% more raw storage?

To the customer, all that matters is how much effective storage they’re able to use, not how much raw storage is in the box.

But, still, this is not what this post is about.

Throughout the years, NetApp and other vendors have offered true innovation on different fronts. Each time that happens, EMC (that also innovates – through acquisition mostly – but likes to act as if nobody else does) employs their usual “minimize and divert” technique. Either they will trivialize the innovation (“who’d want to do that?”) or they will proclaim it false, then divert attention to something they already do (or will do in a few years).

This is even the case for technologies EMC eventually acquired, like Data Domain. Before EMC acquired Data Domain, they disparaged the product, claimed it was the worst kind of device you’d ever want in your datacenter, then tried to sell you the execrable DL3D (AKA Quantum DXi (don’t get me started, the first release was an utter mess).

We all know what happened to that story eventually: at the moment, EMC is offering to swap out existing DL3Ds for free in many cases, and put Data Domain in their place since it’s infinitely better. But wait, weren’t they saying how terrible Data Domain was compared to DL3D?

Some will say this is fine since they’re just trying to compete, and “all is fair”. Personally, if I were approached by sales teams with those about-face tactics, I’d be annoyed.

So, without further ado, I present you with a slide a colleague created. Some of the timing may be a bit off, but the gist should be fairly clear… :)

I could have added a few more lines (Flash Cache, for instance) but it would have made for too busy a slide.

EDIT: I’ll add something I posted as a comment on someone else’s blog that I think is germane.

Since, to provide apples-to-apples protection, EMC HAS to be configured with RAID6, where are the public benchmarks showing EMC RAID6? As you well know, ALL NetApp benchmarks (SPEC, SPC) are with RAID-DP. Any EMC benchmarks around are with RAID10.

Maybe another guarantee is needed:

Provide no worse protection, functionality, space and performance than X competitor.

Otherwise, you’re only tackling a relatively unimportant part of the big picture.


Technorati Tags: ,,,,,,,,,,

NetApp usable space – beyond the FUD

I come across all kinds of FUD, and some of the most ridiculous claims against NetApp regard usable space. I won’t post screenshots from competitive docs since who knows who’ll complain, but suffice it to say that one of the usual strategies against NetApp is to claim the system has something like well under 50% space efficiency using a variety of calculations, anecdotes and obsolete information. In one case, 34% usable space :) Right…

The purpose of this post is to outline the state of the art regarding NetApp usable space as of Spring of 2010.

Since NetApp systems can use free space in various ways instead of just for LUNs, there is frequent confusion regarding what each space-related parameter means, and what the best practices are. NetApp’s recommendations have changed over the years as the technology matured – my goal is to bring everybody up to speed.

Executive summary

Depending on the number and type of drives and the design, aside from edge cases dealing with small systems with a very low number of disks, the real usable space in NetApp systems can easily exceed 75% of the real usable space in the drives. I’ve seen it as high as about 78% of the actual space on the drives. That’s amazingly efficient for something with double-parity protection as default and includes spares. This number is the same whether it represents NAS or SAN data and doesn’t include deduplication, compression or space-efficient clones, which could inflate it to over 1000%. Indeed, NetApp systems are used in the biggest storage installations on the planet partly because they’re so space-efficient. Now, on to the details.

What’s space good for anyway?

Legacy arrays use space in very simple terms – you create RAID groups, then you create LUNs on them and those LUNs pretend they’re normal disks, and that’s that. Figuring out where your space goes is easy – there’s a 1:1 relationship between LUN size and space used on the array. You buy an array that can provide 10TB after RAID and spares, and that’s all you ever get – nothing more, nothing less.

Legacy arrays can sometimes use features such as snapshots, but frequently there are so many caveats around their use (performance being a big one) that either they’re never implemented, or their number is very small indeed to make them really useful.

Since NetApp gear doesn’t suffer from those limitations, customers invariably end up using snapshots a lot, and for various reasons, not just backup. I have customers with over 10,000 snapshots in their arrays – they replicate all those snapshots to another array, can retrieve data that’s several months old, and have stopped relying on legacy backup software, saving money and achieving far faster and easier DR in the process, since with snapshots there’s no restore needed.

What’s your effective space with NetApp gear?

If you consider that each snapshot looks like a complete copy of your data, without factoring in any deduplication at all, the effective logical space could be many, many times more than the physical space. A large law firm I deal with manages to fit about 2.5PB of data into 8TB of snapshot delta space – which is pretty efficient by anyone’s standards. We’re not talking about backups done on deduplicated disk here that need to be restored to become useful – we’re talking about many thousands of straight-up, application-consistent, “full” copies of LUNs, CIFS and NFS shares that you can mount at full speed instantly, without needing to restore from another medium or backup application.

Once you add deduplication and thin cloning, the storage efficiency goes even higher.

It’s not the size of your disk that matters, it’s how you use it

If you use a NetApp system like a legacy disk array, without taking advantage of any of the advanced features (maybe you just care for the multi-protocol functionality, with great performance and reliability) then your usable space falls right within norms. Once you start using the advanced snapshot features, they start eating space of course – but giving you something in return. What you need to figure out is if the tradeoffs are worth it: for instance, if I can keep a month’s worth of Exchange backups with a nominal capacity increase, what is that worth for me? Maybe:

  • I can eliminate backup software licenses
  • I can shrink my storage footprint
  • Avoid purchasing external disk for backups
  • I don’t need to buy external CDP hardware/software and a bunch of extra disk
  • My restores take seconds
  • DR becomes trivial

Or, if I can create 150 clones of my SQL database that my developers can simultaneously use and only chew up a small fraction of the space I’d otherwise need, what is that worth? With other systems, I’d need 150x the space…

Or, create thousands of VM clones for VDI…

How much money are you saving?

What do simplicity and speed mean to your business from an OpEx savings standpoint?

Another way to look at it:

How much more efficient would your business be if you weren’t hampered by the limitations of legacy technology? It’s all about becoming aware of the expanded possibilities.

What you buy

FYI, and to clear any misconceptions in case you can’t be bothered to read the rest: if you ask me for a 10TB usable system, you’ll get a system that will truly provide 10TB usable, honest-to-goodness Base2 space protected against dual-drive failure (no RAID5 silliness), and after all overheads, spares etc. have been taken out. If you want snapshot space we’ll have to add some (like you’d need to with any other vendor). It’s as simple as that.

Right-sized, real space vs raw capacity

Others have explained some of this before but, for completion, I’ll take a stab:

  • The real usable size of, say, a 450GB drive is not really 450GB regardless of the manufacturer.
  • The real usable capacity quoted depends on whether it’s Base2 or Base10 math and a bunch of other factors
  • All vendors that source drives from multiple manufacturers that use RAID groups need to right-size their drives – meaning that, if manufacturer A offers a tad more space in the drive than manufacturer B, in order to use both kinds of drives in the same RAID group, you kinda need to make them seem like the exact same size, meaning you go for the lowest common denominator between drive vendors.
  • Using our 450GB example above, the real addressable right-sized Base10 space in that drive is 438.3GB, and even less in Base2 (402.2). Base2 math simply means 1024 bytes in 1K, not 1000, and the rest follows.
  • Beware of analysis, comparisons or quotes showing Base10 from one vendor and Base2 from another, or raw disk space from one vendor vs right-sized from another! Always ask what base is what you’re seeing and whether the numbers reflect right-sized drives! If you look at the right-sized drive Base2 space from various vendors, it’s usually pretty close. Base your % usable calculations on that number and not the marketing 450GB number that’s not real for any vendor anyway.
  • Everyone pretty much buys the same drives from the same drive manufacturers

Some space reservation axioms

Any system that allows snapshots, clones etc. typically needs some space for those advanced operations. For instance, if you completely fill up a system and then want to take a snapshot, it may let you but if you modify any data then it won’t have space to store the writes and the snapshot will be invalidated and deleted – kinda pointless.

As usual, there is no magic. If you expect to be able to store multiple snapshots, the system needs space to store the data changed between snapshots, regardless of array vendor!

And, out of curiosity – how many man-made devices do you own that you max out all the time? Not leaving breathing room is a recipe for trouble for any piece of equipment.

Explanation of the NetApp data organization

For the uninitiated, here’s a hierarchical list of NetApp structures:

  1. Disks
  2. RAID groups – made of multiple disks. Default RAID is RAID-DP. The system automatically makes them, you don’t need to define them or worry about back-end balancing etc. NetApp RAID groups are typically large, 16 disks or so. RAID-DP ensures better protection than RAID10 (the math shows 163x better than RAID10 and 4,000 better than RAID5).
  3. Parity drives – drives containing extra information that can be used to rebuild data. RAID-DP uses 2 parity drives per RAID group.
  4. Spares – drives that can replace failed or failing drives (no need to wait until the drive is truly dead)
  5. Aggregates – a collection of RAID groups and the basic unit from which space is allocated. That’s really what you define, then the system figures out automatically how to allocate disks and create RAID groups for you (can even expand RAID groups on the fly as you add more disks to the aggregate, even 1 disk at a time).
  6. Volumes – a container that takes space from an Aggregate. A volume can be NAS or SAN. A volume can only belong to one Aggregate, and there will typically be many volumes within an Aggregate. Most people will enable the automatic growing of Volumes.
  7. LUNs – they are placed inside the Volumes. One or more per volume, depending on what you’re trying to do. Usually one.
  8. Snapshots – logical, space-efficient copies of either entire Volumes or structures within volumes. There are 3 kinds depending on what you’re trying to do (Snapshot, Snapvault and Flexclone) but they all use similar underlying technology. I might get into the differences in a future post. Briefly: Snapshot -shorter term, Snapvault – longer term, Flexclone – writeable Snapshot.

Explanation of the NetApp space allocations

  1. Snapshot Reserve – an accounting feature that sets aside a logical percentage of space on a Volume. For instance, if you create a 10TB volume and set a 10% Snap Reserve, the client system will see 9TB usable. Most people will enable automatic deletion of Snapshots. The percentage to set aside is at your discretion and is variable on the fly. The actual amount of space consumed is related to your rate of change between snapshots. See here for some real averages across thousands of systems.
  2. Aggregate Snap Reserve – this is pretty unique. One can actually roll back an entire Aggregate on a NetApp system – can come in handy if you accidentally deleted whole Volumes or in general did some gigantic boo-boo. Rolling back the entire Aggregate will undo whatever was done to that aggregate to break it! This feature is enabled by default and has a 5% reservation. It it not mandatory unless you are running Syncmirror (mostly in Metrocluster setups). Depending on what you want to do, you could disable this altogether or set it to a small number like 1% (my recommendation).
  3. Fractional Reserve – The one that confuses everyone. In a nutshell: it’s a legacy safety net in case you want to modify all the data within a LUN yet still keep the snapshots. Think about it: Let’s say you took a snapshot and you then went ahead and modified every single block of your data. Your snap delta would balloon to the total size of the LUN – regardless of whether you use NetApp, EMC, XIV, Compellent, 3Par, HDS, HP etc etc. The data has to go someplace! There’s a great explanation in this document and I suggest you read it since it covers quite a bit more, too. This one is great, too. Long story short: With snapshot autodelete, and/or volume autogrow, you can set it to zero. If you use the SnapManager products, they take care of snapshot deletion themselves.
  4. System reserve – this is the only one that’s not optional. It’s set to 10% by default. You can actually change it but I’m not telling you how. That space is there for a reason, and changing it will potentially cause problems with high write rate environments. That 10% is used for various operations and has been found to be a good percentage to maintain good performance. All NetApp sizing takes this into account. BTW – ask other vendors if it’s perfectly safe to fill their systems at 100% all the time and whether that impacts performance or prevents them from being able to do certain things. And finally, that 10% lost is gained back in spades with the other NetApp efficiency methodologies (starting at the low level with RAID-DP – please do some simple math based on our 16+ drive RAID group vs typical RAID group sizes) so it doesn’t even matter.

Bottom line: Aside from the 10% system reserve, the rest is all usable space.

The NetApp defaults and some advice

So, here’s where it can get interesting (and confusing) and where the competition gets all their ammunition. Depending on the age of the documentation and firmware, different best practices and defaults apply.

So, if you look at competitive docs from other vendors, they claim that if you use NetApp for LUNs you waste double the space for fractional reserve. That recommendation was true many years ago and it was a safety precaution regarding fractional reserve. The documentation has been updated years ago with zero fractional reserve as the recommendation, but of course that doesn’t help competitors so they left the old messaging. So here’s a basic list of quick recommendations for LUNs:

  1. Snap reserve – 0
  2. Fractional reserve – 0
  3. Snap autodelete on (unless you have SnapManager products managing the snap deletion)
  4. Volume autogrow on
  5. Leave at least a little space available in your volumes, don’t let a LUN 100% fill a volume (the LUN space can be thick but the volume space can be thin-provisioned). This space is needed for deduplication and other processes temporarily
  6. Do consider embracing thin provisioning, even if you don’t want to oversubscribe your disk. It’s much more flexible long-term, and allows for storage elasticity.

So, look at the defaults and ask your engineer if it’s OK to change them if they don’t agree with the settings above. Especially on older systems, I notice that the fractional reserve is still 100%, even after getting updated with the latest software (the update doesn’t change your config). Nothing like giving someone a bunch of disk space back with a few clicks…

If you want to do thin provisioning, depending on the firmware, you may see that using thin provisioning on a volume forces the fractional reserve to 100% – but, ultimately, no real space is being consumed. Was OK in 7.2x, changed to the 100% behavior in 7.3.1, fixed in 7.3.3 since it was confusing everyone.

The bottom line

Ultimately, I want you thinking of how you can use your storage as a resource that enables you to do more than just storing your LUNs. And, finally, I wanted to dispel notions that NetApp storage has less storage efficiency than legacy systems. Comments are always appreciated!


FUD and The Invention of Lying

I watched “The Invention of Lying” movie the other day. Fairly entertaining, and it had an interesting concept:

Imagine a society where nobody can lie – the very concept of lying is alien and never even enters anyone’s mind. Obviously, tons of jokes can be made using that premise, and the movie is riddled with them – such as their fictional Pepsi ad: “Pepsi: when they’re out of Coke!”

In the movie, a single man stumbles upon the concept of lying, and realizes he can do whatever he wishes since nobody else can tell he’s lying.

Obviously, in our society lying is quite prevalent – a large percentage of the population wouldn’t have jobs or offspring without lying.

I thought – what if, just for fun, we applied “The Invention of Lying” movie concept to IT sales? (I guess this is another take on comparing vendors to cars or wines and whatnot). I’m going for an alphabetical, non-comprehensive list (and added a few non-storage entries). I’ll leave it to the reader to figure out if this is more accurate from the standpoint of a rep that cannot lie, or vice versa… :)

  • 3Par: Our best asset is Marc Farley, his highly entertaining blog is what sells our gear. Our gear is pretty fast, though the software not as good as others’. Unsure how we are still in business. Also unsure why nobody has bought us yet. We do have a handful of very large, loyal customers.
  • Apple: Our stuff is prettier but inside it’s all the same, actually often slower than others. Oh, and it’s a lot more expensive. But the software is cool (when you can find it). You’ll probably need to run Windows in a VM anyway to get the full functionality. Did we mention our stuff is prettier?
  • Bluearc: We have limited-functionality NAS with good sequential and random read speeds but not so much for random writes. Oh, and no application integration. But it’s good for certain workloads. Why is nobody acquiring us?
  • Compellent: Data Progression is the coolest thing we do, and we’ll probably go under now that the big vendors can do it. Oh, and it never did much in the real world, especially for performance. Hopefully we’ll get acquired, but if our technology is that good, why did nobody acquire us yet? We’re extremely affordable!
  • Equallogic: We’ll give you free storage (the first hit is free) if/since you also buy Dell servers. We might even throw in a free laptop and a projector. And a mouse pad. Make sure you convert everything to iSCSI since that’s all we do. Oh, you wanted to know specifics about the storage? Well – it’s free! If you buy some servers. You really want to know about the storage? Well, it’s free if… What? You want to understand the failure math of RAID 50? It’s atrocious, but the box is free if…
  • EMC: We buy companies since innovating is kinda hard and time-consuming, so our solutions end up being a mish-mash of technologies. It all mostly works, though interoperability between platforms sucks. Regarding storage, you should really only buy Symmetrix since all our other stuff doesn’t even come close to that quality, we have the other boxes just to meet price points and plug portfolio holes. We trash competitors until we acquire them or until we build something good enough that’s similar. We also sell futures. Hard. We focus too much on NetApp.
  • HDS: We don’t know how to write software but our high-end gear hardware is pretty solid. The cheaper stuff is OK, severely lacks in functionality but we’ll just drop the price enough that you’ll buy it anyway. Capisce?
  • HP: Seems that buying companies works for EMC, we’ll do the same, let’s see what happens. We used to make the best calculators in the world. Oh, and our best array is actually made by HDS. Our servers are great! Please, also buy some printers, they’re pretty good.
  • IBM: We used to be some of the best in storage, now our only 2 products are SVC and DS8K (oops, and now XIV), everything else we resell after we put our faceplates on it. Our biggest sellers are products made by LSI and NetApp. Oh, and we internally compete with the XIV team we acquired. Our storage solutions don’t talk to one another since they’re all made by different people. But SVC can tie it all together! Well, some of it, anyway.
  • Intel: We are so big that even if AMD has better stuff, eventually we catch up. Just you wait. In the meantime, buy more Intel to keep us going. Resistance is futile.
  • Isilon: We are decent for bulk sequential-access NAS, just don’t do any kind of random workload on our gear.
  • LeftHand: If you want any reasonable storage efficiency plus resiliency you need to buy a bunch of boxes (5 or so), since each box is essentially an HP server with internal disks, and the whole server can die. Oh, and we only do iSCSI. So you better make sure you only do iSCSI.
  • NetApp: We probably have some of the worst marketing of all vendors, and often can’t clearly articulate what makes our systems better to C-level execs, focusing almost entirely on techies. We also have issues with making some acquisitions pan out. ONTAP 8 is taking us forever to release, and until then you won’t have very wide striping (update: GA’d 3/19/10). We complicate sales because our engineers are too technical and insist on explaining how the boxes work at a low level, frequently confusing customers, that seldom care about understanding Row-Diagonal Parity equations. Too much good information is tribal knowledge, including performance tuning and the gigantic customers we have. We focus too much on EMC.
  • Pillar: We cry ourselves to sleep because all we have is Larry Ellison and QoS. Maybe Larry will finally force Oracle to finally buy some of his^H^H^H our gear? I wonder how that will go down since Oracle is already using a superior technology and achieving great savings… but we do make a fairly fast box if you’re OK with limited functionality and RAID50.
  • Sun: We can sell you some LSI storage, but even that may be going away. You can also get the exact same storage from IBM that also resells LSI. How about a Thumper? We may also have some leftover HDS gear that we can give you real cheap.
  • Xiotech: Our value prop is extremely obscure and only understood well by about 5 engineers. Out of those 5 engineers, 2 understand the exact failure scenarios of our ISE architecture, and they can’t explain it to anyone else. We are pretty cheap though.
  • XIV: We believe in success through obfuscation. Our box can only do about 17K IOPS if the workload isn’t cache-friendly but we know how to cheat in benchmarks and make it seem faster (make sure your benchmark writes all zeros and/or fits in cache). The box also consumes more power and space than any other storage system. Our reps compete with IBM reps even though we are owned by IBM, since we only get paid on XIV sales, regardless of what the customer’s needs are. Oh, and under certain conditions, a 2-disk failure will bring down the entire system. But don’t you worry about that. BTW, the GUI is amazingly pretty.

Hope you had a chuckle reading some of this!

(minor edits – typo plus some on Twitter complained I was too gentle in the NetApp section :) )


More tales from the field: Sizing best practices – does Compellent follow them?

Technorati Tags: ,,

Note: I edited this a bit to remove some confusing pieces of info.

Another one came in. I’ll keep calling the offenders out until the craziness stops. Fellow engineers – remember that, regardless of where we work, our mission should be to help the customer out first and foremost. Then make a sale, if possible/applicable. I implore you to get your priorities straight. If it looks like you’re losing the fight, figure out what your true value is. If you have no true value, you always have the option of bombing the price. But please, don’t sell someone an under-configured system…

This time, it’s Compellent not seeming to follow basic sizing rules in a specific campaign (I’m not implying this is how all Compellent deals go down). The executive summary: In a deal I’m involved in, they seem to be proposing a lot less disks than are necessary for a specific workload, just so they are perceived as being lower in price. This is their second strike as far as I’m concerned (first case I witnessed was Exchange sizing where they were proposing a single shelf for a workload that needed several times the # drives). Third strike gets you a personal visit. You will never repeat the offense after that, but it gets tiring. Education is better.

And before someone jumps on me and tells me that I don’t know how to properly size for Compellent (which I freely admit) I’ll ask you to consider the following:

There is no magic.

This is not a big NetApp FAS+PAM vs multi-engine Symmetrix V-Max discussion, where the gigantic caches will play a huge role. No – this specific case is a fight between 2 very small systems, both with very limited cache and regular ol’ 15K SAS drives. They’re not quoting SSD that could alleviate the read IOPS issue, and we’re not quoting PAM.

Ergo, this is about to get spindle-bound…

And for all the seasoned pros out there: I know you may know all this, it’s not for you, so don’t complain that it’s too basic. This post is for people new to performance sizing (and maybe some engineers :) )

Some preliminaries:

This is a Windows-only environment. So, the customer sent perfmon data for their servers over for me to analyze and recommend a box.

They’ll be running Exchange plus some databases.

From my days of doing EMC I learned some very important sizing lessons (thanks guys) that I will try to summarize here.

For instance – there is peak performance, average, and what we called “steady-state”.

In any application, there will be some very high I/O spikes from time to time. Those spikes are normal and are usually absorbed by host and array caches. This is the “peak performance”.

The trick is to figure out how long the spikes last for, and see if the caches would be able to accommodate them. If a spike is lasting for 30 min it’s not a spike any more, but rather a real workload you need to accommodate.

If the spikes are in the range of seconds, then cache is usually enough. Depends on the magnitude of the spike, the length of the spike and the size of the cache :)

Then, you have your average performance. That just takes a straight math average across all performance points – so, for instance, if you have, at night, very long periods of inactivity, they will affect the average dramatically. Short-lived spike data points won’t affect it as much since there are so few of them. So the average typically gets skewed towards the low end.

Then there’s the concept of “steady state”.

This effectively tries to get a more meaningful average of steady-state performance during normal working periods. Easy to eyeball actually if you’re looking at the IOPS graphs instead of letting excel do its averaging for you.

A picture will make things clearer:


In this simplified example chart, the vertical axis represents the IOPS and the horizontal is the individual samples over time. You can see there are very quiet periods, a brief spike, then sustained periods of activity. Without needing a degree in Statistics, one can see that the IOPS needed are about 500 in this chart. However, if you just take the average, that’s only 260, or about half! Obviously, not a small difference. But, again obviously, some extra care is required in order to figure out the real requirements instead of just calculating averages!

So, to summarize: it’s usually not correct to size for maximum or average since they’re both misleading (unless you’re sizing for a minimum-latency DB application – then you often size for maximums to accommodate any and all performance requirements). This is the same for every array vendor. The array and host cache accommodate some of the maximum spikes anyway, but the true average steady-state is what you’re trying to accommodate.

So, now that you know the steady-state true average the customer is seeing, the next step in estimating performance is to look at the current disk queues and service times.

I won’t go into disk queuing theory but, simply speaking, if you have a lot of outstanding I/O requests, they end up getting queued up, and the disk tries to service them ASAP but it just can’t quite catch up. You typically want to see low numbers for the queue (as in the very low single digits).

Then, there’s the response time. If the current response times are overly long (anything over 20ms for most DB/email work), then you have a problem…

What this means is that the observed steady-state workload is often constrained by the current hardware. By examining performance reports, all you are seeing is what the current system is doing.

So, the trick is to find out what performance the customer actually NEEDS, at a reasonably low ms response time with low queuing. The perfmon data is just to ensure you don’t make the performance even WORSE than they’re currently seeing! Finding out the true requirements is really the difficult part.

Finally, once you figure out the final, desired steady-state IOPS requirements, you need to translate them into your specific system, since there’s cache helping, but always some overhead to be considered. For instance, in a system that relies on RAID10/RAID5, you need to adjust for the read/write penalties of RAID. That increases the IOPS needed by nature. Again, this is the same for all array vendors – the only time there’s no I/O penalty, is if you’re doing RAID0 (= no protection).

You see, RAID5 for instance, in order to perform writes, has to do some reads as well, to calculate and write the parity. All very normal for the algorithm. Depending on the read/write mix, this extra I/O can be significant, and absolutely needs to be considered when sizing storage! RAID10 doesn’t need to read in order to write, but has to write 2 of everything, so that needs to be considered as well.

You also need to figure out read vs write percentage, I/O block size distributions, random vs sequential… not rocket science, but definitely extra work in order to do right.

The last thing that needs to be taken into account is the working set. Basically, it means this:

Imagine you have a 10TB database, but you’re really only accessing about 100GB of it repeatedly and consistently. Your working set it that 100GB, not the entire 10TB DB. Which is why the more advanced arrays have ways of prioritizing/partitioning cache allocations, since you typically don’t want a big 50TB file share with 10,000 users causing cache starvation for your 10TB DB with the 100GB working set. You need to retain as much of the cache as possible for the DB, since the 50TB file share is too large and unpredictable a working set to fit in cache.

Unless you understand the true working set, you will have no idea how much cache will be able to truly help that particular workload.

Going back to the reason I wrote this post in the first place:

In this specific, small environment, the non-RAID steady-state percentile IOPS required were close to 3,000, with a working set and I/O pattern that wouldn’t fit in the cache of the small systems. Once adjusted for RAID5, the specific I/O mix demanded 50% more IOPS from the disk. The spikes were fairly high, in excess of 10x the steady-state.

Back to basics: A 15K RPM disk can provide about 220 IOPS with reasonable (<20ms) latency, so about 14 disks are needed to accommodate the pre-RAID performance with under 20ms latency. Remember – that doesn’t include spares or RAID overheads, and will not even accommodate I/O spikes. Calculating with the RAID overhead, about 21 drives are needed, at a minimum. Add a spare or two, and you’re up to 22-23 drives, bare minimum, to satisfy steady-state performance without cache starvation in this specific workload.

And, finally, the offense in question:

Compellent said that with their combo RAID1-RAID5 they only needed a single 12-drive SAS enclosure for the entire workload. Take spares out, and, best case, you’re talking about 11 drives doing I/O. Apparently, the writes happen in RAID1, and the reads as RAID5. I’m not the expert, I’m sure someone will chime in. Maybe my math is a bit off since Compellent has the funky RAID1/RAID5 mix, but there are still I/O penalties…

Based on the above analysis, this somehow doesn’t compute with 11 drives, half what my calculations indicate… so, my final question is:

How do Compellent engineers size for performance?