Category Archives: FUD

When competitors try too hard and miss the point – part two

This will be another FUD-busting post in the two-part series (first part here).

It’s interesting how some competitors, in their quest to beat us at any cost, set aside all common sense.

Recently, an Oracle blogger attempted to understand a document NetApp originally wrote in the 90’s (and which we haven’t really updated since, which is admittedly our bad) that explains how WAFL, the block layout engine of Data ONTAP (the storage OS on the FAS platform) works at a high level.

Apparently, he thinks that we turn everything into 4K I/Os, so if someone tried to read 256K, it would have to become 64 separate I/Os, and, by extension, believes this means no NetApp system running ONTAP can ever sustain good read throughput since the back-end would be inundated with IOPS.

The conclusions he comes to are interesting to say the least. I will copy-paste one of the calculations he makes for a 100% read workload:

Erroneous oracle calcs

I like the SAS logo, I guess this is meant to make the numbers look legit, as if they came from actual SAS testing :)

So this person truly believes that to read 2.6GB/s we need 5,120 drives due to the insane back-end IOPS we purportedly generate :)

This would be hilarious if it were true since it would mean NetApp managed to quietly perpetrate the biggest high tech scam in history, fooling customers for 22 years, and somehow managing to become the industry’s #1 storage OS and remain so.

Because customers are that gullible.


Well – here are some stats from a single 8040 controller, with 24 drives, doing over 2.7GB/s reads, at well under 1ms latency, so it’s not even stressed. Thanks to the Australian team for providing the stats:

8040 singlenode

In this example, 2.74GB/s are being read. From stable storage, not cache.

Now, if we do the math the way the competitor would like, it means the back-end is running at over 700,000 4K IOPS. On a single mid-range controller :)

That would be really impressive and hugely wasteful at the same time. Wait – maybe I should turn this around and claim 700,000 4K IOPS at 0.6ms capability per mid-range controller! Imagine how fast the big ones go!

It would also assume 35,000 IOPS per disk at a consistent speed and sub-millisecond response (0.64ms) – because the numbers above are from a single node with only about 20 data SSDs (plus parity and spares).

SSDs are fast but they’re not really that fast, and the purpose of this blog is to illuminate and not obfuscate.

Remember Occam’s razor. What explanation do you think makes more sense here? Pixie-dust drives and controllers, or that the Oracle blogger is massively wrong? :)

Another example – with spinning disks this time

This is a different output, to also illustrate our ability to provide detailed per-disk statistics.

From a single 8060 node, running at over 3GB/s reads during an actual RMAN job and not a benchmark tool (to use a real Oracle application example). There are 192x 10,000 RPM 600GB disks in the config (180x data, 24x parity – we run dual-parity RAID, there were 12x 16-drive RAID groups in a 14+2 config).

Numbers kindly provided by the legendary neto from Brazil (@netofrombrazil on Twitter). Check the link for his blog and all kinds of DB coolness.

This is part of the statit command’s output. I’m not showing all the disks since there are 192 of them after all and each one is a line in the output:

Read chain

The key in these stats is the “chain” column. This shows, per read command, how many blocks were read as a single entity. In this case, the average is about 49, or 196KB per read operation.

Notice the “xfers” – these drives are only doing about 88 physical IOPS on average per drive, and each operation just happens to be large. They could go faster (see the “ut%” column) but that’s just how much they were loaded during the RMAN job.

Again, if we used the blogger’s calculations, this system would have needed over 5,000 drives and generated over 750,000 back-end disk IOPS.

A public apology and retraction would be nice, guys…

Let’s extrapolate this performance at scale.

My examples are for single mid-range controllers. You can multiply that by 24 to see how fast it could go in a full cluster (yes, it’s linear). And that’s not the max these systems will do – just what was in the examples I found that were close to the competitor’s read performance example.

You see, where most of the competition is still dealing with 2-controller systems, NetApp FAS systems running Clustered ONTAP can run 8 engines for block workloads and 24 engines for NAS (8 if mixed), and each engine can have multiple TB of read/write cache (18TB max cache per node currently with ONTAP 8.2.x).

Even if a competitor’s 2 engines are faster than 2 FAS engines, if they stop at 2 and FAS stops at 24, the fight is over before it begins.

People that live in glass houses shouldn’t throw stones.

Since the competitor questioned why NetApp bought Engenio (the acquisition for our E-Series), I have a similar question: Why did Oracle buy Pillar Data? It was purchased after the Sun acquisition. Does that signify a major lack in the ZFS boxes that Pillar is supposed to address?

The Oracle blogger mentioned how their ZFS system had a great score in the SPC-2 tests (which measure throughput and not IOPS). Great.

Interestingly, Oracle ZFS systems can significantly degrade in performance over time (see here especially after writes, deletes and overwrites. Unlike ONTAP systems, ZFS boxes don’t have mechanisms to perform the necessary block reallocations to optimize the data layout in order to bring performance back to original levels (backing up, wiping the box, rebuilding and restoring is not a solution, sorry). There are ways to delay the inevitable, but nothing to fix the core issue.

It follows that the ZFS performance posted in the benchmarks may not be anywhere near what one will get long-term once the ZFS pools are fragmented and full. Making the ZFS SPC-2 benchmark result pretty useless.

NetApp E-Series inherently doesn’t have this fragmentation problem (and is near the top as a price-performance leader in the SPC-2 benchmark, as tested by SGI that resells it). Since there is no long-term speed deterioration issue with E-Series, the throughput you see in the SPC-2 benchmark will be perpetually maintained. The box is in it for the long haul.

Wouldn’t E-Series then be a better choice for a system that needs to constantly deal with such a workload? Both cost-effective and able to sustain high throughput no matter what?

As an aside, I do need to write an article on block layout optimizations available in ONTAP. Many customers are unaware of the possibilities, and competitors use FUD based on observations from back when mud was a novelty. In the meantime, if you’re a NetApp FAS customer, ask your SE and/or check your documentation for the volume option read_realloc space_optimized – great for volumes containing DB data files. Also, check the documentation for the Aggregate option free_space_realloc.

So you’re fast. What else can you do?

There were other “fighting words” in the blogger’s article and they were all about speed and how much faster the new boxes from the competitor are versus some ancient boxes they had from us. Amazing, new controllers being faster than old ones! :)

I see this trend recently, new vendors focusing solely on speed. Guess what – it’s easy to go fast. It’s also easy to be cheap. I’ll save that for a full post another time. But I fully accept that speed sells.

I can build you a commodity-based million-IOPS box during my lunch break. It’s really not that hard. Building a server with dozens of cores and TB of RAM is pretty easy.

But for Enterprise Storage, Reliability is extremely important, far more than sheer speed.

Plus Availability and Serviceability (where the RAS acronym comes from).


Non-Disruptive Operations, even during events that would leave other systems down for extended periods of time.

Extensive automation, management, monitoring and alerting at scale as well.

And of crucial importance is Application Integration, including the ability to perform application-aware data manipulation (fully consistent backups, restores, clones, replication).

So if a system can go fast but can’t do much else, its utility is more towards being a point solution rather than as part of a large, strategic, long-term deployment. Point solutions are useful, yes – but they are also interchangeable with the next cheap fast thing. Most won’t survive.

You know who you are.


Technorati Tags: , , , , , , ,

When competitors try too hard and miss the point

(edit: fixed the images)

After a long hiatus, we return to our regularly scheduled programming with a 2-part series that will address some wild claims Oracle has been making recently.

I’m pleased to introduce Jeffrey Steiner, ex-Oracle employee and all-around DB performance wizard. He helps some of our largest customers with designing high performance solutions for Oracle DBs:

Greetings from a guest-blogger.

I’m one of the original NetApp customers.

I bought my first NetApp in 1995 (I have a 3-digit support case in the system) and it was an F330. I think it came with 512MB SCSI drives, and maxed out at 16GB. It met our performance needs, it was reliable, and it was cost effective.  I continued to buy more over the following years at other employers. We must have been close to the first company to run Oracle databases on NetApp storage. It was late 1999. Again, it met our performance needs, it was reliable, and it was cost effective. My employer immediately prior to joining NetApp was Oracle.

I’m now with NetApp product operations as the principal architect for enterprise solutions, which usually means a big Oracle database is involved, but it can also include DB2, SAS, MongoDB, and others.

I normally ignore competitive blogs, and I had never commented on a blog in my life until I ran into something entitled “Why your NetApp is so slow…” and found this statement:

If an application such MS SQL is writing data in a 64k chunk then before Netapp actually writes it on disk it will have to split it into 16 different 4k writes and 16 different disk IOPS

That’s just openly false. I tried to correct the poster, but was met with nothing but other unsubstantiated claims and insults to the product line. It was clear the blogger wasn’t going to acknowledge their false premise, so I asked Dimitris if I could borrow some time on his blog.

Here’s one of the alleged results of this behavior with ONTAP– the blogger was nice enough to do this calculation for a system reading at 2.6GB/s:




I’m not sure how to interpret this. Are they saying that this alleged horrible, awful design flaw in ONTAP leads to customers buying 50X more drives than required, and our evil sales teams have somehow fooled our customer based into believing this was necessary? Or, is this a claim that ZFS arrays have some kind of amazing ability to use 50X fewer drives?

Given the false premise about ONTAP chopping up any and all IO’s into little 4K blocks and spraying them over the drives, I’m guessing readers are supposed to believe the first interpretation.

Ordinarily, I enjoy this type of marketing. Customers bring this to our attention, and it allows us to explain how things actually work, plus it discredits the account team who provided the information. There was a rep in the UK who used to tell his customers that Oracle had replaced all competing storage arrays in their OnDemand centers with Pillar. I liked it when he said stuff like that. The reason I’m responding is not because I care about the existence of the other blog, but rather that I care about openly false information being spread about how ONTAP works.

How does ONTAP really work?

Some of NetApp’s marketing folks might not like this, but here’s my usual response:

Why does it matter?

It’s an interesting subject, and I’m happy to explain write tetrises and NVMEM write coalescence, and core utilization, but what does that have to do with your business? There was a time we dealt with accusations that NetApp was slow because we has 25 nanometer process CPU’s while the state of the art was 17nm or something like that. These days ‘cores’ seems to come up a lot, as if this happens:


That’s the Brawndo approach to storage sales (

“Our storage arrays contain

5 kinds of technology

which make them AWESOME

unlike other storage arrays which are


A Better Way

I prefer to promote our products based on real business needs. I phrase this especially bluntly when talking to our sales force:

When you are working with a new enterprise customer, shut up about NetApp for at least the first 45 minutes of the discussion

I say that all the time. Not everyone understands it. If you charge into a situation saying, “NetApp is AWESOME, unlike EMC who is NOT AWESOME” the whole conversation turns into PowerPoint wars, links to silly blog articles like the one that prompted this discussion, and whoever wins the deal will win it based on a combination of luck and speaking ability. Providing value will become secondary.

I’m usually working in engineeringland, but in major deals I get involved directly. Let’s say we have a customer with a database performance issue and they’re looking for new storage. I avoid PowerPoint and usually request Oracle AWR/statspack data. That allows me to size a solution with extreme accuracy. I know exactly what the customer needs, I know their performance bottlenecks, and I know whatever solution I propose will meet their requirements. That reduces risk on both sides. It also reduces costs because I won’t be proposing unnecessary capabilities.

None of this has anything to do with who’s got the better SPC-2 benchmark, unless you plan on buying that exact hardware described, configuring it exactly the same way, and then you somehow make money based on running SPC-2 all day.

Here’s an actual Oracle AWR report from a real customer using NetApp. I have pruned the non-storage related parameters to make it easier to read, and I have anonymized the identifying data. This is a major international insurance company calculating its balance sheet at end-of-month. I know of at least 9 or 10 customers that have similar workloads and configurations.


Look at the line that says “Physical reads”. That’s the blocks read per second. Now look at “Std Block Size”. That’s the block size. This is 90K physical block reads per second, which is 90K IOPS in a sense. The IO is predominantly db_file_scattered_read, which counter-intuitively is sequential IO. A parameter called db_file_multiblock_read_count is set to 128. This means Oracle is attempting to read 128 blocks at a time, which equates to 1MB block sizes. It’s a sequential IO read of a file.

Here’s what we got:

1)     89K read “IOPS”, sort of.

2)     Those 89K read IOPS are actually packaged as units of 8 blocks in a single 64k unit.

3)     3K write IOPS

4)     8MB/sec of redo logging.

The most important point here is that the customer needed about 800MB/sec of throughput, they liked the cost savings of IP, and the storage system is meeting their needs. They refresh with NetApp on occasion, so obviouly they’re happy with the TCO.

To put a final nail in the coffin of the Oracle blogger’s argument, if we are really doing 89K block reads/sec, and those blocks are really chopped up into 4k units, that’s a total of about 180,000 4k IOPS that would need to be serviced at the disk layer, per the blogger’s calculation.

  • Our opposing blogger thinks that  would require about 1000 disks in theory
  • This customer is using 132 drives in a real production system.

There’s also a ton of other data on those drives for other workloads. That’s why we have QoS – it allows mixed workloads to play nicely on a single unified system.

To make this even more interesting, the data would have been randomly written in 8k units, yet they are still able to read at 800MB/sec? How is this possible? For one, ONTAP does NOT break up individual IO’s into 4k units. It tries very, very hard to never break up an IO across disks, although that can happen on occasion, notably if you fill you system up to 99% capacity or do something very much against best practices.

The main reason ONTAP can provide good sequential performance with randomly written data is the blocks are organized contiguously on disk. Strictly speaking, there is a sort of ‘fragmentation’ as our competitors like to say, but it’s not like randomly spraying data everywhere. It’s more like large contiguous chunks of data are evenly distributed across the disks. As long as those contiguous segments are sufficiently large, readahead can ensure good throughput.

That’s somewhat of an oversimplification, but it would take a couple hours and a whiteboard to explain the complete details. 20+ years of engineering can’t exactly be summarized in a couple paragraphs. The document misrepresented by the original blog was clearly dated 2006 (and that was to slightly refresh the original posting back in the nineties), and while it’s still correct as far as I can see, it’s also lacking information on the enhancements and how we package data onto disks.

By the way, this database mentioned above? It’s virtualized in VMware too.

Why did I pick an example of only 90K IOPS?  My point was this customer needed 90K IOPS, so they bought 90K IOPS.

If you need this performance:


then let us know. Not a problem. This is from a large SAP environment for a manufacturing company. It beats me what they’re doing, because this is about 10X more IO than what we typically see for this type of SAP application. Maybe they just built a really, really good network that permits this level of IO performance even though they don’t need it.

In any case, that’s 201,734 blocks/sec using a block size of 8k. That’s about 2GB/sec, and it’s from a dual-controller FAS3220 configuration which is rather old (and was the smallest box in its range when it was new).

Sing the bizarro-universe math from the other blog, these 200K IOPS would have been chopped up into 4k blocks and require a total of 400K back-end disk IOPS to service the workload. Divided by 125 IOPS/drive, we have a requirement for 3200 drives. It was ACTUALLY using more like 200 drives.

We can do a lot more, especially with the newer platforms and ONTAP clustering, which would enable up to 24 controllers in the storage cluster. So the performance limits per cluster are staggeringly high.


To put a really interesting (and practical) twist on this, sequential IO in the Oracle realm is probably going to become less important.  You know why? Oracle’s new in-memory feature. Me and several others were floored when we got the first debrief on Oracle In-Memory. I couldn’t have asked for a better implementation if I was in charge of Oracle engineering myself. Folks at NetApp started asking what this means for us, and here’s my list:

  1. Oracle customers will be spending less on storage.

That’s it. That’s my list. The data format on disk remains unchanged, the backup/restore process is the same, the data commitment process is the same. All the NetApp features that earned us around 12,500 Oracle customers are still applicable.

The difference is customers will need smaller controllers, fewer disks, and less bandwidth because they’ll be able to replace a lot of the brute-force full table scan activity with a little In-Memory magic. No, the In-Memory licenses aren’t free, but the benefits will be substantial.

SPC-2 Benchmarks and Engenio Purchases

The other blog demanded two additional answers:

1)     Why hasn’t NetApp done an SPC-2 bencharmk?

2)     Why did NetApp purchase Engenio?


I personally don’t know why we haven’t done an SPC-2 benchmark with ONTAP, but they are rather expensive and it’s aimed at large sequential IO processing. That’s not exactly the prime use case for FAS systems, but not because they’re weak on it. I’ve got AWR reports well into the GB/sec, so it certainly can do all the sequential IO you want in the real world, but what workloads are those?

I see little point in using an ONTAP system for most (but certainly not all) such workloads because the features overall aren’t applicable. I’m aware of some VOD applications on ONTAP where replication and backups were important. Overall, if you want that type of workload, you’d specify a minimum bandwidth requirement, capacity requirement, and then evaluate the proposals from vendors. Cost is usually the deciding factor.

Engenio Acquisition

Again, my personal opinion here on why NetApp acquired Engenio.

Tom Georgens, our CEO, spent 9 years leading Engenio and obviously knew the company and its financials well. I can’t think of any possible way to know you’re getting value for money than having someone in Georgens’ position making this decision.

Here’s the press release about it:

Engenio will enable NetApp to address emerging and fast-growing market segments such as video, including full-motion video capture and digital video surveillance, as well as high performance computing applications, such as genomics sequencing and scientific research.

Yup, sounds about right. That’s all about maximum capacity, high throughput, and low cost. In contrast, ONTAP is about manageability and advanced features. Those are aimed at different sets of business drivers.

Hey, check this out. Here’s an SEC filing:

Since the acquisition of the Engenio business in May 2011, NetApp has been offering the formerly-branded Engenio products as NetApp E-Series storage arrays for SAN workloads. Core differentiators of this price-performance leader include enterprise reliability, availability and scalability. Customers choose E-Series for general purpose computing, high-density content repositories, video surveillance, and high performance computing workloads where data is managed by the application and the advanced data management capabilities of Data ONTAP storage operating system are not required.

Key point here is “where the advanced data management capabilities of Data ONTAP are not required.” It also reflected my logic in storage decisions prior to joining NetApp, and it reflects the message I still repeat to account teams:

  1. Is there any particular feature in ONTAP that is useful for your customer’s actual business requirements? Would they like to snapshot something? Do they need asynchronous replication? Archival? SnapLock? Scale-out clusters with many nodes? Non-disruptive everything? Think carefully, and ask lots of questions.
  2. If the answer is “yes”, go with ONTAP.
  3. If the answer is “no”, go with E-Series.

That’s what I did. I probably influenced or approved around $5M in total purchases. It wasn’t huge, but it wasn’t nothing either. I’d guess we went ONTAP about 70% of the time, but I had a lot of IBM DS3K arrays around too, now known as E-Series.

“Dumb Storage”

I’ve annoyed the E-Series team a few times by referring to it as “dumb storage”, but I mean that in the nicest possible way. It’s primary job is to just sit there and work. It needs to do it fast, reliably, and cost effectively, but on a day-to-day basis it’s not generally doing anything all that advanced.

In some ways, the reliability was a weakness. It was so reliable, that we forgot it was there at all, and we’d do something like changing the email server addresses, and forget to update the RAS feature of the E-Series. Without email notification, it can take a couple years before someone notices the LED that indicates a drive needs replacement.


So now it is OK to sell systems using “Raw IOPS”???

As the self-proclaimed storage vigilante, I will keep bringing these idiocies up as I come across them.

So, the latest “thing” now is selling systems using “Raw IOPS” numbers.

Simply put, some vendors are selling based on the aggregate IOPS the system will do based on per-disk statistics and nothing else

They are not providing realistic performance estimates for the proposed workload, with the appropriate RAID type and I/O sizes and hot vs cold data and what the storage controller overhead will be to do everything. That’s probably too much work. 

For example, if one assumes 200x IOPS per disk, and 200 such disks are in the system, this vendor is showing 40,000 “Raw IOPS”.

This is about as useful as shoes on a snake. Probably less.

The reality is that this is the ultimate “it depends” scenario, since the achievable IOPS depend on far more than how many random 4K IOPS a single disk can sustain (just doing RAID6 could result in having to divide the raw IOPS by 6 where random writes are concerned – and that’s just one thing that affects performance, there are tons more!)

Please refer to prior articles on the subject such as the IOPS/latency primer here and undersizing here. And some RAID goodness here.

If you’re a customer reading this, you have the ultimate power to keep vendors honest. Use it!


Technorati Tags: ,

An explanation of IOPS and latency

<I understand this extremely long post is redundant for seasoned storage performance pros – however, these subjects come up so frequently, that I felt compelled to write something. Plus, even the seasoned pros don’t seem to get it sometimes… :) >

IOPS: Possibly the most common measure of storage system performance.

IOPS means Input/Output (operations) Per Second. Seems straightforward. A measure of work vs time (not the same as MB/s, which is actually easier to understand – simply, MegaBytes per Second).

How many of you have seen storage vendors extolling the virtues of their storage by using large IOPS numbers to illustrate a performance advantage?

How many of you decide on storage purchases and base your decisions on those numbers?

However: how many times has a vendor actually specified what they mean when they utter “IOPS”? :)

For the impatient, I’ll say this: IOPS numbers by themselves are meaningless and should be treated as such. Without additional metrics such as latency, read vs write % and I/O size (to name a few), an IOPS number is useless.

And now, let’s elaborate… (and, as a refresher regarding the perils of ignoring such things wnen it comes to sizing, you can always go back here).


One hundred billion IOPS…


I’ve competed with various vendors that promise customers high IOPS numbers. On a small system with under 100 standard 15K RPM spinning disks, a certain three-letter vendor was claiming half a million IOPS. Another, a million. Of course, my customer was impressed, since that was far, far higher than the number I was providing. But what’s reality?

Here, I’ll do one right now: The old NetApp FAS2020 (the older smallest box NetApp had to offer) can do a million IOPS. Maybe even two million.

Go ahead, prove otherwise.

It’s impossible, since there is no standard way to measure IOPS, and the official definition of IOPS (operations per second) does not specify certain extremely important parameters. By doing any sort of I/O test on the box, you are automatically imposing your benchmark’s definition of IOPS for that specific test.


What’s an operation? What kind of operations are there?

It can get complicated.

An I/O operation is simply some kind of work the disk subsystem has to do at the request of a host and/or some internal process. Typically a read or a write, with sub-categories (for instance read, re-read, write, re-write, random, sequential) and a size.

Depending on the operation, its size could range anywhere from bytes to kilobytes to several megabytes.

Now consider the following most assuredly non-comprehensive list of operation types:

  1. A random 4KB read
  2. A random 4KB read followed by more 4KB reads of blocks in logical adjacency to the first
  3. A 512-byte metadata lookup and subsequent update
  4. A 256KB read followed by more 256KB reads of blocks in logical sequence to the first
  5. A 64MB read
  6. A series of random 8KB writes followed by 256KB sequential reads of the same data that was just written
  7. Random 8KB overwrites
  8. Random 32KB reads and writes
  9. Combinations of the above in a single thread
  10. Combinations of the above in multiple threads
…this could go on.

As you can see, there’s a large variety of I/O types, and true multi-host I/O is almost never of a single type. Virtualization further mixes up the I/O patterns, too.

Now here comes the biggest point (if you can remember one thing from this post, this should be it):

No storage system can do the same maximum number of IOPS irrespective of I/O type, latency and size.

Let’s re-iterate:

It is impossible for a storage system to sustain the same peak IOPS number when presented with different I/O types and latency requirements.


Another way to see the limitation…

A gross oversimplification that might help prove the point that the type and size of operation you do matters when it comes to IOPS. Meaning that a system that can do a million 512-byte IOPS can’t necessarily do a million 256K IOPS.

Imagine a bucket, or a shotshell, or whatever container you wish.

Imagine in this container you have either:

  1. A few large balls or…
  2. Many tiny balls
The bucket ultimately contains about the same volume of stuff either way, and it is the major limiting factor. Clearly, you can’t completely fill that same container with the same number of large balls as you can with small balls.
IOPS containers













They kinda look like shotshells, don’t they?

Now imagine the little spheres being forcibly evacuated rapildy out of one end… which takes us to…


Latency matters

So, we’ve established that not all IOPS are the same – but what is of far more significance is latency as it relates to the IOPS.

If you want to read no further – never accept an IOPS number that doesn’t come with latency figures, in addition to the I/O sizes and read/write percentages.

Simply speaking, latency is a measure of how long it takes for a single I/O request to happen from the application’s viewpoint.

In general, when it comes to data storage, high latency is just about the least desirable trait, right up there with poor reliability.

Databases especially are very sensitive with respect to latency – DBs make several kinds of requests that need to be acknowledged quickly (ideally in under 10ms, and writes especially in well under 5ms). In particular, the redo log writes need to be acknowledged almost instantaneously for a heavy-write DB – under 1ms is preferable.

High sustained latency in a mission-critical app can have a nasty compounding effect – if a DB can’t write to its redo log fast enough for a single write, everything stalls until that write can complete, then moves on. However, if it constantly can’t write to its redo log fast enough, the user experience will be unacceptable as requests get piled up – the DB may be a back-end to a very busy web front-end for doing Internet sales, for example. A delay in the DB will make the web front-end also delay, and the company could well lose thousands of customers and millions of dollars while the delay is happening. Some companies could also face penalties if they cannot meet certain SLAs.

On the other hand, applications doing sequential, throughput-driven I/O (like backup or archival) are nowhere near as sensitive to latency (and typically don’t need high IOPS anyway, but rather need high MB/s).

Here’s an example from an Oracle DB – a system doing about 15,000 IOPS at 25ms latency. Doing more IOPS would be nice but the DB needs the latency to go a lot lower in order to see significantly improved performance – notice the increased IO waits and latency, and that the top event causing the system to wait is I/O:

AWR example Now compare to this system (different format this data but you’ll get the point):

Notice that, in this case, the system is waiting primarily for CPU, not storage.

A significant amount of I/O wait is a good way to determine if storage is an issue (there can be other latencies outside the storage of course – CPU and network are a couple of usual suspects). Even with good latencies, if you see a lot of I/O waits it means that the application would like faster speeds from the storage system.

But this post is not meant to be a DB sizing class. Here’s the important bit that I think is confusing a lot of people and is allowing vendors to get away with unrealistic performance numbers:

It is possible (but not desirable) to have high IOPS and high latency simultaneously.

How? Here’s a, once again, oversimplified example:

Imagine 2 different cars, both with a top speed of 150mph.

  • Car #1 takes 50 seconds to reach 150mph
  • Car #2 takes 200 seconds to reach 150mph

The maximum speed of the two cars is identical.

Does anyone have any doubt as to which car is actually faster? Car #1 indeed feels about 4 times faster than Car #2, even though they both hit the exact same top speed in the end.

Let’s take it an important step further, keeping the car analogy since it’s very relatable to most people (but mostly because I like cars):

  • Car #1 has a maximum speed of 120mph and takes 30 seconds to hit 120mph
  • Car #2 has a maximum speed of 180mph, takes 50 seconds to hit 120mph, and takes 200 seconds to hit 180mph

In this example, Car #2 actually has a much higher top speed than Car #1. Many people, looking at just the top speed, might conclude it’s the faster car.

However, Car #1 reaches its top speed (120mph) far faster than Car # 2 reaches that same top speed of Car #1 (120mph).

Car #2 continues to accelerate (and, eventually, overtakes Car #1), but takes an inordinately long amount of time to hit its top speed of 180mph.

Again – which car do you think would feel faster to its driver?

You know – the feeling of pushing the gas pedal and the car immediately responding with extra speed that can be felt? Without a large delay in that happening?

Which car would get more real-world chances of reaching high speeds in a timely fashion? For instance, overtaking someone quickly and safely?

Which is why car-specific workload benchmarks like the quarter mile were devised: How many seconds does it take to traverse a quarter mile (the workload), and what is the speed once the quarter mile has been reached?

(I fully expect fellow geeks to break out the slide rules and try to prove the numbers wrong, probably factoring in gearing, wind and rolling resistance – it’s just an example to illustrate the difference between throughput and latency, I had no specific cars in mind… really).


And, finally, some more storage-related examples…

Some vendor claims… and the fine print explaining the more plausible scenario beneath each claim:

“Mr. Customer, our box can do a million IOPS!”

512-byte ones, sequentially out of cache.

“Mr. Customer, our box can do a quarter million random 4K IOPS – and not from cache!”

at 50ms latency.

“Mr. Customer, our box can do a quarter million 8K IOPS, not from cache, at 20ms latency!”

but only if you have 1000 threads going in parallel.

“Mr. Customer, our box can do a hundred thousand 4K IOPS, at under 20ms latency!”

but only if you have a single host hitting the storage so the array doesn’t get confused by different I/O from other hosts.

Notice how none of these claims are talking about writes or working set sizes… or the configuration required to support the claim.


What to look for when someone is making a grandiose IOPS claim

Audited validation and a specific workload to be measured against (that includes latency as a metric) both help. I’ll pick on HDS since they habitually show crazy numbers in marketing literature.

For example, from their website:



It’s pretty much the textbook case of unqualified IOPS claims. No information as to the I/O size, reads vs writes, sequential or random, what type of medium the IOPS are coming from, or, of course, the latency…

However, that very same box almost makes 270,000 SPC-1 IOPS with good latency in the audited SPC-1 benchmark:


Last I checked, 270,000 was almost 15 times less than 4,000,000. Don’t get me wrong, 260,000 low-latency IOPS is a great SPC-1 result, but it’s not 4 million SPC-1 IOPS.

Check my previous article on SPC-1 and how to read the results here. And if a vendor is not posting results for a platform – ask why.


Where are the IOPS coming from?

So, when you hear those big numbers, where are they really coming from? Are they just ficticious? Not necessarily. So far, here are just a few of the ways I’ve seen vendors claim IOPS prowess:

  1. What the controller will theoretically do given unlimited back-end resources.
  2. What the controller will do purely from cache.
  3. What a controller that can compress data will do with all zero data.
  4. What the controller will do assuming the data is at the FC port buffers (“huh?” is the right reaction, only one three-letter vendor ever did this so at least it’s not a widespread practice).
  5. What the controller will do given the configuration actually being proposed driving a very specific application workload with a specified latency threshold and real data.
The figures provided by the approaches above are all real, in the context of how the test was done by each vendor and how they define “IOPS”. However, of the (non-exhaustive) options above, which one do you think is the more realistic when it comes to dealing with real application data?


What if someone proves to you a big IOPS number at a PoC or demo?

Proof-of-Concept engagements or demos are great ways to prove performance claims.

But, as with everything, garbage in – garbage out.

If someone shows you IOmeter doing crazy IOPS, use the information in this post to help you at least find out what the exact configuration of the benchmark is. What’s the block size, is it random, sequential, a mix, how many hosts are doing I/O, etc. Is the config being short-stroked? Is it coming all out of cache?

Typically, things like IOmeter can be a good demo but that doesn’t mean the combined I/O of all your applications’ performance follows the same parameters, nor does it mean the few servers hitting the storage at the demo are representative of your server farm with 100x the number of servers. Testing with as close to your application workload as possible is preferred. Don’t assume you can extrapolate – systems don’t always scale linearly.


Factors affecting storage system performance

In real life, you typically won’t have a single host pumping I/O into a storage array. More likely, you will have many hosts doing I/O in parallel. Here are just some of the factors that can affect storage system performance in a major way:


  1. Controller, CPU, memory, interlink counts, speeds and types.
  2. A lot of random writes. This is the big one, since, depending on RAID level, the back-end I/O overhead could be anywhere from 2 I/Os (RAID 10) to 6 I/Os (RAID6) per write, unless some advanced form of write management is employed.
  3. Uniform latency requirements – certain systems will exhibit latency spikes from time to time, even if they’re SSD-based (sometimes especially if they’re SSD-based).
  4. A lot of writes to the same logical disk area. This, even with autotiering systems or giant caches, still results in tremendous load on a rather limited set of disks (whether they be spinning or SSD).
  5. The storage type used and the amount – different types of media have very different performance characteristics, even within the same family (the performance between SSDs can vary wildly, for example).
  6. CDP tools for local protection – sometimes this can result in 3x the I/O to the back-end for the writes.
  7. Copy on First Write snapshot algorithms with heavy write workloads.
  8. Misalignment.
  9. Heavy use of space efficiency techniques such as compression and deduplication.
  10. Heavy reliance on autotiering (resulting in the use of too few disks and/or too many slow disks in an attempt to save costs).
  11. Insufficient cache with respect to the working set coupled with inefficient cache algorithms, too-large cache block size and poor utilization.
  12. Shallow port queue depths.
  13. Inability to properly deal with different kinds of I/O from more than a few hosts.
  14. Inability to recognize per-stream patterns (for example, multiple parallel table scans in a Database).
  15. Inability to intelligently prefetch data.


What you can do to get a solution that will work…

You should work with your storage vendor to figure out, at a minimum, the items in the following list, and, after you’ve done so, go through the sizing with them and see the sizing tools being used in front of you. (You can also refer to this guide).

  1. Applications being used and size of each (and, ideally, performance logs from each app)
  2. Number of servers
  3. Desired backup and replication methods
  4. Random read and write I/O size per app
  5. Sequential read and write I/O size per app
  6. The percentages of read vs write for each app and each I/O type
  7. The working set (amount of data “touched”) per app
  8. Whether features such as thin provisioning, pools, CDP, autotiering, compression, dedupe, snapshots and replication will be utilized, and what overhead they add to the performance
  9. The RAID type (R10 has an impact of 2 I/Os per random write, R5 4 I/Os, R6 6 I/Os – is that being factored?)
  10. The impact of all those things to the overall headroom and performance of the array.

If your vendor is unwilling or unable to do this type of work, or, especially, if they tell you it doesn’t matter and that their box will deliver umpteen billion IOPS – well, at least now you know better :)


Technorati Tags: , , , , , , , , , , , ,


NetApp vs EMC usability report: malice, stupidity or both?

Most are familiar with Hanlon’s Razor:

Never attribute to malice that which is adequately explained by stupidity.

A variation of that is:

Never attribute to malice that which is adequately explained by stupidity, but don’t rule out malice.

You see, EMC sponsored a study comparing their systems to ones from the company they look up to and try to emulate. The report deals with ease-of-use (and I’ll be the first to admit the current iteration of EMC boxes is far easier to use than in the past and the GUI has some cool stuff in it). I was intrigued, but after reading the official-looking report posted by Chuck Hollis, I wondered who in their right mind will lend it credence, and ignored it since I have a real day job solving actual customer problems and can’t possibly respond to every piece of FUD I see (and I see a lot).

Today I’m sitting in a rather boring meeting so I thought I’d spend a few minutes to show how misguided the document is.

In essence, the document tackles the age-old dilemma of which race car to get by comparing how easy it is to change the oil, and completely ignores the “winning the race with said car” part. My question would be: “which car allows you to win the race more easily and with the least headaches, least cost and least effort?”

And if you think winning a “race” is just about performance, think again.

It is also interesting how the important aspects of efficiency, reliability and performance are not tackled, but I guess this is a “usability” report…

Strange that a company named “Strategic Focus” reduces itself to comparing arrays by measuring the number of mouse clicks. Not sure how this is strategic for customers. They were commissioned by EMC, so maybe EMC considers this strategic.

I’ll show how wrong the document is by pointing at just some of the more glaring issues, but I’ll start by saying a large multinational company has many PB of NetApp boxes around the globe and 3 relaxed guys to manage it all. How’s that for a real example? :)

  1. Page 2, section 4, “Methodology”: EMC’s own engineers set up the VNX properly. No mention of who did the NetApp testing, what their qualifications are, and so on. So, first question: “Do these people even know what they’re doing? Have they really used a NetApp system before?”
  2. Page 10, Table A, showing the configurations. A NetApp FAS3070 was used, running the latest code at this moment (8.01). Thanks EMC for the unintended compliment – you see, that system is 2 generations old (the current one is 3270 and the previous one is 3170) yet it can still run the very latest 64-bit ONTAP code just fine. What about the EMC CX3? Can it run FLARE31? Or is that a forklift upgrade? Something to be said for investment protection :)
  3. Page 3 table 5-1, #1. Storage pools on all modern arrays would typically be created upfront, so the wording is very misleading. In order to create a new LUN one does NOT NEED to create a pool. Same goes for all vendors.
  4. Same table and section (also mentioned in section 7): Figuring out the space available is as simple as going to the aggregate page, where the space is clearly shown for the aggregates. So, unsure what is meant here.
  5. Regarding LUN creation… Let me ask you a question: After you create a LUN on any array, what do you need to do next? You see, the goal is to attach the LUN to a host, do alignment, partition creation, multipathing and create a filesystem and write stuff to it. You know, use it. NetApp largely automates end-to-end creation of host filesystems and, indeed, does not need an administrator to create a LUN on the array at all. Clearly the person doing the testing is either not aware of this or decided to omit this fact.
  6. Page 4, item 4 (thin provisioning). Asinine statement – plus, any NetApp LUN can be made thick or thin with a single click, whereas a VNX needs to do a migration. Indeed, NetApp does not complicate things whether thin or thick is required, does not differentiate between thin and thick when writing, and therefore does not incur a performance penalty, whereas EMC does (according to EMC documentation).
  7. Page 4, item 5 (Creation of virtual CIFS servers). The Multistore feature is free of charge on all new systems and allows one to create fully segregated, secure multitenancy virtual CIFS, NFS and iSCSI NetApp “partitions” – far beyond the capabilities of EMC. Again, misleading.
  8. Page 4, item 6 (growing storage elements). No measurable difference? Kindly show all the steps to grow a LUN until the new space is visible from the host side. End-to-end is important to real users since they want to use the storage. Or maybe not, for the authors of this document.
  9. Page 5, Item 1. We are really talking here about EMC snapshots? Seriously? Versus NetApp? To earn the right to do so assumes your snapshots are a usable and decent feature and that you can take a good number of them without the box crumbling to pieces. Ask any vendor about a production array with the most snaps and ask to talk to the customer using it. Then compare the number of snaps to a typical NetApp customer’s. Don’t be surprised if one number is a few hundred times less than the other.
  10. Page 5, item 3 (storage tiering): part of a longer conversation but this assumes all arrays need to do tiering. If my solution is optimized to the level that it doesn’t need to do this but yours is not optimized so it needs tiering, why on earth am I being penalized for doing storage more efficiently than you? (AKA the “not invented here” syndrome).
  11. Page 6 item 1 (VMware awareness): NetApp puts all the awareness inside vCenter and, indeed, datastore creation (including volume/LUN/NAS creation and resizing), VM cloning etc. all from within vCenter itself. Ask for a demo and prepare to be amazed.
  12. Page 6, item 2 – (dedupe/compress individual VMs): This one had my blood boiling. You see, EMC cannot even dedupe individual VMs, (impossible, given the fact that current DART code only does dedupe at the file and not block level and no two active VMs will ever be exactly the same), can’t dedupe at all for block storage (maybe in the future but not today), and in general doesn’t recommend compression for VMs! Ask to see the best practices guide that states all this is supported and recommended for active production VMs, and to talk to a customer doing it at scale (not 10 VMs). A feature you can theoretically turn on but that will never work is not quite useful, you see…
  13. Page 8, entire table: Too much to comment on, suffice it to say that NetApp systems come with tools not mentioned in this report that go so far beyond what Unisphere does that it’s not even funny (at no additional cost). Used by customers that have thousands of NetApp systems. That’s how much those tools scale. EMC would need vast portions of the Ionix suite to do anything remotely similar (at $$$). Of course, mentioning that would kinda derail this document… and the piece about support and upgrades is utterly wrong, but I like to keep the surprise for when I do the demos and not share cool IP ideas here :)
  14. Page 11, Table B1: In the end, the funniest one of all! If you add up the total number of mouse clicks, NetApp needed 92 vs EMC’s 111. Since the whole point of this usability report is to show overall ease of use by measuring the total number of clicks to do stuff, it’s interesting that they didn’t do a simple total to show who won in the end… :)

I could keep going but I need to pay attention to my meeting now since it suddenly became interesting.

Ultimately, when it comes to ease of use, it’s simple to just do a demo and have the customer decide for themselves which approach they like best. Documents such as this one mean less than nothing for actual end users.

I should have another similar list showing clicks and TIME needed to do certain other things. For instance, using RecoverPoint (or any other method), kindly show the number of clicks and time (and disk space) for creating 30 writable clones of a 10TB SQL DB and mounting them on 30 different DB servers simultaneously. Maintaining unique instance names etc. Kinda goes a bit beyond LUN creation, doesn’t it? :)

All this BTW doesn’t mean any vendor should rest on their laurels and stop working on improving usability. It’s a never-ending quest. Just stop it with the FUD, please…

Finishing with something funny: Check this video for a good demonstration of something needing few clicks yet not being that easy to do.

Comments welcome.


Technorati Tags: , , , ,

EMC conclusively proves that VNX bottlenecks NAS performance

A bit of a controversial title, no?

Allow me to elaborate.

EMC posted a new SPEC SFS result as part of a marketing stunt (which is working, look at what I’m doing – I’m talking about them, if only to clear the air).

In simple terms, EMC got almost 500,000 SPEC SFS NFS IOPS (not to be confused with, say, block-based SPC-1 IOPS) with the following configuration:

  1. Four (4) totally separate VNX arrays, each loaded with SSD storage, utterly unaware of each other (8 total controllers since each box has 2)
  2. Five (5) Celerra VG8 NAS heads/gateways (1 spare), one on top of each VNX box
  3. 2 Control Stations
  4. 8 exported filesystems (2 per VG8 head/VNX system)
  5. Multiple pools of storage (at least 1 per VG8) – not shared among the various boxes, no data mobility between boxes
  6. Only 60TB NAS space with RAID5 (or 15TB per box)

Now, this post is not about whether this configuration is unrealistic and expensive (almost nobody would pay $6m for merely 60TB of NAS, not today). I get it that EMC is trying to publish the best possible number by loading a bunch of separate arrays with SSD. It’s OK as long as everyone understands the details.

My beef has to do with how it’s marketed.

EMC is very vague about the configuration, unless you look at the actual SPEC website. In the marketing materials they just mention VNX, as in “The EMC VNX performed at 497,623 SPECsfs2008_nfs.v3 operations per second”. Kinda like saying it’s OK to take 3 5-year olds and a 6-year old to a bar because their age adds up to 21.

No – the far more accurate statement is “four separate VNXs working independently and utterly unaware of each other did 124,405 SPEC fs2008_nfs.v3 operations per second each“.

All EMC did was add up the result of 4 boxes.

Heck, that’s easy to do!

NetApp already has a result for the 6240 (just 2 controllers doing a respectable 190,675 SPEC NFS ops taking care of NAS and RAID all at once since they’re actually unified, no cornucopia of boxes there) without using Solid State Drives (common SAS drives plus a large cache were used instead – a standard, realistic config we sell every day, and not a “lab queen”).

If all we’re doing is adding up the result of different boxes, simply multiply this by 4 (plus we do have Cluster-Mode for NAS so it would count as a single clustered system with failover etc. among the nodes) and end up with the following result:

  1. 762,700 SPEC SFS NFS operations
  2. 8 exported filesystems
  3. 343TB usable with RAID-DP (thousands of times more resilient than RAID5)

So, which one do you think is the better deal? More speed, 343TB and better protection, or less speed, 60TB and far less protection? :)

Customers curious about other systems can do the same multiplication trick for other configs, the sky is the limit!

The other, more serious part, and what prompted me to title the post the way I did, is that EMC’s benchmarking made pretty clear the fact that the VNX is the bottleneck, only able to really support a single VG8 head at top speed, necessitating the need for 4 separate VNX systems to accomplish the final result. So, the fact that a VNX can have up to 8 Celerra heads on top of it means nothing since the back-end is your limiting factor. You might as well stick to a dual-head VG8 config (1 active 1 passive) since that’s all it can comfortably drive (otherwise why benchmark it that way?)

But with only 1 active NAS head you’d be limited to just 256TB max NAS capacity, since that’s how much total space a Celerra head can address as of the time of this writing. Which is probably enough for most people.

I wonder if the NAS heads that can be bought as a package with VNX are slower than VG8 heads, and by how much. You see, most people buying the VNX will be getting the NAS heads that can be packaged with it since it’s cheaper that way. How fast does that go? I’m sure customers would like to know, since that’s what they will typically buy.

I also wonder how fast it would be with RAID6.

Here’s a novel idea: benchmark what customers will actually buy!

So apples-to-apples comparisons can become easier instead of something like this:


For the curious: on the left you see an “Autumn Glory” Malus Floribunda (miniature apple). Photo courtesy of John Fullbright.


Technorati Tags: , , , , , , , ,

Questions to ask EMC regarding their new VNX systems…

It’s that time of the year again. The usual websites are busy with news of the upcoming EMC midrange refresh called VNX. And records being broken.

(NEWSFLASH: Watching the webcast now, the record they kept saying they would break ended up being some guy jumping over a bunch of EMC arrays with a motorcycle – and here I was hoping to see some kind of performance record…)

I’m not usually one to rain on anyone’s parade, but I keep seeing the “unified” word a lot, but based on what I’m seeing, it’s all more of the same, albeit with newer CPUs, a different faceplate, and (join the club) SAS. I’m sure the new systems will be faster courtesy of faster CPUs, more RAM and SAS. But are they offering something materially closer to a unified architecture?

Note that I’m not attacking anything in the EMC announcement, merely the continued “unified” claim. I’m sure the new Data Domain, Isilon and Vmax systems are great.

So here are some questions to ask EMC regarding VNX – I’ll keep this as a list instead of a more verbose entry to keep things easy for the ADD-afflicted and allow easier copy-paste into emails :)

  1. Let’s say I have a 100TB VNX system. Let’s say I allocate all 100TB to NAS. Then let’s say that all the 100TB is really chewed up in the beginning but after a year my real data requirements are more like 70TB. Can I take that 30TB I’m not using any more and instantly use it for FC? Since it’s “unified” and all? Without breaking best practices for LUN allocation to Celerra? Or is it forever tied to the NAS part and I have to buy all new storage if I don’t want to destroy what’s there and start from scratch?
  2. Is the VNX (or even the NS before it) 3rd-party verified as an over 5-nines system? (I believe the CX is but is the CX/NS combo?)
  3. How is the architecture of these boxes any different than before? It looks like you still have 2 CX SPs, then some NAS gateways. Seems like very much the same overall architecture and there’s (still) nothing unified about it. I call for some truth in advertising! Only the little VNXe seems materially different (not in the software but in the amount of blades it takes to run it all).
  4. Are the new systems licenced by capacity?
  5. Can the new systems use more than the 2TB of FAST Cache?
  6. On the subject of cache, what is the best practice regarding the minimum number of SSDs to use for cache? Is it 8? How many shelves/buses should they be distributed on?
  7. What is the best practice regarding cache oversubscription and how is this sized?
  8. Since the FAST Cache can also cache writes, what are the ramifications if the cache fails? How many customers have had this happen? After all, we are talking about SSDs, and even mirrored SSDs are much less reliable than mirrored RAM.
  9. What’s the granularity for using RecoverPoint to replicate the NAS piece? Seems like it needs to replicate everything NAS as one chunk as a large consistency group, with Celerra Replicator needed for more granular replication.
  10. What’s the granularity for recovering NAS with RecoverPoint? Seems like you can’t do things by file or by volume even. The entire data mover may need to be recovered in one go, regardless of the volumes within.
  11. When using RecoverPoint, does one need to not use storage pools for certain operations? And what does that mean regarding the complexity of implementation?
  12. Speaking of storage pools, when are they recommended, when not, and why? And what does that mean about the complexity of administration?
  13. What functionality does one lose if one does not use pools?
  14. Can one prioritize FAST Cache in pool LUNs or is cache simply on or off for the entire pool?
  15. Can I do a data-in-place upgrade from CX3 or CX4 or is this a forklift upgrade?
  16. Why is FASTv2 not recommended for Exchange 2010 and various other DBs?
  17. If Autotiering is not really applicable to many workloads, what is it really good for?
  18. What is the percentage of flash needed to properly do autotiering on VNX? (it’s only 3% on VMAX since it uses a 7MB page, but VNX uses a 1GB page, which is far more inefficient). Why is FAST still at the grossly inefficient 1GB chunk?
  19. Can FAST on the VNX exclude certain time periods that can confuse the algorithms, like when backups occur?
  20. Is file-level FAST still a separate system?
  21. Why does the low-end VNXe not offer FC?
  22. Can I upgrade from VNXe to VNX?
  23. Does the VNXe offer FAST?
  24. Can a 1GB chunk span RAID groups or is performance limited to 1 RAID group’s worth of drives?
  25. Why are functions like block, NAS and replication still in separate hardware and software?
  26. Why are there still 2 kinds of snapshotting systems?
  27. Are the block snaps finally without a huge write performance impact? How about the NAS snaps?
  28. Are the snaps finally able to be retained for years if needed?
  29. Why are there 4 kinds of replication? (Mirrorview, Celerra Replicator, Recoverpoint, SAN copy)
  30. Why are there still all these OSes to patch? (Win XP in the SPs, Linux on the Control Station and RecoverPoint, DART on the NAS blades, maybe more if they can run Rainfinity and Atmos on the blades as well)
  31. Why still no dedupe for FC and iSCSI?
  32. Why no dedupe for memory and cache?
  33. Why not sub-file dedupe?
  34. Why is Celerra still limited to 256TB per data mover?
  35. Is Celerra still limited to 16TB per volume? Or is yet another, completely separate system (Isilon) needed to do that?
  36. Is Celerra still limited to not being able to share a volume between data movers? Or is, again, Isilon needed to do that?
  37. Can Celerra non-disruptively move CIFS and NFS volumes between data movers?
  38. Why can there not be a single FCoE link to transfer all the protocols if the boxes are “unified”?
  39. Have the thin provisioning performance overheads been fixed?
  40. Have the pool performance bottlenecks been fixed? Or is it still recommended to use normal RAID LUNs for highest performance?
  41. Can one actually stripe/restripe within a FLARE pool now? When adding storage? With thin provisioning?
  42. What is the best practice for expanding, say, a 50 drive pool? How many drives do I have to expand by? Why?
  43. Does one still need to do a migration to use thin provisioning?
  44. Does one need to do yet another migration to “re-thin” a LUN once it gets temporarily chunky?
  45. Have the RAID5 and RAID6 write inefficiencies been fixed? And how?
  46. Will the benchmarks for the new systems use RAID6 or will they, again, show RAID10? After all, most customers don’t deploy RAID10 for everything, and RAID5 is thousands of times less reliable than RAID6. How about some SPC-1 benchmarks?
  47. Why is EMC still not fessing up to using a filesystem for their new pools? Maybe because they keep saying doing so is not a “real” SAN, even in recent communication?
  48. Since EMC is using a filesystem in order to get functionality in the CX SPs like pools, thin provisioning, compression and auto-tiering (and probably dedupe in the future), how are they keeping fragmentation under control? (how the tables have turned!)

What I notice is a lack of thought leadership when it comes to technology innovation – EMC is still playing catch-up with other vendors in many important architectural areas,  and keeps buying companies left and right to plug portfolio holes. All vendors play catch-up to some extent, the trick is finding the one playing catch-up in the fewest areas and leading in the most, with the fewest compromises.

Some areas of NetApp leadership to answer a question in the comments:

  • First Unified architecture (since 2002)
  • First with RAID that has the space efficiency of RAID5, the performance of RAID10 and the reliability of RAID6
  • First with block-level deduplication for all protocols
  • FIrst with zero-impact snapshots
  • First with Megacaches (up to 16TB cache per system possible)
  • First with VMware integration including VM clones
  • First with space- and time-efficient, integrated replication for all protocols
  • First with snapshot-based archive storage (being able to store different versions of your data for years on nearline storage)
  • First with Unified Connect and FCoE – single cable capability for all protocols (FC, iSCSI, NFS, CIFS)

However, EMC is strong when it comes to marketing, messaging and – wait for it – the management part. Since it’s amazingly difficult to integrate all the technologies EMC has acquired over the years (heck, it’s taking NetApp forever to properly integrate Spinnaker and that’s just one other architecture), EMC is focusing instead on the management of the various bits (the current approach being Unisphere, tying together a subset of EMC’s acquisitions).

So, Unified Storage in EMC-speak really means unified management. Which would be fine if they were upfront about it. Somehow, “our new arrays with unified management but not unified architecture” doesn’t quite roll off the tongue as easily as “unified storage”.

Mike Riley eloquently explains whether it’s easier to fix an architecture or fix management here. Ultimately, unified management can’t tackle all the underlying problems and limitations, but it does allow for some very nice demos.

A cool GUI with frankenstorage behind it is like putting lipstick on a pig, or putting a nice shell on top of a car cobbled together from disparate bits. The underlying build is masked superficially, until it’s not… usually, at the worst possible time.

Sure, ultimately, management is what the end user interfaces with. Many people won’t really care about what goes on inside, nor have the time or inclination to learn. I merely invite them to start thinking more about the inner bits, because when things get tricky is also when something like a portal GUI meshing 4-5 different products together also stops working as expected, and that’s also when you start bouncing between 3-4 completely different support teams all trying to figure out which of the underlying products is causing the problem.

Always think in terms of what happens if something goes wrong with a certain subsystem and always assume things will break – only then can you have proper procedures and be prepared for the worst.









And always remember that the more complex a machine, the more difficult it can be to troubleshoot and fix when it does break (and it will break – everything does). There’s no substitute for clean and simple engineering.

Of course, Rube Goldberg-esque machines can be entertaining… if entertainment is what you’re after :)



Technorati Tags: , , , , , , , , , , ,


FUD tales from the blogosphere: when vendors attack (and a wee bit on expanding and balancing RAID groups)

Haven’t blogged in a while, way too busy. Against my better judgment, I thought I’d respond to some comments I’ve seen on the blogosphere, adding one of my trademark extremely long titles. Part response, part tutorial. People with no time to read it all: Skip to the end and see if you know the answer to the question or if you have ideas on how to do such a thing.

It’s funny how some vendors won’t hesitate to wholeheartedly agree when some “independent” blogger criticizes their competition (before I get flamed, independent in quotes since, as I discussed before, there ain’t no such thing whether said blogger realizes it or not – being biased is a basic human condition).

The equivalent of someone posting in an Audi forum about excessive brake dust, and having guys from Mercedes and BMW chime in and claim how they “tested” Audis and indeed they had issues (but of course!) and how their cars are better now and indeed maybe Audi doesn’t have as much of a lead any more (if, indeed, they ever did). I think the term for that is “shill” but I can understand taking every opportunity to harm an opponent.

So the “Storage Architect” posted entries asking about certain features to be implemented on NetApp storage, one of them being able to reduce the size of an aggregate. Then everyone and their mum jumped on and complained how on earth such an important feature isn’t there :) BTW I’m not saying such a thing wouldn’t be useful to have from time to time. I’ll just try to explain why it’s tricky to implement and maybe ways to avoid problems.

For the uninitiated, a NetApp aggregate is a collection of RAID-DP RAID groups, that are pooled, striped and I/O then hits all the drives from all RAID groups equally for performance. You then carve out volumes out of that aggregate (containers for NFS, CIFS, iSCSI, FC).

A pretty simple structure, really, but effective. Similar constructs are used by many other storage vendors that allow pooling.

So, the question was, why not be able to make an aggregate smaller? (you can already make it bigger on-the-fly, as well as grow or shrink the existing volumes within).

An HP guy them proceeded to complain about how he put too few drives in an aggregate and ended up with an imbalanced configuration while trying to test a NetApp box.

So, some basics:  the following picture shows a well-balanced pool – notice the equal number of drives per RAID group:

The idea being that everything is load-balanced:

Makes sense, right?

You then end up with pieces of data across all disks, which is the intent. Growing it is easy – which is, after all, what 99.99% of customers ever want to do.

However, the HP dude didn’t have enough disks to create a balanced config with the default-sized RAID group (16). So he ended up with something like this, not performance-optimal:

So what the HP dude wanted to do, was to reduce the size of the RAID group and remove drives, even though he expanded the aggregate (and by extension the RAID group) originally.

Normally, before one starts creating pools of storage (with any storage system), one also knows (or should) what one has to play with in order to get the best overall config. It’s like “I want to build a 12-cylinder car engine, but I only have 9 cylinders”. Well – either buy more cylinders, or build an 8-cylinder engine! Don’t start building the 12-cylinder engine and go “oops” :) This is just Storage 101. Mistakes can and do happen, of course.

So, with the current state of tech, if I only had 20 drives to play with (and no option to get more), assuming no spares, I’d rather do one of the following:

  1. Aggregate with 10 + 10 RAID groups inside or
  2. Use all 20 drives in a single RAID group for max space
  3. Ask someone that knows the system better than I do for some advice

This is common sense and both doable and trivial with a NetApp system. The idea is you set the desired RAID group size for that aggregate BEFORE you put in disks. Not really difficult and pretty logical.

For instance, aggr options HPdudeAggr raidsize 10 before adding the drives would have achieved #1 above. Graphically, the Web GUI has that option in there as well, when you modify an aggregate. The option exists and it’s well-known and documented. Not knowing about it is a basic education issue. Arguing that no education should be needed to use a storage device (with an extreme number of features) properly even for deeply involved, low-level operations, is a romantic notion at best. Maybe some day. We are all working hard to make it a reality. Indeed, a lot of things that would take a really long time in the past (or still, with other boxes) have become trivialized – look at SnapDrive and the SnapManager products, for instance.

Back to our example: if, in the future, 10 more disks were purchased, and approach #1 above was taken, one would simply add the ten disks to the aggregate with aggr add HPdudeAggr 10. Resulting in a 10+10+10 config.

But what if I had done #2 above (make a 20-drive RAID group the default for that aggregate)?

Then, simply, you’d end up imbalanced again, with a 20+10. Some thought is needed before embarking on such journeys.

Maybe a better approach would be to add, say, a more reasonable number of drives to achieve good balance? Adding 12 more drives, for example, would allow for an aggregate with 16+16 drives. So, one could simply change the raidsize using aggr options HPdudeAggr raidsize 16, then, add the 12 disks to the aggregate with aggr add HPdudeAggr -g all 12.

This would expand both RAID groups contained within the aggregate dynamically to 16 drives per, resulting in a 16+16 configuration. Which, BTW, is not something you can easily do with most other storage systems!

Having said all that, I think that for people that are not storage savvy (or for the storage savvy that are suffering from temporary brain fog), a good enhancement would be for the interfaces to warn you about imbalanced final configs and show you what will be created in a nice graphical fashion, asking you if you agree (and possibly providing hints on how it could be done better).

I’m not aware of any other storage system that does that degree of handholding but hey, I don’t know everything.

Indeed, maybe the nature of the other posts was being bait so I’ll obligingly take the bait and ask the question so you can advertise your wares here: :)

Is anyone aware of a well-featured storage system from an established, viable vendor that currently (Aug 7, 2010, not roadmap or “Real Soon Now”) allows the creation of a wide-striped pool of drives with some RAID structures underneath; then allows one to evacuate and then destroy some of those underlying RAID groups selectively, non-disruptively, without losing data, even though they already contain parts of the stripes; then change the RAID layout to something else using those same existing drives and restripe without requiring some sort of data migration to another pool and without needing to buy more drives? Again, NOT for expansion, but for the shrinking of the pool?

To clarify even further: What the HP guy did was exactly this: He had 20 drives to play with, he created by mistake a pool with 2 RAID groups, 14+2 and a 2+2, how would your solution take those 2 RAID groups, with data, and change the config to something like 10 + 10 without needing more drives or the destruction of anything?

Can you dynamically reduce a RAID group? (NetApp can dynamically expand, but not reduce a RAID group).

I’m not implying such a thing doesn’t exist, I’m merely curious. I could see ways to make this work by virtualizing RAID further. Still, it’s just one (small) part of the storage puzzle.

The one without sin may cast the first stone! :)


Technorati Tags: ,,

Et tu, Brute? EMC offering capacity guarantees? The sky is falling! Will Chuck resign?

It came to my attention that EMC is offering a 20% efficiency guarantee vs the competition (they seem to be focusing on NetApp as usual but that’s besides the point in this post). See here.

Now, I won’t go ahead and attack their guarantee. Good luck with that, more power to you etc etc. They need all the competitive edge they can get.

No, what I’ll do is expose yet more EMC messaging inconsistency. If you’ve been following the posts in my site you’ll notice that I have absolutely nothing against EMC products – but I do have issues with how they’re sold and marketed and what they’ll say about the competition.

First and foremost: most major storage players, with the notable exception of EMC, have been offering some kind of efficiency guarantee. Sure, you needed to read the fine print to see if your specific use case would be covered (like with every binding document), but at least the guarantees were there. NetApp was first with our 50% efficiency guarantee, then came others (HDS and 3Par are just some that come to mind). We even offer a 35% guarantee if we virtualize EMC arrays :)

We all have different ways of getting the efficiency. NetApp has a combo of deduplication, thin provisioning, snapshots, highly efficient RAID and thin cloning, for instance. Others have a subset (3Par has their really good thin provisioning, for example). Regardless, we all tried to offer some measure of extra efficiency in these hard economic times.

And it’s not just marketing: I have multiple customers that, especially on virtualized environments, save at least 70% (that’s a real 70%, not 70% because we switched them from RAID10 to RAID-DP – literally, a 10TB data set is occupying 3TB). And for deployments like VDI, the savings are in the extreme range.

EMC’s stance was to, at a minimum, ridicule said guarantees. The inimitable Barry Burke (the storage anarchist) had this pretty funny post.

Chuck Hollis has been far more polemic about this – the worst was when he said he’d quit if EMC tried to do something similar (see here in the comments). BTW â we are all waiting for that resignation :) (on a more serious note, Chuck, if you don’t resign because of this, at least refrain from promising next time).

He also called other guarantees “shenanigans” here. I guess he’s really against the idea of guarantees.

But now it’s all good you see, EMC is offering a blanket 20% efficiency guarantee versus the competition! I.e. they will be able to provide 20% more actual usable storage or else they’ll give you free drives to cover the difference. You see, this guarantee is real, not like what all the other companies offer :)

Kidding aside, methinks they’re missing the point – this (to go back to my favorite car analogies) is like saying: :Both our car and your car have a 3-liter engine, but yours has twin turbos and a racing intercooler and 3 times the horsepower but we won’t take any of that into account, we will strictly examine whether you indeed have a 3-liter engine, and we’ll bore ours out to make it 3.6 liters for free”. Alrighty then. I’ll keep my turbos. But how will they deal with an existing NetApp customer that’s getting something like 3x efficiency already? Fulfilling the guarantee terms could get mighty expensive.

If a NetApp customer is getting 3x the usable storage due to deduplication and other means, will EMC come up with the difference or will they just make sure they offer 20% more raw storage?

To the customer, all that matters is how much effective storage they’re able to use, not how much raw storage is in the box.

But, still, this is not what this post is about.

Throughout the years, NetApp and other vendors have offered true innovation on different fronts. Each time that happens, EMC (that also innovates – through acquisition mostly – but likes to act as if nobody else does) employs their usual “minimize and divert” technique. Either they will trivialize the innovation (“who’d want to do that?”) or they will proclaim it false, then divert attention to something they already do (or will do in a few years).

This is even the case for technologies EMC eventually acquired, like Data Domain. Before EMC acquired Data Domain, they disparaged the product, claimed it was the worst kind of device you’d ever want in your datacenter, then tried to sell you the execrable DL3D (AKA Quantum DXi (don’t get me started, the first release was an utter mess).

We all know what happened to that story eventually: at the moment, EMC is offering to swap out existing DL3Ds for free in many cases, and put Data Domain in their place since it’s infinitely better. But wait, weren’t they saying how terrible Data Domain was compared to DL3D?

Some will say this is fine since they’re just trying to compete, and “all is fair”. Personally, if I were approached by sales teams with those about-face tactics, I’d be annoyed.

So, without further ado, I present you with a slide a colleague created. Some of the timing may be a bit off, but the gist should be fairly clear… :)

I could have added a few more lines (Flash Cache, for instance) but it would have made for too busy a slide.

EDIT: I’ll add something I posted as a comment on someone else’s blog that I think is germane.

Since, to provide apples-to-apples protection, EMC HAS to be configured with RAID6, where are the public benchmarks showing EMC RAID6? As you well know, ALL NetApp benchmarks (SPEC, SPC) are with RAID-DP. Any EMC benchmarks around are with RAID10.

Maybe another guarantee is needed:

Provide no worse protection, functionality, space and performance than X competitor.

Otherwise, you’re only tackling a relatively unimportant part of the big picture.


Technorati Tags: ,,,,,,,,,,

NetApp usable space – beyond the FUD

I come across all kinds of FUD, and some of the most ridiculous claims against NetApp regard usable space. I won’t post screenshots from competitive docs since who knows who’ll complain, but suffice it to say that one of the usual strategies against NetApp is to claim the system has something like well under 50% space efficiency using a variety of calculations, anecdotes and obsolete information. In one case, 34% usable space :) Right…

The purpose of this post is to outline the state of the art regarding NetApp usable space as of Spring of 2010.

Since NetApp systems can use free space in various ways instead of just for LUNs, there is frequent confusion regarding what each space-related parameter means, and what the best practices are. NetApp’s recommendations have changed over the years as the technology matured – my goal is to bring everybody up to speed.

Executive summary

Depending on the number and type of drives and the design, aside from edge cases dealing with small systems with a very low number of disks, the real usable space in NetApp systems can easily exceed 75% of the real usable space in the drives. I’ve seen it as high as about 78% of the actual space on the drives. That’s amazingly efficient for something with double-parity protection as default and includes spares. This number is the same whether it represents NAS or SAN data and doesn’t include deduplication, compression or space-efficient clones, which could inflate it to over 1000%. Indeed, NetApp systems are used in the biggest storage installations on the planet partly because they’re so space-efficient. Now, on to the details.

What’s space good for anyway?

Legacy arrays use space in very simple terms – you create RAID groups, then you create LUNs on them and those LUNs pretend they’re normal disks, and that’s that. Figuring out where your space goes is easy – there’s a 1:1 relationship between LUN size and space used on the array. You buy an array that can provide 10TB after RAID and spares, and that’s all you ever get – nothing more, nothing less.

Legacy arrays can sometimes use features such as snapshots, but frequently there are so many caveats around their use (performance being a big one) that either they’re never implemented, or their number is very small indeed to make them really useful.

Since NetApp gear doesn’t suffer from those limitations, customers invariably end up using snapshots a lot, and for various reasons, not just backup. I have customers with over 10,000 snapshots in their arrays – they replicate all those snapshots to another array, can retrieve data that’s several months old, and have stopped relying on legacy backup software, saving money and achieving far faster and easier DR in the process, since with snapshots there’s no restore needed.

What’s your effective space with NetApp gear?

If you consider that each snapshot looks like a complete copy of your data, without factoring in any deduplication at all, the effective logical space could be many, many times more than the physical space. A large law firm I deal with manages to fit about 2.5PB of data into 8TB of snapshot delta space – which is pretty efficient by anyone’s standards. We’re not talking about backups done on deduplicated disk here that need to be restored to become useful – we’re talking about many thousands of straight-up, application-consistent, “full” copies of LUNs, CIFS and NFS shares that you can mount at full speed instantly, without needing to restore from another medium or backup application.

Once you add deduplication and thin cloning, the storage efficiency goes even higher.

It’s not the size of your disk that matters, it’s how you use it

If you use a NetApp system like a legacy disk array, without taking advantage of any of the advanced features (maybe you just care for the multi-protocol functionality, with great performance and reliability) then your usable space falls right within norms. Once you start using the advanced snapshot features, they start eating space of course – but giving you something in return. What you need to figure out is if the tradeoffs are worth it: for instance, if I can keep a month’s worth of Exchange backups with a nominal capacity increase, what is that worth for me? Maybe:

  • I can eliminate backup software licenses
  • I can shrink my storage footprint
  • Avoid purchasing external disk for backups
  • I don’t need to buy external CDP hardware/software and a bunch of extra disk
  • My restores take seconds
  • DR becomes trivial

Or, if I can create 150 clones of my SQL database that my developers can simultaneously use and only chew up a small fraction of the space I’d otherwise need, what is that worth? With other systems, I’d need 150x the space…

Or, create thousands of VM clones for VDI…

How much money are you saving?

What do simplicity and speed mean to your business from an OpEx savings standpoint?

Another way to look at it:

How much more efficient would your business be if you weren’t hampered by the limitations of legacy technology? It’s all about becoming aware of the expanded possibilities.

What you buy

FYI, and to clear any misconceptions in case you can’t be bothered to read the rest: if you ask me for a 10TB usable system, you’ll get a system that will truly provide 10TB usable, honest-to-goodness Base2 space protected against dual-drive failure (no RAID5 silliness), and after all overheads, spares etc. have been taken out. If you want snapshot space we’ll have to add some (like you’d need to with any other vendor). It’s as simple as that.

Right-sized, real space vs raw capacity

Others have explained some of this before but, for completion, I’ll take a stab:

  • The real usable size of, say, a 450GB drive is not really 450GB regardless of the manufacturer.
  • The real usable capacity quoted depends on whether it’s Base2 or Base10 math and a bunch of other factors
  • All vendors that source drives from multiple manufacturers that use RAID groups need to right-size their drives – meaning that, if manufacturer A offers a tad more space in the drive than manufacturer B, in order to use both kinds of drives in the same RAID group, you kinda need to make them seem like the exact same size, meaning you go for the lowest common denominator between drive vendors.
  • Using our 450GB example above, the real addressable right-sized Base10 space in that drive is 438.3GB, and even less in Base2 (402.2). Base2 math simply means 1024 bytes in 1K, not 1000, and the rest follows.
  • Beware of analysis, comparisons or quotes showing Base10 from one vendor and Base2 from another, or raw disk space from one vendor vs right-sized from another! Always ask what base is what you’re seeing and whether the numbers reflect right-sized drives! If you look at the right-sized drive Base2 space from various vendors, it’s usually pretty close. Base your % usable calculations on that number and not the marketing 450GB number that’s not real for any vendor anyway.
  • Everyone pretty much buys the same drives from the same drive manufacturers

Some space reservation axioms

Any system that allows snapshots, clones etc. typically needs some space for those advanced operations. For instance, if you completely fill up a system and then want to take a snapshot, it may let you but if you modify any data then it won’t have space to store the writes and the snapshot will be invalidated and deleted – kinda pointless.

As usual, there is no magic. If you expect to be able to store multiple snapshots, the system needs space to store the data changed between snapshots, regardless of array vendor!

And, out of curiosity – how many man-made devices do you own that you max out all the time? Not leaving breathing room is a recipe for trouble for any piece of equipment.

Explanation of the NetApp data organization

For the uninitiated, here’s a hierarchical list of NetApp structures:

  1. Disks
  2. RAID groups – made of multiple disks. Default RAID is RAID-DP. The system automatically makes them, you don’t need to define them or worry about back-end balancing etc. NetApp RAID groups are typically large, 16 disks or so. RAID-DP ensures better protection than RAID10 (the math shows 163x better than RAID10 and 4,000 better than RAID5).
  3. Parity drives – drives containing extra information that can be used to rebuild data. RAID-DP uses 2 parity drives per RAID group.
  4. Spares – drives that can replace failed or failing drives (no need to wait until the drive is truly dead)
  5. Aggregates – a collection of RAID groups and the basic unit from which space is allocated. That’s really what you define, then the system figures out automatically how to allocate disks and create RAID groups for you (can even expand RAID groups on the fly as you add more disks to the aggregate, even 1 disk at a time).
  6. Volumes – a container that takes space from an Aggregate. A volume can be NAS or SAN. A volume can only belong to one Aggregate, and there will typically be many volumes within an Aggregate. Most people will enable the automatic growing of Volumes.
  7. LUNs – they are placed inside the Volumes. One or more per volume, depending on what you’re trying to do. Usually one.
  8. Snapshots – logical, space-efficient copies of either entire Volumes or structures within volumes. There are 3 kinds depending on what you’re trying to do (Snapshot, Snapvault and Flexclone) but they all use similar underlying technology. I might get into the differences in a future post. Briefly: Snapshot -shorter term, Snapvault – longer term, Flexclone – writeable Snapshot.

Explanation of the NetApp space allocations

  1. Snapshot Reserve – an accounting feature that sets aside a logical percentage of space on a Volume. For instance, if you create a 10TB volume and set a 10% Snap Reserve, the client system will see 9TB usable. Most people will enable automatic deletion of Snapshots. The percentage to set aside is at your discretion and is variable on the fly. The actual amount of space consumed is related to your rate of change between snapshots. See here for some real averages across thousands of systems.
  2. Aggregate Snap Reserve – this is pretty unique. One can actually roll back an entire Aggregate on a NetApp system – can come in handy if you accidentally deleted whole Volumes or in general did some gigantic boo-boo. Rolling back the entire Aggregate will undo whatever was done to that aggregate to break it! This feature is enabled by default and has a 5% reservation. It it not mandatory unless you are running Syncmirror (mostly in Metrocluster setups). Depending on what you want to do, you could disable this altogether or set it to a small number like 1% (my recommendation).
  3. Fractional Reserve – The one that confuses everyone. In a nutshell: it’s a legacy safety net in case you want to modify all the data within a LUN yet still keep the snapshots. Think about it: Let’s say you took a snapshot and you then went ahead and modified every single block of your data. Your snap delta would balloon to the total size of the LUN – regardless of whether you use NetApp, EMC, XIV, Compellent, 3Par, HDS, HP etc etc. The data has to go someplace! There’s a great explanation in this document and I suggest you read it since it covers quite a bit more, too. This one is great, too. Long story short: With snapshot autodelete, and/or volume autogrow, you can set it to zero. If you use the SnapManager products, they take care of snapshot deletion themselves.
  4. System reserve – this is the only one that’s not optional. It’s set to 10% by default. You can actually change it but I’m not telling you how. That space is there for a reason, and changing it will potentially cause problems with high write rate environments. That 10% is used for various operations and has been found to be a good percentage to maintain good performance. All NetApp sizing takes this into account. BTW – ask other vendors if it’s perfectly safe to fill their systems at 100% all the time and whether that impacts performance or prevents them from being able to do certain things. And finally, that 10% lost is gained back in spades with the other NetApp efficiency methodologies (starting at the low level with RAID-DP – please do some simple math based on our 16+ drive RAID group vs typical RAID group sizes) so it doesn’t even matter.

Bottom line: Aside from the 10% system reserve, the rest is all usable space.

The NetApp defaults and some advice

So, here’s where it can get interesting (and confusing) and where the competition gets all their ammunition. Depending on the age of the documentation and firmware, different best practices and defaults apply.

So, if you look at competitive docs from other vendors, they claim that if you use NetApp for LUNs you waste double the space for fractional reserve. That recommendation was true many years ago and it was a safety precaution regarding fractional reserve. The documentation has been updated years ago with zero fractional reserve as the recommendation, but of course that doesn’t help competitors so they left the old messaging. So here’s a basic list of quick recommendations for LUNs:

  1. Snap reserve – 0
  2. Fractional reserve – 0
  3. Snap autodelete on (unless you have SnapManager products managing the snap deletion)
  4. Volume autogrow on
  5. Leave at least a little space available in your volumes, don’t let a LUN 100% fill a volume (the LUN space can be thick but the volume space can be thin-provisioned). This space is needed for deduplication and other processes temporarily
  6. Do consider embracing thin provisioning, even if you don’t want to oversubscribe your disk. It’s much more flexible long-term, and allows for storage elasticity.

So, look at the defaults and ask your engineer if it’s OK to change them if they don’t agree with the settings above. Especially on older systems, I notice that the fractional reserve is still 100%, even after getting updated with the latest software (the update doesn’t change your config). Nothing like giving someone a bunch of disk space back with a few clicks…

If you want to do thin provisioning, depending on the firmware, you may see that using thin provisioning on a volume forces the fractional reserve to 100% – but, ultimately, no real space is being consumed. Was OK in 7.2x, changed to the 100% behavior in 7.3.1, fixed in 7.3.3 since it was confusing everyone.

The bottom line

Ultimately, I want you thinking of how you can use your storage as a resource that enables you to do more than just storing your LUNs. And, finally, I wanted to dispel notions that NetApp storage has less storage efficiency than legacy systems. Comments are always appreciated!