When competitors try too hard and miss the point – part two

This will be another FUD-busting post in the two-part series (first part here).

It’s interesting how some competitors, in their quest to beat us at any cost, set aside all common sense.

Recently, an Oracle blogger attempted to understand a document NetApp originally wrote in the 90’s (and which we haven’t really updated since, which is admittedly our bad) that explains how WAFL, the block layout engine of Data ONTAP (the storage OS on the FAS platform) works at a high level.

Apparently, he thinks that we turn everything into 4K I/Os, so if someone tried to read 256K, it would have to become 64 separate I/Os, and, by extension, believes this means no NetApp system running ONTAP can ever sustain good read throughput since the back-end would be inundated with IOPS.

The conclusions he comes to are interesting to say the least. I will copy-paste one of the calculations he makes for a 100% read workload:

Erroneous oracle calcs

I like the SAS logo, I guess this is meant to make the numbers look legit, as if they came from actual SAS testing 🙂

So this person truly believes that to read 2.6GB/s we need 5,120 drives due to the insane back-end IOPS we purportedly generate 🙂

This would be hilarious if it were true since it would mean NetApp managed to quietly perpetrate the biggest high tech scam in history, fooling customers for 22 years, and somehow managing to become the industry’s #1 storage OS and remain so.

Because customers are that gullible.

Right.

Well – here are some stats from a single 8040 controller (not an HA system with at least 2 controllers, I really mean a single controller doing work, not two or more), with 24 drives, doing over 2.7GB/s reads, at well under 1ms latency, so it’s not even stressed. Thanks to the Australian team for providing the stats:

8040 singlenode

In this example, 2.74GB/s are being read. From stable storage, not cache.

Now, if we do the math the way the competitor would like, it means the back-end is running at over 700,000 4K IOPS. On a single mid-range controller 🙂

That would be really impressive and hugely wasteful at the same time. Wait – maybe I should turn this around and claim 700,000 4K IOPS at 0.6ms capability per mid-range controller! Imagine how fast the big ones go!

It would also assume 35,000 IOPS per disk at a consistent speed and sub-millisecond response (0.64ms) – because the numbers above are from a single node with only about 20 data SSDs (plus parity and spares).

SSDs are fast but they’re not really that fast, and the purpose of this blog is to illuminate and not obfuscate.

Remember Occam’s razor. What explanation do you think makes more sense here? Pixie-dust drives and controllers, or that the Oracle blogger is massively wrong? 🙂

Another example – with spinning disks this time

This is a different output, to also illustrate our ability to provide detailed per-disk statistics.

From a single 8060 node, running at over 3GB/s reads during an actual RMAN job and not a benchmark tool (to use a real Oracle application example). There are 192x 10,000 RPM 600GB disks in the config (180x data, 24x parity – we run dual-parity RAID, there were 12x 16-drive RAID groups in a 14+2 config).

Numbers kindly provided by the legendary neto from Brazil (@netofrombrazil on Twitter). Check the link for his blog and all kinds of DB coolness.

This is part of the statit command’s output. I’m not showing all the disks since there are 192 of them after all and each one is a line in the output:

Read chain

The key in these stats is the “chain” column. This shows, per read command, how many blocks were read as a single entity. In this case, the average is about 49, or 196KB per read operation.

Notice the “xfers” – these drives are only doing about 88 physical IOPS on average per drive, and each operation just happens to be large. They could go faster (see the “ut%” column) but that’s just how much they were loaded during the RMAN job.

Again, if we used the blogger’s calculations, this system would have needed over 5,000 drives and generated over 750,000 back-end disk IOPS.

A public apology and retraction would be nice, guys…

Let’s extrapolate this performance at scale.

My examples are for single mid-range controllers. You can multiply that by 24 to see how fast it could go in a full cluster (yes, it’s linear). And that’s not the max these systems will do – just what was in the examples I found that were close to the competitor’s read performance example.

You see, where most of the competition is still dealing with 2-controller systems, NetApp FAS systems running Clustered ONTAP can run 8 engines for block workloads and 24 engines for NAS (8 if mixed), and each engine can have multiple TB of read/write cache (18TB max cache per node currently with ONTAP 8.2.x).

Even if a competitor’s 2 engines are faster than 2 FAS engines, if they stop at 2 and FAS stops at 24, the fight is over before it begins.

People that live in glass houses shouldn’t throw stones.

Since the competitor questioned why NetApp bought Engenio (the acquisition for our E-Series), I have a similar question: Why did Oracle buy Pillar Data? It was purchased after the Sun acquisition. Does that signify a major lack in the ZFS boxes that Pillar is supposed to address?

The Oracle blogger mentioned how their ZFS system had a great score in the SPC-2 tests (which measure throughput and not IOPS). Great.

Interestingly, Oracle ZFS systems can significantly degrade in performance over time (see here http://blog.delphix.com/uday/2013/02/19/78/) especially after writes, deletes and overwrites. Unlike ONTAP systems, ZFS boxes don’t have mechanisms to perform the necessary block reallocations to optimize the data layout in order to bring performance back to original levels (backing up, wiping the box, rebuilding and restoring is not a solution, sorry). There are ways to delay the inevitable, but nothing to fix the core issue.

It follows that the ZFS performance posted in the benchmarks may not be anywhere near what one will get long-term once the ZFS pools are fragmented and full. Making the ZFS SPC-2 benchmark result pretty useless.

NetApp E-Series inherently doesn’t have this fragmentation problem (and is near the top as a price-performance leader in the SPC-2 benchmark, as tested by SGI that resells it). Since there is no long-term speed deterioration issue with E-Series, the throughput you see in the SPC-2 benchmark will be perpetually maintained. The box is in it for the long haul.

Wouldn’t E-Series then be a better choice for a system that needs to constantly deal with such a workload? Both cost-effective and able to sustain high throughput no matter what?

As an aside, I do need to write an article on block layout optimizations available in ONTAP. Many customers are unaware of the possibilities, and competitors use FUD based on observations from back when mud was a novelty. In the meantime, if you’re a NetApp FAS customer, ask your SE and/or check your documentation for the volume option read_realloc space_optimized – great for volumes containing DB data files. Also, check the documentation for the Aggregate option free_space_realloc.

So you’re fast. What else can you do?

There were other “fighting words” in the blogger’s article and they were all about speed and how much faster the new boxes from the competitor are versus some ancient boxes they had from us. Amazing, new controllers being faster than old ones! 🙂

I see this trend recently, new vendors focusing solely on speed. Guess what – it’s easy to go fast. It’s also easy to be cheap. I’ll save that for a full post another time. But I fully accept that speed sells.

I can build you a commodity-based million-IOPS box during my lunch break. It’s really not that hard. Building a server with dozens of cores and TB of RAM is pretty easy.

But for Enterprise Storage, Reliability is extremely important, far more than sheer speed.

Plus Availability and Serviceability (where the RAS acronym comes from).

Predictability.

Non-Disruptive Operations, even during events that would leave other systems down for extended periods of time.

Extensive automation, management, monitoring and alerting at scale as well.

And of crucial importance is Application Integration, including the ability to perform application-aware data manipulation (fully consistent backups, restores, clones, replication).

So if a system can go fast but can’t do much else, its utility is more towards being a point solution rather than as part of a large, strategic, long-term deployment. Point solutions are useful, yes – but they are also interchangeable with the next cheap fast thing. Most won’t survive.

You know who you are.

Technorati Tags: benchmark, Benchmark, NetApp, performance, ZFS, SPC-2, FUD, Oracle

Hi gents!

Another great post with the information which (in my experience) you just cannot get from 90% of sales managers even in NetApp distributor companies (to say nothing about competitors). And this is pretty sad…

As far as I understand now, this generally accepted approach (which is widely used on the training courses and vendor-comparative presentations) with the table for different HDD types (FC, SAS, SATA, SSD) and thier respective IOPS numbers is completely unacceptable when we talk about any kind of storage systems (even DAS)? There are so many factors which can affect these “per drive IOPS” numbers like mentioned “write tetrises and NVMEM write coalescence” and different kinds of flash caches which make using such line of reasoning for performance calculations just wrong.
As I can clearly see from these two-part post, NetApp has really almost the cheapest and the highest IOPS number per drive (with its very rich feature and high availability options set of course) with the assumption that you have chosen the system which is adequate to your application its load profile. It seems to me now that this is one of the most important parts of sale if the customer really wants to spend his money optimally.

However, there is one question regarding this post which is still unclear to me: am I correctly understood that the only conclusion we can make from the various mentioned real NetApp systems’ read performance examples is that we can *never* say how physical read IO will be splitted or organized by the controller when it will hit HDDs even if we know exactly what block size our application uses? So, in our argumentation in the face of our customer we should never roll down to such IOPs comparisons because we just cannot predict this numbers in real life random workloads?
One good example from your another great colleague (John Martin) with excellent blog which guide me to such thoughts (http://storagewithoutborders.wordpress.com/2010/07/19/data-storage-for-vdi-part-5-raid-dp-wafl-the-ultimate-write-accelerator):
“Suppose, for example, that in writing data to a 32-block long region on a single disk in the RAID group, we find that there are 4 blocks already allocated that cannot be overwritten. First, we read those in, this will likely involve fewer than 4 reads, even if the data is not contiguous. We will issue some smaller number of reads (perhaps only 1) to pick up up the blocks we need and the blocks in between, and then discard the blocks in between (called dummy reads). When we go to write the data back out, we’ll send all 28 (32-4) blocks down as a single write operation, along with a skip-mask that tells the disk which blocks to skip over. Thus we will send at most 5 operations (1 write + 4 reads) to this disk, and perhaps as few as 2. The parity reads will almost certainly combine, as almost any stripe that has an already allocated block will cause us to read parity. So suppose we have to do a write to an area that is 25% allocated. We will write .75 * 14 * 32 blocks, or 336 blocks. The writes will be performed in 16 operations (1 for each data disk, 1 for each parity). On each parity we’ll issue 1 read. There are expected to be 8 blocks read from each disk, but with dummy reads we expect substantial combining, so lets assume we issue 4 reads per disk (which is very conservative). There are 4 * 14 + 2 read operations, or 58 read operations. Thus we expect to write 336 blocks in 58+16= 74 disk operations. “

3 Replies to “When competitors try too hard and miss the point – part two”

AK says:

September 22, 2014 at 12:11 pm

Nice one, D
Nick says:

September 24, 2014 at 1:11 am

Hi gents!

Another great post with the information which (in my experience) you just cannot get from 90% of sales managers even in NetApp distributor companies (to say nothing about competitors). And this is pretty sad…

As far as I understand now, this generally accepted approach (which is widely used on the training courses and vendor-comparative presentations) with the table for different HDD types (FC, SAS, SATA, SSD) and thier respective IOPS numbers is completely unacceptable when we talk about any kind of storage systems (even DAS)? There are so many factors which can affect these “per drive IOPS” numbers like mentioned “write tetrises and NVMEM write coalescence” and different kinds of flash caches which make using such line of reasoning for performance calculations just wrong.
As I can clearly see from these two-part post, NetApp has really almost the cheapest and the highest IOPS number per drive (with its very rich feature and high availability options set of course) with the assumption that you have chosen the system which is adequate to your application its load profile. It seems to me now that this is one of the most important parts of sale if the customer really wants to spend his money optimally.

However, there is one question regarding this post which is still unclear to me: am I correctly understood that the only conclusion we can make from the various mentioned real NetApp systems’ read performance examples is that we can *never* say how physical read IO will be splitted or organized by the controller when it will hit HDDs even if we know exactly what block size our application uses? So, in our argumentation in the face of our customer we should never roll down to such IOPs comparisons because we just cannot predict this numbers in real life random workloads?
One good example from your another great colleague (John Martin) with excellent blog which guide me to such thoughts (http://storagewithoutborders.wordpress.com/2010/07/19/data-storage-for-vdi-part-5-raid-dp-wafl-the-ultimate-write-accelerator):
“Suppose, for example, that in writing data to a 32-block long region on a single disk in the RAID group, we find that there are 4 blocks already allocated that cannot be overwritten. First, we read those in, this will likely involve fewer than 4 reads, even if the data is not contiguous. We will issue some smaller number of reads (perhaps only 1) to pick up up the blocks we need and the blocks in between, and then discard the blocks in between (called dummy reads). When we go to write the data back out, we’ll send all 28 (32-4) blocks down as a single write operation, along with a skip-mask that tells the disk which blocks to skip over. Thus we will send at most 5 operations (1 write + 4 reads) to this disk, and perhaps as few as 2. The parity reads will almost certainly combine, as almost any stripe that has an already allocated block will cause us to read parity. So suppose we have to do a write to an area that is 25% allocated. We will write .75 * 14 * 32 blocks, or 336 blocks. The writes will be performed in 16 operations (1 for each data disk, 1 for each parity). On each parity we’ll issue 1 read. There are expected to be 8 blocks read from each disk, but with dummy reads we expect substantial combining, so lets assume we issue 4 reads per disk (which is very conservative). There are 4 * 14 + 2 read operations, or 58 read operations. Thus we expect to write 336 blocks in 58+16= 74 disk operations. “
1. Dimitris says:
  
  September 25, 2014 at 3:45 pm
  
  Hi Nick,
  
  Actually you can tell how I/O will be split if you are doing I/O to an aggregate with a uniform I/O size.
  
  My point when customers ask me this is:
  
  Who cares exactly how the I/Os are precisely laid down?
  
  We virtualize that part and overall system performance is a function of controller speed, spindle speed and count, and cache size, when faced with a specific blend of I/O.
  
  Then prefetching also plays a role, depending on the workload.
  
  So when we size systems we use an elaborate tool (web-based – called SPM, ask your local SE if you want to see it) that simulates overall I/O – you can be generic and feed it an overall I/O blend (incl random vs sequential, how much of the data is “hot”, I/O sizes) or you can be specific and dedicate I/O profiles to certain pools (maybe have a pool for large block sequential I/O, for example).
  
  If you want to do all your own calculations, it’s far easier to do them for a simpler architecture like NetApp E-Series. That one can be configured with “old school” RAID as well as with the fancy DDP (which makes it harder to figure stuff out – the more intelligent the back end, the harder it is in general to manually calculate stuff for it).
  
  I would typically suggest a couple of different sizings based on I/O profiles (maybe day and/or time-dependent), and pick the one that needs the biggest configuration.
  
  Hope this helps.
  
  Thx
  
  D

Comments are closed.