This will be another FUD-busting post in the two-part series (first part here).
It’s interesting how some competitors, in their quest to beat us at any cost, set aside all common sense.
Recently, an Oracle blogger attempted to understand a document NetApp originally wrote in the 90’s (and which we haven’t really updated since, which is admittedly our bad) that explains how WAFL, the block layout engine of Data ONTAP (the storage OS on the FAS platform) works at a high level.
Apparently, he thinks that we turn everything into 4K I/Os, so if someone tried to read 256K, it would have to become 64 separate I/Os, and, by extension, believes this means no NetApp system running ONTAP can ever sustain good read throughput since the back-end would be inundated with IOPS.
The conclusions he comes to are interesting to say the least. I will copy-paste one of the calculations he makes for a 100% read workload:
I like the SAS logo, I guess this is meant to make the numbers look legit, as if they came from actual SAS testing
So this person truly believes that to read 2.6GB/s we need 5,120 drives due to the insane back-end IOPS we purportedly generate
This would be hilarious if it were true since it would mean NetApp managed to quietly perpetrate the biggest high tech scam in history, fooling customers for 22 years, and somehow managing to become the industry’s #1 storage OS and remain so.
Because customers are that gullible.
Well – here are some stats from a single 8040 controller (not an HA system with at least 2 controllers, I really mean a single controller doing work, not two or more), with 24 drives, doing over 2.7GB/s reads, at well under 1ms latency, so it’s not even stressed. Thanks to the Australian team for providing the stats:
In this example, 2.74GB/s are being read. From stable storage, not cache.
Now, if we do the math the way the competitor would like, it means the back-end is running at over 700,000 4K IOPS. On a single mid-range controller
That would be really impressive and hugely wasteful at the same time. Wait – maybe I should turn this around and claim 700,000 4K IOPS at 0.6ms capability per mid-range controller! Imagine how fast the big ones go!
It would also assume 35,000 IOPS per disk at a consistent speed and sub-millisecond response (0.64ms) – because the numbers above are from a single node with only about 20 data SSDs (plus parity and spares).
SSDs are fast but they’re not really that fast, and the purpose of this blog is to illuminate and not obfuscate.
Remember Occam’s razor. What explanation do you think makes more sense here? Pixie-dust drives and controllers, or that the Oracle blogger is massively wrong?
Another example – with spinning disks this time
This is a different output, to also illustrate our ability to provide detailed per-disk statistics.
From a single 8060 node, running at over 3GB/s reads during an actual RMAN job and not a benchmark tool (to use a real Oracle application example). There are 192x 10,000 RPM 600GB disks in the config (180x data, 24x parity – we run dual-parity RAID, there were 12x 16-drive RAID groups in a 14+2 config).
Numbers kindly provided by the legendary neto from Brazil (@netofrombrazil on Twitter). Check the link for his blog and all kinds of DB coolness.
This is part of the statit command’s output. I’m not showing all the disks since there are 192 of them after all and each one is a line in the output:
The key in these stats is the “chain” column. This shows, per read command, how many blocks were read as a single entity. In this case, the average is about 49, or 196KB per read operation.
Notice the “xfers” – these drives are only doing about 88 physical IOPS on average per drive, and each operation just happens to be large. They could go faster (see the “ut%” column) but that’s just how much they were loaded during the RMAN job.
Again, if we used the blogger’s calculations, this system would have needed over 5,000 drives and generated over 750,000 back-end disk IOPS.
A public apology and retraction would be nice, guys…
Let’s extrapolate this performance at scale.
My examples are for single mid-range controllers. You can multiply that by 24 to see how fast it could go in a full cluster (yes, it’s linear). And that’s not the max these systems will do – just what was in the examples I found that were close to the competitor’s read performance example.
You see, where most of the competition is still dealing with 2-controller systems, NetApp FAS systems running Clustered ONTAP can run 8 engines for block workloads and 24 engines for NAS (8 if mixed), and each engine can have multiple TB of read/write cache (18TB max cache per node currently with ONTAP 8.2.x).
Even if a competitor’s 2 engines are faster than 2 FAS engines, if they stop at 2 and FAS stops at 24, the fight is over before it begins.
People that live in glass houses shouldn’t throw stones.
Since the competitor questioned why NetApp bought Engenio (the acquisition for our E-Series), I have a similar question: Why did Oracle buy Pillar Data? It was purchased after the Sun acquisition. Does that signify a major lack in the ZFS boxes that Pillar is supposed to address?
The Oracle blogger mentioned how their ZFS system had a great score in the SPC-2 tests (which measure throughput and not IOPS). Great.
Interestingly, Oracle ZFS systems can significantly degrade in performance over time (see here http://blog.delphix.com/uday/2013/02/19/78/) especially after writes, deletes and overwrites. Unlike ONTAP systems, ZFS boxes don’t have mechanisms to perform the necessary block reallocations to optimize the data layout in order to bring performance back to original levels (backing up, wiping the box, rebuilding and restoring is not a solution, sorry). There are ways to delay the inevitable, but nothing to fix the core issue.
It follows that the ZFS performance posted in the benchmarks may not be anywhere near what one will get long-term once the ZFS pools are fragmented and full. Making the ZFS SPC-2 benchmark result pretty useless.
NetApp E-Series inherently doesn’t have this fragmentation problem (and is near the top as a price-performance leader in the SPC-2 benchmark, as tested by SGI that resells it). Since there is no long-term speed deterioration issue with E-Series, the throughput you see in the SPC-2 benchmark will be perpetually maintained. The box is in it for the long haul.
Wouldn’t E-Series then be a better choice for a system that needs to constantly deal with such a workload? Both cost-effective and able to sustain high throughput no matter what?
As an aside, I do need to write an article on block layout optimizations available in ONTAP. Many customers are unaware of the possibilities, and competitors use FUD based on observations from back when mud was a novelty. In the meantime, if you’re a NetApp FAS customer, ask your SE and/or check your documentation for the volume option read_realloc space_optimized – great for volumes containing DB data files. Also, check the documentation for the Aggregate option free_space_realloc.
So you’re fast. What else can you do?
There were other “fighting words” in the blogger’s article and they were all about speed and how much faster the new boxes from the competitor are versus some ancient boxes they had from us. Amazing, new controllers being faster than old ones!
I see this trend recently, new vendors focusing solely on speed. Guess what – it’s easy to go fast. It’s also easy to be cheap. I’ll save that for a full post another time. But I fully accept that speed sells.
I can build you a commodity-based million-IOPS box during my lunch break. It’s really not that hard. Building a server with dozens of cores and TB of RAM is pretty easy.
But for Enterprise Storage, Reliability is extremely important, far more than sheer speed.
Plus Availability and Serviceability (where the RAS acronym comes from).
Non-Disruptive Operations, even during events that would leave other systems down for extended periods of time.
Extensive automation, management, monitoring and alerting at scale as well.
And of crucial importance is Application Integration, including the ability to perform application-aware data manipulation (fully consistent backups, restores, clones, replication).
So if a system can go fast but can’t do much else, its utility is more towards being a point solution rather than as part of a large, strategic, long-term deployment. Point solutions are useful, yes – but they are also interchangeable with the next cheap fast thing. Most won’t survive.
You know who you are.