<Edited to add some more information on how SPC-1 works since there was some confusion based on the comments received>
We’ve been busy at NetApp… busy perfecting the industry’s only scale-out unified platform, among other things.
We’ve already released ONTAP 8.1, which, in Cluster-Mode, allows 24 nodes (each with up to 8TB cache) for NAS workloads, and 4 nodes for block workloads (FC and iSCSI).
With ONTAP 8.1.1 (released on June 14th), we increased the node count to 6 for block workloads plus we added some extra optimizations and features. FYI: the node count is just what’s officially supported now, there’s no hard limit.
After our record NFS benchmark results, people have been curious about the block I/O performance of ONTAP Cluster-Mode, so we submitted an SPC-1 benchmark result using part of the same gear left over from the SPEC SFS NFS testing.
To the people that think NetApp is not a fit for block workloads (typically the ones believing competitor FUD): These are among the best SPC-1 results for enterprise disk-based systems given the low latency for the IOPS provided (it’s possible to get higher IOPS with higher latency, as we’ll explain later on in this post).
This blog has covered SPC-1 tests before. A quick recap: The SPC-1 benchmark is an industry-standard, audited, tough, block-based benchmark (on Fiber Channel) that tries to stress-test disk subsystems with a lot of writes, overwrites, hotspots, a mix of random and sequential, write after read, read after write, etc. About 60% of the workload is writes. The I/O sizes are of a large variety – from small to large (so, SPC-1 IOPS are decidedly not the same thing as fully random uniform 4KB IOPS and should not be treated as such).
The benchmark access patterns do have hotspots that are a significant percentage of the total workload. Such hotspots can be either partially cached if the cache is large enough or placed on SSD if the arrays tested have an autotiering system granular and intelligent enough.
If an array can perform well in the SPC-1 workload, it will usually perform extremely well under difficult, latency-sensitive, dynamically changing DB workloads and especially OLTP. The full spec is here for the morbidly curious.
The trick with benchmarks is interpreting the results. A single IOPS number, while useful, doesn’t tell the whole story with respect to the result being useful for real applications. We’ll attempt to assist in the deciphering of the results in this post.
Before we delve into the obligatory competitive analysis, some notes for the ones lacking in faith:
- There was no disk short-stroking in the NetApp benchmark (a favorite way for many vendors to get good speeds out of disk systems by using only the outer part of the disk – the combination of higher linear velocity and smaller head movement providing higher performance and reduced seeks). Indeed, we used a tuning parameter that uses the entire disk surface, no matter how full the disks. Look at the full disclosure report here, page 61. For the FUD-mongers out there: This effectively pre-ages WAFL. We also didn’t attempt to optimize the block layout by reallocating blocks.
- There was no performance degradation over time.
- Average latency (“All ASUs” in the results) was flat and stayed below 5ms during multiple iterations of the test, including the sustainability test (page 28 of the full disclosure report).
- No extra cache beyond what comes with the systems was added (512GB comes standard with each 6240 node, 3TB per node is possible on this model, so there’s plenty of headroom for much larger working sets).
- It was not a “lab queen” system. We used very few disks to achieve the performance compared to the other vendors, and it’s not even the fastest box we have.
- High sustained IOPS (inconsistency is frowned upon).
- IOPS/drive (a measure of efficiency – 500 IOPS/drive is twice as efficient as 250 IOPS/drive, meaning a lot less drives are needed, which results in lower costs, less physical footprint, etc.)
- Low, stable latency over time (big spikes are frowned upon).
- IOPS as a function of latency (do you get high IOPS but also very high latency at the top end? Is that a useful system?)
- The RAID protection used (RAID6? RAID10? RAID6 can provide both better protection and better space efficiency than mirroring, resulting in lower cost yet more reliable systems).
- What kind of drives were used? Ones you are likely to purchase?
- Was autotiering used? If not, why not? Isn’t it supposed to help in such difficult scenarios? Some SSDs would be able to handle the hotspots.
- The amount of hardware needed to get the stated performance (are way too many drives and controllers needed to do it? Does that mean a more complex and costly system? What about management?)
- The cost (some vendors show discounts and others show list price, so be careful there).
- The cost/op (which is the more useful metric – assuming you compare list price to list price).
In this analysis we are comparing disk-based systems. Pure-SSD (or plain old RAM) performance-optimized configs can (predictably) get very high performance and may be a good choice if someone has a very small workload that needs to run very fast.
The results we are focusing on, on the other hand, are highly reliable, general-purpose systems that can provide both high performance, low latency and high capacity at a reasonable cost to many hosts and applications, along with rich functionality (snaps, replication, megacaching, thin provisioning, deduplication, compression, multiple protocols incl. NAS etc. Whoops – none of the other boxes aside from NetApp do all this, but such is the way the cookie crumbles).
Here’s a list of the systems with links to their full SPC-1 disclosure where you can find all the info we’ll be displaying. Those are all systems with high results and relatively flat sustained latency results.
There are some other disk-based systems with decent IOPS results but if you look at their sustained latency (“Sustainability – Average Response Time (ms) Distribution Data” in any full disclosure report) there’s too high a latency overall and too much jitter past the initial startup phase, with spikes over 30ms (which is extremely high), so we ignored them.
Here’s a quick chart of the results sorted according to latency. In addition, the prices shown are the true list prices (which can be found in the disclosures) plus the true $/IO cost based on that list price (a lot of vendors show discounted pricing to make that seem lower):
…BUT THAT CHART SHOWS THAT SOME OF THE OTHER BIG BOXES ARE FASTER THAN NETAPP… RIGHT?
That depends on whether you value and need low latency or not (and whether you take RAID type into account). For the vast majority of DB workloads, very low I/O latencies are vastly preferred to high latencies.
- Choose any of the full disclosure links you are interested in. Let’s say the 3Par one, since it shows both high IOPS and high latency.
- Find the section titled “Response Time – Throughput Curve”. Page 13 in the 3Par result.
- Check whether latency rises sharply as load is added to the system.
Shown below is the 3Par curve:
Notice how latency rises quite sharply after a certain point.
Now compare this to the NetApp result (page 13):
Notice how the NetApp result has in general much lower latency but, more importantly, the latency stays low and rises slowly as load is added to the system.
Which is why the column “SPC-1 IOPS around 3ms” was added to the table. Effectively, what would the IOPS be at around the same latency for all the vendors?
Once you do that, you realize that the 3Par system is actually slower than the NetApp system if a similar amount of low latency is desired. Plus it costs several times more.
You can get the exact latency numbers just below the graphs on page 13, the NetApp table looks like this (under the heading “Response Time – Throughput Data”):
Indeed, of all the results compared, only the IBM SVC (with a bunch of V7000 boxes behind it) is faster than NetApp at that low latency point. Which neatly takes us to the next section…
WHAT IS THE 100% LOAD POINT?
I had to add this since it is confusing. The 100% load point does not mean the arrays tested were necessarily maxed out. Indeed, most of the arrays mentioned could sustain bigger workloads given higher latencies. 3Par just decided to show the performance at that much higher latency point. The other vendors decided to show the performance at latencies more palatable to Tier 1 DB workloads.
The SPC-1 load generators are simply told to run at a specific target IOPS and that is chosen to be the load level. The goal being to balance cost, IOPS and latency.
JUST HOW MUCH HARDWARE IS NEEDED TO GET A SYSTEM TO PERFORM?
Almost any engineering problem can be solved given the application of enough hardware. The IBM result is a great example of a very fast system built by adding a lot of hardware together:
- 8 SVC virtualization engines plus…
- …16 separate V7000 systems under the SVC controllers…
- …each consisting of 2 more SVC controllers and 2 RAID controllers
- 1,920 146GB 15,000 RPM disks (not quite the drive type people buy these days)
- For a grand total of 40 Linux-based SVC controllers (8 larger and 32 smaller), 32 RAID controllers, and a whole lot of disks.
Putting aside for a moment the task of actually putting together and managing such a system, or the amount of power it draws, or the rack space consumed, that’s quite a bit of gear. I didn’t even attempt to add up all the CPUs working in parallel, I’m sure it’s a lot.
- 6 controllers in one cluster
- 432 450GB 15,000 RPM disks (a pretty standard and common drive type as of the time of this writing in June 2012).
SOME QUESTIONS (OTHER VENDORS FEEL FREE TO RESPOND):
- What would performance be with RAID6 for the other vendors mentioned? NetApp always tests with our version of RAID6 (RAID-DP). RAID6 is more reliable than mirroring, especially when large pools are in question (not to mention more space-efficient). Most customers won’t buy big systems with all-RAID10 configs these days… (customers, ask your vendor. There is no magic – I bet they have internal results with RAID6, make them show you).
- Autotiering is the most talked-about feature it seems, with attributes that make it seem more important than the invention of penicillin or even the wheel, maybe even fire… However, none of the arrays mentioned are using any SSDs for autotiering (IBM published a result once – nothing amazing, draw your own conclusions). One would think that a benchmark that creates hot spots would be an ideal candidate… (and, to re-iterate, there are hotspots and of a percentage small enough to easily fit in SSD). At least IBM’s result proves that (after about 19 hours) autotiering works for the SPC-1 workload – which further solidifies the question: Why is nobody doing this if it’s supposed to be so great?
- Why are EMC and Dell unwilling to publish SPC-1 results? (they are both SPC members). They are the only 2 major storage vendors that won’t publish SPC-1 results. EMC said in the past they don’t think SPC-1 is a realistic test – well, only running your applications with your data on the array is ever truly realistic. What SPC-1 is, though, is an industry-standard benchmark for a truly difficult random workload with block I/O, and a great litmus test.
- For a box regularly marketed for Tier-1 workloads, the IBM XIV is, once more, suspiciously absent, even in its current Gen3 guise. It’s not like IBM is shy about submitting SPC-1 results
- Finally – some competitors keep saying NetApp is “not true SAN”, “emulated SAN” etc. Whatever that means – maybe the NetApp approach is better after all… the maximum write latency of the NetApp submission was 1.91ms for a predominantly write workload
- Allows for highly performant and dynamically-scalable unified clusters for FC, iSCSI, NFS and CIFS.
- Exhibits proven low latency while maintaining high performance.
- Provides excellent price/performance.
- Allows data on any node to be accessed from any other node.
- Moves data non-disruptively between nodes (including CIFS, which normally is next to impossible).
- Maintains the traditional NetApp features (write optimization, application awareness, snapshots, deduplication, compression, replication, thin provisioning, megacaching).
- Can use the exact same FAS gear as ONTAP running in the legacy 7-mode for investment protection.
- Can virtualize other arrays behind it.