NetApp posts SPC-1 Top Ten Performance results for its high end systems – Tier 1 meets high functionality and high performance

It’s been a while since our last SPC-1 benchmark submission with high-end systems in 2012. Since then we launched all new systems, and went from ONTAP 8.1 to ONTAP 8.3, big jumps in both hardware and software.

In 2012 we posted an SPC-1 result with a 6-node FAS6240 cluster  – not our biggest system at the time but we felt it was more representative of a realistic solution and used a hybrid configuration (spinning disks boosted by flash caching technology). It still got the best overall balance of low latency, high SPC-1 IOPS, price, scalability, data resiliency and functionality compared to all other spinning disk systems at the time.

Today (April 22, 2015) we published SPC-1 results with an 8-node all-flash high-end FAS8080 cluster to illustrate the performance of the largest current NetApp FAS systems in this industry-standard benchmark.

For the impatient…

  • The NetApp All-Flash FAS8080 SPC-1 submission places the system in the #5 performance spot in the SPC-1 Top Ten by performance list.
  • And #3 if you look at performance at 1ms latency.
  • Highest performing All-Flash Enterprise Unified system.
  • The NetApp system uses RAID-DP, similar to RAID-6, whereas the other entries use RAID-10 – performance would be far lower for the other entries with RAID-6
  • Price-performance-wise, the FAS8080 gets the #4 spot once adjusted for all list prices
  • In addition, the FAS8080 shows the best storage efficiency, by far, of any SPC-1 submission to date (and without using compression or deduplication).
  • The FAS8080 offers far more functionality than any other system in the list.

We also recently posted results with the NetApp EF560  – the other major hardware platform NetApp offers. See my post here and the official results here. Different value proposition for that platform – less features but very low latency and great cost effectiveness are the key themes for the EF560.

In this post I want to explain the current Clustered ONTAP results and why they are important.

Flash performance without compromise

Solid state storage technologies are becoming increasingly popular.

The challenge with flash offerings from most vendors is that customers typically either have to give up a lot in order to get the high performance of flash, or have to combine 4-5 different products into a complex “solution” in order to satisfy different requirements.

For instance, dedicated all-flash offerings may not be able to natively replicate to less expensive, spinning-drive solutions.

Or, a flash system may offer high performance but not the functionality, scalability, reliability and data integrity of more mature solutions.

But what if you could have it all? Performance and reliability and functionality and scalability and maturity? That’s exactly what Clustered ONTAP 8.3 provides.

Here are some Clustered ONTAP 8.3 running on FAS8080 highlights:

  • All the NetApp signature ultra-tight application integration and automation for replication, SnapShots, Clones
  • Fancy write-accelerated RAID6-equivalent protection by default
  • Comprehensive data integrity and protection against insidious lost write/torn page/misplaced write errors that RAID and normal checksums don’t always catch
  • Non-disruptive data mobility for all protocols
  • Non-disruptive operations – no downtime even when doing things that would require downtime and extensive PS with other vendors
  • Granular QoS
  • Deduplication and compression
  • Highly scalable – 5,760 drives possible in an 8-node cluster, 17,280 drives possible in the max 24 nodes. Various drive types, from SSD to SATA and everything else in between.
  • Multiprotocol (FC, iSCSI, NFS, SMB1,2,3) on the same hardware (no “helper” boxes needed, no dedicated SAN vs NAS pools needed)
  • 96,000 LUNs per 8-node cluster (that’s right, ninety-six thousand LUNs)
  • ONTAP is VMware vVol ready
  • The only array that has been validated by VMware for VMware Horizon 6 with vVols – hopefully the competitors will follow our lead
  • Over 460TB (yes, TeraBytes) of usable cache after all overheads are accounted for (and without accounting for cache amplification through deduplication and clones) in an 8-node cluster. Makes competitor maximum cache amounts seem like rounding errors – indeed, the actual figure might be 465TB or more, but it’s OK… :) (and 3x that number in a 24-node cluster, over 1.3PB cache!)
  • The ability to virtualize other storage arrays behind it
  • The ability to have a cluster with dissimilar size and type nodes – no need to keep all engines the same (unlike monolithic offerings). Why pay the same for all nodes when some nodes may not need all the performance? Why be forced to keep all nodes in the same hardware family? What if you don’t want to buy all at once? Maybe you want to upgrade part of the cluster with a newer-gen system? :)
  • The ability to evacuate part of a cluster and build that part as a different cluster elsewhere
  • The ability to have multiple disk types in a cluster and, indeed, dedicate nodes to functions (for instance, have a few nodes all-flash, some nodes with flash-accelerated SAS and a couple with very dense yet flash-accelerated NL-SAS, with full online data mobility between nodes)
That last bullet deserves a picture:
 MixedCluster

 

“SVM” stands for Storage Virtual Machine –  it means a logical storage partition that can span one or more cluster nodes and have parts of the underlying capacity (performance and space) available to it, with its own users, capacity and performance limits etc.

In essence, Clustered ONTAP offers the best combination of performance, scalability, reliability, maturity and features of any storage system extant as of this writing. Indeed – look at some of the capabilities like maximum cache and number of LUNs. This is designed to be the cornerstone of a datacenter.

it makes most other systems seem like toys in comparison…

Ships

FUD buster

Another reason we wanted to show this result was FUD from competitors struggling to find an angle to fight NetApp. It goes a bit like this: “NetApp FAS systems aren’t real SAN, it’s all simulated and performance will be slow!”

Right…

Drevilsimulated

Well – for a “simulated” SAN (whatever that means), the performance is pretty amazing given the level of protection used (RAID6-equivalent – far more resilient and capacity-efficient for large pooled deployments than the RAID10 the other submissions use) and all the insane scalability, reliability and functionality on tap :)

Another piece of FUD has been that ONTAP isn’t “flash-optimized” since it’s a very mature storage OS and wasn’t written “from the ground up”. We’ll let the numbers speak for themselves. It’s worth noting that we have been incorporating a lot of flash-related innovations into FAS systems well before any other competitor did so, something conveniently ignored by the FUD-mongers. In addition, ONTAP 8.3 has a plethora of flash optimizations and path length improvements that helped with the good latency results. And lots more is coming.

The final piece of FUD we made sure was addressed was system fullness – last time we ran the test we didn’t fill up as much as we could have, which prompted the FUD-mongers to say that FAS systems need gigantic amounts of free space to perform. Let’s see what they’ll come up with this time ;)

On to the numbers!

As a refresher, you may want to read past SPC-1 posts here and here, and my performance primer here.

Important note: SPC-1 is a 100% block-based benchmark with its own I/O blend and, as such, the results from any vendor SPC-1 submission should not be compared to marketing IOPS numbers of all reads or metadata-heavy NAS benchmarks like SPEC SFS (which are far easier on systems than the 60% write blend of the SPC-1 workload). Indeed, the tested configuration might perform in the millions of “marketing” IOPS – but that’s decidedly not the point of this benchmark.

The SPC-1 Result links if you want the detail are here (summary) and here (full disclosure). In addition, here’s the link to the “Top 10 Performance” systems page so you can compare other submissions that are in the upper performance echelon (unfortunately, SPC-1 results are normally just alphabetically listed, making it time-consuming to compare systems unless you’re looking at the already sorted Top 10 list).

I recommend you look beyond the initial table in each submission showing the performance and $/IOPS and at least go to the actual price list to see the detail. For instance, HDS shows a 58% discount if you go to the detail here, and calculates their $/IOPS number based on the discounted price. Just be aware and remember – the only way to get a real price is to talk to your sales rep.

The things to look for in SPC-1 submissions

Typically you’re looking for the following things to make sense of an SPC-1 submission:

  • Latency vs IOPS – many submissions will show high IOPS at huge latency, which would be rather useless when it comes to Flash storage
  • Sustainability – was performance even or are there constant huge spikes?
  • RAID level – most submissions use RAID10 for speed, what would happen with RAID6?
  • Application Utilization. This one is important yet glossed over. It signifies how much capacity the benchmark consumed vs the overall raw capacity of the system, before RAID, spares etc.

Let’s go over these one by one.

Latency vs IOPS

Our average latency was 1.23ms at 685,281.71 SPC-1 IOPS, and pretty flat over time during the test:

Response_time_complete

Sustainability

The SPC-1 rules state the minimum runtime should be 8 hours. We ran the test for 18 hours to observe if there would be variation in the performance. There was no significant variation:

IOdistributionRamp

RAID level

RAID-DP was used for all testing. This is mathematically analogous in protection to RAID-6. Given that these systems are typically deployed in very large pooled configurations, we elected long ago to not recommend single parity RAID since it’s simply not safe enough. RAID-10 is fast and fine for smaller capacity SSD systems but, at scale, it gets too expensive for anything but a lab queen (a system that nobody in their right mind will ever buy but which benchmarks well).

Application Utilization

Our Application Utilization was a very high 61.92% – unheard of by other vendors posting SPC-1 results since they use RAID10 which, by definition, wastes half the capacity (plus spares and other overheads to worry about on top of that).

AppUtilization

Some vendors using RAID10 will fill up the resulting space after RAID, spares etc. to a very high degree, and call out the “Protected Application Utilization” as being the key thing to focus on.

This could not be further from the truth – Application Utilization is the only metric that really shows how much of the total possible raw capacity the benchmark actually used and signifies how space-efficient the storage was.

Otherwise, someone could do quadruple mirroring of 100TB, fill up the resulting 25TB to 100%, and call that 100% efficient… when in fact it only consumed 25% :)

It is important to note there was no compression or deduplication enabled by any vendor since it is not allowed by the current version of the benchmark.

Compared to other vendors

I wanted to show a comparison between the Top Ten Performance results both in absolute terms and also normalized around 1ms latency.

Here are the Top Ten highest performing systems as of April 22, 2015, with vendor results links if you want to look at things in detail:

  1. Hitachi Virtual Storage Platform G1000
  2. Kaminario K2
  3. Huawei OceanStor 18800
  4. IBM Power Server 780
  5. NetApp FAS8080
  6. Huawei OceanStor 6800 V3
  7. HDS VSP
  8. HP XP P9500 (same as the VSP above, HP resells it as their high end offering)
  9. Huawei OceanStor Dorado 5100
  10. IBM SVC with V7000
  11. IBM System Storage DS8870

I will show columns that explain the results of each vendor around 1ms. Why 1ms and not more or less? Because in the Top Ten SPC-1 performance list, most results show fairly low latency, but some have very high latency, and it’s useful to show performance at that lower latency point, which is becoming the latency standard for All-Flash systems. 1ms seems to be a good point for multi-function SSD systems (vs simpler, smaller but more speed-optimized architectures like the NetApp EF560).

The way you determine the 1ms latency point is by looking at the graph that shows latency vs SPC-1 IOPS. Let’s pick IBM’s 780 since it has a very interesting curve so you learn what to look for.

From page 5 of the IBM 780 SPC-1 report:

IBM780

IBM’s submitted SPC-1 IOPS are high but at a huge latency number for an all-SSD solution (18.90ms). Not very useful for customers picking an all-SSD system. Even the next load point, with an average latency of 6.41ms, is high for an all-flash solution.

To more accurately compare this to the rest of the vendors with decent latency, you need to look at the chart around 1ms.

They didn’t publish a load point close to 1ms so I’ll “grant” them 200,000 SPC-1 IOPS at that point (the chart shows it’s probably less but it’s OK, it makes no difference to the overall standing in the end).

You can do a similar exercise for the rest, it’s worth a look – I don’t want to paste all these graphs since this post will get too big and firmly in tl;dr territory if it isn’t already :)

Here’s the table with the current Top Ten SPC-1 Performance results as of 4/22/2015. Click on it for a clearer picture, there’s a lot going on.

8080SPC1chart

What do the results show?

Predictably, all-flash systems trump disk-based and hybrid systems for performance and can offer very nice $/SPC-1 IOPS numbers. That is the major allure of flash – high performance density.

Some takeaways from the comparison:

  • Once adjusted for 1ms latency and list price, the results shift dramatically, what was once awesome suddenly is no more.
  • The other vendors used RAID10 – NetApp used RAID-DP (similar to RAID6 in protection). What would happen to their results if they switched to RAID6 to provide a similar level of protection and efficiency?
  • Some vendors try to fit a lot of the benchmark in RAM. I show that calculation as “Working Set Size as a % of RAM”. You want that number to be comfortably bigger than 100%. 100% and under means there’s a high likelihood much of the I/O was cached in RAM. This is important – and possibly explains why some vendors used such a small capacity (indeed, on the verge of legality within the SPC-1 rules). FYI, the “hot” data in SPC-1 is about 6.75% of the overall capacity used.
  • Aside from the NetApp FAS result, the rest of the Top Ten Performance submissions offer vastly lower Application Utilization – about half! Which means that NetApp is able to use 2x the capacity vs raw compared to the other submissions. And that’s before starting to count the possible storage efficiencies we can turn on like dedupe and compression.
  • No competitor system offers the sheer functionality the FAS8080 does – not even close.
  • Certain competitors have very questionable viability and/or tiny market penetration, making them a risky proposition for a high end system purchase.

Overall – the all-flash FAS8080EX gets a pretty amazing performance and efficiency result, especially given the sheer amount of functionality it offers.

How does one pick a flash array?

It depends. What are you trying to do? Solve a tactical problem? Just need a lot of extra speed and far lower latency for some workloads? No need for the array to have a ton of functionality? A lot of the data management happens in the application? Need something cost-effective, simple yet reliable? Then an all-flash system like the NetApp EF560 is a solid answer, and it can still be front-ended by a Clustered ONTAP system to provide more functionality if the need arises in the future (we are firm believers in hardware reuse and investment protection – you see, some companies talk about Software Defined Storage, we do Software Defined Storage).

On the other hand, if you would prefer an Enterprise architecture that can serve as the cornerstone of your datacenter for almost any workload and protocol, offers rich data management functionality and tight application integration, insane scalability and offers the most features (reliably) compared to any other platform – then the FAS line running Clustered Data ONTAP is the only possible answer.

Couple that with OnCommand Insight – the best multivendor fabric management tool on the planet – plus Workflow Automation, and we’ve got you covered.

Thx

D

Technorati Tags: , , , , , , , ,

Marketing fun: NetApp industry first of up to 13 million IOPS in a single rack

I’m seeing some really “out there” marketing lately, every vendor (including us) trying to find an angle that sounds exciting while not being an outright lie (most of the time).

A competitor recently claimed an industry first of up to 1.7 million (undefined type) IOPS in a single rack.

The number (which admittedly sounds solid), got me thinking. Was the “industry first” that nobody else did up to 1.7 million IOPS in a single rack?

Would that statement also be true if someone else did up to 5 million IOPS in a rack?

I think that, in the world of marketing, it would – since the faster vendor doesn’t do up to 1.7 million IOPS in a rack, they do up to 5! It’s all about standing out in some way.

Well – let’s have some fun.

I can stuff 21x EF560 systems in a single rack.

Each of those systems can do 650,000 random 4K reads at a stable 800 microseconds (since I like defining my performance stats), 600,000 random 8K reads at under 1ms, and over 300,000 random 32KB reads at under 1ms. Also 12GB/s large blog sequential reads. This is I/O straight from the SSDs and not RAM cache (the I/O from cache can of course be higher but let’s not count that).

See here for the document showing some of the performance numbers.

Well – some simple math shows a standard 42U rack fully populated with EF560 will do the following:

  • 13,650,000 IOPS.
  • 252GB/s throughput.
  • Up to 548TB of usable SSD capacity using DDP protection (up to 639TB with RAID5).

Not half bad.

Doesn’t quite roll off the tongue though – industry first of up to thirteen million six hundred and fifty thousand IOPS in a single rack. :)

I hope rounding down to 13 million is OK with everyone.

 

D

Technorati Tags: , , ,

Beware of benchmarking storage that does inline compression

In this post I will examine the effects of benchmarking highly compressible data and why that’s potentially a bad idea.

Compression is not a new storage feature. Of the large storage vendors, at a minimum NetApp, EMC and IBM can do it (depending on the array). <EDIT (thanks to Matt Davis for reminding me): Some arrays also do zero detection and will not write zeroes to disk – think of it as a specialized form of compression that ONLY works on zeroes>

A lot of the newer storage vendors are now touting real-time compression for all data (often used instead of true deduplication – it’s just easier to implement compression).

Nothing wrong with real-time compression. However, and here’s where I have a problem with some of the sales approaches some vendors follow:

Real-time compression can provide grossly unrealistic benchmark results if the benchmarks used are highly compressible!

Compression can indeed provide a performance benefit for various data types (simply since less data has to be read and written from disk), with the tradeoff being CPU. However, most normal data isn’t composed of all zeroes. Typically, compressing data will provide a decent benefit on average, but usually not several times.

So, what will typically happen is, a vendor will drop off one of their storage appliances and provide the prospect with some instructions on how to benchmark it with your garden variety benchmark apps. Nothing crazy.

Here’s the benchmark problem

A lot of the popular benchmarks just write zeroes. Which of course are extremely easy for compression and zero-detect algorithms to deal with and get amazing efficiency out of, resulting in extremely high benchmark performance.

I wanted to prove this out in an easy way that anyone can replicate with free tools. So I installed Fedora 18 with the btrfs filesystem and ran the bonnie++ benchmark with and without compression. The raw data with mount options etc. is here. An explanation of the various fields here. Not everything is accelerated by btrfs compression in the bonnie++ benchmark, but a few things really are (sequential writes, rewrites and reads):

Bonniebtrfs

Notice the gigantic improvement (in write throughput especially) btrfs compression affords with all-zero data.

Now, does anyone think that, in general, the write throughput will be 300MB/s for a decrepit 5400 RPM SATA disk?  That will be impossible unless the user is constantly writing all-zero data, at which point the bottlenecks lie elsewhere.

Some easy ways for dealing with the compressible benchmark issue

So what can you do in order to ensure you get a more realistic test for your data? Here are some ideas:

  • Always best is to use your own applications and not benchmarks. This is of course more time-consuming and a bigger commitment. If you cant do that, then…
  • Create your own test data using, for example, dd and /dev/random as a source in some sort of Unix/Linux variant. Some instructions here. You can even move that data to use with Windows and IOmeter – just generate the random test data in UNIX-land and move the file(s) to Windows.
  • Another, far more realistic way: Use your own data. In IOmeter, you just copy one of your large DB files to iobw.tst and IOmeter will use your own data to test… Just make sure it’s large enough and doesn’t all fit in array cache. If not large enough, you could probably make it large enough by concatenating multiple data files and random data together.
  • Use a tool that generates incompressible data automatically, like the AS-SSD benchmark (though it doesn’t look like it can deal with multi-TB stuff – but worth a try).
  • vdbench seems to be a very solid benchmark tool – with tunable compression and dedupe settings.
  • And don’t forget the obvious but often forgotten rule: never test with a data set that fits entirely in RAM!

 

In all cases though, be aware of how you are testing. There is no magic :)

D

 

Technorati Tags: , ,

NetApp posts great Cluster-Mode SPC-1 result

<Edited to add some more information on how SPC-1 works since there was some confusion based on the comments received>

We’ve been busy at NetApp… busy perfecting the industry’s only scale-out unified platform, among other things.

We’ve already released ONTAP 8.1, which, in Cluster-Mode, allows 24 nodes (each with up to 8TB cache) for NAS workloads, and 4 nodes for block workloads (FC and iSCSI).

With ONTAP 8.1.1 (released on June 14th), we increased the node count to 6 for block workloads plus we added some extra optimizations and features. FYI: the node count is just what’s officially supported now, there’s no hard limit.

After our record NFS benchmark results, people have been curious about the block I/O performance of ONTAP Cluster-Mode, so we submitted an SPC-1 benchmark result using part of the same gear left over from the SPEC SFS NFS testing.

To the people that think NetApp is not a fit for block workloads (typically the ones believing competitor FUD): These are among the best SPC-1 results for enterprise disk-based systems given the low latency for the IOPS provided (it’s possible to get higher IOPS with higher latency, as we’ll explain later on in this post).

Here’s the link to the result and another with the page showing all the results.

This blog has covered SPC-1 tests before. A quick recap: The SPC-1 benchmark is an industry-standard, audited, tough, block-based benchmark (on Fiber Channel) that tries to stress-test disk subsystems with a lot of writes, overwrites, hotspots, a mix of random and sequential, write after read, read after write, etc. About 60% of the workload is writes. The I/O sizes are of a large variety – from small to large (so, SPC-1 IOPS are decidedly not the same thing as fully random uniform 4KB IOPS and should not be treated as such).

The benchmark access patterns do have hotspots that are a significant percentage of the total workload. Such hotspots can be either partially cached if the cache is large enough or placed on SSD if the arrays tested have an autotiering system granular and intelligent enough.

If an array can perform well in the SPC-1 workload, it will usually perform extremely well under difficult, latency-sensitive, dynamically changing DB workloads and especially OLTP. The full spec is here for the morbidly curious.

The trick with benchmarks is interpreting the results. A single IOPS number, while useful, doesn’t tell the whole story with respect to the result being useful for real applications. We’ll attempt to assist in the deciphering of the results in this post.

Before we delve into the obligatory competitive analysis, some notes for the ones lacking in faith:

  1. There was no disk short-stroking in the NetApp benchmark (a favorite way for many vendors to get good speeds out of disk systems by using only the outer part of the disk – the combination of higher linear velocity and smaller head movement providing higher performance and reduced seeks). Indeed, we used a tuning parameter that uses the entire disk surface, no matter how full the disks. Look at the full disclosure report here, page 61. For the FUD-mongers out there: This effectively pre-ages WAFL. We also didn’t attempt to optimize the block layout by reallocating blocks.
  2. There was no performance degradation over time.
  3. Average latency (“All ASUs” in the results) was flat and stayed below 5ms during multiple iterations of the test, including the sustainability test (page 28 of the full disclosure report).
  4. No extra cache beyond what comes with the systems was added (512GB comes standard with each 6240 node, 3TB per node is possible on this model, so there’s plenty of headroom for much larger working sets).
  5. It was not a “lab queen” system. We used very few disks to achieve the performance compared to the other vendors, and it’s not even the fastest box we have.


ANALYSIS

When looking at this type of benchmark, one should probably focus on :
  1. High sustained IOPS (inconsistency is frowned upon).
  2. IOPS/drive (a measure of efficiency – 500 IOPS/drive is twice as efficient as 250 IOPS/drive, meaning a lot less drives are needed, which results in lower costs, less physical footprint, etc.)
  3. Low, stable latency over time (big spikes are frowned upon).
  4. IOPS as a function of latency (do you get high IOPS but also very high latency at the top end? Is that a useful system?)
  5. The RAID protection used (RAID6? RAID10? RAID6 can provide both better protection and better space efficiency than mirroring, resulting in lower cost yet more reliable systems).
  6. What kind of drives were used? Ones you are likely to purchase?
  7. Was autotiering used? If not, why not? Isn’t it supposed to help in such difficult scenarios? Some SSDs would be able to handle the hotspots.
  8. The amount of hardware needed to get the stated performance (are way too many drives and controllers needed to do it? Does that mean a more complex and costly system? What about management?)
  9. The cost (some vendors show discounts and others show list price, so be careful there).
  10. The cost/op (which is the more useful metric – assuming you compare list price to list price).
SPC-1 is not a throughput-type benchmark, for sheer GB/s look elsewhere. Most of the systems didn’t do more than 4GB/s in this benchmark since a lot of the operations are random (and 4GB/s is quite a lot of random I/O).

SYSTEMS COMPARED

In this analysis we are comparing disk-based systems. Pure-SSD (or plain old RAM) performance-optimized configs can (predictably) get very high performance and may be a good choice if someone has a very small workload that needs to run very fast.

The results we are focusing on, on the other hand, are highly reliable, general-purpose systems that can provide both high performance, low latency and high capacity at a reasonable cost to many hosts and applications, along with rich functionality (snaps, replication, megacaching, thin provisioning, deduplication, compression, multiple protocols incl. NAS etc. Whoops – none of the other boxes aside from NetApp do all this, but such is the way the cookie crumbles).

Here’s a list of the systems with links to their full SPC-1 disclosure where you can find all the info we’ll be displaying. Those are all systems with high results and relatively flat sustained latency results.

There are some other disk-based systems with decent IOPS results but if you look at their sustained latency (“Sustainability – Average Response Time (ms) Distribution Data” in any full disclosure report) there’s too high a latency overall and too much jitter past the initial startup phase, with spikes over 30ms (which is extremely high), so we ignored them.

Here’s a quick chart of the results sorted according to latency. In addition, the prices shown are the true list prices (which can be found in the disclosures) plus the true $/IO cost based on that list price (a lot of vendors show discounted pricing to make that seem lower):

…BUT THAT CHART SHOWS THAT SOME OF THE OTHER BIG BOXES ARE FASTER THAN NETAPP… RIGHT?

That depends on whether you value and need low latency or not (and whether you take RAID type into account). For the vast majority of DB workloads, very low I/O latencies are vastly preferred to high latencies.

Here’s how you figure out the details:
  1. Choose any of the full disclosure links you are interested in. Let’s say the 3Par one, since it shows both high IOPS and high latency.
  2. Find the section titled “Response Time – Throughput Curve”. Page 13 in the 3Par result.
  3. Check whether latency rises sharply as load is added to the system.

Shown below is the 3Par curve:

3parlatency

Notice how latency rises quite sharply after a certain point.

Now compare this to the NetApp result (page 13):

Netappspclatency

Notice how the NetApp result has in general much lower latency but, more importantly, the latency stays low and rises slowly as load is added to the system.

Which is why the column “SPC-1 IOPS around 3ms” was added to the table. Effectively, what would the IOPS be at around the same latency for all the vendors?

Once you do that, you realize that the 3Par system is actually slower than the NetApp system if a similar amount of low latency is desired. Plus it costs several times more.

You can get the exact latency numbers just below the graphs on page 13, the NetApp table looks like this (under the heading “Response Time – Throughput Data”):

Netappspcdata

Indeed, of all the results compared, only the IBM SVC (with a bunch of V7000 boxes behind it) is faster than NetApp at that low latency point. Which neatly takes us to the next section…

WHAT IS THE 100% LOAD POINT?

I had to add this since it is confusing. The 100% load point does not mean the arrays tested were necessarily maxed out. Indeed, most of the arrays mentioned could sustain bigger workloads given higher latencies. 3Par just decided to show the performance at that much higher latency point. The other vendors decided to show the performance at latencies more palatable to Tier 1 DB workloads.

The SPC-1 load generators are simply told to run at a specific target IOPS and that is chosen to be the load level. The goal being to balance cost, IOPS and latency.

JUST HOW MUCH HARDWARE IS NEEDED TO GET A SYSTEM TO PERFORM?

Almost any engineering problem can be solved given the application of enough hardware. The IBM result is a great example of a very fast system built by adding a lot of hardware together:

  • 8 SVC virtualization engines plus…
  • …16 separate V7000 systems under the SVC controllers…
  • …each consisting of 2 more SVC controllers and 2 RAID controllers
  • 1,920 146GB 15,000 RPM disks (not quite the drive type people buy these days)
  • For a grand total of 40 Linux-based SVC controllers (8 larger and 32 smaller), 32 RAID controllers, and a whole lot of disks.

Putting aside for a moment the task of actually putting together and managing such a system, or the amount of power it draws, or the rack space consumed, that’s quite a bit of gear. I didn’t even attempt to add up all the CPUs working in parallel, I’m sure it’s a lot.

Compare it to the NetApp configuration:
  • 6 controllers in one cluster
  • 432 450GB 15,000 RPM disks (a pretty standard and common drive type as of the time of this writing in June 2012).

SOME QUESTIONS (OTHER VENDORS FEEL FREE TO RESPOND):

  1. What would performance be with RAID6 for the other vendors mentioned? NetApp always tests with our version of RAID6 (RAID-DP). RAID6 is more reliable than mirroring, especially when large pools are in question (not to mention more space-efficient). Most customers won’t buy big systems with all-RAID10 configs these days… (customers, ask your vendor. There is no magic – I bet they have internal results with RAID6, make them show you).
  2. Autotiering is the most talked-about feature it seems, with attributes that make it seem more important than the invention of penicillin or even the wheel, maybe even fire… However, none of the arrays mentioned are using any SSDs for autotiering (IBM published a result once – nothing amazing, draw your own conclusions). One would think that a benchmark that creates hot spots would be an ideal candidate… (and, to re-iterate, there are hotspots and of a percentage small enough to easily fit in SSD). At least IBM’s result proves that (after about 19 hours) autotiering works for the SPC-1 workload – which further solidifies the question: Why is nobody doing this if it’s supposed to be so great?
  3. Why are EMC and Dell unwilling to publish SPC-1 results? (they are both SPC members). They are the only 2 major storage vendors that won’t publish SPC-1 results. EMC said in the past they don’t think SPC-1 is a realistic test – well, only running your applications with your data on the array is ever truly realistic. What SPC-1 is, though, is an industry-standard benchmark for a truly difficult random workload with block I/O, and a great litmus test.
  4. For a box regularly marketed for Tier-1 workloads, the IBM XIV is, once more, suspiciously absent, even in its current Gen3 guise. It’s not like IBM is shy about submitting SPC-1 results :)
  5. Finally – some competitors keep saying NetApp is “not true SAN”, “emulated SAN” etc. Whatever that means – maybe the NetApp approach is better after all… the maximum write latency of the NetApp submission was 1.91ms for a predominantly write workload :)

FINAL THOUGHTS

With this recent SPC-1 result, NetApp showed once more that ONTAP running in Cluster-Mode is highly performing and highly scalable for both SAN and NAS workloads. Summarily, ONTAP Cluster-Mode:
  • Allows for highly performant and dynamically-scalable unified clusters for FC, iSCSI, NFS and CIFS.
  • Exhibits proven low latency while maintaining high performance.
  • Provides excellent price/performance.
  • Allows data on any node to be accessed from any other node.
  • Moves data non-disruptively between nodes (including CIFS, which normally is next to impossible).
  • Maintains the traditional NetApp features (write optimization, application awareness, snapshots, deduplication, compression, replication, thin provisioning, megacaching).
  • Can use the exact same FAS gear as ONTAP running in the legacy 7-mode for investment protection.
  • Can virtualize other arrays behind it.
 Courteous comments always welcome.
D

NetApp posts world-record SPEC SFS2008 NFS benchmark result

Just as NetApp dominated the older version of the SPEC SFS97_R1 NFS benchmark back in May of 2006 (and was unsurpassed in that benchmark with 1 million SFS operations per second), the time has come to once again dominate the current version, SPEC SFS2008 NFS.

Recently we have been focusing on benchmarking realistic configurations that people might actually put in their datacenters, instead of lab queens with unusable configs focused on achieving the highest result regardless of cost.

However, it seems the press doesn’t care about realistic configs (or to even understand the configs) but instead likes headline-grabbing big numbers.

So we decided to go for the best of both worlds – a headline-grabbing “big number” but also a config that would make more financial sense than the utterly crazy setups being submitted by competitors.

Without further ado, NetApp achieved over 1.5 million SPEC SFS2008 NFS operations per second with a 24-node cluster based on FAS6240 boxes running ONTAP 8 in Cluster Mode. Click here for the specific result. There are other results in the page showing different size clusters so you can get some idea of the scaling possible.

See below table for a high-level analysis (including the list pricing I could find for these specific performance-optimized configs for whatever that’s worth). The comparison is between NetApp and the nearest scale-out competitor result (one of many EMC’s recent acquisitions –  Isilon, the niche, dedicated NAS box – nothing else is close enough to bother including in the comparison).

BTW – the EMC price list is publicly available from here (and other places I’m sure): http://www.emc.com/collateral/emcwsca/master-price-list.pdf

From page 422:

S200-6.9TB & 200GB SSD, 48GB RAMS200-6.9TB & 200GB SSD, 48GB RAM, 2x10GE SFP+ & 2x1G $84,061. Times 140…

Before we dive into the comparison, an important note since it seems the competition doesn’t understand how to read SPEC SFS results:

Out of 1728 450GB disks (the number includes spares and OS drives, otherwise it was 1632 disks), the usable capacity was 574TB (73% of all raw space – even more if one considers a 450GB disk never actually provides 450 real GB in base2). The exported capacity was 288TB. This doesn’t mean we tried to short-stroke or that there is a performance benefit exporting a smaller filesystem – the way NetApp writes to disk, the size of the volume you export has nothing to do with performance. Since SPEC SFS doesn’t use all the available disk space, the person doing the setup thought like a real storage admin and didn’t give it all the available space. 

Lest we be accused of tuning this config or manually making sure client accesses were load-balanced and going to the optimal nodes, please understand this:  23 out of 24 client accesses were not going to the nodes owning the data and were instead happening over the cluster interconnect (which, for any scale-out architecture, is worst-case-scenario performance). Look under the “Uniform Access Rules Compliance” in the full disclosure details of the result in the SPEC website here. This means that, compared to the 2-node ONTAP 7-mode results, there is a degradation due to the cluster operating (intentionally) through non-optimal paths.

EMC NetApp Difference
Cost (approx. USD List) 11,800,000 6,280,000 NetApp is almost half the cost while offering much higher performance
SPEC SFS2008 NFS operations per second 1,112,705 1,512,784 NetApp is over 35% faster, while using potentially better RAID protection
Average Latency (ORT) 2.54 1.53 NetApp offers almost 40% better average latency without using costly SSDs, and is usable for challenging random workloads like DBs, VMs etc.
Space (TB) 864 (out of which 128889GB was used in the test) 574 (out of which 176176GB was used in the test) Isilon offers about 50% more usable space (coming from a lot more drives, 28% more raw space and potentially less RAID protection – N+2 results from Isilon would be different)
$/SPEC SFS2008 NFS operation 10.6 4.15 Netapp is less than half the cost per SPEC SFS2008 NFS operation
$/TB 13,657 10,940 NetApp is about 20% less expensive than EMC per usable TB
RAID Per-file protection. Files < 128K are at least mirrored. Files over 128K are at a 13+1 level protection in this specific test. RAID-DP Ask EMC what 13+1 protection means in an Isilon cluster (I believe 1 node can be completely gone but what about simultaneous drive failures that contain the sameprotected file?)NetApp RAID-DP is mathematically analogous to RAID6 and has a parity drive penalty of 2 drives every 16-20 drives.
Boxes needed to accomplish result 140 nodes, 3,360 drives (incl. 25TB of SSDs for cache), 1,120 CPU cores, 6.7TB RAM. 24 unified controllers, 1,728 drives, 12.2TB Flash Cache, 192 CPU cores, 1.2TB RAM. NetApp is far more powerful per node, and achieves higher performance with a lotless drives, CPUs, RAM and cache.In addition, NetApp can be used for all protocols (FC, iSCSI, NFS, CIFS) and all connectivity methods (FC 4/8Gb, Ethernet 1/10Gb, FCoE).

 

Notice the response time charts:

IsilonVs6240response

NetApp exhibits traditional storage system behavior – latency is very low initially and gradually gets higher the more the box is pushed, as one would expect. Isilon on the other hand starts out slow and gets faster as more metadata gets cached, until the controllers run out of steam (SPEC SFS is very heavy in NAS metadata ops, and should not be compared to heavy-duty block benchmarks like SPC-1).

This is one of the reasons an Isilon cluster is not really applicable for low-latency DB-type apps, or low-latency VMs. It is a great architecture designed to provide high sequential speeds for large files over NAS protocols, and is not a general-purpose storage system. Kudos to the Isilon guys for even getting the great SPEC result in the first place, given that this isn’t what the box is designed to do (the extreme Isilon configuration needed to run the benchmark is testament to that). The better application for Isilon would be capacity-optimized configs (which is what the system is designed for to begin with).

 

Some important points:

  1. First and foremost, the cluster-mode ONTAP architecture now supports all protocols, it is the only unified scale-out architecture available. Any competitors playing in that space only have NAS or SAN offerings but not both in a single architecture.
  2. We didn’t even test with the even faster 6280 box and extra cache (that one can take 8TB cache per node). The result is not the fastest a NetApp cluster can go :) With 6280s it would be a healthy percentage faster, but we had a bunch of the 6240s in the lab so it was easier to test them, plus they’re a more common and less expensive box, making for a more realistic result.
  3. ONTAP in cluster-mode is a general-purpose storage OS, and can be used to run Exchange, SQL, Oracle, DB2, VMs, etc. etc. Most other scale-out architectures are simply not suitable for low-latency workloads like DBs and VMs and are instead geared towards high NAS throughput for large files (IBRIX, SONAS, Isilon to name a few – all great at what they do best).
  4. ONTAP in cluster mode is, indeed, a single scale-out cluster and administered as such. It should not be compared to block boxes with NAS gateways in front of them like VNX, HDS + Bluearc, etc.
  5. In ONTAP cluster mode, workloads and virtual interfaces can move around the cluster non-disruptively, regardless of protocol (FC, iSCSI, NFS and yes, even CIFS can move around non-disruptively assuming you have clients that can talk SMB 2.1 and above).
  6. In ONTAP cluster mode, any data can be accessed from any node in the cluster – again, impossible with non-unified gateway solutions like VNX that have individual NAS servers in front of block storage, with zero awareness between the NAS heads aside from failover.
  7. ONTAP cluster mode can allow certain cool things like upgrading storage controllers from one model to another completely non-disruptively, most other storage systems need some kind of outage to do this. All we do is add the new boxes to the existing cluster :)
  8. ONTAP cluster mode supports all the traditional NetApp storage efficiency and protection features: RAID-DP, replication, deduplication, compression, snaps, clones, thin provisioning. Again, the goal is to provide a scale-out general-purpose storage system, not a niche box for only a specific market segment. It even supports virtualizing your existing storage.
  9. There was a single namespace for the NFS data. Granted, not the same architecture as a single filesystem from some competitors.
  10. Last but not least – no “special” NetApp boxes are needed to run Cluster Mode. In contrast to other vendors selling a completely separate scale-out architecture (different hardware and software and management), normal NetApp systems can enter a scale-out cluster as long as they have enough connectivity for the cluster network and can run ONTAP 8. This ensures investment protection for the customer plus it’s easier for NetApp since we don’t have umpteen hardware and software architectures to develop for and support :)
  11. Since people have been asking: The SFS benchmark generates about 120MB per operation. The slower you go, the less space you will use on the disks, regardless of how many disks you have. This creates some imbalance in large configs (for example, only about 128TB of the 864TB available was used on Isilon).

Just remember – in order to do what ONTAP in Cluster Mode does, how many different architectures would other vendors be proposing?

  • Scale-out SAN
  • Scale-out NAS
  • Replication appliances
  • Dedupe appliances
  • All kinds of management software

How many people would it take to keep it all running? And patched? And how many firmware inter-dependencies would there be?

And what if you didn’t need, say, scale-out SAN to begin with, but some time after buying traditional SAN realized you needed scale-out? Would your current storage vendor tell you you needed, in addition to your existing SAN platform, that other one that can do scale-out? That’s completely different than the one you bought? And that you can’t re-use any of your existing stuff as part of the scale-out box, regardless of how high-end your existing SAN is?

How would that make you feel?

Always plan for the future…

Comments welcome.

D

PS: Made some small edits in the RAID parts and also added the official EMC pricelist link.

 

Technorati Tags: , , , , , , , , ,