NetApp posts SPC-1 Top Ten Performance results for its high end systems – Tier 1 meets high functionality and high performance

It’s been a while since our last SPC-1 benchmark submission with high-end systems in 2012. Since then we launched all new systems, and went from ONTAP 8.1 to ONTAP 8.3, big jumps in both hardware and software.

In 2012 we posted an SPC-1 result with a 6-node FAS6240 cluster  – not our biggest system at the time but we felt it was more representative of a realistic solution and used a hybrid configuration (spinning disks boosted by flash caching technology). It still got the best overall balance of low latency, high SPC-1 IOPS, price, scalability, data resiliency and functionality compared to all other spinning disk systems at the time.

Today (April 22, 2015) we published SPC-1 results with an 8-node all-flash high-end FAS8080 cluster to illustrate the performance of the largest current NetApp FAS systems in this industry-standard benchmark.

For the impatient…

  • The NetApp All-Flash FAS8080 SPC-1 submission places the system in the #5 performance spot in the SPC-1 Top Ten by performance list.
  • And #3 if you look at performance at 1ms latency.
  • Highest performing All-Flash Enterprise Unified system.
  • The NetApp system uses RAID-DP, similar to RAID-6, whereas the other entries use RAID-10 – performance would be far lower for the other entries with RAID-6
  • Price-performance-wise, the FAS8080 gets the #4 spot once adjusted for all list prices
  • In addition, the FAS8080 shows the best storage efficiency, by far, of any SPC-1 submission to date (and without using compression or deduplication).
  • The FAS8080 offers far more functionality than any other system in the list.

We also recently posted results with the NetApp EF560  – the other major hardware platform NetApp offers. See my post here and the official results here. Different value proposition for that platform – less features but very low latency and great cost effectiveness are the key themes for the EF560.

In this post I want to explain the current Clustered ONTAP results and why they are important.

Flash performance without compromise

Solid state storage technologies are becoming increasingly popular.

The challenge with flash offerings from most vendors is that customers typically either have to give up a lot in order to get the high performance of flash, or have to combine 4-5 different products into a complex “solution” in order to satisfy different requirements.

For instance, dedicated all-flash offerings may not be able to natively replicate to less expensive, spinning-drive solutions.

Or, a flash system may offer high performance but not the functionality, scalability, reliability and data integrity of more mature solutions.

But what if you could have it all? Performance and reliability and functionality and scalability and maturity? That’s exactly what Clustered ONTAP 8.3 provides.

Here are some Clustered ONTAP 8.3 running on FAS8080 highlights:

  • All the NetApp signature ultra-tight application integration and automation for replication, SnapShots, Clones
  • Fancy write-accelerated RAID6-equivalent protection by default
  • Comprehensive data integrity and protection against insidious lost write/torn page/misplaced write errors that RAID and normal checksums don’t always catch
  • Non-disruptive data mobility for all protocols
  • Non-disruptive operations – no downtime even when doing things that would require downtime and extensive PS with other vendors
  • Granular QoS
  • Deduplication and compression
  • Highly scalable – 5,760 drives possible in an 8-node cluster, 17,280 drives possible in the max 24 nodes. Various drive types, from SSD to SATA and everything else in between.
  • Multiprotocol (FC, iSCSI, NFS, SMB1,2,3) on the same hardware (no “helper” boxes needed, no dedicated SAN vs NAS pools needed)
  • 96,000 LUNs per 8-node cluster (that’s right, ninety-six thousand LUNs)
  • ONTAP is VMware vVol ready
  • The only array that has been validated by VMware for VMware Horizon 6 with vVols – hopefully the competitors will follow our lead
  • Over 460TB (yes, TeraBytes) of usable cache after all overheads are accounted for (and without accounting for cache amplification through deduplication and clones) in an 8-node cluster. Makes competitor maximum cache amounts seem like rounding errors – indeed, the actual figure might be 465TB or more, but it’s OK… :) (and 3x that number in a 24-node cluster, over 1.3PB cache!)
  • The ability to virtualize other storage arrays behind it
  • The ability to have a cluster with dissimilar size and type nodes – no need to keep all engines the same (unlike monolithic offerings). Why pay the same for all nodes when some nodes may not need all the performance? Why be forced to keep all nodes in the same hardware family? What if you don’t want to buy all at once? Maybe you want to upgrade part of the cluster with a newer-gen system? :)
  • The ability to evacuate part of a cluster and build that part as a different cluster elsewhere
  • The ability to have multiple disk types in a cluster and, indeed, dedicate nodes to functions (for instance, have a few nodes all-flash, some nodes with flash-accelerated SAS and a couple with very dense yet flash-accelerated NL-SAS, with full online data mobility between nodes)
That last bullet deserves a picture:
 MixedCluster

 

“SVM” stands for Storage Virtual Machine –  it means a logical storage partition that can span one or more cluster nodes and have parts of the underlying capacity (performance and space) available to it, with its own users, capacity and performance limits etc.

In essence, Clustered ONTAP offers the best combination of performance, scalability, reliability, maturity and features of any storage system extant as of this writing. Indeed – look at some of the capabilities like maximum cache and number of LUNs. This is designed to be the cornerstone of a datacenter.

it makes most other systems seem like toys in comparison…

Ships

FUD buster

Another reason we wanted to show this result was FUD from competitors struggling to find an angle to fight NetApp. It goes a bit like this: “NetApp FAS systems aren’t real SAN, it’s all simulated and performance will be slow!”

Right…

Drevilsimulated

Well – for a “simulated” SAN (whatever that means), the performance is pretty amazing given the level of protection used (RAID6-equivalent – far more resilient and capacity-efficient for large pooled deployments than the RAID10 the other submissions use) and all the insane scalability, reliability and functionality on tap :)

Another piece of FUD has been that ONTAP isn’t “flash-optimized” since it’s a very mature storage OS and wasn’t written “from the ground up”. We’ll let the numbers speak for themselves. It’s worth noting that we have been incorporating a lot of flash-related innovations into FAS systems well before any other competitor did so, something conveniently ignored by the FUD-mongers. In addition, ONTAP 8.3 has a plethora of flash optimizations and path length improvements that helped with the good latency results. And lots more is coming.

The final piece of FUD we made sure was addressed was system fullness – last time we ran the test we didn’t fill up as much as we could have, which prompted the FUD-mongers to say that FAS systems need gigantic amounts of free space to perform. Let’s see what they’ll come up with this time ;)

On to the numbers!

As a refresher, you may want to read past SPC-1 posts here and here, and my performance primer here.

Important note: SPC-1 is a 100% block-based benchmark with its own I/O blend and, as such, the results from any vendor SPC-1 submission should not be compared to marketing IOPS numbers of all reads or metadata-heavy NAS benchmarks like SPEC SFS (which are far easier on systems than the 60% write blend of the SPC-1 workload). Indeed, the tested configuration might perform in the millions of “marketing” IOPS – but that’s decidedly not the point of this benchmark.

The SPC-1 Result links if you want the detail are here (summary) and here (full disclosure). In addition, here’s the link to the “Top 10 Performance” systems page so you can compare other submissions that are in the upper performance echelon (unfortunately, SPC-1 results are normally just alphabetically listed, making it time-consuming to compare systems unless you’re looking at the already sorted Top 10 list).

I recommend you look beyond the initial table in each submission showing the performance and $/IOPS and at least go to the actual price list to see the detail. For instance, HDS shows a 58% discount if you go to the detail here, and calculates their $/IOPS number based on the discounted price. Just be aware and remember – the only way to get a real price is to talk to your sales rep.

The things to look for in SPC-1 submissions

Typically you’re looking for the following things to make sense of an SPC-1 submission:

  • Latency vs IOPS – many submissions will show high IOPS at huge latency, which would be rather useless when it comes to Flash storage
  • Sustainability – was performance even or are there constant huge spikes?
  • RAID level – most submissions use RAID10 for speed, what would happen with RAID6?
  • Application Utilization. This one is important yet glossed over. It signifies how much capacity the benchmark consumed vs the overall raw capacity of the system, before RAID, spares etc.

Let’s go over these one by one.

Latency vs IOPS

Our average latency was 1.23ms at 685,281.71 SPC-1 IOPS, and pretty flat over time during the test:

Response_time_complete

Sustainability

The SPC-1 rules state the minimum runtime should be 8 hours. We ran the test for 18 hours to observe if there would be variation in the performance. There was no significant variation:

IOdistributionRamp

RAID level

RAID-DP was used for all testing. This is mathematically analogous in protection to RAID-6. Given that these systems are typically deployed in very large pooled configurations, we elected long ago to not recommend single parity RAID since it’s simply not safe enough. RAID-10 is fast and fine for smaller capacity SSD systems but, at scale, it gets too expensive for anything but a lab queen (a system that nobody in their right mind will ever buy but which benchmarks well).

Application Utilization

Our Application Utilization was a very high 61.92% – unheard of by other vendors posting SPC-1 results since they use RAID10 which, by definition, wastes half the capacity (plus spares and other overheads to worry about on top of that).

AppUtilization

Some vendors using RAID10 will fill up the resulting space after RAID, spares etc. to a very high degree, and call out the “Protected Application Utilization” as being the key thing to focus on.

This could not be further from the truth – Application Utilization is the only metric that really shows how much of the total possible raw capacity the benchmark actually used and signifies how space-efficient the storage was.

Otherwise, someone could do quadruple mirroring of 100TB, fill up the resulting 25TB to 100%, and call that 100% efficient… when in fact it only consumed 25% :)

It is important to note there was no compression or deduplication enabled by any vendor since it is not allowed by the current version of the benchmark.

Compared to other vendors

I wanted to show a comparison between the Top Ten Performance results both in absolute terms and also normalized around 1ms latency.

Here are the Top Ten highest performing systems as of April 22, 2015, with vendor results links if you want to look at things in detail:

  1. Hitachi Virtual Storage Platform G1000
  2. Kaminario K2
  3. Huawei OceanStor 18800
  4. IBM Power Server 780
  5. NetApp FAS8080
  6. Huawei OceanStor 6800 V3
  7. HDS VSP
  8. HP XP P9500 (same as the VSP above, HP resells it as their high end offering)
  9. Huawei OceanStor Dorado 5100
  10. IBM SVC with V7000
  11. IBM System Storage DS8870

I will show columns that explain the results of each vendor around 1ms. Why 1ms and not more or less? Because in the Top Ten SPC-1 performance list, most results show fairly low latency, but some have very high latency, and it’s useful to show performance at that lower latency point, which is becoming the latency standard for All-Flash systems. 1ms seems to be a good point for multi-function SSD systems (vs simpler, smaller but more speed-optimized architectures like the NetApp EF560).

The way you determine the 1ms latency point is by looking at the graph that shows latency vs SPC-1 IOPS. Let’s pick IBM’s 780 since it has a very interesting curve so you learn what to look for.

From page 5 of the IBM 780 SPC-1 report:

IBM780

IBM’s submitted SPC-1 IOPS are high but at a huge latency number for an all-SSD solution (18.90ms). Not very useful for customers picking an all-SSD system. Even the next load point, with an average latency of 6.41ms, is high for an all-flash solution.

To more accurately compare this to the rest of the vendors with decent latency, you need to look at the chart around 1ms.

They didn’t publish a load point close to 1ms so I’ll “grant” them 200,000 SPC-1 IOPS at that point (the chart shows it’s probably less but it’s OK, it makes no difference to the overall standing in the end).

You can do a similar exercise for the rest, it’s worth a look – I don’t want to paste all these graphs since this post will get too big and firmly in tl;dr territory if it isn’t already :)

Here’s the table with the current Top Ten SPC-1 Performance results as of 4/22/2015. Click on it for a clearer picture, there’s a lot going on.

8080SPC1chart

What do the results show?

Predictably, all-flash systems trump disk-based and hybrid systems for performance and can offer very nice $/SPC-1 IOPS numbers. That is the major allure of flash – high performance density.

Some takeaways from the comparison:

  • Once adjusted for 1ms latency and list price, the results shift dramatically, what was once awesome suddenly is no more.
  • The other vendors used RAID10 – NetApp used RAID-DP (similar to RAID6 in protection). What would happen to their results if they switched to RAID6 to provide a similar level of protection and efficiency?
  • Some vendors try to fit a lot of the benchmark in RAM. I show that calculation as “Working Set Size as a % of RAM”. You want that number to be comfortably bigger than 100%. 100% and under means there’s a high likelihood much of the I/O was cached in RAM. This is important – and possibly explains why some vendors used such a small capacity (indeed, on the verge of legality within the SPC-1 rules). FYI, the “hot” data in SPC-1 is about 6.75% of the overall capacity used.
  • Aside from the NetApp FAS result, the rest of the Top Ten Performance submissions offer vastly lower Application Utilization – about half! Which means that NetApp is able to use 2x the capacity vs raw compared to the other submissions. And that’s before starting to count the possible storage efficiencies we can turn on like dedupe and compression.
  • No competitor system offers the sheer functionality the FAS8080 does – not even close.
  • Certain competitors have very questionable viability and/or tiny market penetration, making them a risky proposition for a high end system purchase.

Overall – the all-flash FAS8080EX gets a pretty amazing performance and efficiency result, especially given the sheer amount of functionality it offers.

How does one pick a flash array?

It depends. What are you trying to do? Solve a tactical problem? Just need a lot of extra speed and far lower latency for some workloads? No need for the array to have a ton of functionality? A lot of the data management happens in the application? Need something cost-effective, simple yet reliable? Then an all-flash system like the NetApp EF560 is a solid answer, and it can still be front-ended by a Clustered ONTAP system to provide more functionality if the need arises in the future (we are firm believers in hardware reuse and investment protection – you see, some companies talk about Software Defined Storage, we do Software Defined Storage).

On the other hand, if you would prefer an Enterprise architecture that can serve as the cornerstone of your datacenter for almost any workload and protocol, offers rich data management functionality and tight application integration, insane scalability and offers the most features (reliably) compared to any other platform – then the FAS line running Clustered Data ONTAP is the only possible answer.

Couple that with OnCommand Insight – the best multivendor fabric management tool on the planet – plus Workflow Automation, and we’ve got you covered.

Thx

D

Technorati Tags: , , , , , , , ,

Marketing fun: NetApp industry first of up to 13 million IOPS in a single rack

I’m seeing some really “out there” marketing lately, every vendor (including us) trying to find an angle that sounds exciting while not being an outright lie (most of the time).

A competitor recently claimed an industry first of up to 1.7 million (undefined type) IOPS in a single rack.

The number (which admittedly sounds solid), got me thinking. Was the “industry first” that nobody else did up to 1.7 million IOPS in a single rack?

Would that statement also be true if someone else did up to 5 million IOPS in a rack?

I think that, in the world of marketing, it would – since the faster vendor doesn’t do up to 1.7 million IOPS in a rack, they do up to 5! It’s all about standing out in some way.

Well – let’s have some fun.

I can stuff 21x EF560 systems in a single rack.

Each of those systems can do 650,000 random 4K reads at a stable 800 microseconds (since I like defining my performance stats), 600,000 random 8K reads at under 1ms, and over 300,000 random 32KB reads at under 1ms. Also 12GB/s large blog sequential reads. This is I/O straight from the SSDs and not RAM cache (the I/O from cache can of course be higher but let’s not count that).

See here for the document showing some of the performance numbers.

Well – some simple math shows a standard 42U rack fully populated with EF560 will do the following:

  • 13,650,000 IOPS.
  • 252GB/s throughput.
  • Up to 548TB of usable SSD capacity using DDP protection (up to 639TB with RAID5).

Not half bad.

Doesn’t quite roll off the tongue though – industry first of up to thirteen million six hundred and fifty thousand IOPS in a single rack. :)

I hope rounding down to 13 million is OK with everyone.

 

D

Technorati Tags: , , ,

NetApp Posts Top Ten SPC-1 Price-Performance Results for the new EF560 All-Flash Array

<edit: updated with the changes in the SPC-1 price/performance lineup as of 3/27/2015, fixed some typos>

I’m happy to announce that today we announced the new, third-gen EF560 all-flash array, and also posted SPC-1 results showing the impressive performance it is capable of in this extremely difficult benchmark.

If you have no time to read further – the EF560 achieves, by far, the absolute best price/performance at very low latencies in the SPC-1 benchmark.

The EF line has been enjoying great success for some time now with huge installations in some of the biggest companies in the world with the highest profile applications (as in, things most of us use daily).

The EF560 is the latest all-flash variant of the E-Series family, optimized for very low latency and high performance workloads while ensuring high reliability, cost effectiveness and simplicity.

EF560 highlights

The EF560 runs SANtricity – a lean, heavily optimized storage OS with an impressively short path length (the overhead imposed by the storage OS itself to all data going through the system). In the case of the EF the path length is tiny, around 30 microseconds. Most other storage arrays have a much longer path length as a result of more features and/or coding inefficiencies.

Keeping the path length this impressively short is one of the reasons the EF does away with fashionable All-Flash features like compression and deduplication –  make no mistake, no array that performs those functions is able to sustain that impressively short a path length. There’s just too much in the way. If you really want data reduction and an incredible number of features, we offer that in the FAS line – but the path length naturally isn’t as short as the EF560’s.

A result of the short path length is impressively low latency while maintaining high IOPS with a very reasonable configuration, as you will see further in the article.

Some other EF560 features:

  • No write cliff due to SSD aging or fullness
  • No performance impact due to SSD garbage collection
  • Enterprise components – including SSDs
  • Six-nines available
  • Up to 120x 1.6TB SSDs per system (135TB usable with DDP protection, even more with RAID5/6)
  • High throughput – 12GB/s reads, 8GB/s writes per system (many people forget that DB workloads need not just low latency and high IOPS but also high throughput for certain operations).
  • All software is included in the system price, apart from encryption
  • The system can do snaps and replication, including fully synchronous replication
  • Consistency Group support
  • Several application plug-ins
  • There are no NAS capabilities but instead there is a plethora of block connectivity options: FC, iSCSI, SAS, InfiniBand
  • The usual suspects of RAID types – 5, 10, 6 plus…
  • DDP – Dynamic Disk Pools, a type of declustered RAID6 implementation that performs RAID at the sub-disk level – very handy for large pools, rapid disk rebuilds with minimal performance impact and overall increased flexibility (for instance, you could add a single disk to the system instead of entire RAID groups’ worth)
  • T10-PI to help protect against insidious data corruption that might bypass RAID and normal checksums, and provide end-to-end protection, from the application all the way to the storage device
  • Can also be part of a Clustered Data ONTAP system using the FlexArray license on FAS.

The point of All-Flash Arrays

Going back to the short path length and low latency discussion…

Flash has been a disruptive technology because, if used properly, it allows an unprecedented performance density, at increasingly reasonable costs.

The users of All-Flash Arrays typically fall in two camps:

  1. Users that want lots of features, data reduction algorithms, good but not deterministic performance and not crazy low latencies – 1-2ms is considered sufficient for this use case (with the occasional latency spike), as it is better than hybrid arrays and way better than all-disk systems.
  2. Users that need the absolute lowest possible latency (starting in the microseconds – and definitely less than 1ms worst-case) while maintaining uncompromising reliability for their applications, and are willing to give up certain features to get that kind of performance. The performance for this type of user needs to be deterministic, without weird latency spikes, ever.

The low latency camp typically uses certain applications that need very low latency to generate more revenue. Every microsecond counts, while failures would typically mean significant revenue loss (to the point of making the cost of the storage seem like pocket change).

Some of you may be reading this and be thinking “so what, 1ms to 2ms is a tiny difference, it’s all awesome”. Well – at that level of the game, 2ms is twice the latency of 1ms, and it is a very big deal indeed. For the people that need low latency, a 1ms latency array is half the speed of a 500 microsecond array, even if both do the same IOPS.

You may also be thinking “SSDs that fit in a server’s PCI slot have low latency, right?”

The answer is yes, but what’s missing is the reliability a full-fledged array brings. If the server dies, access is lost. If the card dies, all is lost.

So, when looking for an All-Flash Array, think about what type of flash user you are. What your business actually needs. That will help shape your decisions.

All-Flash Array background operations can affect latency

The more complex All-Flash Arrays have additional capabilities compared to the ultra-low-latency gang, but also have a higher likelihood of producing relatively uneven latency under heavy load while full, and even latency spikes (besides their naturally higher latency due to the longer path length).

For instance, things like cleanup operations, various kinds of background processing that kicks off at different times, and different ways of dealing with I/O depending on how full the array is, can all cause undesirable latency spikes and overall uneven latency. It’s normal for such architectures, but may be unacceptable for certain applications.

Notably, the EF560 doesn’t suffer from such issues. We have been beating competitors in difficult performance situations with the slower predecessors of the EF560, and we will keep doing it with the new, faster system :)

Enough already, show me the numbers!

As a refresher, you may want to read past SPC-1 posts here and here, and my performance primer here.

Important note: SPC-1 is a block-based benchmark with its own I/O blend and, as such, the results from any vendor’s SPC-1 Result should not be compared to marketing IOPS numbers of all reads or metadata-heavy NAS benchmarks like SPEC SFS (which are far easier on systems than the 60% write blend and hotspots of the SPC-1 workload). Indeed, the tested configuration could perform way more “marketing” IOPS – but that’s decidedly not the point of this benchmark.

The EF560 SPC-1 Result links if you want the detail are here (summary) and here (full disclosure). In addition, here’s the link to the “Top 10 by Price-Performance” systems page so you can compare to other submissions (unfortunately, SPC-1 results are normally just alphabetically listed, making it time-consuming to compare systems unless you’re looking at the already sorted Top 10 lists).

The things to look for in SPC-1 submissions

Typically you’re looking for the following things to make sense of an SPC-1 submission:

  • Latency vs IOPS – many submissions will show high IOPS at huge latency, which would be rather useless for the low-latency crowd
  • Sustainability – was performance even or are there constant huge spikes?
  • RAID level – most submissions use RAID10 for speed, what would happen with RAID6?
  • Application Utilization. This one is important yet glossed over. It signifies how much capacity the benchmark consumed vs the overall raw capacity of the system, before RAID, spares etc.
  • Price – discounted or list?

Let’s go over these one by one.

Latency vs IOPS

Our average latency was 0.93ms at 245,011.76 SPC-1 IOPS, and extremely flat during the test:

EF560distrib

Sustainability

The SPC-1 rules state the minimum runtime should be 8 hours. There was no significant variation in performance during the test:

EF560distramp

RAID level

RAID-10 was used for all testing, with T10-PI Data Assurance enabled (which has a performance penalty but the applications these systems are used for typically need paranoid data integrity). This system would perform slower with RAID5 or RAID6. But for applications where the absolute lowest latency is important, RAID10 is a safe bet, especially with systems that are not write-optimized for RAID6 writes like Data ONTAP is. Not to fret though – the price/performance remained stellar as you will see.

Application Utilization

Our Application Utilization was a very high 46.90% – among the highest of any submission with RAID10 (and among the highest overall, only Data ONTAP submissions can go higher due to RAID-DP).

EF560CapUtil

We did almost completely fill up the resulting RAID10 space, to show that the system’s performance is unaffected when very full. However, Application Utilization is the only metric that really shows how much of the total possible raw capacity the benchmark actually used and signifies how space-efficient the storage was.

Otherwise, someone could do quadruple mirroring of 100TB, fill up the resulting 25TB to 100%, and call that 100% efficient… when in fact it only consumed 25% :)

It is important to note there was no compression or deduplication enabled by any vendor since it is not allowed by the current version of the benchmark.

Compared to other vendors

I wanted to show a comparison between the SPC-1 Top Ten Price-Performance results both in absolute terms and also normalized around 500 microsecond latency to illustrate the fact that very low latency with great performance is still possible at a compelling price point with this solution.

Why 500 microseconds you might ask? Because that’s a good place for very low latency flash storage systems. Why not 1 millisecond you might also ask? Well, 1ms is more commonly found on systems that have more features and don’t concentrate on low latency as much (1ms is half the speed of 500 microseconds).

Here are the Top Ten Price-Performance systems as of March 27, 2015, with SPC-1 Results links if you want to look at things in detail:

  1. X-IO ISE 820 G3 All Flash Array
  2. Dell Storage SC4020 (6 SSDs)
  3. NetApp EF560 Storage System
  4. Huawei OceanStor Dorado2100 G2
  5. HP 3PAR StoreServ 7400 Storage System
  6. FUJITSU ETERNUS DX200 S3
  7. Kaminario K2 (28 nodes)
  8. Huawei OCEANSTOR Dorado 5100
  9. Huawei OCEANSTOR Dorado 2100
  10. FUJITSU ETERNUS DX100 S3

I will show columns that explain the results of each vendor around 500 microseconds, plus how changing the latency target affects SPC-1 IOPS and also how it affects $/SPC1-IOPS.

The way you determine that lower latency point (SPC calls it “Average Response Time“) is by looking at the graph that shows latency vs SPC-1 IOPS and finding the load point closest to 500 microseconds. Let’s pick Kaminario’s K2 so you learn what to look for:

K2curve

Notice how the SPC-1 IOPS around half a millisecond is about 10x slower than the performance around 3ms latency. The system picks up after that very rapidly, but if your requirements are for latency to not exceed 500 microseconds, you will be better off spending your money elsewhere (indeed, a very high profile client asked us for 400 microsecond max response at the host level from the first-gen EF systems for their Oracle DBs – this is actually very realistic for many market segments).

Here’s the table with all this analysis done for you. BTW, the “adjusted latency” $/SPC-1 IOPS is not something in the SPC-1 Reports but simply calculated for our example by dividing system price by the SPC-1 IOPS found at the 500 microsecond point in all the reports.

What do the results show?

As submitted, the EF560 is #3 in the absolute Price-Performance ranking. Interestingly, once adjusted for latency around 500 microseconds at list prices (to keep a level playing field), the price/performance of the EF560 is far better than anything else on the chart.

Regarding pricing: Note that some vendors have discounted pricing and some not, always check the SPC-1 report for the prices and don’t just read the summary at the beginning (for example, Fujitsu has 30% discounts showing in the reports, Dell, X-IO and HP all at 45% off – the rest aren’t discounted).

Our price-performance is even better once you adjust for discounts in some of the other results. Update: In this edited version of the chart I show the list price calculations as well. We are #1 in price/performance when adjusted for list pricing even at the higher submitted latencies for all vendors… :)

Another interesting observation is the effects of longer path length on some platforms – for instance, Dell’s lowest reported latency is 0.70ms at a mere 11,249.97 SPC-1 IOPS. Clearly, that is not a system geared towards high performance at very low latency. In addition, the response time for the submitted max SPC-1 IOPS for the Dell system is 4.83ms, firmly in the “nobody cares” category for all-flash systems :) (sorry guys).

Conversely… The LRT (Least Response Time) we submitted for the EF560 was a tiny 0.18ms (180 microseconds) at 24,501.04 SPC-1 IOPS. This is the lowest LRT anyone has ever posted on any array for the SPC-1 benchmark.

Clearly we are doing something right :)

Final thoughts

If your storage needs require very low latency coupled with very high reliability, the EF560 would be an ideal candidate. In addition, the footprint of the system is extremely compact, the SPC-1 results shown are with just a 2U EF560 with 24x 400GB SSDs.

Coupled with Clustered Data ONTAP systems and OnCommand Insight and WorkFlow Automation, NetApp has an incredible portfolio, able to take on any challenge.

Thx

D

Technorati Tags: , , , ,

Beware of storage performance guarantees

Ah, nothing to bring joy to the holidays like a bit of good old-fashioned sales craziness.

Recently we started seeing weird performance “guarantees” by some storage vendors, who seem will try anything for a sale.

Probably by people that haven’t read this.

It goes a bit like this:

“Mr. Customer, we guarantee our storage will do 100,000 IOPS no matter the I/O size and workload”.

Next time a vendor pulls this, show them the following chart. It’s a simple plot of I/O size vs throughput for 100,000 IOPS:

Throughput IO Size

Notice that at a 1MB I/O size the throughput is a cool 100GB/s :)

Then ask that vendor again if they’re sure they still want to make that guarantee. In writing. With severe penalties if it’s not met. As in free gear UNTIL the requirement is met. At any point during the lifetime of the equipment.

Then sit back and enjoy the backpedalling. 

You can make it even more fun, especially if it’s a hybrid storage vendor (mixed spinning and flash storage for caching, with or without autotiering):

  • So you will guarantee those IOPS even if the data is not in cache?
  • For completely random reads spanning the entire pool?
  • For random overwrites? (that should be a fun one, 100GB/s of overwrite activity).
  • For non-zero or at least not crazily compressible data?
  • And what’s the latency for the guarantee? (let’s not forget the big one).
  • etc. You get the point.
 
Happy Holidays everyone!
 
Thx
 
D
 

 

When competitors try too hard and miss the point – part two

This will be another FUD-busting post in the two-part series (first part here).

It’s interesting how some competitors, in their quest to beat us at any cost, set aside all common sense.

Recently, an Oracle blogger attempted to understand a document NetApp originally wrote in the 90’s (and which we haven’t really updated since, which is admittedly our bad) that explains how WAFL, the block layout engine of Data ONTAP (the storage OS on the FAS platform) works at a high level.

Apparently, he thinks that we turn everything into 4K I/Os, so if someone tried to read 256K, it would have to become 64 separate I/Os, and, by extension, believes this means no NetApp system running ONTAP can ever sustain good read throughput since the back-end would be inundated with IOPS.

The conclusions he comes to are interesting to say the least. I will copy-paste one of the calculations he makes for a 100% read workload:

Erroneous oracle calcs

I like the SAS logo, I guess this is meant to make the numbers look legit, as if they came from actual SAS testing :)

So this person truly believes that to read 2.6GB/s we need 5,120 drives due to the insane back-end IOPS we purportedly generate :)

This would be hilarious if it were true since it would mean NetApp managed to quietly perpetrate the biggest high tech scam in history, fooling customers for 22 years, and somehow managing to become the industry’s #1 storage OS and remain so.

Because customers are that gullible.

Right.

Well – here are some stats from a single 8040 controller (not an HA system with at least 2 controllers, I really mean a single controller doing work, not two or more), with 24 drives, doing over 2.7GB/s reads, at well under 1ms latency, so it’s not even stressed. Thanks to the Australian team for providing the stats:

8040 singlenode

In this example, 2.74GB/s are being read. From stable storage, not cache.

Now, if we do the math the way the competitor would like, it means the back-end is running at over 700,000 4K IOPS. On a single mid-range controller :)

That would be really impressive and hugely wasteful at the same time. Wait – maybe I should turn this around and claim 700,000 4K IOPS at 0.6ms capability per mid-range controller! Imagine how fast the big ones go!

It would also assume 35,000 IOPS per disk at a consistent speed and sub-millisecond response (0.64ms) – because the numbers above are from a single node with only about 20 data SSDs (plus parity and spares).

SSDs are fast but they’re not really that fast, and the purpose of this blog is to illuminate and not obfuscate.

Remember Occam’s razor. What explanation do you think makes more sense here? Pixie-dust drives and controllers, or that the Oracle blogger is massively wrong? :)

Another example – with spinning disks this time

This is a different output, to also illustrate our ability to provide detailed per-disk statistics.

From a single 8060 node, running at over 3GB/s reads during an actual RMAN job and not a benchmark tool (to use a real Oracle application example). There are 192x 10,000 RPM 600GB disks in the config (180x data, 24x parity – we run dual-parity RAID, there were 12x 16-drive RAID groups in a 14+2 config).

Numbers kindly provided by the legendary neto from Brazil (@netofrombrazil on Twitter). Check the link for his blog and all kinds of DB coolness.

This is part of the statit command’s output. I’m not showing all the disks since there are 192 of them after all and each one is a line in the output:

Read chain

The key in these stats is the “chain” column. This shows, per read command, how many blocks were read as a single entity. In this case, the average is about 49, or 196KB per read operation.

Notice the “xfers” – these drives are only doing about 88 physical IOPS on average per drive, and each operation just happens to be large. They could go faster (see the “ut%” column) but that’s just how much they were loaded during the RMAN job.

Again, if we used the blogger’s calculations, this system would have needed over 5,000 drives and generated over 750,000 back-end disk IOPS.

A public apology and retraction would be nice, guys…

Let’s extrapolate this performance at scale.

My examples are for single mid-range controllers. You can multiply that by 24 to see how fast it could go in a full cluster (yes, it’s linear). And that’s not the max these systems will do – just what was in the examples I found that were close to the competitor’s read performance example.

You see, where most of the competition is still dealing with 2-controller systems, NetApp FAS systems running Clustered ONTAP can run 8 engines for block workloads and 24 engines for NAS (8 if mixed), and each engine can have multiple TB of read/write cache (18TB max cache per node currently with ONTAP 8.2.x).

Even if a competitor’s 2 engines are faster than 2 FAS engines, if they stop at 2 and FAS stops at 24, the fight is over before it begins.

People that live in glass houses shouldn’t throw stones.

Since the competitor questioned why NetApp bought Engenio (the acquisition for our E-Series), I have a similar question: Why did Oracle buy Pillar Data? It was purchased after the Sun acquisition. Does that signify a major lack in the ZFS boxes that Pillar is supposed to address?

The Oracle blogger mentioned how their ZFS system had a great score in the SPC-2 tests (which measure throughput and not IOPS). Great.

Interestingly, Oracle ZFS systems can significantly degrade in performance over time (see here http://blog.delphix.com/uday/2013/02/19/78/) especially after writes, deletes and overwrites. Unlike ONTAP systems, ZFS boxes don’t have mechanisms to perform the necessary block reallocations to optimize the data layout in order to bring performance back to original levels (backing up, wiping the box, rebuilding and restoring is not a solution, sorry). There are ways to delay the inevitable, but nothing to fix the core issue.

It follows that the ZFS performance posted in the benchmarks may not be anywhere near what one will get long-term once the ZFS pools are fragmented and full. Making the ZFS SPC-2 benchmark result pretty useless.

NetApp E-Series inherently doesn’t have this fragmentation problem (and is near the top as a price-performance leader in the SPC-2 benchmark, as tested by SGI that resells it). Since there is no long-term speed deterioration issue with E-Series, the throughput you see in the SPC-2 benchmark will be perpetually maintained. The box is in it for the long haul.

Wouldn’t E-Series then be a better choice for a system that needs to constantly deal with such a workload? Both cost-effective and able to sustain high throughput no matter what?

As an aside, I do need to write an article on block layout optimizations available in ONTAP. Many customers are unaware of the possibilities, and competitors use FUD based on observations from back when mud was a novelty. In the meantime, if you’re a NetApp FAS customer, ask your SE and/or check your documentation for the volume option read_realloc space_optimized – great for volumes containing DB data files. Also, check the documentation for the Aggregate option free_space_realloc.

So you’re fast. What else can you do?

There were other “fighting words” in the blogger’s article and they were all about speed and how much faster the new boxes from the competitor are versus some ancient boxes they had from us. Amazing, new controllers being faster than old ones! :)

I see this trend recently, new vendors focusing solely on speed. Guess what – it’s easy to go fast. It’s also easy to be cheap. I’ll save that for a full post another time. But I fully accept that speed sells.

I can build you a commodity-based million-IOPS box during my lunch break. It’s really not that hard. Building a server with dozens of cores and TB of RAM is pretty easy.

But for Enterprise Storage, Reliability is extremely important, far more than sheer speed.

Plus Availability and Serviceability (where the RAS acronym comes from).

Predictability.

Non-Disruptive Operations, even during events that would leave other systems down for extended periods of time.

Extensive automation, management, monitoring and alerting at scale as well.

And of crucial importance is Application Integration, including the ability to perform application-aware data manipulation (fully consistent backups, restores, clones, replication).

So if a system can go fast but can’t do much else, its utility is more towards being a point solution rather than as part of a large, strategic, long-term deployment. Point solutions are useful, yes – but they are also interchangeable with the next cheap fast thing. Most won’t survive.

You know who you are.

D

Technorati Tags: , , , , , , ,