NetApp Enterprise Grade Flash

June 23rd marked the release of new NetApp Enterprise Flash technology. The release consists of:

  • New ONTAP 8.3.1 with significant performance, feature and usability enhancements
  • New Inline Storage Efficiencies
  • New All-Flash FAS systems (AFF works only with SSDs)
  • New, aggressive pricing model
  • New maintenance options (price protection on warranty for up to 7 years, even if you buy less warranty up front)

Enterprise Grade Flash

So what does “Enterprise Grade Flash” mean?


The combination of new technologies that Flash made possible, like high performance and inline storage efficiencies, plus ease of use, plus Enterprise features such as:

  • Deep application integration for clones, backups, restores, migrations
  • Comprehensive automation
  • Scale up
  • Scale out
  • Non-disruptive everything
  • Multiprotocol (SAN and NAS in the same system)
  • Secure multi-tenancy
  • Encryption (FIPS 140-2 validated hardware)
  • Hardened storage OS
  • Enterprise hardware
  • Comprehensive backup, cloning, archive
  • Comprehensive integration with leading backup software
  • Synchronous replication
  • Active-Active Datacenters
  • The ability to have AFF systems in the same cluster as hybrid
  • Seamless data movement between All-Flash and Hybrid
  • Easy replication between All-Flash and Hybrid
  • Cloud capable (storage OS ability to run in the cloud) – with replication to and from the cloud instances of ONTAP
  • QoS and SLO provisioning
  • Inline Foreign LUN Import (to make migrations from third-party arrays less impactful to the users)

The point is that competitor All-Flash offerings typically focus on a few of these points (for example, performance, ease of use and maybe inline storage efficiencies) but are severely lacking in the rest. This limited vision approach creates inflexibility and silos, and neither inflexibility nor silos are particularly desirable elements in Enterprise IT.

But what if you could have Flash without compromises? That’s what we are offering with the new AFF systems. A high performance offering with all the “coolness” but also all the “seriousness” and features Enterprise Grade storage demands.

It’s a bit like this Venn diagram:


AFF simply offers far more flexibility than All-Flash competitors. There may be certain things AFF doesn’t do vs some of the competitors, but they pale vs what AFF does and the competitors cannot.

And even if you don’t need all the features – at least they’re there waiting for you in case you do need them in the future (for instance, you may not need to replicate to the cloud and back today, but knowing that you have the option is reassuring in case your IT strategy changes).


It’s far harder to add serious enterprise data management features to newly built architectures than add new architecture benefits to a platform that already had the Enterprise Grade stuff down pat.

WAFL (the block layout engine of ONTAP) is already naturally well suited to working with SSDs:

  • Avoids modifying data in place
  • Writes to free space
  • Performs I/O coalescing in order to lump many operations in a single large I/O
  • Preserves the temporal locality of user data with metadata to further reduce I/O
  • Achieves a naturally low Write Amplification Factor (it’s worth noting that in all the years we’ve been selling Flash and after hundreds of PB, we have had exactly zero worn out SSDs – they’re not even close to wearing out).

So we optimized ONTAP where it mattered, while keeping the existing codebase where needed. It helps that the ONTAP architecture is already modular – it was fairly straightforward to enhance the parts of the code that dealt with storage media, for instance, or the parts that dealt with the I/O path. This significant optimization started with 8.3.0 and continued with 8.3.1.

The overall effect has been dramatically reduced latencies, enhanced code parallelism, increased resiliency, while at the same time enabling inline storage efficiencies. For customers still on 8.2.x and prior, or on 7-mode ONTAP, the differences with 8.3.1 will be pretty extreme… :)

Ease of Use

New with ONTAP 8.3.1 is a completely redesigned administrative GUI, and a SAN-optimized config for the ability to be serving I/O in 15 minutes from unpacking the system.

In addition, wizards allow the easy creation of LUNs for Databases with just 3 questions, and ONTAP can now be upgraded from the GUI.

Storage Efficiencies

ONTAP has had various flavors of storage efficiencies for a while. Those efficiencies typically had to be turned on manually, and often affected performance.

With ONTAP 8.3.1, Inline Compression is on by default, as is Inline Zero Deduplication (very helpful in VM deployments that use EagerZeroThick). In addition, Always On Deduplication is also available – a deduplication that runs very frequently (every 5 minutes).

In conjunction with already excellent thin provisioning plus state-of-the-art cloning and snapshot capabilities, some excellent efficiency ratios are possible. We can show up to 30:1 for certain kinds of VDI deployments, while things like Databases will not be able to be squeezed quite that much (especially if DB-side compression is already active). The overall efficiency ratio will vary depending on how the system is used.

Ultimately, Storage Efficiencies aim to reduce overall cost. Focus not so much on the actual efficiency ratio. Instead look at the effective price/TB. The efficiency differences between most vendors are probably less than most vendors want you to think they are – in real terms you will not save more than a few SSD’s worth of capacity. The real value lies elsewhere.


The AFF systems are fast. A maximum-size AFF cluster will do about 4 million IOPS at 1ms latency (8K random, with inline efficiencies). Throughput-wise, 100GB/s is possible.

We wanted to have performance stop being a discussion point except in the most extreme of situations. The reality is that any top-tier Flash solution will be fast. Once again – the real value lies elsewhere.

For the curious, here’s an IOPS vs Latency chart for a 2-node AFF8080 system, given that this level of performance is more than enough for the vast majority of customers out there (the max speed of a cluster is 12x what’s shown in this graph):


Many optimizations were implemented in ONTAP 8.3.1 – certain operations are over 4x faster than with ONTAP 8.2.x, and almost 50% faster than with 8.3.0. SSDs, being as fast as they are, benefit from those optimizations the most.

If you want more performance proof points, we have previously shown SSD performance for a very difficult workload (over 60% writes, combination of block sizes, random and sequential I/O) with 8.3.0 in our SPC-1 benchmarks (needs to be updated for 8.3.1 but even the 8.3.0 result is solid). We also have SQL, Oracle and VDI Technical Reports.

We are also always happy to demonstrate these systems for you.


There are some significant pricing changes, leading to an overall far more cost-effective solution. For instance, the cost delta between different controller models with the same amount of storage is far smaller than in the past. The scalability of all the systems is the same (240 SSDs * 1.6TB max currently per 2 nodes). The only difference is performance and the amount of connectivity possible.

Warranty pricing is now stable even if you buy 3 years up front and extend later on.

Oh – and all AFF models now include all NetApp FAS software: All the protocols, all the SnapManager application integration modules, replication, the works.

Final Words

The pace of innovation at NetApp is accelerating dramatically. We had to work hard to bring Clustered ONTAP to feature parity with the older 7-mode, which delayed things. Now with  8.3.x dropping 7-mode altogether, we have many more developers to focus on improving Clustered ONTAP. The big enhancements in 8.3.1 came very rapidly after 8.3.0 became GA… and there’s a lot more to come soon.

Make no mistake: NetApp is a storage giant and has an Engineering organization not to be trifled with.

Now, how do you like your Flash? With or without compromises? :)


Technorati Tags: , , , , , , ,

NetApp posts SPC-1 Top Ten Performance results for its high end systems – Tier 1 meets high functionality and high performance

It’s been a while since our last SPC-1 benchmark submission with high-end systems in 2012. Since then we launched all new systems, and went from ONTAP 8.1 to ONTAP 8.3, big jumps in both hardware and software.

In 2012 we posted an SPC-1 result with a 6-node FAS6240 cluster  – not our biggest system at the time but we felt it was more representative of a realistic solution and used a hybrid configuration (spinning disks boosted by flash caching technology). It still got the best overall balance of low latency (Average Response Time or ART in SPC-1 parlance, to be used from now on), high SPC-1 IOPS, price, scalability, data resiliency and functionality compared to all other spinning disk systems at the time.

Today (April 22, 2015) we published SPC-1 results with an 8-node all-flash high-end FAS8080 cluster to illustrate the performance of the largest current NetApp FAS systems in this industry-standard benchmark.

For the impatient…

  • The NetApp All-Flash FAS8080 SPC-1 submission places the system in the #5 performance spot in the SPC-1 Top Ten by performance list.
  • And #3 if you look at performance at load points around 1ms Average Response Time (ART).
  • The NetApp system uses RAID-DP, similar to RAID-6, whereas the other entries use RAID-10 (typically, RAID-6 is considered slower than RAID-10).
  • In addition, the FAS8080 shows the best storage efficiency, by far, of any Top Ten SPC-1 submission (and without using compression or deduplication).
  • The FAS8080 offers far more functionality than any other system in the list.

We also recently posted results with the NetApp EF560  – the other major hardware platform NetApp offers. See my post here and the official results here. Different value proposition for that platform – less features but very low ART and great cost effectiveness are the key themes for the EF560.

In this post I want to explain the current Clustered Data ONTAP results and why they are important.

Flash performance without compromise

Solid state storage technologies are becoming increasingly popular.

The challenge with flash offerings from most vendors is that customers typically either have to give up a lot in order to get the high performance of flash, or have to combine 4-5 different products into a complex “solution” in order to satisfy different requirements.

For instance, dedicated all-flash offerings may not be able to natively replicate to less expensive, spinning-drive solutions.

Or, a flash system may offer high performance but not the functionality, scalability, reliability and data integrity of more mature solutions.

But what if you could have it all? Performance and reliability and functionality and scalability and maturity? That’s exactly what Clustered Data ONTAP 8.3 provides.

Here are some Clustered Data ONTAP 8.3 running on FAS8080 highlights:

  • All the NetApp signature ultra-tight application integration and automation for replication, SnapShots, Clones
  • Fancy write-accelerated RAID6-equivalent protection by default
  • Comprehensive data integrity and protection against insidious lost write/torn page/misplaced write errors that RAID and normal checksums don’t always catch
  • Non-disruptive data mobility for all protocols
  • Non-disruptive operations – no downtime even when doing things that would require downtime and extensive PS with other vendors
  • Granular QoS
  • Deduplication and compression
  • Highly scalable – 5,760 drives possible in an 8-node cluster, 17,280 drives possible in the max 24 nodes. Various drive types in the cluster, from SSD to SATA and everything else in between.
  • Multiprotocol (FC, iSCSI, NFS, SMB1,2,3) on the same hardware (no “helper” boxes needed, no dedicated SAN vs NAS pools needed)
  • 96,000 LUNs per 8-node cluster (that’s right, ninety-six thousand LUNs – about 50% more than the maximum possible with the other high-end systems)
  • ONTAP is VMware vVol ready
  • The only array that has been validated by VMware for VMware Horizon 6 with vVols – hopefully the competitors will follow our lead
  • Over 460TB (yes, TeraBytes) of usable cache after all overheads are accounted for (and without accounting for cache amplification through deduplication and clones) in an 8-node cluster. Makes competitor maximum cache amounts seem like rounding errors – indeed, the actual figure might be 465TB or more, but it’s OK… :) (and 3x that number in a 24-node cluster, over 1.3PB cache!)
  • The ability to virtualize other storage arrays behind it
  • The ability to have a cluster with dissimilar size and type nodes – no need to keep all engines the same (unlike monolithic offerings). Why pay the same for all nodes when some nodes may not need all the performance? Why be forced to keep all nodes in the same hardware family? What if you don’t want to buy all at once? Maybe you want to upgrade part of the cluster with a newer-gen system? :)
  • The ability to evacuate part of a cluster and build that part as a different cluster elsewhere
  • The ability to have multiple disk types in a cluster and, indeed, dedicate nodes to functions (for instance, have a few nodes all-flash, some nodes with flash-accelerated SAS and a couple with very dense yet flash-accelerated NL-SAS, with full online data mobility between nodes)
That last bullet deserves a picture:


“SVM” stands for Storage Virtual Machine –  it means a logical storage partition that can span one or more cluster nodes and have parts of the underlying capacity (performance and space) available to it, with its own users, capacity and performance limits etc.

In essence, Clustered Data ONTAP offers the best combination of performance, scalability, reliability, maturity and features of any storage system extant as of this writing. Indeed – look at some of the capabilities like maximum cache and number of LUNs. This is designed to be the cornerstone of a datacenter.

it makes most other systems seem like toys in comparison…


FUD buster

Another reason we wanted to show this result was FUD from competitors struggling to find an angle to fight NetApp. It goes a bit like this: “NetApp FAS systems aren’t real SAN, it’s all simulated and performance will be slow!”



Well – for a “simulated” SAN (whatever that means), the performance is pretty amazing given the level of protection used (RAID6-equivalent – far more resilient and capacity-efficient for large pooled deployments than the RAID10 the other submissions use) and all the insane scalability, reliability and functionality on tap :)

Another piece of FUD has been that ONTAP isn’t “flash-optimized” since it’s a very mature storage OS and wasn’t written “from the ground up for flash”. We’ll let the numbers speak for themselves. It’s worth noting that we have been incorporating a lot of flash-related innovations into FAS systems well before any other competitor did so, something conveniently ignored by the FUD-mongers. In addition, ONTAP 8.3 has a plethora of flash optimizations and path length improvements that helped with the excellent response time results. And lots more is coming.

The final piece of FUD we made sure was addressed was system fullness – last time we ran the test we didn’t fill up as much as we could have, which prompted the FUD-mongers to say that FAS systems need gigantic amounts of free space to perform. Let’s see what they’ll come up with this time 😉

On to the numbers!

As a refresher, you may want to read past SPC-1 posts here and here, and my performance primer here.

Important note: SPC-1 is a 100% block-based benchmark with its own I/O blend and, as such, the results from any vendor SPC-1 submission should not be compared to marketing IOPS numbers of all reads or metadata-heavy NAS benchmarks like SPEC SFS (which are far easier on systems than the 60% write blend of the SPC-1 workload). Indeed, the tested configuration might perform in the millions of “marketing” IOPS – but that’s decidedly not the point of this benchmark.

The SPC-1 Result links if you want the detail are here (summary) and here (full disclosure). In addition, here’s the link to the “Top 10 Performance” systems page so you can compare other submissions that are in the upper performance echelon (unfortunately, SPC-1 results are normally just alphabetically listed, making it time-consuming to compare systems unless you’re looking at the already sorted Top 10 list).

I recommend you look beyond the initial table in each submission showing the performance and $/SPC-1 IOPS and at least go to the price table to see the detail. The submissions calculate $/SPC-1 IOPS based on submitted price but not all vendors use discounted pricing. You may want to do your own price/performance calculations.

The things to look for in SPC-1 submissions

Typically you’re looking for the following things to make sense of an SPC-1 submission:

  • ART vs IOPS – many submissions will show high IOPS at huge ART, which would be rather useless when it comes to Flash storage
  • Sustainability – was performance even or are there constant huge spikes?
  • RAID level – most submissions use RAID10 for speed, what would happen with RAID6?
  • Application Utilization. This one is important yet glossed over. It signifies how much capacity the benchmark consumed vs the overall raw capacity of the system, before RAID, spares etc.

Let’s go over these one by one.


Our ART was 1.23ms at 685,281.71 SPC-1 IOPS, and pretty flat over time during the test:



The SPC-1 rules state the minimum runtime should be 8 hours. We ran the test for 18 hours to observe if there would be variation in the performance. There was no significant variation:


RAID level

RAID-DP was used for all FAS8080EX testing. This is mathematically analogous in protection to RAID-6. Given that these systems are typically deployed in very large pooled configurations, we elected long ago to not recommend single parity RAID since it’s simply not safe enough. RAID-10 is fast and fine for smaller capacity SSD systems but, at scale, it gets too expensive for anything but a lab queen (a system that nobody in their right mind will ever buy but which benchmarks well).

Application Utilization

Our Application Utilization was a very high 61.92% – unheard of by other vendors posting SPC-1 results since they use RAID10 which, by definition, wastes half the capacity (plus spares and other overheads to worry about on top of that).


Some vendors using RAID10 will fill up the resulting space after RAID, spares etc. to a very high degree, and call out the “Protected Application Utilization” as being the key thing to focus on.

This could not be further from the truth – Application Utilization is the only metric that really shows how much of the total possible raw capacity the benchmark actually used and signifies how space-efficient the storage was.

Otherwise, someone could do quadruple mirroring of 100TB, fill up the resulting 25TB to 100%, and call that 100% efficient… when in fact it only consumed 25% :)

It is important to note there was no compression or deduplication enabled by any vendor since it is not allowed by the current version of the benchmark.

Compared to other vendors

I wanted to show a comparison between the Top Ten Performance results both in absolute terms and also normalized around 1ms ART.

Here are the Top Ten highest performing systems as of April 22, 2015, with vendor results links if you want to look at things in detail:

FYI, the HP XP 9500 and the Hitachi system above it in the list are the exact same system, HP resells the HDS array as their high-end offering.

I will show columns that explain the results of each vendor around 1ms. Why 1ms and not more or less? Because in the Top Ten SPC-1 performance list, most results show fairly low ART, but some have very high ART, and it’s useful to show performance at that lower ART load point, which is becoming the ART standard for All-Flash systems. 1ms seems to be a good point for multi-function SSD systems (vs simpler, smaller but more speed-optimized architectures like the NetApp EF560).

The way you determine the 1ms ART load point is by looking at the table that shows ART vs SPC-1 IOPS. Let’s pick IBM’s 780 since it has a very interesting curve so you learn what to look for.

From page 5 of the IBM Power Server 780 SPC-1 Executive Summary:


IBM’s submitted SPC-1 IOPS are high but at a huge ART number for an all-SSD solution (18.90ms). Not very useful for customers picking an all-SSD system. Even the next load point, with an average ART of 6.41ms, is high for an all-flash solution.

To more accurately compare this to the rest of the vendors with decent ART, you need to look at the table to find the closest load point around 1ms (which, in this case, it’s the 10% load point at 0.71ms – the next one up is much higher at 2.65ms).

You can do a similar exercise for the rest, it’s worth a look – I don’t want to paste all these tables and graphs since this post will get too big. But it’s interesting to see how SPC-1 IOPS vs ART are related and translate that to your business requirements for application latency.

Here’s the table with the current Top Ten SPC-1 Performance results as of 4/22/2015. Click on it for a clearer picture, there’s a lot going on.


Key for the chart (the non-obvious parts anyway):

  • The “SPC-1 Load Level near 1ms” is the load point in each SPC-1 Report that corresponds to the SPC-1 IOPS achieved near 1ms. This is not how busy each array was (I see this misinterpreted all the time).
  • The “Total ASU Capacity” is the amount of capacity the test consumed.
  • The “Physical Storage Capacity” is the total amount of capacity in the array before RAID etc.
  • “Application Utilization” is ASU Capacity divided by Physical Storage Capacity.

What do the results show?

Predictably, all-flash systems trump disk-based and hybrid systems for performance and can offer very nice $/SPC-1 IOPS numbers. That is the major allure of flash – high performance density.

Some takeaways from the comparison:

  • Based on SPC-1 IOPs around 1ms Average Response Time load points, the FAS8080 EX shifts from 5th place to 3rd
  • The other vendors used RAID10 – NetApp used RAID-DP (similar to RAID6 in protection). What would happen to their results if they switched to RAID6 to provide a similar level of protection and efficiency?
  • Aside from the NetApp FAS result, the rest of the Top Ten Performance submissions offer vastly lower Application Utilization – about half! Which means that NetApp is able to use 2x the capacity vs raw compared to the other submissions. And that’s before starting to count the possible storage efficiencies we can turn on like dedupe and compression.

How does one pick a flash array?

It depends. What are you trying to do? Solve a tactical problem? Just need a lot of extra speed and far lower latency for some workloads? No need for the array to have a ton of functionality? A lot of the data management happens in the application? Need something cost-effective, simple yet reliable? Then an all-flash system like the NetApp EF560 is a solid answer, and it can still be front-ended by a Clustered Data ONTAP system to provide more functionality if the need arises in the future (we are firm believers in hardware reuse and investment protection – you see, some companies talk about Software Defined Storage, we do Software Defined Storage).

On the other hand, if you would prefer an Enterprise architecture that can serve as the cornerstone of your datacenter for almost any workload and protocol, offers rich data management functionality and tight application integration, insane scalability, non-disruptive everything and offers the most features (reliably) compared to any other platform – then the FAS line running Clustered Data ONTAP is the only possible answer.

Couple that with OnCommand Insight – the best multivendor fabric management tool on the planet – plus Workflow Automation, and we’ve got you covered.

In summary – the all-flash FAS8080EX gets a pretty amazing performance and efficiency SPC-1 result, especially given the extensive portfolio of functionality it offers. In my opinion, no competitor system offers the sheer functionality the FAS8080 does – not even close. Additionally, I believe that certain competitors have very questionable viability and/or tiny market penetration, making them a risky proposition for a high end system purchase.



Technorati Tags: , , , , , , , ,

How to decipher EMC’s new VNX pre-announcement and look behind the marketing.

It was with interest that I watched some of EMC’s announcements during EMC World. Partly due to competitor awareness, and partly due to being an irrepressible nerd, hoping for something really cool.

BTW: Thanks to Mark Kulacz for assisting with the proof points. Mark, as much as it pains me to admit so, is quite possibly an even bigger nerd than I am.

So… EMC did deliver something. A demo of the possible successor to VNX (VNX2?), unavailable as of this writing (indeed, a lot of fuss was made about it being lab only etc).

One of the things they showed was increased performance vs their current top-of-the-line VNX7500.

The aim of this article is to prove that the increases are not proportionally as much as EMC claims they are, and/or they’re not so much because of software, and, moreover, that some planned obsolescence might be coming the way of the VNX for no good reason. Aside from making EMC more money, that is.

A lot of hoopla was made about software being the key driver behind all the performance increases, and how they are now able to use all CPU cores, whereas in the past they couldn’t. Software this, software that. It was the theme of the party.

OK – I’ll buy that. Multi-core enhancements are a common thing in IT-land. Parallelization is key.

So, they showed this interesting chart (hopefully they won’t mind me posting this – it was snagged from their public video):

MCX core util arrow

I added the arrows for clarification.

Notice that the chart above left shows the current VNX using, according to EMCmaybe a total of 2.5 out of the 6 cores if you stack everything up (for instance, Core 0 is maxed out, Core 1 is 50% busy, Cores 2-4 do little, Core 5 does almost nothing). This is important and we’ll come back to it. But, currently, if true, this shows extremely poor multi-core utilization. Seems like there is a dedication of processes to cores – Core 0 does RAID only, for example. Maybe a way to lower context switches?

Then they mentioned how the new box has 16 cores per controller (the current VNX7500 has 6 cores per controller).

OK, great so far.

Then they mentioned how, By The Holy Power Of Software,  they can now utilize all cores on the upcoming 16-core box equally (chart above, right).

Then, comes the interesting part. They did an IOmeter test for the new box only.

They mentioned how the current VNX 7500 would max out at 170,000 8K random reads from SSD (this in itself a nice nugget when dealing with EMC reps claiming insane VNX7500 IOPS). And that the current model’s relative lack of performance is due to the fact its software can’t take advantage of all the cores.

Then they showed the experimental box doing over 5x that I/O. Which is impressive, indeed, even though that’s hardly a realistic way to prove performance, but I accept the fact they were trying to show how much more read-only speed they could get out of extra cores, plus it’s a cooler marketing number.

Writes are a whole separate wrinkle for arrays, of course. Then there are all the other ways VNX performance goes down dramatically.

However, all this leaves us with a few big questions:

  1. If this is really all about just optimized software for the VNX, will it also be available for the VNX7500?
  2. Why not show the new software on the VNX7500 as well? After all, it would probably increase performance by over 2x, since it would now be able to use all the cores equally. Of course, that would not make for good marketing. But if with just a software upgrade a VNX7500 could go 2x faster, wouldn’t that decisively prove EMC’s “software is king” story? Why pass up the opportunity to show this?
  3. So, if, with the new software the VNX7500 could do, say, 400,000 read IOPS in that same test, the difference between new and old isn’t as dramatic as EMC claims… right? :)
  4. But, if core utilization on the VNX7500 is not as bad as EMC claims in the chart (why even bother with the extra 2 cores on a VNX7500 vs a VNX5700 if that were the case), then the new speed improvements are mostly due to just a lot of extra hardware. Which, again, goes against the “software” theme!
  5. Why do EMC customers also need XtremeIO if the new VNX is that fast? What about VMAX? :)

Point #4 above is important. For instance, EMC has been touting multi-core enhancements for years now. The current VNX FLARE release has 50% better core efficiency than the one before, supposedly. And, before that, in 2008, multi-core was advertised as getting 2x the performance vs the software before that. However, the chart above shows extremely poor core efficiency. So which is it? 

Or is it maybe that the box demonstrated is getting most of its speed increase not so much by the magic of better software, but mostly by vastly faster hardware – the fastest Intel CPUs (more clockspeed, not just more cores, plus more efficient instruction processing), latest chipset, faster memory, faster SSDs, faster buses, etc etc. A potential 3-5x faster box by hardware alone.

It doesn’t quite add up as being a software “win” here.

However – I (or at least current VNX customers) probably care more about #1. Since it’s all about the software after all:)

If the new software helps so much, will they make it available for the existing VNX? Seems like any of the current boxes would benefit since many of their cores are doing nothing according to EMC. A free performance upgrade!

However… If they don’t make it available, then the only rational explanation is that they want to force people into the new hardware – yet another forklift upgrade (CX->VNX->”new box”).

Or maybe that there’s some very specific hardware that makes the new performance levels possible. Which, as mentioned before, kinda destroys the “software magic” story.

If it’s all about “Software Defined Storage”, why is the software so locked to the hardware?

All I know is that I have an ancient NetApp FAS3070 in the lab. The box was released ages ago (2006 vintage), and yet it’s running the most current GA ONTAP code. That’s going back 3-4 generations of boxes, and it launched with software that was very, very different to what’s available today. Sometimes I think we spoil our customers.

Can a CX3-80 (the beefiest of the CX3 line, similar vintage to the NetApp FAS3070) take the latest code shown at EMC World? Can it even take the code currently GA for VNX? Can it even take the code available for CX4? Can a CX4-960 (again, the beefiest CX4 model) take the latest code for the shipping VNX? I could keep going. But all this paints a rather depressing picture of being able to stretch EMC hardware investments.

But dealing with hardware obsolescence is a very cool story for another day.



Technorati Tags: , , , , , , ,

Are some flash storage vendors optimizing too heavily for short-lived NAND flash?

I really resisted using the “flash in the pan” phrase in the title… first, because the term is overused and second, because I don’t believe solid state is of limited value. On the contrary.

However, I am noticing an interesting trend among some newcomers in the array business, desperate to find a flash niche to compete in:

Writing their storage OS around very specific NAND flash technologies. Almost as bad as writing an entire storage OS to support a single hypervisor technology, but that’s a story for another day.

Solid state technology is still too fluid. Unlike spinning disk technology that is overall very reliable and mature and likely won’t see huge advances in the years to come, solid state technology seems to advance almost weekly. New SSD controllers are coming out almost too frequently, and new kinds of solid state storage are either out now (Triple Level Cell, anyone?) or coming in the future (MRAM, ReRAM, FeRAM, PCM, PMC, and probably a lot more that I’m forgetting).

My point is:

How far ahead are certain vendors thinking if they are writing an entire storage OS around the limitations of a class of storage that may look very different in just a year or two?

Some of them go really deep and try to do all kinds of clever optimizations to ensure good wear leveling for the flash chips. Some write their own controller software and use bare NAND flash chips, not even off-the-shelf SSDs. Which is great, but what if you don’t need to do that in two years? Or what if the optimizations need to be drastically different for the new technologies? How long will coding for the new flash technologies take? Or will they be stuck using old technologies? Food for thought.

I guess some of us are in it for the long haul, and some aren’t. “Can’t see the forest for the trees” comes to mind. “Gold rush” also seems relevant.

I strongly believe general-purpose storage OSes need to be flexible enough to be reasonably adaptable to different underlying media. And storage OSes that are specifically designed for solid state storage need to be especially flexible regarding the underlying SSD technology to avoid the problems outlined above, and to avoid the relative lack of reliability of current SSD solutions (another story for another day).

At the moment I don’t see clear winners yet. I see a few great short-term stories, but who has the most flexible architecture to be able to deal with different kinds of technologies for years to come?


Technorati Tags: , ,

Buyer beware: is your storage vendor sizing properly for performance, or are they under-sizing technologies like Megacaching and Autotiering?

With the advent of performance-altering technologies (notice the word choice), storage sizing is just not what it used to be.

I’m writing this post because more and more I see some vendors not using scientific methods to size their solution, instead aiming to reach a price point, hoping the technology will work to achieve the requisite performance (and if it doesn’t, it’s sold anyway, either they can give some free gear to make the problem go away, or the customer can always buy more, right?)

Back in the “good old days”, with legacy arrays one could (and still can) get fairly deterministic performance by knowing the workload required and, given a RAID type, know roughly how many disks would be needed to maintain the required performance in a sustained fashion, as long as the controller and buses were not overloaded.

With modern systems, there is now a plethora of options that can be used to get more performance out of the array, or, alternatively, get the same average performance as before, using less hardware (hopefully for less money).

If anything, advanced technologies have made array sizing more complex than before.

For instance, Megacaches can be used to dramatically change the I/O reaching the back-end disks of the array. NetApp FAS systems can have up to 16TB of deduplication-aware, ultra-granular (4K) and intelligent read cache. Truly a gigantic size, bigger than the vast majority of storage users will ever need (and bigger than many customers’ entire storage systems). One could argue that with such an enormous amount of cache, one could dispense with most disk drives and instead save money by using SATA (indeed, several customers are doing exactly that). Other vendors are following NetApp’s lead and starting to implement similar technologies — simply because it makes a lot of sense.


It is crucial that, when relying on caching, extra care is taken to size the solution properly, if a reduction in the number and speed of the back-end disks is desired.

You see, caches only work well if they can cache the majority of what’s called the active working set.

Simply put, the working set is not all your data, but the subset of the data you’re “touching” constantly over a period of time. For a customer that has, say, a 20TB Database, the true working set may only be something as small as 5% — enabling most of the active data to fit in 1TB of cache. So, during daily use, a 1TB cache could satisfy most of the I/O requirements of the DB. The back-end disks could comfortably be just enough SATA to fit the DB.

But what about the times when I/O is not what’s normally expected? Say, during a re-indexing, or a big DB export, or maybe month-end batch processing. Such operations could vastly change the working set and temporarily raise it from 5% to something far larger — at which point, a 1TB cache and a handful of back-end SATA may not be enough.

Which is why, when sizing, multiple measurements need to be taken, and not just average or even worst-case.

Let’s use a database as an example again (simply because the I/O can change so dramatically with DBs).You could easily have the following I/O types:

  1. Normal use – 20,000 IOPS, all random, 8K I/O size, 80% reads
  2. DB exports — high MB/s, mostly sequential write,large I/O size, relatively few IOPS
  3. Sequential read after random write — maybe data is added to the DB randomly, then a big sequential read (or maybe many parallel ones) are launched.

You see, the I/O profile can change dramatically. If you only size for case #1, you may not have enough back-end disk to sustain the DB exports or the parallel sequential table scans. If you size for case 2, you may think you don’t need much cache since the I/O is mostly sequential (and most caches are bypassed for sequential I/O). But that would be totally wrong during normal operation.

If your storage vendor has told you they sized for what generates the most I/O, then the question is, what kind of I/O was it?

The other new trendy technology (and the most likely to be under-sized) is Autotiering.

Autotiering, simply put, allows moving chunks of data around the array depending on their “heat index”. Chunks that are very active may end up on SSD, whereas chunks that are dormant could safely stay on SATA.

Different arrays do different kinds of Autotiering, mostly based on various underlying architectural characteristics and limitations. For example, on an EMC Symmetrix the chunk size is about 7.5MB. On an HDS VSP, the chunk is about 40MB. On an IBM DS8000, SVC or EMC Clariion/VNX, it’s 1GB.

With Autotiering, just like with caching, the smaller the chunk size, the more efficient the end result will ultimately be. For instance, a 7.5MB chunk could need as little as 3-5%% of ultra-fast disk as a tier, whereas a 1GB chunk may need as much as 10-15%, due to the larger size chunk containing not very active data mixed together with the active data.

Since most arrays write data with a geometric locality of reference (in contrast, NetApp uses geometric and temporal), with large-chunk autotiering you end up with pieces of data that are “hot” that always occupy the same chunk as neighboring “cool” pieces of data. This explains why the smaller the chunk, the better off you are.

So, with a large chunk, this can happen:


The array will try to cache as much as it can, then migrate chunks if they are consistently busy or not. But the whole chunk has to move, not just the active bits within the chunk… which may be just fine, as long as you have enough of everything.

So what can you do to ensure correct sizing?

There are a few things you can do to make sure you get accurate sizing with modern technologies.

  1. Provide performance statistics to vendors — the more detailed the better. If we don’t know what’s going on, it’s hard to provide an engineered solution.
  2. Provide performance expectations — i.e. “I want Oracle queries to finish in 1/4th the time compared to what I have now” — and tie those expectations to business benefits (makes it easier to justify).
  3. Ask vendors to show you their sizing tools and explain the math behind the sizing — there is no magic!
  4. Ask vendors if they are sizing for all the workloads you have at the moment (not just different apps but different workloads within each app) — and how.
  5. Ask them to show you what your working set is and how much of it will fit in the cache.
  6. Ask them to show you how your data would be laid out in an Autotiered environment and what bits of it would end up on what tier. How is that being calculated? Is the geometry of the layout taken into consideration?
  7. Do you have enough capacity for each tier? On Autotiering architectures with large chunks, do you have 10-15% of total storage being SSD?
  8. Have the controller RAM and CPU overheads due to caching and autotiering been taken into account? Such technologies do need extra CPU and RAM to work. Ask to see the overhead (the smaller the Autotiering chunk size, the more metadata overhead, for example). Nothing is free.
  9. Beware of sizings done verbally or on cocktail napkins, calculators, or even spreadsheets – I’ve yet to see a spreadsheet model storage performance accurately.
  10. Beware of sizings of the type “a 15K disk can do 180 IOPS” — it’s a lot more complicated than that!
  11. Understand the difference between sequential, random, reads, writes and I/O size for each proposed architecture — the differences in how I/O is done depending on the platform are staggering and can result in vastly different disk requirements — making apples-to-apples comparisons challenging.
  12. Understand the extra I/O and capacity impact of certain CDP/Replication devices — it can be as much as 3x, and needs to be factored in.
  13. What RAID type is each vendor using? That can have a gigantic performance impact on write-intensive workloads (in addition to the reliability aspect).
  14. If you are getting unbelievably low pricing — ask for a contract ensuring upgrade pricing will be along the same lines. “The first hit is free” is true in more than one line of business.
  15. And, last but by no means least — ask how busy the proposed solution will be given the expected workload! It surprises me that people will try to sell a box that can do the workload but will be 90% busy doing so. Are you OK with that kind of headroom? Remember – disk arrays are just computers running specialized software and hardware, and as such their CPU can run out of steam just like anything else.

If this all seems hard — it’s because it is. But see it as due diligence — you owe it to your company, plus you probably don’t want to be saddled with an improperly-sized box for the next 3-5 years, just because the offer was too good to refuse…



Technorati Tags: , , , , , , , , ,