The Importance of the Effective Capacity Ratio in Modern Storage Systems

In modern storage devices (especially All Flash Arrays), extensive data reduction techniques are commonplace and expected by customers.

This has, unavoidably, led to various marketing schemes that aim to make certain systems seem more appealing than the rest. Or at least not less appealing…

I will attempt to explain what customers should be looking for when trying to decipher capacity claims from a manufacturer.

In a nutshell – and for the ADD-afflicted – the most important number you should be looking for is the Effective Capacity Ratio, which is simply: (Effective Capacity)/(Raw Capacity). Ignore the more common but far less useful Data Reduction Ratio, which is: (Effective Capacity)/(Usable Capacity).

Read on for more detail…

The problem

Most modern (and some not so modern) storage vendors claim a similar Data Reduction Ratio for their devices. Numbers like 5:1 seem commonplace these days. Which, to anyone with a modicum of common sense, simply means “I can store five times more stuff in the same space”. Unfortunately, the same 5:1 ratio does not mean the same end result for all vendors…

Why this is a problem

Customers simply want to get a good deal for their money. The reality is that the exact same 5:1 Data Reduction Ratio may mean very different things between storage devices.

For instance: Not all vendors count space savings using the same math. Is the virtual size of snapshots calculated in the final savings ratio? (for example, if I take 5 snaps of a 1TB volume one after another, is that showing up as 6:1 savings?)

How about Thin Provisioning? Is that included? Such math can wildly alter the overall efficiency numbers.

Here’s a clear example of why counting thin provisioning in the overall ratio is probably misleading, and best used only for educational purposes… Remember that overall savings ratios are multiplicative. For example, 2:1 compression and 2:1 deduplication mean 4:1 overall savings:

Thin Craziness

Does anyone really believe that such a system is actually providing 1000:1 savings? Or even 10:1? Yet that’s what many vendors are counting towards savings ratios, without separating the thin provisioning savings from the overall savings number.

However, there is another dimension, and it has to do with how efficiently the raw capacity is actually utilized. It will be one of my many Captain Obvious moments for some of you, but I’ve seen enough people get confused, so it’s worth explaining.

The Effective Capacity Ratio

Different storage systems have different ways of utilizing their raw capacity. For example, a mirrored system can never have better than a 50% Usable:Raw ratio. By definition, it’s mathematically impossible since 2 copies are needed. That’s not even counting spares and other possible overheads.

Systems that do triple mirroring can’t do better than a theoretical 33.3% Usable:Raw etc.

Some definitions are in order:

  • Usable Capacity: How much data I can store in a system after overheads such as RAID, sparing etc. but before data reduction techniques.
  • Raw Capacity: Add up the capacity of all the storage media in the system. Usually a Base 10 number (TB/GB, not Base 2, TiB/GiB)
  • Effective Capacity: How much data I can store in a system after data reduction techniques like Deduplication and Compression but not Thin Provisioning
  • Data Reduction Ratio: (Effective Capacity)/(Usable Capacity)
  • Effective Capacity Ratio: (Effective Capacity)/(Raw Capacity)

Who is More Efficient?

If every vendor is claiming 5:1 average savings, who is truly more efficient? A Reductio ad Absurdum example makes it pretty clear:

  1. A vendor that can do a Data Reduction Ratio of 5:1 but has a Usable Capacity of 10% vs Raw or…
  2. A vendor that can do a Data Reduction Ratio of 5:1 but has a Usable Capacity of 70% vs Raw?

Let’s put some numbers on a table. They roughly correspond to some existing storage vendors today (there may be some variation depending on whether the numbers for each line are TB vs TiB – everyone shows numbers differently – but the overall point remains the same):

Effective Cap Ratio

As you can see, the same Data Reduction Ratio, on the same amount of Raw Capacity, can have wildly different Effective Capacity results, depending on the system.

Clearly, a truly efficient system is one that can both:

  • Provide a high Usable:Raw ratio and
  • Provide a high Data Reduction Ratio (that does not include fluff like Thin Provisioning)

The Business Benefits of a High Effective Capacity Ratio

There are multiple business reasons why chasing a high Effective Capacity Ratio is important:

  • High rack density. Some vendors are eight times more dense than others (ironically, the ones claiming “white box economics” are the least dense and most expensive). This could lead to significant power/cooling and $/rack savings
  • Lower CapEx – if a vendor needs 2x the Raw Capacity vs another, guess who’s paying for the difference?
  • Less complexity – denser devices means less devices, less cables, less switching

What you can do as a Customer

Level the playing field. It’s actually pretty easy:

  • Insist on seeing all capacity numbers expressed as TiB from every vendor – and if they don’t know what that means, run… (or at least subtract 9% from any capacity number you see and you’ll be safer)
  • Insist on seeing the Data Reduction Ratio without including Thin Provisioning or Snapshots. If they can’t break out the savings by category, run
  • Insist on seeing the Usable:Raw percentage for various configurations. Is it better in larger configs vs smaller? Is it following all best practices? What are the best practices for space consumption? If they tell you to ignore the man behind the curtain, run… (there is a theme developing)
  • Do your own Effective Capacity Ratio calculation and assign a $/Effective TiB to each vendor

D

Technorati Tags: , , , , , ,

The Well-Behaved Storage System: Automatic Noisy Neighbor Avoidance

This topic is very near and dear to me, and is one of the big reasons I came over to Nimble Storage.

I’ve always believed that storage systems should behave gracefully and predictably under pressure. Automatically. Even under complex and difficult situations.

It sounds like a simple request and it makes a whole lot of sense, but very few storage systems out there actually behave this way. This creates business challenges and increases risk and OpEx.

The Problem

The simplest way to state the problem is that most storage systems can enter conditions where workloads can suffer from unfair and abrupt performance starvation under several circumstances.

OK, maybe that wasn’t the simplest way.

Consider the following scenarios:

  1. A huge sequential I/O job (backup, analytics, data loads etc.) happening in the middle of latency-sensitive transaction processing
  2. Heavy array-generated workloads (garbage collection, post-process dedupe, replication, big snapshot deletions etc.) happening at the same time as user I/O
  3. Failed drives
  4. Controller failover (due to an actual problem or simply a software update)

#3 and #4 are more obvious – a well-behaved system will ensure high performance even during a drive failure (or three), and after a controller fails over. For instance, if total system headroom is automatically kept at 50% for a dual-controller system (or, simplistically, 100/n, where n is the controller count for shared-everything architectures), even after a controller fails, performance should be fine.

#1 and #2 are a bit more complicated to deal with. Let’s look at this in more detail.

The Case of Competing Workloads During Hard Times

Inside every array, at any given moment, a balancing act occurs. Multiple things need to happen simultaneously.

Several user-generated workloads, for instance:

  • DB
  • VDI
  • File Services
  • Analytics

Various internal array processes – they also are workloads, just array-generated, and often critical:

  • Data reduction (dedupe, compression)
  • Cleanup (object deletion, garbage collection)
  • Data protection (integrity-related)
  • Backups (snaps, replication)

If the system has enough headroom, all these things will happen without performance problems.

If the system runs out of headroom, that’s where most arrays have challenges with prioritizing what happens when.

The most common way a system may run out of headroom is the sudden appearance of a hostile “bully” workload. This is also called a “noisy neighbor”. Here’s an example of system behavior in the presence of a bully workload:

bully_vs_victim_2

 

In this example, the latency-sensitive workload will greatly and unfairly suffer after the “noisy neighbor” suddenly appears. If the latency-sensitive workload is a mission-critical application, this could cause a serious business problem (slow processing of financial transactions, for instance).

This is an extremely common scenario. A lot of the time it’s not even a new workload. Often, an existing workload changes behavior (possibly due to an application change – for instance a patch or a modified SQL query). This stuff happens.

How some vendors have tried to fix the issue with Manual “QoS”

As always, there is more than one way to skin a cat, if one is so inclined. Here are a couple of manual methods to fix workload contention:

  • Some arrays have a simple IOPS or throughput limit that an administrator can manually adjust in order to fix a performance problem. This is an iterative and reactive method and hard to automate properly in real time. In addition, if the issue was caused by an internal array-generated workload, there is often no tooling available to throttle those processes.
  • Other arrays insist on the user setting up minimum, maximum and burst IOPS values for every single volume in the system, upon volume creation. This assumes the user knows in advance what performance envelope is required, in detail, per volume. The reality is that almost nobody knows these things beforehand, and getting the numbers wrong can itself cause a huge problem with latencies. Most people just want to get on with their lives and have their stuff work without babysitting.
Manual mechanisms for fixing the “bully” workload challenge result in systems that are hard to consume and complex to support while under performance pressure. Moreover, when a performance issue occurs, speed of resolution is critical. The issue needs to be resolved immediately, especially for latency-sensitive workloads. Manual methods will simply not be fast enough. Business will be impacted.

How Nimble Storage Fixed the Noisy Neighbor Issue

No cats were harmed in the process. Nimble engineers looked at the extensive telemetry in InfoSight, used data science, and neatly identified areas that could be massively automated in order to optimize system behavior under a wide variety of adverse conditions. Some of what was done:

  • Highly advanced Fair Share disk scheduling (separate mechanisms that deal with different scenarios)
  • Fair Share CPU scheduling
  • Dynamic Weight Adjustment – automatically adjust priorities in various ways under different resource contention conditions, so that the system can always complete critical tasks and not fall dangerously behind

The end result is a system that:

  • Lets system latency increase gracefully and progressively as load increases
  • Carefully and automatically balances user and system workloads
  • Achieves I/O deadlines and preemption behavior
  • Eliminates the Noisy Neighbor problem without the need for any manual QoS adjustments
  • Allows latency-sensitive small-block I/O to proceed without interference from bully workloads
Such automation achieves a better business result: less risk, less OpEx, easy supportability, simple and safe overall consumption even under difficult conditions.

What Should Nimble Customers do to get this Capability?

As is typical with Nimble systems and their impressive Ease of Consumption, nothing fancy needs to be done apart from simply upgrading to a specific release of the code (in this case 3.1 and up – 2.3 did some of the magic but 3.1 is the fully realized vision).

A bit anticlimactic, apologies… if you like complexity, watching this instead is probably more fun than juggling QoS manually.

D

Technorati Tags: , , , ,

Going Green: Why I Joined Nimble Storage

I am proud to announce that, as of today, I am a member of the Nimble Storage team.

Nimble Logo

This marks the end of an era – I spent quite a bit of time at NetApp: learned a lot, did a lot – by the end I had my hands in all kinds of sausage making… 🙂

I wish my friends at NetApp the best of luck for the future. The storage industry is a very tough arena, and one that will be increasingly harder and with less tolerance than ever before.

Why?

I compared Nimble Storage with many competitors before making my decision. Quite simply, Nimble’s core values agree with mine. It goes without saying that I wouldn’t choose to move to a company unless I believed they had the best technology (and the best support), but the core values is where it all starts. The product is built upon those core values.

I firmly believe that modern storage should be easy to consume. Indeed, it should be a joy to consume, even for complex multi-site environments. It should not be a burden to deal with. Nor should it be a burden to sell.

Systems that are holistically easy to consume have several business benefits, some of which are: lower OPEX and CAPEX, increased productivity, less risk, easier planning, faster problem resolution.

It’s important to understand that easy to consume is not at all the same as easy to usethat is but a very small subset of easy consumption.

The core value of easy consumption encompasses several aspects, many of which are ignored by most storage vendors. Most modern players will focus on ease of use, show demos of a pretty GUI and suchlike. “Look how easy it is to install” or “look how easy it is to create a LUN”. Well – there’s a lot more to worry about in real life.

The lifecycle of a storage system

Beyond initial installation and simple element creation, there is a multitude of things users and vendors need to be concerned with. Here’s a partial list:

  • Installation
  • Migration to/from
  • Provisioning
  • Host/fabric configuration
  • Backups, restores, replication
  • Scaling up/out
  • Upgrading from a smaller/older version
  • Firmware updates for all components (including drives)
  • Tech refresh
  • Support

What about more advanced topics?

A storage solution cannot exist in a vacuum. There are several ancillary (but extremely important) services needed in order to help consume storage, especially at scale. Services that typically cannot (and many that should not) reside on the storage system itself. How about…

  • Initial and future sizing
  • Capacity planning based on long-term usage data
  • Performance analysis and profiling
  • Performance issue resolution/recommendations
  • Root cause analysis
  • What-if scenario modeling
  • Support case resolution
  • Comprehensive end-to-end monitoring and alerting
  • Comprehensive reporting (including auditing)
  • Security (including RBAC and delegation)
  • Upgrade planning
  • Pervasive automation (including host-side)
  • Ensuring adherence to best practices
If a storage solution doesn’t make all or most of the above straightforward, then it is not truly easy to consume.

The problem:

Storage vendors will typically either be lacking in many of the above areas, or may need many different tools, and significant manual effort, in order to provide even some of these services.

Not having the tools creates an obvious problem – the customer and vendor simply can’t perform these functions, or the implementation is too basic. Most smaller vendors are in this camp. Not much functionality beyond what’s inside the storage device itself. Long-term consumption, especially at scale, becomes challenging.

On the other hand, having a multitude of tools to help with these areas also makes the solution hard to consume overall. Larger vendors fall into this category.

For instance: Customers may need to access many different tools just to monitor and alert on various metrics. One tool may provide certain information, another tool provides different metrics (often with significant overlap with the first tool), and so on. And not all tools work with all versions of the product. This increases administrative complexity and overall time and money spent. And the end result is often compromised and incredibly hard to support.

Vendors that need many different tools also create a problem for themselves: Almost nobody on staff will have the expertise to deal with the plethora of tools necessary to do certain things like sizing, performance troubleshooting or even a tech refresh. Or optimizing a product for specific workloads. Deep expertise is often needed to interpret the results of the tools. This causes interminable delays in problem resolution, lengthens sales cycles, complicates product development, creates staffing challenges, increases costs, and in general makes life miserable.

RG autojack

How?

What always fascinated me about Nimble Storage is that not only did they recognize these challenges, they actually built an entire infrastructure and innovative approach in order to solve the problem.

Nimble recognized the value of Predictive Analytics.

The challenge: How to use Big Data to solve the challenges faced by storage customers and storage vendors. And how to do this in a way that achieves a dramatically better end result.

While most vendors have call-home features, and some even have rudimentary capacity, configuration and maybe even performance telemetry being sent to some central repository (usually very infrequently), Nimble elected instead to send extremely comprehensive sensor telemetry to a huge analytics engine. A difficult undertaking, but one that would define the company in the years to come.

Nimble also recognized the need to do this from the very beginning. Each Nimble array sends 30-70 million data points back to Nimble every day. Trying to retrofit telemetry of this scope would be extremely difficult if not impossible to achieve effectively.

This wealth of data (the largest storage-related analytics engine in the world, by far) is used to help customers with the challenges mentioned previously, while at the same time lowering complexity.

It also, crucially, helps Nimble better support customers and design better products without having to bother customers for data dumps.

For example: What if a Nimble engineer trying to optimize SQL I/O performance wants to see detailed I/O statistics only for SQL workloads on all Nimble arrays in the world? Or on one array? Or on all arrays at a certain customer? It’s only a simple query away… and that’s just scratching the surface of what’s possible. It certainly beats trying to design storage based on arbitrary synthetic benchmarks, or manually grabbing performance traces from customer gear…

What?

Enter InfoSight. That’s the name of the gigantic analytics engine currently ingesting trillions of anonymized sensor data points every week. And growing. Check some numbers here

Nimble Storage customers do not need to install custom monitoring tools to perform highly advanced storage analytics, performance troubleshooting, and even hardware upgrade recommendations based on automated performance analysis heuristics.

No need to use the CLI, no need to manually send data dumps to the vendor, no need to use 10 different tools.

All the information customers need is available through a browser GUI. Even the vast majority of support cases are automatically handled by InfoSight, and I’m not talking about simply sending replacement hardware (that’s trivial).

I always saw InfoSight as the core offering of Nimble Storage, the huge differentiator that works hand in hand with the hardware and helps customers consume storage easily. Yes, Nimble Storage arrays are fast, reliable, easy to use, have impressive data reduction abilities, scale nicely, have great features, are cost-effective etc. But other vendors can claim they can satisfy at least some of those attributes.

Nobody else has anything even remotely the depth and capability of InfoSight. This is why Nimble calls their offering the Predictive Flash Platform. InfoSight Predictive Analytics + great hardware = Predictive Flash.

I will be covering this fascinating topic in a lot more depth in the future. An AI Expert System powered by a behemoth analytics engine, helping reduce complexity and making the solution Easy To Consume is a pretty impressive piece of engineering.

Watch this space…

D

Technorati Tags: , , , ,

7-Mode to Clustered ONTAP Transition

I normally deal with different aspects of storage (arguably far more exciting) but I thought I would write something to provide some common sense perspective on the current state of 7-Mode to cDOT adoption.

I will tackle the following topics:

  1. cDOT vs 7-Mode capabilities
  2. Claims that not enough customers are moving to cDOT
  3. 7-Mode to cDOT transition is seen by some as difficult and expensive
  4. Some argue it might make sense to look at competitors and move to those instead
  5. What programs and tools are offered by NetApp to make transition easy and quick
  6. Migrating from competitors to cDOT

cDOT vs 7-Mode capabilities

I don’t want to make this into a dissertation or take a trip down memory lane. Suffice it to say that while cDOT has most of the 7-Mode features, it is internally very different but also much, much more powerful than 7-Mode – cDOT is a far more capable and scalable storage OS in almost every possible way (and the roadmap is utterly insane).

For instance, cDOT is able to nondisruptively do anything, including crazy stuff like moving SMB shares around cluster nodes (moving LUNs around is much easier than dealing with the far more quirky SMB protocol). Some reasons to move stuff around the cluster could be node balancing, node evacuation and replacement… all done on the fly in cDOT, regardless of protocol.

cDOT also handles Flash much better (cDOT 8.3.x+ can be several times faster than 7-Mode for Flash on the exact same hardware). Even things like block I/O (FC and iSCSI) are completely written from the ground up in cDOT. Cloud integration. Automation. How failover is done. Or how CPU cores are used, how difficult edge conditions are handled… I could continue but then the ADD-afflicted would move on, if they haven’t already…

In a nutshell, cDOT is a more flexible, forward-looking architecture that respects the features that made 7-Mode so popular with customers, but goes incredibly further. There is no competitor with the breadth of features available in cDOT, let alone the features coming soon.

cDOT is quite simply the next logical step for a 7-Mode customer.

Not enough customers moving to cDOT?

The reality is actually pretty straightforward.

  • Most new customers simply go with cDOT, naturally. 7-Mode still has a couple of features cDOT doesn’t, and if those features are really critical to a customer, that’s when someone might go with 7-Mode today. With each cDOT release the feature delta list gets smaller and smaller. Plus, as mentioned earlier, cDOT has a plethora of features and huge enhancements that will never make it to 7-Mode, with much more coming soon.
  • The things still missing in cDOT (like WORM) aren’t even offered by the majority of storage vendors… many large customers use our WORM technology.
  • Large existing customers, especially ones running critical applications, naturally take longer to cycle technologies. The average time to switch major technologies (irrespective of vendor) is around 5 years. The big wave of cDOT transitions hasn’t even hit yet!
  • Given that cDOT 8.2 with SnapVault was the appropriate release for many of our customers and 8.3 the release for most of our customers, a huge number of systems are still within that 5-year window prior to converting to cDOT, given when those releases came out.
  • Customers with mission-critical systems will typically not convert an existing system – they will wait for the next major refresh cycle. Paranoia rules in those environments (that’s a general statement regardless of vendor). And we have many such customers.

7-Mode to cDOT transition is seen by some as difficult and expensive

This is a fun one, and a favorite FUD item for competitors and so-called “analysts”. I sometimes think we confused people by calling cDOT “ONTAP”. I bet expectations would be different if we’d called it “SuperDuper ClusterFrame OS”.

You see, cDOT is radically different in its internals versus 7-Mode – however, it’s still officially also called “ONTAP”. As such, customers are conditioned to super-easy upgrades between ONTAP releases (just load the new code and you’re done). cDOT is different enough that we can’t just do that.

I lobbied for the “ClusterFrame” name but was turned down BTW. I still think it rocks.

The fact that you can run either 7-Mode or cDOT on the same physical hardware confuses people even further. It’s a good thing to be able to reuse hardware (software-defined and all that). Some vendors like to make each new rev of the same family line (and its code) utterly incompatible with the last one… we don’t do that.

And for the startup champions: Startups haven’t been around long enough to have seen a truly major hardware and/or software change! (another thing conveniently ignored by many). Nor do they have the sheer amount of features and ancillary software ONTAP does. And of course, some vendors forget to mention what even a normal tech refresh looks like for their fancy new “built from the ground up” box with the extremely exciting name.

We truly know how to do upgrades… probably better than any vendor out there. For instance: What most people don’t know is that WAFL (ONTAP’s underlying block layout abstraction layer) has been quietly upgraded many, many times over the years. On the fly. In major ways. With a backout option. Another vendor’s product (again the one with the extremely exciting name) needed to be wiped twice by as many “upgrades” in one year in order to have its block layout changed.

Here’s the rub:

Transition complexity really depends on how complex your current deployment is, your appetite for change and tolerance of risk. But transition urgency depends on how much you need the fully nondisruptive nature of cDOT and all the other features it has vs 7-Mode.

What I mean by that:

We have some customers that lose upwards of $4m/hour of downtime. The long-term benefits of a truly nondisruptive architecture make any arguments regarding migration efforts effectively moot.

If you are using a lot of the 7-Mode features and companion software (and it has more features than almost any other storage OS), specific tools written only for 7-Mode, older OS clients only supported on 7-Mode, tons of snaps and clones going back to several years’ retention etc…

Then, in order to retain that kind of similar elaborate deployment in cDOT, the migration effort will also naturally be a bit more complex. But still doable. And we can automate most of it. Including moving over all the snapshots and archives!

On the other hand, if you are using the system like an old-fashioned device and aren’t taking advantage of all the cool stuff, then moving to anything is relatively easy. And especially if you’re close to 100% virtualized, migration can be downright trivial (though simply moving VMs around storage systems ignores any snapshot history – the big wrinkle with VM migrations).

Some argue it might make sense to look at competitors and move to those instead

Looking at options is something that makes business sense in general. It would be very disingenuous of me to say it’s foolish to look at options.

But this holds firmly true: If you want to move to a competitor platform, and use a lot of the 7-Mode features, it would arguably be impossible to do cleanly and maintain full functionality (at a bare minimum you’d lose all your snaps and clones – and some customers have several years’ worth of backup data in SnapVault – try asking them to give that up).

This is true for all competitor platforms: Someone using specific features, scripts, tools, snaps, clones etc. on any platform, will find it almost impossible to cleanly migrate to a different platform. I don’t care who makes it. Doesn’t matter. Can you cleanly move from VMware + VMware snaps to Hyper-V and retain the snaps?

Backup/clone retention is really the major challenge here – for other vendors. Do some research and see how frequently customers switch backup platforms… 🙂 We can move snaps etc. from 7-Mode to cDOT just fine 🙂

The less features you use, the easier the migration and acclimatization to new stuff becomes, but the less value you are getting out of any given product.

Call it vendor lock-in if you must, but it’s merely a side effect of using any given device to its full potential.

The reality: it is incredibly easier to move from 7-Mode to cDOT than from 7-Mode to other vendor products. Here’s why…

What programs and tools are offered by NetApp to make transition easy and quick?

Initially, migration of a complex installation was harder. But we’ve been doing this a while now, and can do the following to make things much easier:

  1. The very cool 7MTT (7-Mode Transition Tool). This is an automation tool we keep rapidly enhancing that dramatically simplifies migrations of complex environments from 7-Mode to cDOT. Any time and effort analysis that ignores how this tool works is quite simply a flawed and incomplete analysis.
  2. After migrating to a new cDOT system, you can take your old 7-Mode gear and convert it to cDOT (another thing that’s impossible with a competitor – you can’t move from, say, a VNX to a VMAX and then convert the VNX to a VMAX).
  3. As of cDOT 8.3.0: We made SnapMirror replication work for all protocols from 7-Mode to cDOT! This is the fundamental way we can easily move over not just the baseline data but also all the snaps, clones etc. Extremely important, and something that moving to a competitor would simply be impossible to carry forward.
  4. As of cDOT 8.3.2 we are allowing something pretty amazing: CFT (Copy Free Transition). Which does exactly what the name suggests: Allows not having to move any data over to cDOT! It’s a combination ONTAP and 7MTT feature, and allows disconnecting the shelves from 7-Mode controllers and re-attaching them to cDOT controllers, and thereby converting even a gigantic system in practically no time. See here for a quick guide, here for a great blog post.
And before I forget…

What about migrating from other vendors to cDOT?

It cuts both ways – anything less would be unfair. Not only is it far easier to move from 7-Mode to cDOT than to a competitor, it’s also easy to move from a competitor to cDOT! Since it’s all about growth, and that’s the only way real growth happens.

As of version 8.3.1 we have what’s called Online Foreign LUN Import (FLI). See link here. It’s all included – no special licenses needed.

With Online FLI we can migrate LUNs from other arrays and maintain maximum uptime during the migration (a cutover is needed at some point of course but that’s quick).

And all this we do without external “helper” gear or special software tools.

In the case of NAS migrations, we have the free and incredibly cool XCP software that can migrate things like high file count environments 25-100x faster than traditional methods. Check it out here.

In summary

I hardly expect to change the minds of people suffering from acute confirmation bias (I wish I could say “you know who you are”, not knowing you are afflicted is the major problem), but hopefully the more level-headed among us should recognize by now that:

  • 7-Mode to cDOT migrations are extremely straightforward in all but the most complex and custom environments
  • Those same complex environments would find it impossible to transparently migrate to anything else anyway
  • Backups/clones is one of those things that complicates migrations for any vendor – ONTAP happens to be used by a lot of customers to handle backups as part of its core value prop
  • NetApp provides extremely powerful tools to help with migrations from 7-Mode to cDOT and from competitors to cDOT (with amazing tools for both SAN and NAS!) that will also handle the backups/clones/archives!
  • The grass isn’t always greener on the other side – The transition from 7-Mode to cDOT is the first time NetApp has asked customers to do anything that major in over 20 years. Other, especially younger, vendors haven’t even seen a truly major code change yet. How will they react to such a thing? NetApp is handling it just fine 🙂

D

Technorati Tags: , , , , , ,

NetApp E-Series posts top ten price performance SPC-2 result

NetApp published a Top Ten Price/Performance SPC-2 result for the E5660 platform (the all-SSD EF560 variant of the platform also comfortably placed in the Top Ten Price/Performance for the SPC-1 test). In this post I will explain why the E-Series platform is an ideal choice for workloads requiring high performance density while being cost effective.

Executive Summary

The NetApp E5660 offers a unique combination of high performance, low cost, high reliability and extreme performance density, making it perfectly suited for applications that require high capacity, very high speed sequential I/O and a dense form factor. In EF560 form with all SSD, it offers that same high sequential speed shown here coupled with some of the lowest latencies for extremely hard OLTP workloads as demonstrated by the SPC-1 results.

The SPC-2 test

Whereas the SPC-1 test deals with an extremely busy OLTP system (high percentage of random I/O, over 60%writes, with SPC-1 IOPS vs latency being a key metric), SPC-2 instead focuses on throughput, with SPC-2 MBPS (MegaBytes Per Second) being a key metric.

SPC-2 tries to simulate applications that need to move a lot of data around very rapidly. It is divided into three workloads:

  1. Large File Processing: Applications that need to do operations with large files, typically found in analytics, scientific computing and financial processing.
  2. Large Database Queries: Data mining, large table scans, business intelligence…
  3. Video on Demand: Applications that deal with providing multiple simultaneous video streams (even of the same file) by drawing from a large library.
The SPC-2 test measures each workload and then provides an average SPC-2 MBPS across the three workloads. Here’s the link to the SPC-2 NetApp E5660 submission.

Metrics that matter

As I’ve done in previous SPC analyses, I like using the published data to show business value with additional metrics that may not be readily spelled out in the reports. Certainly $/SPC-2 MBPS is a valuable metric, as is the sheer SPC-2 MBPS metric that shows how fast the array was in the benchmark.

However… think of what kind of workloads represent the type of I/O shown in SPC-2. Analytics, streaming media, data mining. 

Then think of the kind of applications driving such workloads.

Such applications typically treat storage like lego bricks and can use multiple separate arrays in parallel (there’s no need or even advantage in having a single array). Those deployments invariably need high capacities, high speeds and at the same time don’t need the storage to do anything too fancy beyond being dense, fast and reliable.

In addition, such solutions often run in their own dedicated silo, and don’t share their arrays with the rest of the applications in a datacenter. Specialized arrays are common targets in these situations.

It follows that a few additional metrics are of paramount importance for these types of applications and workloads to make a system relevant in the real world:

  • I/O density – how fast does this system go per rack unit of space?
  • Price per used (not usable) GB – can I use most of my storage space to drive this performance, or just a small fraction of the available space?
  • Used capacity per Rack Unit – can I get a decent amount of application space per Rack Unit and maintain this high performance?

Terms used

Some definition of the terms used in the chart will be necessary. The first four are standard metrics found in all SPC-2 reports:

  • $/SPC-2 MBPS – the standard SPC-2 reported metric of price/performance
  • SPC-2 MBPS – the standard SPC-2 reported metric of the sheer performance attained for the test. It’s important to look at this number in the context of the application used to generate it – SPC-2. It will invariably be a lot less than marketing throughput numbers… Don’t compare marketing numbers for systems that don’t have SPC-2 results to official, audited SPC-2 numbers. Ask the vendor that doesn’t have SPC-2 results to submit their systems and get audited numbers.
  • Price – the price (in USD) submitted for the system, after all discounts etc.
  • Used Capacity – this is the space actually consumed by the benchmark, in GB. “ASU Capacity” in SPC parlance. This is especially important since many vendors will use a relatively small percentage of their overall capacity (Oracle for example uses around 28% of the total space in their SPC-2 submissions, NetApp uses over 79%). Performance is often a big reason to use a small percentage of the capacity, especially with spinning disks.

This type of table is found in all submissions, the sections before it explain how capacity was calculated. Page 8 of the E5660 submission:

E5660 Utilization

The next four metrics are very easily derived from existing SPC-2 metrics in each report:

  • $/ASU GB: How much do I pay for each GB actually consumed by the benchmark? Since that’s what the test actually used to get the reported performance… metrics such as cost per raw GB are immaterial. To calculate, just divide $/ASU Capacity.
  • Rack Units: Simply the rack space consumed by each system based on the list of components. How much physical space does the system consume?
  • SPC-2 MBPS/Rack Unit: How much performance is achieved by each Rack Unit of space? Shows performance density. Simply divide the posted MBPS/total Rack Units.
  • ASU GB/Rack Unit: Of the capacity consumed by the benchmark, how much of it fits in a single Rack Unit of space? Shows capacity density. To calculate, divide ASU Capacity/total Rack Units.
A balanced solution needs to look at a combination of the above metrics.

Analysis of Results

Here are the Top Ten Price/Performance SPC-2 submissions as of December 4th, 2015:

SPC 2 Analysis

Notice that while the NetApp E5660 doesn’t dominate in sheer SPC-2 MBPS in a single system, it utterly annihilates the competition when it comes to performance and capacity density and cost/used GB. The next most impressive system is the SGI InfiniteStorage 5600, which is the SGI OEM version of the NetApp E5600 🙂 (tested with different drives and cost structure).

Notice that the NetApp E5660 with spinning disks is faster per Rack Unit than the fastest all-SSD system HP has to offer, the 20850… this is not contrived, it’s simple math. Even after a rather hefty HP discount, the HP system is over 20x more expensive per used GB than the NetApp E5600.

If you were buiding a real system for large-scale heavy-duty sequential I/O, what would you rather have, assuming non-infinite budget and rack space?

Another way to look at it: One could buy 10x the NetApp systems for the cost of a single 3Par system, and achieve, while running them easily in parallel:

  • 30% faster performance than the 3Par system
  • 19.7x the ASU Capacity of the 3Par system
These are real business metrics – same cost, almost 20x the capacity and 30% more performance.

This is where hero numbers and Lab Queens fail to match up to real world requirements at a reasonable cost. You can do similar calculations for the rest of the systems.

A real-life application: The Lawrence Livermore National Laboratory. They run multiple E-Series systems in parallel and can achieve over 1TB/s throughput. See here.

Closing Thoughts

With Top Ten Price/Performance rankings for both the SPC-1 and SPC-2 benchmarks, the NetApp E-Series platform has proven that it can be the basis for a highly performant, reliable yet cost-efficient and dense solution. In addition, it’s the only platform that is in the Top Ten Price/Performance for both SPC-1 and SPC-2.

A combination of high IOPS, extremely low latency and very high throughput at low cost is possible with this system. 

When comparing solutions, look beyond the big numbers and think how a system would provide business value. Quite frequently, the big numbers come with big caveats…

D

Technorati Tags: , , , ,