Practical Considerations for Implementing NVMe Storage

Before we begin, something needs to be clear: Although dual-ported NVMe drives are not yet cost effective, the architecture of Nimble Storage is NVMe-ready today. And always remember that in order to get good benefits from NVMe, one needs to implement it all the way from the client. Doing NVMe only at the array isn’t as effective.

In addition, Nimble already uses technology far faster than NVMe: Our write buffers use byte-addressable NVDIMM-N, instead of slower NVRAM HBAs or NVMe drives that other vendors use. Think about it: I/O happens at DDR4 RAM speeds, which makes even the fastest NVMe drive seem positively glacial.

nvdimm-n

I did want to share my personal viewpoint of where storage technology in general may be headed if NVMe is to be mass-adopted in a realistic fashion and without making huge sacrifices.

About NVMe

Lately, a lot of noise is being made about NVMe technology. The idea being that NVMe will be the next step in storage technology evolution. And, as is the natural order of things, new vendors are popping up to take advantage of this perceived opening.

For the uninitiated: NVMe is a relatively new standard that was created specifically for devices connected over a PCI bus. It has certain nice advantages vs SCSI such as reduced latency and improved IOPS. Sequential throughput can be significantly higher. It can be more CPU-efficient. It needs a small and simple driver, the standard requires only 13 commands, and it can also be used over some FC or Ethernet networks (NVMe over Fabrics). Going through a fabric only adds a small amount of extra latency to the stack compared to DAS.

NVMe is strictly an optimized block protocol, and not applicable to NAS/object platforms unless one is talking about their internal drives.

Due to the additional performance, NVMe drives are a no brainer in systems like laptops and DASD/internal to servers. Usually there is only a small number (often just one device) and no fancy data services are running on something like a laptop… replacing the media with better media+interface is a good idea.

For enterprise arrays though, the considerations are different.

NVMe Performance

Marketing has managed to confuse people regarding NVMe’s true performance. It’s important to note that tests illustrating NVMe performance show a single NVMe device being faster than a single SAS or SATA SSD. But storage arrays usually don’t have a single device and so drive performance isn’t the bottleneck as it is with low media count systems.

In addition, most tests and research papers comparing NVMe to other technologies use wildly dissimilar SSD models. For instance, pitting a modern, ultra-high-end NVMe SSD against an older consumer SATA SSD with a totally different internal controller. This can make proper performance comparisons difficult. How much of the performance boost is due to NVMe and how much because the expensive, fancy SSD is just a much better engineered device?

For instance, consider this chart of NVMe device latency, courtesy of Intel:

3dxpoint 

As you can see, regarding latency, NVMe as a drive connection protocol will offer better latency than SAS or SATA but the difference is in the order of a few microseconds. The protocol differences become truly important only with next gen technologies like 3D Xpoint, which ideally needs a memory interconnect to shine (or, at a minimum, PCI) since the media is so much faster than the usual NAND. But such media will be prohibitively expensive to be used as the entire storage within an array in the foreseeable future, and would quickly be bottlenecked by the array CPUs at scale.

NVMe over Fabrics

Additional latency savings will come from connecting clients using NVMe over Fabrics. By doing I/O over an RDMA network, a latency reduction of around 100 microseconds is possible versus encapsulated SCSI protocols like iSCSI, assuming all the right gear is in place (HBAs, switches, host drivers). Doing NVMe at the client side also helps with lowering CPU utilization, which can make client processing overall more efficient.

Where are the Bottlenecks?

The reality is that the main bottleneck in today’s leading modern AFAs is the controller itself and not the SSDs (simply because there is enough performance in just a couple of dozen modern SAS/SATA SSDs to saturate most systems). Moving to competent NVMe SSDs will mean that those same controllers will now be saturated by maybe 10 NVMe SSDs. For example, a single NVMe drive may be able to read sequentially at 3GB/s, whereas a single SATA drive 500MB/s. Putting 24 NVMe drives in the controller doesn’t mean that magically the controller will now deliver 72GB/s. In the same way, a single SATA SSD might be able to do 100000 read small block random IOPS and an NVMe with better innards 400000 IOPS. Again, it doesn’t mean that same controller with 24 devices will all of a sudden now do 9.6 million IOPS!

How Tech is Adopted

Tech adoption comes in waves until a significant technology advancement is affordable and reliable enough to become pervasive. For instance, ABS brakes were first used in planes in 1929 and were too expensive and cumbersome to use in everyday cars. Today, most cars have ABS brakes and we take for granted the added safety they offer.

But consider this: What if someone told you that in order to get a new kind of car (that has several great benefits) you would have to utterly give up things like airbags, ABS brakes, all-wheel-drive, traction control, limited-slip differential? Without an equivalent replacement for these functions?

You would probably realize that you’re not that excited about the new car after all, no matter how much better than your existing car it might be in other key aspects.

Storage arrays follow a similar paradigm. There are several very important business reasons that make people ask for things like HA, very strong RAID, multi-level checksums, encryption, compression, data reduction, replication, snaps, clones, hot firmware updates. Or the ability to dynamically scale a system. Or comprehensive cross-stack analytics and automatic problem prevention.

Such features evolved over a long period of time, and help mitigate risk and accelerate business outcomes. They’re also not trivial to implement properly.

NVMe Arrays Today

The challenge I see with the current crop of ultra-fast NVMe over Fabrics arrays is that they’re so focused on speed that they ignore the aforementioned enterprise features in lieu of sheer performance. I get it: it takes great skill, time and effort to reliably implement such features, especially in a way that they don’t strip the performance potential of a system.

There is also a significant cost challenge in order to safely utilize NVMe media en masse. Dual-ported SSDs are crucial in order to deliver proper HA. Current dual-ported NVMe SSDs tend to be very expensive per TB vs current SAS/SATA SSDs. In addition, due to the much higher speed of the NVMe interface, even with future CPUs that include FPGAs, many CPUs and PCI switches are needed to create a highly scalable system that can fully utilize such SSDs (and maintain enterprise features), which further explains why most NVMe solutions using the more interesting devices tend to be rather limited.

There are also client-side challenges: Using NVMe over Fabrics can often mean purchasing new HBAs and switches, plus dealing with some compromises. For instance, in the case of RoCE, DCB switches are necessary, end-to-end congestion management is a challenge, and routability is not there until v2.

There’s a bright side: There actually exist some very practical ways to give customers the benefits of NVMe without taking away business-critical capabilities.

Realistic Paths to NVMe Adoption

We can divide the solution into two pieces, the direction chosen will then depend on customer readiness and component availability. All the following assumes no loss of important enterprise functionality (as we discussed, giving up on all the enterprise functionality is the easy way out when it comes to speed):

Scenario 1: Most customers are not ready to adopt host-side NVMe connectivity:

If this is the case, a good option would be to have something like a fast byte-addressable ultra-fast device inside the controller to massively augment the RAM buffers (like 3D Xpoint in a DIMM), or, if not available, some next-gen NVMe drives to act as cache. That would provide an overall speed boost to the clients and not need any client-side modifications. This approach would be the most friendly to an existing infrastructure (and a relatively economical enhancement for arrays) without needing all internal drives to be NVMe nor extensive array modifications.

You see, part of any competent array’s job is using intelligence to hide any underlying media issues from the end user. A good example: even super-fast SSDs can suffer from garbage collection latency incidents. A good system will smooth out the user experience so users won’t see extreme latency spikes. The chosen media and host interface are immaterial for this, but I bet if you were used to 100μs latencies and they suddenly spiked to 10ms for a while, it would be a bad day. Having an extra-large buffer in the array would help do this more easily, yet not need customers to change anything host-side.

An evolutionary second option would be to change all internal drives to NVMe, but to make this practical would require wide availability of cost-effective dual-ported devices. Note that with low SSD counts (less than 12) this would provide speed benefits even if the customer doesn’t adopt a host-side NVMe interface, but it will be a diminishing returns endeavor at larger scale, unless the controllers are significantly modified.

Scenario 2: Large numbers of customers are ready and willing to adopt NVMe over Fabrics.

In this case, the first thing that needs to change is the array connectivity to the outside world. That alone will boost speeds on modern systems even without major modifications. Of course, this will often mean client and networking changes to be most effective, and often such changes can be costly.

The next step depends on the availability of cost-effective dual-ported NVMe devices. But in order for very large performance benefits to be realized, pretty big boosts to CPU and PCI switch counts may be necessary, necessitating bigger changes to storage systems (and increased costs).

Architecture Matters

In the quest for ultra-low latency and high throughput without sacrificing enterprise features (yet remaining reasonably cost-effective), overall architecture becomes extremely important.

For instance, how will one do RAID? Even with NVMe over Fabrics, approaches like erasure coding and triple mirroring can be costly from an infrastructure perspective. Erasure coding remains CPU-hungry (even more so when trying to hit ultra-low latencies), and triple mirroring across an RDMA fabric would mean massive extra traffic on that fabric.

Localized CPU:RAID domains remain more efficient, and mechanisms such as Nimble NCM can fairly distribute the load across multiple storage nodes without relying on a cluster network for heavy I/O. This technology is available today.

Next Steps

In summary, I urge customers to carefully consider the overall business impact of their storage making decisions, especially when it comes to new technologies and protocols. Understand the true benefits first. Carefully balance risk with desired outcome, and consider the overall system and not just the components. Of course, one needs to understand the risks vs rewards first, hence this article. Just make sure that, in order to achieve a certain ideal, you don’t give up on critical functionality that you’ve been taking for granted.

The Importance of Automated Headroom Management

Before we begin: This is another vendor-neutral post. I realize there may be no architecture that can do everything I’m proposing, but some may come closer to what you need than others. Whether you’re a vendor or a customer, see it as stuff you should be doing or be asking for respectively…

Headroom!

Headroom is a term that applies to almost all technologies, and it’s crucially important for all of them. Some examples:

  • Photography
  • Cars
  • Bridges
  • Storage arrays…

Why is Sufficient Headroom Important?

Maintaining sufficient headroom in any solution is a way to ensure safety and predictability of operation under most conditions (especially under unfavorable ones).

For instance, if the maximum load for an evenly loaded bridge before it collapses is X, the overall recommended load will be a fraction of that. But even the weight/length/axle count of a single truck on a bridge will also be subject to certain strict limits in order to avoid excessive localized stress on the structure.

Headroom in Storage Arrays

Apologies to the seasoned storage pros for all the foundational material but it’s crucial to take this step by step.

It is important to note that headroom in arrays is not necessarily as simple as how busy the CPU is. Headroom is a multi-dimensional concept.

More factors than just CPU come into play, including how busy the underlying storage media are, how saturated various buses are, and how much of the CPU is spent on true workload vs opportunistic tasks (that could be deferred). Not to mention that in some systems, certain tasks are single-threaded and could pose an overall headroom bottleneck by maxing out a single CPU core, while the rest of the CPU is not busy at all.

Maintaining sufficient headroom in storage arrays is necessary in order to provide acceptable latency, especially in the event of high load during a controller failover. Depending on the underlying architecture of an array, different headroom approaches and calculations are necessary.

Some examples of different architectures:

  • Active-Active controllers, per-controller pool
  • Active-Standby, single pool
  • Active-Active, single pool
  • Grid, single pool
  • Permutations thereof (it’s beyond the scope of this article to explore all possible options)

The single vs multiple pool question complicates things a bit, plus things like disk ownership are also hugely important. This isn’t an argument about which architecture is better (it depends anyway), but rather about headroom management in different architectures.

Dual-Controller Headroom

Dual-controller architectures need to be extremely careful with headroom management. After all, there are only two controllers in play. Here’s what sufficient headroom looks like in a dual-controller system:

HeadroomHA2

There are not many options to keep things healthy in a dual-controller architecture. In an Active-Standby system, the Standby controller is ready to immediately take over. There is no danger in loading up the Active controller, aside from expected load-related latency.

In an Active-Active HA system, maintaining a healthy amount of headroom has to be managed so that there is, overall, an entire controller’s worth of free headroom available.

Headroom in a Cluster of HA Pairs Architecture

There are several implementations that make use of a multiple HA Pair architecture. Often, the multiple HA pairs present a virtual pool to the outside world, even if, internally, there are multiple private pools. Some implementations just keep it to pools owned by each controller.

Here’s an example of healthy headroom in such a system:

HeadroomMultiHA2

Even though there are multiple controllers (at least 4 total), in order to maintain an overall healthy system, a total of 100% headroom needs to be maintained in each  HA pair, otherwise the performance of an underlying private pool (in green) might suffer, making the overall virtual pool performance (light blue) unpredictable.

Headroom in a True Grid Architecture

Grid Architectures spread overall load among multiple nodes (often plain servers with some disks inside and connected via a network).

In such a scheme, overall headroom that needs to be maintained per node as a percentage is 100/N, where N is the number of nodes in the storage cluster.

So, in a 4-node cluster, 100/4=25% headroom per node needs to be maintained.

This doesn’t account for the significant work that rebalancing after a node failure takes in such architectures, nor the capacity headroom needed, but it’s roughly accurate enough for our purposes.

Schematically:

HeadroomGrid2

How Headroom is Managed is Crucial

In order to manage headroom, four things need to be able to happen first:

  1. Be able to calculate headroom
  2. Be able to throttle workloads
  3. Be able to prioritize between types of workload
  4. Be able to move workloads around (architecture-dependent).

The only architecture that inherently makes this a bit easier is Active-Standby since there is always a controller waiting to take over if anything bad happens. But even with a single active controller, headroom needs to be managed in order to avoid bad latency conditions during normal operation (see here for an example approach). Remember, headroom is a multi-dimensional thing.

Example Problem Case: Imbalanced & Overloaded Controllers

Consider the following scenario: An Active-Active system has both controllers overloaded, and one of them is really busy:

Headroom Imbalanced2

Clearly, there are a few problems with this picture:

  1. It may be impossible to fail over in the event of a controller failure (total load is 165% of a single controller’s headroom)
  2. The first controller may already be experiencing latency issues
  3. Why did the system even get to this point in the first place?
This is a commonplace occurrence unfortunately.

Automation is Key in Managing Headroom

The biggest problem in our example is actually the last point: Why was the system allowed to get to that state to begin with?

Even if a system is able to calculate headroom, throttle workloads and move workloads around, if nothing is done automatically to prevent problems, it’s extremely easy for users to get into the problem situation depicted above. I’ve seen it affect critical production systems far too many times for comfort.

Manual QoS is Not The Best Answer

Being able to manually throttle workloads can obviously help in such a situation. The problems with the manual QoS approach are outlined in a past article, but, in summary, most users simply have absolutely no idea what the actual limits should be (nor should they be expected to). Most importantly, placing QoS limits up front doesn’t result in balanced controllers… and may even result in other kinds of performance problems.

Of course, using QoS limits reactively is not going to prevent the problem from occurring in the first place.

Some companies offer Data Classification as a Professional Services engagement, in order to try and figure out an IOPS/TB/Application metric. Even if that is done, it doesn’t result in balanced controllers… it’s also not very useful in dynamic environments. It’s more used as a guideline for setting up manual QoS.

Automation Mechanisms to Consider for Managing Headroom

Clearly, pervasive automation is needed in order to keep headroom at safe levels.

I will split up the proposed mechanisms per architecture. There is some common functionality needed regardless of architecture:

Common Automation Needed

Every architecture needs to have the ability to automatically achieve the following, without user intervention at any point:

  1. Conserve headroom per controller
  2. Differentiate between different kinds of user workloads
  3. Differentiate between different kinds of system workloads
  4. Automatically prioritize between different workloads, especially under pressure
  5. Automatically throttle different kinds of workloads, especially under pressure

And now for the extra automation needed per architecture:

Active-Standby Automation

If in a single HA pair, nothing else is needed. If in a scale-out cluster of Active-Standby pairs:

  1. Automatically balance capacity and headroom utilization between HA pairs even if they’re different types
  2. Be able to auto-migrate workloads to other cluster nodes (if using multiple pools instead of one)

Active-Active Automation

  1. Automatically conserve one node’s worth of headroom across the HA pair (50/50, 60/40, 70/30 – all are OK)
  2. When provisioning new workloads, auto-balance them by performance and capacity across the nodes
  3. Be able to balance by auto-migrating workloads to the other node (if using multiple pools instead of one)

Active-Active with Multiple HA Pairs Automation

  1. Automatically conserve one node’s worth of headroom per HA pair
  2. Be able to auto-migrate workloads to any other node
  3. Automatically balance workloads and capacity utilization in the underlying per-HA pools

Grid Automation

  1. Automatically conserve at least one node’s worth of headroom across the grid
  2. Automatically conserve enough capacity to be able to lose one node, rebalance, and have enough capacity left to lose another one (the more cautious may want the capability to lose 2-3 nodes simultaneously)
  3. Automatically take into account grid size and rebalancing effort in order to conserve the right amount of headroom

In Closing…

If you’re a consumer of storage systems, always remember to be running your storage with sufficient headroom to be able to sustain a major failure without overly affecting your performance.

In addition, when looking to refresh your storage system, always ask the vendors about how they automate their headroom management.

Finally, if any vendor is quoting you performance numbers, always ask them how much headroom is left on the array at that performance level… (in addition to the extra questions about read/write percentages and latencies you should be asking already).

The answer may surprise you.

D

The Well-Behaved Storage System: Automatic Noisy Neighbor Avoidance

This topic is very near and dear to me, and is one of the big reasons I came over to Nimble Storage.

I’ve always believed that storage systems should behave gracefully and predictably under pressure. Automatically. Even under complex and difficult situations.

It sounds like a simple request and it makes a whole lot of sense, but very few storage systems out there actually behave this way. This creates business challenges and increases risk and OpEx.

The Problem

The simplest way to state the problem is that most storage systems can enter conditions where workloads can suffer from unfair and abrupt performance starvation under several circumstances.

OK, maybe that wasn’t the simplest way.

Consider the following scenarios:

  1. A huge sequential I/O job (backup, analytics, data loads etc.) happening in the middle of latency-sensitive transaction processing
  2. Heavy array-generated workloads (garbage collection, post-process dedupe, replication, big snapshot deletions etc.) happening at the same time as user I/O
  3. Failed drives
  4. Controller failover (due to an actual problem or simply a software update)

#3 and #4 are more obvious – a well-behaved system will ensure high performance even during a drive failure (or three), and after a controller fails over. For instance, if total system headroom is automatically kept at 50% for a dual-controller system (or, simplistically, 100/n, where n is the controller count for shared-everything architectures), even after a controller fails, performance should be fine.

#1 and #2 are a bit more complicated to deal with. Let’s look at this in more detail.

The Case of Competing Workloads During Hard Times

Inside every array, at any given moment, a balancing act occurs. Multiple things need to happen simultaneously.

Several user-generated workloads, for instance:

  • DB
  • VDI
  • File Services
  • Analytics

Various internal array processes – they also are workloads, just array-generated, and often critical:

  • Data reduction (dedupe, compression)
  • Cleanup (object deletion, garbage collection)
  • Data protection (integrity-related)
  • Backups (snaps, replication)

If the system has enough headroom, all these things will happen without performance problems.

If the system runs out of headroom, that’s where most arrays have challenges with prioritizing what happens when.

The most common way a system may run out of headroom is the sudden appearance of a hostile “bully” workload. This is also called a “noisy neighbor”. Here’s an example of system behavior in the presence of a bully workload:

bully_vs_victim_2

 

In this example, the latency-sensitive workload will greatly and unfairly suffer after the “noisy neighbor” suddenly appears. If the latency-sensitive workload is a mission-critical application, this could cause a serious business problem (slow processing of financial transactions, for instance).

This is an extremely common scenario. A lot of the time it’s not even a new workload. Often, an existing workload changes behavior (possibly due to an application change – for instance a patch or a modified SQL query). This stuff happens.

In addition, the noisy neighbor doesn’t have to be a different LUN/Volume. It could be competing workload within the same Volume! In such cases, simple fair sharing will not work.

How some vendors have tried to fix the issue with Manual “QoS”

As always, there is more than one way to skin a cat, if one is so inclined. Here are a couple of manual methods to fix workload contention:

  • Some arrays have a simple IOPS or throughput limit that an administrator can manually adjust in order to fix a performance problem. This is an iterative and reactive method and hard to automate properly in real time. In addition, if the issue was caused by an internal array-generated workload, there is often no tooling available to throttle those processes.
  • Other arrays insist on the user setting up minimum, maximum and burst IOPS values for every single volume in the system, upon volume creation. This assumes the user knows in advance what performance envelope is required, in detail, per volume. The reality is that almost nobody knows these things beforehand, and getting the numbers wrong can itself cause a huge problem with latencies. Most people just want to get on with their lives and have their stuff work without babysitting.
  • A few modern arrays do “fair sharing” among LUNs/Volumes. This can help a bit overall but doesn’t address the issue of competing workloads within the same Volume.
Manual mechanisms for fixing the “bully” workload challenge result in systems that are hard to consume and complex to support while under performance pressure. Moreover, when a performance issue occurs, speed of resolution is critical. The issue needs to be resolved immediately, especially for latency-sensitive workloads. Manual methods will simply not be fast enough. Business will be impacted.

How Nimble Storage Fixed the Noisy Neighbor Issue

No cats were harmed in the process. Nimble engineers looked at the extensive telemetry in InfoSight, used data science, and neatly identified areas that could be massively automated in order to optimize system behavior under a wide variety of adverse conditions. Some of what was done:

  • Highly advanced Fair Share disk scheduling (separate mechanisms that deal with different scenarios)
  • Fair Share CPU scheduling
  • Dynamic Weight Adjustment (the most important innovation listed) – automatically adjust priorities in various ways under different resource contention conditions, so that the system can always complete critical tasks and not fall dangerously behind. For instance, even within the same Volume, preferentially prioritize latency-sensitive I/O (like a DB redo log write or random DB access) vs things like a large DB table scan (which isn’t latency-sensitive).

The end result is a system that:

  • Lets system latency increase gracefully and progressively as load increases
  • Carefully and automatically balances user and system workloads, even within the same Volume
  • Achieves I/O deadlines and preemption behavior for latency-sensitive I/O
  • Eliminates the Noisy Neighbor problem without the need for any manual QoS adjustments
  • Allows latency-sensitive small-block I/O to proceed without interference from bully workloads even at extremely high system loads.
Such automation achieves a better business result: less risk, less OpEx, easy supportability, simple and safe overall consumption even under difficult conditions.

What Should Nimble Customers do to get this Capability?

As is typical with Nimble systems and their impressive Ease of Consumption, nothing fancy needs to be done apart from simply upgrading to a specific release of the code (in this case 3.1 and up – 2.3 did some of the magic but 3.1 is the fully realized vision).

A bit anticlimactic, apologies… if you like complexity, watching this instead is probably more fun than juggling QoS manually.

D

Technorati Tags: , , , ,

NetApp E-Series posts top ten price performance SPC-2 result

NetApp published a Top Ten Price/Performance SPC-2 result for the E5660 platform (the all-SSD EF560 variant of the platform also comfortably placed in the Top Ten Price/Performance for the SPC-1 test). In this post I will explain why the E-Series platform is an ideal choice for workloads requiring high performance density while being cost effective.

Executive Summary

The NetApp E5660 offers a unique combination of high performance, low cost, high reliability and extreme performance density, making it perfectly suited for applications that require high capacity, very high speed sequential I/O and a dense form factor. In EF560 form with all SSD, it offers that same high sequential speed shown here coupled with some of the lowest latencies for extremely hard OLTP workloads as demonstrated by the SPC-1 results.

The SPC-2 test

Whereas the SPC-1 test deals with an extremely busy OLTP system (high percentage of random I/O, over 60%writes, with SPC-1 IOPS vs latency being a key metric), SPC-2 instead focuses on throughput, with SPC-2 MBPS (MegaBytes Per Second) being a key metric.

SPC-2 tries to simulate applications that need to move a lot of data around very rapidly. It is divided into three workloads:

  1. Large File Processing: Applications that need to do operations with large files, typically found in analytics, scientific computing and financial processing.
  2. Large Database Queries: Data mining, large table scans, business intelligence…
  3. Video on Demand: Applications that deal with providing multiple simultaneous video streams (even of the same file) by drawing from a large library.
The SPC-2 test measures each workload and then provides an average SPC-2 MBPS across the three workloads. Here’s the link to the SPC-2 NetApp E5660 submission.

Metrics that matter

As I’ve done in previous SPC analyses, I like using the published data to show business value with additional metrics that may not be readily spelled out in the reports. Certainly $/SPC-2 MBPS is a valuable metric, as is the sheer SPC-2 MBPS metric that shows how fast the array was in the benchmark.

However… think of what kind of workloads represent the type of I/O shown in SPC-2. Analytics, streaming media, data mining. 

Then think of the kind of applications driving such workloads.

Such applications typically treat storage like lego bricks and can use multiple separate arrays in parallel (there’s no need or even advantage in having a single array). Those deployments invariably need high capacities, high speeds and at the same time don’t need the storage to do anything too fancy beyond being dense, fast and reliable.

In addition, such solutions often run in their own dedicated silo, and don’t share their arrays with the rest of the applications in a datacenter. Specialized arrays are common targets in these situations.

It follows that a few additional metrics are of paramount importance for these types of applications and workloads to make a system relevant in the real world:

  • I/O density – how fast does this system go per rack unit of space?
  • Price per used (not usable) GB – can I use most of my storage space to drive this performance, or just a small fraction of the available space?
  • Used capacity per Rack Unit – can I get a decent amount of application space per Rack Unit and maintain this high performance?

Terms used

Some definition of the terms used in the chart will be necessary. The first four are standard metrics found in all SPC-2 reports:

  • $/SPC-2 MBPS – the standard SPC-2 reported metric of price/performance
  • SPC-2 MBPS – the standard SPC-2 reported metric of the sheer performance attained for the test. It’s important to look at this number in the context of the application used to generate it – SPC-2. It will invariably be a lot less than marketing throughput numbers… Don’t compare marketing numbers for systems that don’t have SPC-2 results to official, audited SPC-2 numbers. Ask the vendor that doesn’t have SPC-2 results to submit their systems and get audited numbers.
  • Price – the price (in USD) submitted for the system, after all discounts etc.
  • Used Capacity – this is the space actually consumed by the benchmark, in GB. “ASU Capacity” in SPC parlance. This is especially important since many vendors will use a relatively small percentage of their overall capacity (Oracle for example uses around 28% of the total space in their SPC-2 submissions, NetApp uses over 79%). Performance is often a big reason to use a small percentage of the capacity, especially with spinning disks.

This type of table is found in all submissions, the sections before it explain how capacity was calculated. Page 8 of the E5660 submission:

E5660 Utilization

The next four metrics are very easily derived from existing SPC-2 metrics in each report:

  • $/ASU GB: How much do I pay for each GB actually consumed by the benchmark? Since that’s what the test actually used to get the reported performance… metrics such as cost per raw GB are immaterial. To calculate, just divide $/ASU Capacity.
  • Rack Units: Simply the rack space consumed by each system based on the list of components. How much physical space does the system consume?
  • SPC-2 MBPS/Rack Unit: How much performance is achieved by each Rack Unit of space? Shows performance density. Simply divide the posted MBPS/total Rack Units.
  • ASU GB/Rack Unit: Of the capacity consumed by the benchmark, how much of it fits in a single Rack Unit of space? Shows capacity density. To calculate, divide ASU Capacity/total Rack Units.
A balanced solution needs to look at a combination of the above metrics.

Analysis of Results

Here are the Top Ten Price/Performance SPC-2 submissions as of December 4th, 2015:

SPC 2 Analysis

Notice that while the NetApp E5660 doesn’t dominate in sheer SPC-2 MBPS in a single system, it utterly annihilates the competition when it comes to performance and capacity density and cost/used GB. The next most impressive system is the SGI InfiniteStorage 5600, which is the SGI OEM version of the NetApp E5600 🙂 (tested with different drives and cost structure).

Notice that the NetApp E5660 with spinning disks is faster per Rack Unit than the fastest all-SSD system HP has to offer, the 20850… this is not contrived, it’s simple math. Even after a rather hefty HP discount, the HP system is over 20x more expensive per used GB than the NetApp E5600.

If you were buiding a real system for large-scale heavy-duty sequential I/O, what would you rather have, assuming non-infinite budget and rack space?

Another way to look at it: One could buy 10x the NetApp systems for the cost of a single 3Par system, and achieve, while running them easily in parallel:

  • 30% faster performance than the 3Par system
  • 19.7x the ASU Capacity of the 3Par system
These are real business metrics – same cost, almost 20x the capacity and 30% more performance.

This is where hero numbers and Lab Queens fail to match up to real world requirements at a reasonable cost. You can do similar calculations for the rest of the systems.

A real-life application: The Lawrence Livermore National Laboratory. They run multiple E-Series systems in parallel and can achieve over 1TB/s throughput. See here.

Closing Thoughts

With Top Ten Price/Performance rankings for both the SPC-1 and SPC-2 benchmarks, the NetApp E-Series platform has proven that it can be the basis for a highly performant, reliable yet cost-efficient and dense solution. In addition, it’s the only platform that is in the Top Ten Price/Performance for both SPC-1 and SPC-2.

A combination of high IOPS, extremely low latency and very high throughput at low cost is possible with this system. 

When comparing solutions, look beyond the big numbers and think how a system would provide business value. Quite frequently, the big numbers come with big caveats…

D

Technorati Tags: , , , ,

Architecture has long term scalability implications for All Flash Appliances

Recently, NetApp announced the availability of a 3.84TB SSD. It’s not extremely exciting – it’s just a larger storage medium. Sure, it’s really advanced 3D NAND, it’s fast and ultra-reliable, and will allow some nicely dense configurations at a reduced $/GB. Another day in Enterprise Storage Land.

But, ultimately, that’s how drives roll – they get bigger. And in the case of SSD, the roadmaps seem extremely aggressive regarding capacities.

Then I realized that several of our competitors don’t have this large SSD capacity available. Some don’t even have half that.

But why? Why ignore such a seemingly easy and hugely cost-effective way to increase density?

In this post I will attempt to explain why certain architectural decisions may lead to inflexible design constructs that can have long-term flexibility and scalability ramifications.

Design Center

Each product has its genesis somewhere. It is designed to address certain key requirements in specific markets and behave in a better/different way than competitors in some areas. Plug specific gaps. Possibly fill a niche or even become a new product category.

This is called the “Design Center” of the product.

Design centers can evolve over time. But, ultimately, every product’s Design Center is an exercise in compromise and is one of the least malleable parts of the solution.

There’s no such thing as a free lunch. Every design decision has tradeoffs. Often, those tradeoffs sacrifice long term viability for speed to market. There’s nothing wrong with tradeoffs as long as you know what those are, especially if the tradeoffs have a direct impact on your data management capabilities long term.

It’s all about the Deduplication/RAM relationship

Aside from compression, scale up and/or scale out, deduplication is a common way to achieve better scalability and efficiencies out of storage.

There are several ways to offer deduplication in storage arrays: Inline, post-process, fixed chunk, variable chunk, per volume scope, global scope – all are design decisions that can have significant ramifications down the line and various business impacts.

Some deduplication approaches require very large amounts of memory to store metadata (hashes representing unique chunk signatures). This may limit scalability or make a product more expensive, even with scale-out approaches (since many large, costly controllers would be required).

There is no perfect answer, since each kind of architecture is better at certain things than others. This is what is meant by “tradeoffs” in specific Design Centers. But let’s see how things look for some example approaches (this is not meant to be a comprehensive list of all permutations).

I am keeping it simple – I’m not showing how metadata might get shared and compared between nodes (in itself a potentially hugely impactful operation as some scale-out AFA vendors have found to their chagrin). In addition, I’m not exploring container vs global deduplication or different scale-out methods – this post would become unwieldy… If there’s interest drop me a line or comment and I will do a multi-part series covering the other aspects.

Fixed size chunk approach

In the picture below you can see the basic layout of a fixed size chunk deduplication architecture. Each data chunk is represented by a hash value in RAM. Incoming new chunks are compared to the RAM hash store in order to determine where and whether they may be stored:

Hashes fixed chunk

The benefit of this kind of approach is that it’s relatively straightforward from a coding standpoint, and it probably made a whole lot of sense a couple of years ago when small SSDs were all that was available and speed to market was a major design decision.

The tradeoff is that a truly exorbitant amount of memory is required in order to store all the hash metadata values in RAM. As SSD capacities increase, the linear relationship of SSD size vs RAM size results in controllers with multi-TB RAM implementations – which gets expensive.

It follows that systems using this type of approach will find it increasingly difficult (if not impossible) to use significantly larger SSDs without either a major architectural change or the cost of multiple TB of RAM dropping dramatically. You should really ask the vendor what their roadmap is for things like 10+TB SSDs… and whether you can expand by adding the larger SSDs into a current system without having to throw everything you’ve already purchased away.

Variable size chunk approach

This one is almost identical to the previous example, but instead of a small, fixed block, the architecture allows for variable size blocks to be represented by the same hash size:

Hashes variable chunk

This one is more complex to code, but the massive benefit is that metadata space is hugely optimized since much larger data chunks are represented by the same hash size as smaller data chunks. The system does this chunk division automatically. Less hashes are needed with this approach, leading to better utilization of memory.

Such an architecture needs far less memory than the previous example. However, it is still plagued by the same fundamental scaling problem – only at a far smaller scale. Conversely, it allows a less expensive system to be manufactured than in the previous example since less RAM is needed for the same amount of storage. By combining multiple inexpensive systems via scale-out, significant capacity at scale can be achieved at a lesser cost than with the previous example.

Fixed chunk, metadata both in RAM and on-disk

An approach to lower the dependency on RAM is to have some metadata in RAM and some on SSD:

Hashes fixed chunk metadata on disk

This type of architecture finds it harder to do full speed inline deduplication since not all metadata is in RAM. However, it also offers a more economical way to approach hash storage. SSD size is not a concern with this type of approach. In addition, being able to keep dedupe metadata on cold storage aids in data portability and media independence, including storing data in the cloud.

Variable chunk, multi-tier metadata store

Now that you’ve seen examples of various approaches, it starts making logical sense what kind of architectural compromises are necessary to achieve both high deduplication performance and capacity scale.

For instance, how about variable blocks and the ability to store metadata on multiple tiers of storage? Upcoming, ultra-fast Storage Class Memory technologies are a good intermediate step between RAM and SSD. Lots of metadata can be placed there yet retain high speeds:

Hashes variable chunk metadata on 3 tiers

Coding for this approach is of course complex since SCM and SSD have to be treated as a sort of Level 2/Level 3 cache combination but with cache access time spans in the days or weeks, and parts of the cache never going “cold”. It’s algorithmically more involved, plus relies on technologies not yet widely available… but it does solve multiple problems at once. One could of course use just SCM for the entire metadata store and simplify implementation, but that would somewhat reduce the performance afforded by the approach shown (RAM is still faster). But if the SCM is fast enough… 🙂

However, being able to embed dedupe metadata in cold storage can still help with data mobility and being able to retain deduplication even across different types of storage and even cloud. This type of flexibility is valuable.

Why should you care?

Aside from the academic interest and nerd appeal, different architecture approaches have a business impact:

  • Will the storage system scale large enough for significant future growth?
  • Will I be able to use significantly new media technologies and sizes without major disruption?
  • Can I use extremely large media sizes?
  • Can I mix media sizes?
  • Can I mix controller types in a scale-out cluster?
  • Can I use cost-optimized hardware?
  • Does deduplication at scale impact performance negatively, especially with heavy writes?
  • If inline efficiencies aren’t comprehensive, how does that affect overall capacity sizing?
  • Does the deduplication method enforce a single large failure domain? (single pool – meaning that any corruption would result in the entire system being unusable)
  • What is the interoperability with Cloud and Disk technologies?
  • Can data mobility from All Flash to Disk to Cloud retain deduplication savings?
  • What other tradeoffs is this shiny new technology going to impose now and in the future? Ask to see a 5-year vision roadmap!

Always look beyond the shiny feature and think of the business benefits/risks. Some of the above may be OK for you. Some others – not so much.

There’s no free lunch.

D

Technorati Tags: , , , ,