Practical Considerations for Implementing NVMe Storage

Before we begin, something needs to be clear: Although dual-ported NVMe drives are not yet cost effective, the architecture of Nimble Storage is NVMe-ready today. And always remember that in order to get good benefits from NVMe, one needs to implement it all the way from the client. Doing NVMe only at the array isn’t as effective.

In addition, Nimble already uses technology far faster than NVMe: Our write buffers use byte-addressable NVDIMM-N, instead of slower NVRAM HBAs or NVMe drives that other vendors use. Think about it: I/O happens at DDR4 RAM speeds, which makes even the fastest NVMe drive seem positively glacial.

nvdimm-n

I did want to share my personal viewpoint of where storage technology in general may be headed if NVMe is to be mass-adopted in a realistic fashion and without making huge sacrifices.

About NVMe

Lately, a lot of noise is being made about NVMe technology. The idea being that NVMe will be the next step in storage technology evolution. And, as is the natural order of things, new vendors are popping up to take advantage of this perceived opening.

For the uninitiated: NVMe is a relatively new standard that was created specifically for devices connected over a PCI bus. It has certain nice advantages vs SCSI such as reduced latency and improved IOPS. Sequential throughput can be significantly higher. It can be more CPU-efficient. It needs a small and simple driver, the standard requires only 13 commands, and it can also be used over some FC or Ethernet networks (NVMe over Fabrics). Going through a fabric only adds a small amount of extra latency to the stack compared to DAS.

NVMe is strictly an optimized block protocol, and not applicable to NAS/object platforms unless one is talking about their internal drives.

Due to the additional performance, NVMe drives are a no brainer in systems like laptops and DASD/internal to servers. Usually there is only a small number (often just one device) and no fancy data services are running on something like a laptop… replacing the media with better media+interface is a good idea.

For enterprise arrays though, the considerations are different.

NVMe Performance

Marketing has managed to confuse people regarding NVMe’s true performance. It’s important to note that tests illustrating NVMe performance show a single NVMe device being faster than a single SAS or SATA SSD. But storage arrays usually don’t have a single device and so drive performance isn’t the bottleneck as it is with low media count systems.

In addition, most tests and research papers comparing NVMe to other technologies use wildly dissimilar SSD models. For instance, pitting a modern, ultra-high-end NVMe SSD against an older consumer SATA SSD with a totally different internal controller. This can make proper performance comparisons difficult. How much of the performance boost is due to NVMe and how much because the expensive, fancy SSD is just a much better engineered device?

For instance, consider this chart of NVMe device latency, courtesy of Intel:

3dxpoint 

As you can see, regarding latency, NVMe as a drive connection protocol will offer better latency than SAS or SATA but the difference is in the order of a few microseconds. The protocol differences become truly important only with next gen technologies like 3D Xpoint, which ideally needs a memory interconnect to shine (or, at a minimum, PCI) since the media is so much faster than the usual NAND. But such media will be prohibitively expensive to be used as the entire storage within an array in the foreseeable future, and would quickly be bottlenecked by the array CPUs at scale.

NVMe over Fabrics

Additional latency savings will come from connecting clients using NVMe over Fabrics. By doing I/O over an RDMA network, a latency reduction of around 100 microseconds is possible versus encapsulated SCSI protocols like iSCSI, assuming all the right gear is in place (HBAs, switches, host drivers). Doing NVMe at the client side also helps with lowering CPU utilization, which can make client processing overall more efficient.

Where are the Bottlenecks?

The reality is that the main bottleneck in today’s leading modern AFAs is the controller itself and not the SSDs (simply because there is enough performance in just a couple of dozen modern SAS/SATA SSDs to saturate most systems). Moving to competent NVMe SSDs will mean that those same controllers will now be saturated by maybe 10 NVMe SSDs. For example, a single NVMe drive may be able to read sequentially at 3GB/s, whereas a single SATA drive 500MB/s. Putting 24 NVMe drives in the controller doesn’t mean that magically the controller will now deliver 72GB/s. In the same way, a single SATA SSD might be able to do 100000 read small block random IOPS and an NVMe with better innards 400000 IOPS. Again, it doesn’t mean that same controller with 24 devices will all of a sudden now do 9.6 million IOPS!

How Tech is Adopted

Tech adoption comes in waves until a significant technology advancement is affordable and reliable enough to become pervasive. For instance, ABS brakes were first used in planes in 1929 and were too expensive and cumbersome to use in everyday cars. Today, most cars have ABS brakes and we take for granted the added safety they offer.

But consider this: What if someone told you that in order to get a new kind of car (that has several great benefits) you would have to utterly give up things like airbags, ABS brakes, all-wheel-drive, traction control, limited-slip differential? Without an equivalent replacement for these functions?

You would probably realize that you’re not that excited about the new car after all, no matter how much better than your existing car it might be in other key aspects.

Storage arrays follow a similar paradigm. There are several very important business reasons that make people ask for things like HA, very strong RAID, multi-level checksums, encryption, compression, data reduction, replication, snaps, clones, hot firmware updates. Or the ability to dynamically scale a system. Or comprehensive cross-stack analytics and automatic problem prevention.

Such features evolved over a long period of time, and help mitigate risk and accelerate business outcomes. They’re also not trivial to implement properly.

NVMe Arrays Today

The challenge I see with the current crop of ultra-fast NVMe over Fabrics arrays is that they’re so focused on speed that they ignore the aforementioned enterprise features in lieu of sheer performance. I get it: it takes great skill, time and effort to reliably implement such features, especially in a way that they don’t strip the performance potential of a system.

There is also a significant cost challenge in order to safely utilize NVMe media en masse. Dual-ported SSDs are crucial in order to deliver proper HA. Current dual-ported NVMe SSDs tend to be very expensive per TB vs current SAS/SATA SSDs. In addition, due to the much higher speed of the NVMe interface, even with future CPUs that include FPGAs, many CPUs and PCI switches are needed to create a highly scalable system that can fully utilize such SSDs (and maintain enterprise features), which further explains why most NVMe solutions using the more interesting devices tend to be rather limited.

There are also client-side challenges: Using NVMe over Fabrics can often mean purchasing new HBAs and switches, plus dealing with some compromises. For instance, in the case of RoCE, DCB switches are necessary, end-to-end congestion management is a challenge, and routability is not there until v2.

There’s a bright side: There actually exist some very practical ways to give customers the benefits of NVMe without taking away business-critical capabilities.

Realistic Paths to NVMe Adoption

We can divide the solution into two pieces, the direction chosen will then depend on customer readiness and component availability. All the following assumes no loss of important enterprise functionality (as we discussed, giving up on all the enterprise functionality is the easy way out when it comes to speed):

Scenario 1: Most customers are not ready to adopt host-side NVMe connectivity:

If this is the case, a good option would be to have something like a fast byte-addressable ultra-fast device inside the controller to massively augment the RAM buffers (like 3D Xpoint in a DIMM), or, if not available, some next-gen NVMe drives to act as cache. That would provide an overall speed boost to the clients and not need any client-side modifications. This approach would be the most friendly to an existing infrastructure (and a relatively economical enhancement for arrays) without needing all internal drives to be NVMe nor extensive array modifications.

You see, part of any competent array’s job is using intelligence to hide any underlying media issues from the end user. A good example: even super-fast SSDs can suffer from garbage collection latency incidents. A good system will smooth out the user experience so users won’t see extreme latency spikes. The chosen media and host interface are immaterial for this, but I bet if you were used to 100μs latencies and they suddenly spiked to 10ms for a while, it would be a bad day. Having an extra-large buffer in the array would help do this more easily, yet not need customers to change anything host-side.

An evolutionary second option would be to change all internal drives to NVMe, but to make this practical would require wide availability of cost-effective dual-ported devices. Note that with low SSD counts (less than 12) this would provide speed benefits even if the customer doesn’t adopt a host-side NVMe interface, but it will be a diminishing returns endeavor at larger scale, unless the controllers are significantly modified.

Scenario 2: Large numbers of customers are ready and willing to adopt NVMe over Fabrics.

In this case, the first thing that needs to change is the array connectivity to the outside world. That alone will boost speeds on modern systems even without major modifications. Of course, this will often mean client and networking changes to be most effective, and often such changes can be costly.

The next step depends on the availability of cost-effective dual-ported NVMe devices. But in order for very large performance benefits to be realized, pretty big boosts to CPU and PCI switch counts may be necessary, necessitating bigger changes to storage systems (and increased costs).

Architecture Matters

In the quest for ultra-low latency and high throughput without sacrificing enterprise features (yet remaining reasonably cost-effective), overall architecture becomes extremely important.

For instance, how will one do RAID? Even with NVMe over Fabrics, approaches like erasure coding and triple mirroring can be costly from an infrastructure perspective. Erasure coding remains CPU-hungry (even more so when trying to hit ultra-low latencies), and triple mirroring across an RDMA fabric would mean massive extra traffic on that fabric.

Localized CPU:RAID domains remain more efficient, and mechanisms such as Nimble NCM can fairly distribute the load across multiple storage nodes without relying on a cluster network for heavy I/O. This technology is available today.

Next Steps

In summary, I urge customers to carefully consider the overall business impact of their storage making decisions, especially when it comes to new technologies and protocols. Understand the true benefits first. Carefully balance risk with desired outcome, and consider the overall system and not just the components. Of course, one needs to understand the risks vs rewards first, hence this article. Just make sure that, in order to achieve a certain ideal, you don’t give up on critical functionality that you’ve been taking for granted.

Uncompromising Resiliency

(cross-posted at https://www.nimblestorage.com/blog/uncompromising-resiliency/)

The cardinal rule for enterprise storage systems is to never compromise when it comes to data integrity and resiliency.  Everything else, while important, is secondary.

Many storage consumers are not aware of what data integrity mechanisms are available or which ones are necessary to meet their protection expectations and requirements. It doesn’t help that a lot of the technologies and the errors they prevent are rather esoteric. However, if you want a storage system that safely stores your data and always returns it correctly, no measure is too extreme.

The Golden Rules of Storage Engineering

When architecting enterprise storage systems, there are three Golden Rules to follow.

In order of criticality:

  1. Integrity: Don’t ever return incorrect data
  2. Durability: Don’t ever lose any data
  3. Availability: Don’t ever lose access to data

To better understand the order, ask yourself, “what is preferred, temporary loss of access to data or the storage system returning the wrong data without anyone even knowing it’s wrong?”

Imagine life or death situations, where the wrong piece of information could have catastrophic consequences. Interestingly, vendors exist that focus a lot on Availability (even offering uptime “guarantees”) but are lacking in Integrity and Durability. Being able to access the array but have data corruption is almost entirely useless. Consider modern storage arrays with data deduplication and/or multi-petabyte storage pools. The effects are far more severe now that a single block represents the data for 1-100+ blocks and data is spread across 10’s – 100’s of drives instead of a few drives.

The Nimble Storage Approach

Nimble Storage has taken a multi-stage approach to satisfy the Golden Rules, and in some cases, the amount of protection offered verges on being paranoid (but the good kind of paranoid).

Simply, Nimble employs these mechanisms:

  1. Integrity: Comprehensive multi-level checksums
  2. Durability: Hardened RAID protection and resilience upon power loss
  3. Availability: Redundant hardware coupled with predictive analytics

We will primarily focus on the first two as they are often glossed over, assumed, or not well understood. Availability will be discussed in a separate blog, however it is good to mention a few details here.

To start, Nimble has achieved greater than 99.9997% measured uptime since 2014. This is measured across more than 9,000 customers using multiple generations of hardware and software. A key aspect of Nimble’s availability comes from InfoSight which continually improves and learns as more systems are used. Each week, trillions of data points are analyzed and processed with the goal of predicting and preventing issues, not just in the array, but across the entire infrastructure. 86% of issues are detected and automatically resolved before the customer is even aware of the problem. To further enhance this capability, Nimble’s Technical Support Engineers can resolve issues faster as they have all the data available when an issue arises. This bypasses the hours-days-weeks often required to collect data, send to support, analyze, repeat – until a solution can be found.

Data Integrity Mechanisms in Detail

The goal is simple: What is read must always match what was written. And, if it doesn’t, we fix it on the fly.

What many people don’t realize is there are occasions where storage media will lose a write, corrupt it or place it at the wrong location on the media. RAID (including 3-way mirroring) or Erasure Coding are not enough to protect against such issues. The older T10 PI employed by some systems is also not enough to protect against all eventualities.

The solution involves using checksums which get more computationally intensive the more paranoid one is. As checksums are computationally intensive, certain vendors don’t employ or minimally employ them to gain more performance or faster time to market. Unfortunately, the trade-off can lead to data corruption.

Broadly, Nimble creates a checksum and a “self-ID” for each piece of data. The checksum protects against data corruption. The self-ID protects against lost/misplaced writes and misdirected reads (incredible as it may seem, these things happen enough to warrant this level of protection).

For instance, if the written data has a checksum, and corruption occurs, when the data is read and checksummed again, the checksums will not match. However, if instead the data was placed at an incorrect location on the media, the checksums will match, but the self-IDs will not match.

checksums

Where it gets interesting:

Nimble doesn’t just do block-level checksums/IDs. These multi-level checksums are also performed:

  1. Per segment in each write stripe
  2. Per block, before and after compression
  3. Per snapshot (including all internal housekeeping snaps)
  4. For replication
  5. For all data movement within a cluster
  6. All data and metadata in NVRAM

This way, every likely data corruption event is covered, including metadata consistency and replication issues, which are often overlooked.

Durability Mechanisms in Detail

There are two kinds of data on a storage system and both need to be protected:

  1. Data in flight
  2. Data on persistent storage

One may differentiate between user data and metadata but we protect both with equal paranoid fervor. Some systems try to accelerate operations by not protecting metadata sufficiently, which greatly increases risk. This is especially true with deduplicating systems, where metadata corruption can mean losing everything!

Data in flight is data that is not yet committed to persistent storage. Nimble ensures all critical data in flight is checksummed and committed to both RAM and an ultra-fast byte-addressable NVDIMM-N memory module sitting right on the motherboard. The NVDIMM-N is mirrored to the partner controller and both controller NVDIMMs are protected against power loss via a supercapacitor. In the event of a power loss, the NVDIMMs simply flush their contents to flash storage. This approach is extremely reliable and doesn’t need inelegant solutions like a built-in UPS.

Data on persistent storage is protected by what we call Triple+ Parity RAID. Three orders of magnitude more resilient than RAID6. For comparison, RAID6 is three orders of magnitude more resilient than RAID5. The “+” sign means that there is extra intra-drive parity that can safeguard against entire sectors being lost even if three whole drives fail in a single RAID group.

Some might say this is a bit much, however with drive sizes increasing rapidly (especially SSDs) and drive read error rates increasing as drives age, it was the architecturally correct choice to make.

In Summary

Users frequently assume that all storage systems will safely store their data. And they will, most of the time. But when it comes to your data, “most of the time” isn’t good enough. No measure should be considered too extreme. When looking for a storage system, it’s worth taking the time to understand all situations where your data could be compromised. And, if nothing else, it’s worth choosing a vendor who is paranoid and goes to extremes to keep your data safe.

D

The Importance of Automated Headroom Management

Before we begin: This is another vendor-neutral post. I realize there may be no architecture that can do everything I’m proposing, but some may come closer to what you need than others. Whether you’re a vendor or a customer, see it as stuff you should be doing or be asking for respectively…

Headroom!

Headroom is a term that applies to almost all technologies, and it’s crucially important for all of them. Some examples:

  • Photography
  • Cars
  • Bridges
  • Storage arrays…

Why is Sufficient Headroom Important?

Maintaining sufficient headroom in any solution is a way to ensure safety and predictability of operation under most conditions (especially under unfavorable ones).

For instance, if the maximum load for an evenly loaded bridge before it collapses is X, the overall recommended load will be a fraction of that. But even the weight/length/axle count of a single truck on a bridge will also be subject to certain strict limits in order to avoid excessive localized stress on the structure.

Headroom in Storage Arrays

Apologies to the seasoned storage pros for all the foundational material but it’s crucial to take this step by step.

It is important to note that headroom in arrays is not necessarily as simple as how busy the CPU is. Headroom is a multi-dimensional concept.

More factors than just CPU come into play, including how busy the underlying storage media are, how saturated various buses are, and how much of the CPU is spent on true workload vs opportunistic tasks (that could be deferred). Not to mention that in some systems, certain tasks are single-threaded and could pose an overall headroom bottleneck by maxing out a single CPU core, while the rest of the CPU is not busy at all.

Maintaining sufficient headroom in storage arrays is necessary in order to provide acceptable latency, especially in the event of high load during a controller failover. Depending on the underlying architecture of an array, different headroom approaches and calculations are necessary.

Some examples of different architectures:

  • Active-Active controllers, per-controller pool
  • Active-Standby, single pool
  • Active-Active, single pool
  • Grid, single pool
  • Permutations thereof (it’s beyond the scope of this article to explore all possible options)

The single vs multiple pool question complicates things a bit, plus things like disk ownership are also hugely important. This isn’t an argument about which architecture is better (it depends anyway), but rather about headroom management in different architectures.

Dual-Controller Headroom

Dual-controller architectures need to be extremely careful with headroom management. After all, there are only two controllers in play. Here’s what sufficient headroom looks like in a dual-controller system:

HeadroomHA2

There are not many options to keep things healthy in a dual-controller architecture. In an Active-Standby system, the Standby controller is ready to immediately take over. There is no danger in loading up the Active controller, aside from expected load-related latency.

In an Active-Active HA system, maintaining a healthy amount of headroom has to be managed so that there is, overall, an entire controller’s worth of free headroom available.

Headroom in a Cluster of HA Pairs Architecture

There are several implementations that make use of a multiple HA Pair architecture. Often, the multiple HA pairs present a virtual pool to the outside world, even if, internally, there are multiple private pools. Some implementations just keep it to pools owned by each controller.

Here’s an example of healthy headroom in such a system:

HeadroomMultiHA2

Even though there are multiple controllers (at least 4 total), in order to maintain an overall healthy system, a total of 100% headroom needs to be maintained in each  HA pair, otherwise the performance of an underlying private pool (in green) might suffer, making the overall virtual pool performance (light blue) unpredictable.

Headroom in a True Grid Architecture

Grid Architectures spread overall load among multiple nodes (often plain servers with some disks inside and connected via a network).

In such a scheme, overall headroom that needs to be maintained per node as a percentage is 100/N, where N is the number of nodes in the storage cluster.

So, in a 4-node cluster, 100/4=25% headroom per node needs to be maintained.

This doesn’t account for the significant work that rebalancing after a node failure takes in such architectures, nor the capacity headroom needed, but it’s roughly accurate enough for our purposes.

Schematically:

HeadroomGrid2

How Headroom is Managed is Crucial

In order to manage headroom, four things need to be able to happen first:

  1. Be able to calculate headroom
  2. Be able to throttle workloads
  3. Be able to prioritize between types of workload
  4. Be able to move workloads around (architecture-dependent).

The only architecture that inherently makes this a bit easier is Active-Standby since there is always a controller waiting to take over if anything bad happens. But even with a single active controller, headroom needs to be managed in order to avoid bad latency conditions during normal operation (see here for an example approach). Remember, headroom is a multi-dimensional thing.

Example Problem Case: Imbalanced & Overloaded Controllers

Consider the following scenario: An Active-Active system has both controllers overloaded, and one of them is really busy:

Headroom Imbalanced2

Clearly, there are a few problems with this picture:

  1. It may be impossible to fail over in the event of a controller failure (total load is 165% of a single controller’s headroom)
  2. The first controller may already be experiencing latency issues
  3. Why did the system even get to this point in the first place?
This is a commonplace occurrence unfortunately.

Automation is Key in Managing Headroom

The biggest problem in our example is actually the last point: Why was the system allowed to get to that state to begin with?

Even if a system is able to calculate headroom, throttle workloads and move workloads around, if nothing is done automatically to prevent problems, it’s extremely easy for users to get into the problem situation depicted above. I’ve seen it affect critical production systems far too many times for comfort.

Manual QoS is Not The Best Answer

Being able to manually throttle workloads can obviously help in such a situation. The problems with the manual QoS approach are outlined in a past article, but, in summary, most users simply have absolutely no idea what the actual limits should be (nor should they be expected to). Most importantly, placing QoS limits up front doesn’t result in balanced controllers… and may even result in other kinds of performance problems.

Of course, using QoS limits reactively is not going to prevent the problem from occurring in the first place.

Some companies offer Data Classification as a Professional Services engagement, in order to try and figure out an IOPS/TB/Application metric. Even if that is done, it doesn’t result in balanced controllers… it’s also not very useful in dynamic environments. It’s more used as a guideline for setting up manual QoS.

Automation Mechanisms to Consider for Managing Headroom

Clearly, pervasive automation is needed in order to keep headroom at safe levels.

I will split up the proposed mechanisms per architecture. There is some common functionality needed regardless of architecture:

Common Automation Needed

Every architecture needs to have the ability to automatically achieve the following, without user intervention at any point:

  1. Conserve headroom per controller
  2. Differentiate between different kinds of user workloads
  3. Differentiate between different kinds of system workloads
  4. Automatically prioritize between different workloads, especially under pressure
  5. Automatically throttle different kinds of workloads, especially under pressure

And now for the extra automation needed per architecture:

Active-Standby Automation

If in a single HA pair, nothing else is needed. If in a scale-out cluster of Active-Standby pairs:

  1. Automatically balance capacity and headroom utilization between HA pairs even if they’re different types
  2. Be able to auto-migrate workloads to other cluster nodes (if using multiple pools instead of one)

Active-Active Automation

  1. Automatically conserve one node’s worth of headroom across the HA pair (50/50, 60/40, 70/30 – all are OK)
  2. When provisioning new workloads, auto-balance them by performance and capacity across the nodes
  3. Be able to balance by auto-migrating workloads to the other node (if using multiple pools instead of one)

Active-Active with Multiple HA Pairs Automation

  1. Automatically conserve one node’s worth of headroom per HA pair
  2. Be able to auto-migrate workloads to any other node
  3. Automatically balance workloads and capacity utilization in the underlying per-HA pools

Grid Automation

  1. Automatically conserve at least one node’s worth of headroom across the grid
  2. Automatically conserve enough capacity to be able to lose one node, rebalance, and have enough capacity left to lose another one (the more cautious may want the capability to lose 2-3 nodes simultaneously)
  3. Automatically take into account grid size and rebalancing effort in order to conserve the right amount of headroom

In Closing…

If you’re a consumer of storage systems, always remember to be running your storage with sufficient headroom to be able to sustain a major failure without overly affecting your performance.

In addition, when looking to refresh your storage system, always ask the vendors about how they automate their headroom management.

Finally, if any vendor is quoting you performance numbers, always ask them how much headroom is left on the array at that performance level… (in addition to the extra questions about read/write percentages and latencies you should be asking already).

The answer may surprise you.

D

The Importance of SSD Firmware Updates

I wanted to bring this crucial issue to light since I’m noticing several storage vendors being either cavalier about this or simply unaware.

I will explain why solutions that don’t offer some sort of automated, live SSD firmware update mechanism are potentially extremely risky propositions. Yes, this is another “vendor hat off, common sense hat on” type of post.

Modern SSD Architecture is Complex

The increased popularity and lower costs of fast SSD media are good things for storage users, but there is some inherent complexity within each SSD that many people are unaware of.

Each modern SSD is, in essence, an entire pocket-sized storage array, that includes, among other things:

  • An I/O interface to the outside world (often two)
  • A CPU
  • An OS
  • Memory
  • Sometimes Compression and/or Encryption
  • What is, in essence, a log-structured filesystem, complete with complex load balancing and garbage collection algorithms
  • An array of flash chips driven in parallel through multiple channels
  • Some sort of RAID protection for the flash chips, including sparing, parity, error checking and correction…
  • A supercapacitor to safely flush cache to the flash chips in case of power failure.

Sounds familiar?

With Great Power and Complexity Come Bugs

To make something clear: This discussion has nothing to do with overall SSD endurance & hardware reliability. Only the software aspect of the devices.

All this extra complexity in modern SSDs means that an increased number of bugs compared to simpler storage media is a statistical certainty. There is just a lot going on in these devices.

Bugs aren’t necessarily the end of the world. They’re something understood, a fact of life, and there’s this magical thing engineers thought of called… Patching!

As a fun exercise, go to the firmware download pages of various popular SSDs and check the release notes for some of the bugs fixed. Many fixes address some rather abject gibbering horrors… 🙂

Even costlier enterprise SSDs have been afflicted by some really dangerous bugs – usually latent defects (as in: they don’t surface until you’ve been using something for a while, which may explain why these bugs were missed by QA).

I fondly remember a bug that hit some arrays at a previous place of employment: the SSDs would work great but after a certain number of hours of operation, if you shut your machine down, the SSDs would never come up again. Or, another bug that hit a very popular SSD that would downsize itself to an awesome 8MB of capacity (losing all existing data of course) once certain conditions were met.

Clearly, these are some pretty hairy situations. And, what’s more, RAID, checksums and node-level redundancy wouldn’t protect against all such bugs.

For instance, think of the aforementioned power off bug – all SSDs of the same firmware vintage would be affected simultaneously and the entire array would have zero SSDs that functioned. This actually happened, I’m not talking about a theoretical possibility. You know, just in case someone starts saying “but SSDs are reliable, and think of all the RAID!”

It’s all about approaching correctness from a holistic point of view. Multiple lines of defense are necessary.

The Rules: How True Enterprise Storage Deals with Firmware

Just like with Fight Club, there are some basic rules storage systems need to follow when it comes to certain things.

  1. Any firmware patching should be a non-event. Doesn’t matter what you’re updating, there should be no downtime.
  2. ANY firmware patching should be a NON-EVENT. Doesn’t matter what you’re updating, there should be NO downtime!
  3. Firmware updates should be automated even when dealing with devices en masse.
  4. The customer should automatically be notified of important updates they need to perform.
  5. Different vintage and vendor component updates should be handled automatically and centrally. And, most importantly: Safely.

If these rules are followed, bug risks are significantly mitigated and higher uptime is possible. Enterprise arrays typically will follow the above rules (but always ask the vendor).

Why Firmware Updating is a Challenge with Some Storage Solutions

Certain kinds of solutions make it inherently harder to manage critical tasks like component firmware updates.

You see, being able to hot-update different kinds of firmware in any given set of hardware means that the mechanism doing the updating must be intimately familiar with the underlying hardware & software combination, however complex.

Consider the following kind of solution, maybe for someone sold on the idea that white box approaches are the future:

  • They buy a bunch of diskless server chassis from Vendor A
  • They buy a bunch of SSDs from Vendor B
  • They buy some Software Defined Storage offering from Vendor C
  • All running on the underlying OS of Vendor D…

Now, let’s say Vendor B has an emergency SSD firmware fix they made available, easily downloadable on their website. Here are just some of the challenges:

  1. How will that customer be notified by Vendor B that such a critical fix is available?
  2. Once they have the fix located, which Vendor will automate updating the firmware on the SSDs of Vendor B, and how?
  3. How does the customer know that Vendor B’s firmware fix doesn’t violently clash with something from Vendor A, C or D?
  4. How will all that affect the data-serving functionality of Vendor C?
  5. Can any of Vendors A, B, C or D orchestrate all the above safely?
  6. With no downtime?

In most cases I’ve seen, the above chain of events will not even progress past #1. The user will simply be unaware of any update, simply because component vendors don’t usually have a mechanism that alerts individual customers regarding firmware.

You could inject a significant permutation here: What if you buy the servers pre-built, including SSDs, from Vendor A, including full certification with Vendors C and D? 

Sure – it still does not materially change the steps above. One of Vendors A, C or D still need to somehow:

  1. Automatically alert the customer about the critical SSD firmware fix being available
  2. Be able to non-disruptively update the firmware…
  3. …While not clashing with the other hardware and software from Vendors A, C and D
I could expand this type of conversation to other things like overall environmental monitoring and checksums but let’s keep it simple for now and focus on just component firmware updates…

Always Remember – Solve Business Problems & Balance Risk

Any solution is a compromise. Always make sure you are comfortable with the added risk certain areas of compromise bring (and that you are fully aware of said risk).

The allure of certain approaches can be significant (at the very least because of lower promised costs). It’s important to maintain a balance between increased risk and business benefit.

In the case of SSDs specifically, the utter criticality of certain firmware updates means that it’s crucially important for any given storage solution to be able to safely and automatically address the challenge of updating SSD firmware.

D

The Importance of the Effective Capacity Ratio in Modern Storage Systems

In modern storage devices (especially All Flash Arrays), extensive data reduction techniques are commonplace and expected by customers.

This has, unavoidably, led to various marketing schemes that aim to make certain systems seem more appealing than the rest. Or at least not less appealing…

I will attempt to explain what customers should be looking for when trying to decipher capacity claims from a manufacturer.

In a nutshell – and for the ADD-afflicted – the most important number you should be looking for is the Effective Capacity Ratio, which is simply: (Effective Capacity)/(Raw Capacity). Ignore the more common but far less useful Data Reduction Ratio, which is: (Effective Capacity)/(Usable Capacity).

Read on for more detail…

The problem

Most modern (and some not so modern) storage vendors claim a similar Data Reduction Ratio for their devices. Numbers like 5:1 seem commonplace these days. Which, to anyone with a modicum of common sense, simply means “I can store five times more stuff in the same space”. Unfortunately, the same 5:1 ratio does not mean the same end result for all vendors…

Why this is a problem

Customers simply want to get a good deal for their money. The reality is that the exact same 5:1 Data Reduction Ratio may mean very different things between storage devices.

For instance: Not all vendors count space savings using the same math. Is the virtual size of snapshots calculated in the final savings ratio? (for example, if I take 5 snaps of a 1TB volume one after another, is that showing up as 6:1 savings?)

How about Thin Provisioning? Is that included? Such math can wildly alter the overall efficiency numbers.

Here’s a clear example of why counting thin provisioning in the overall ratio is probably misleading, and best used only for educational purposes… Remember that overall savings ratios are multiplicative. For example, 2:1 compression and 2:1 deduplication mean 4:1 overall savings:

Thin Craziness

Does anyone really believe that such a system is actually providing 1000:1 savings? Or even 10:1? Yet that’s what many vendors are counting towards savings ratios, without separating the thin provisioning savings from the overall savings number.

However, there is another dimension, and it has to do with how efficiently the raw capacity is actually utilized. It will be one of my many Captain Obvious moments for some of you, but I’ve seen enough people get confused, so it’s worth explaining.

The Effective Capacity Ratio

Different storage systems have different ways of utilizing their raw capacity. For example, a mirrored system can never have better than a 50% Usable:Raw ratio. By definition, it’s mathematically impossible since 2 copies are needed. That’s not even counting spares and other possible overheads.

Systems that do triple mirroring can’t do better than a theoretical 33.3% Usable:Raw etc.

Some definitions are in order:

  • Usable Capacity: How much data I can store in a system after overheads such as RAID, sparing etc. but before data reduction techniques.
  • Raw Capacity: Add up the capacity of all the storage media in the system. Usually a Base 10 number (TB/GB, not Base 2, TiB/GiB)
  • Effective Capacity: How much data I can store in a system after data reduction techniques like Deduplication and Compression but not Thin Provisioning
  • Data Reduction Ratio: (Effective Capacity)/(Usable Capacity)
  • Effective Capacity Ratio: (Effective Capacity)/(Raw Capacity)

Who is More Efficient?

If every vendor is claiming 5:1 average savings, who is truly more efficient? A Reductio ad Absurdum example makes it pretty clear:

  1. A vendor that can do a Data Reduction Ratio of 5:1 but has a Usable Capacity of 10% vs Raw or…
  2. A vendor that can do a Data Reduction Ratio of 5:1 but has a Usable Capacity of 70% vs Raw?

Let’s put some numbers on a table. They roughly correspond to some existing storage vendors today (there may be some variation depending on whether the numbers for each line are TB vs TiB – everyone shows numbers differently – but the overall point remains the same):

Effective Cap Ratio

As you can see, the same Data Reduction Ratio, on the same amount of Raw Capacity, can have wildly different Effective Capacity results, depending on the system.

Clearly, a truly efficient system is one that can both:

  • Provide a high Usable:Raw ratio and
  • Provide a high Data Reduction Ratio (that does not include fluff like Thin Provisioning)

The Business Benefits of a High Effective Capacity Ratio

There are multiple business reasons why chasing a high Effective Capacity Ratio is important:

  • High rack density. Some vendors are eight times more dense than others (ironically, the ones claiming “white box economics” are the least dense and most expensive). This could lead to significant power/cooling and $/rack savings
  • Lower CapEx – if a vendor needs 2x the Raw Capacity vs another, guess who’s paying for the difference?
  • Less complexity – denser devices means less devices, less cables, less switching

What you can do as a Customer

Level the playing field. It’s actually pretty easy:

  • Insist on seeing all capacity numbers expressed as TiB from every vendor – and if they don’t know what that means, run… (or at least subtract 9% from any capacity number you see and you’ll be safer)
  • Insist on seeing the Data Reduction Ratio without including Thin Provisioning or Snapshots. If they can’t break out the savings by category, run
  • Insist on seeing the Usable:Raw percentage for various configurations. Is it better in larger configs vs smaller? Is it following all best practices? What are the best practices for space consumption? If they tell you to ignore the man behind the curtain, run… (there is a theme developing)
  • Do your own Effective Capacity Ratio calculation and assign a $/Effective TiB to each vendor

D

Technorati Tags: , , , , , ,