Uncompromising Resiliency

(cross-posted at https://www.nimblestorage.com/blog/uncompromising-resiliency/)

The cardinal rule for enterprise storage systems is to never compromise when it comes to data integrity and resiliency.  Everything else, while important, is secondary.

Many storage consumers are not aware of what data integrity mechanisms are available or which ones are necessary to meet their protection expectations and requirements. It doesn’t help that a lot of the technologies and the errors they prevent are rather esoteric. However, if you want a storage system that safely stores your data and always returns it correctly, no measure is too extreme.

The Golden Rules of Storage Engineering

When architecting enterprise storage systems, there are three Golden Rules to follow.

In order of criticality:

  1. Integrity: Don’t ever return incorrect data
  2. Durability: Don’t ever lose any data
  3. Availability: Don’t ever lose access to data

To better understand the order, ask yourself, “what is preferred, temporary loss of access to data or the storage system returning the wrong data without anyone even knowing it’s wrong?”

Imagine life or death situations, where the wrong piece of information could have catastrophic consequences. Interestingly, vendors exist that focus a lot on Availability (even offering uptime “guarantees”) but are lacking in Integrity and Durability. Being able to access the array but have data corruption is almost entirely useless. Consider modern storage arrays with data deduplication and/or multi-petabyte storage pools. The effects are far more severe now that a single block represents the data for 1-100+ blocks and data is spread across 10’s – 100’s of drives instead of a few drives.

The Nimble Storage Approach

Nimble Storage has taken a multi-stage approach to satisfy the Golden Rules, and in some cases, the amount of protection offered verges on being paranoid (but the good kind of paranoid).

Simply, Nimble employs these mechanisms:

  1. Integrity: Comprehensive multi-level checksums
  2. Durability: Hardened RAID protection and resilience upon power loss
  3. Availability: Redundant hardware coupled with predictive analytics

We will primarily focus on the first two as they are often glossed over, assumed, or not well understood. Availability will be discussed in a separate blog, however it is good to mention a few details here.

To start, Nimble has greater than six nines measured uptime (more info here). This is measured across more than 9,000 customers using multiple generations of hardware and software. A key aspect of Nimble’s availability comes from InfoSight which continually improves and learns as more systems are used. Each week, trillions of data points are analyzed and processed with the goal of predicting and preventing issues, not just in the array, but across the entire infrastructure. 86% of issues are detected and automatically resolved before the customer is even aware of the problem. To further enhance this capability, Nimble’s Technical Support Engineers can resolve issues faster as they have all the data available when an issue arises. This bypasses the hours-days-weeks often required to collect data, send to support, analyze, repeat – until a solution can be found.

Data Integrity Mechanisms in Detail

The goal is simple: What is read must always match what was written. And, if it doesn’t, we fix it on the fly.

What many people don’t realize is there are occasions where storage media will lose a write, corrupt it or place it at the wrong location on the media. RAID (including 3-way mirroring) or Erasure Coding are not enough to protect against such issues. The older T10 PI employed by some systems is also not enough to protect against all eventualities.

The solution involves using checksums which get more computationally intensive the more paranoid one is. As checksums are computationally intensive, certain vendors don’t employ or minimally employ them to gain more performance or faster time to market. Unfortunately, the trade-off can lead to data corruption.

Broadly, Nimble creates a checksum and a “self-ID” for each piece of data. The checksum protects against data corruption. The self-ID protects against lost/misplaced writes and misdirected reads (incredible as it may seem, these things happen enough to warrant this level of protection).

For instance, if the written data has a checksum, and corruption occurs, when the data is read and checksummed again, the checksums will not match. However, if instead the data was placed at an incorrect location on the media, the checksums will match, but the self-IDs will not match.


Where it gets interesting:

Nimble doesn’t just do block-level checksums/IDs. These multi-level checksums are also performed:

  1. Per segment in each write stripe
  2. Per block, before and after compression
  3. Per snapshot (including all internal housekeeping snaps)
  4. For replication
  5. For all data movement within a cluster
  6. All data and metadata in NVRAM

This way, every likely data corruption event is covered, including metadata consistency and replication issues, which are often overlooked.

Durability Mechanisms in Detail

There are two kinds of data on a storage system and both need to be protected:

  1. Data in flight
  2. Data on persistent storage

One may differentiate between user data and metadata but we protect both with equal paranoid fervor. Some systems try to accelerate operations by not protecting metadata sufficiently, which greatly increases risk. This is especially true with deduplicating systems, where metadata corruption can mean losing everything!

Data in flight is data that is not yet committed to persistent storage. Nimble ensures all critical data in flight is checksummed and committed to both RAM and an ultra-fast byte-addressable NVDIMM-N memory module sitting right on the motherboard. The NVDIMM-N is mirrored to the partner controller and both controller NVDIMMs are protected against power loss via a supercapacitor. In the event of a power loss, the NVDIMMs simply flush their contents to flash storage. This approach is extremely reliable and doesn’t need inelegant solutions like a built-in UPS.

Data on persistent storage is protected by what we call Triple+ Parity RAID. Three orders of magnitude more resilient than RAID6. For comparison, RAID6 is three orders of magnitude more resilient than RAID5. The “+” sign means that there is extra intra-drive parity that can safeguard against entire sectors being lost even if three whole drives fail in a single RAID group.

Some might say this is a bit much, however with drive sizes increasing rapidly (especially SSDs) and drive read error rates increasing as drives age, it was the architecturally correct choice to make.

In Summary

Users frequently assume that all storage systems will safely store their data. And they will, most of the time. But when it comes to your data, “most of the time” isn’t good enough. No measure should be considered too extreme. When looking for a storage system, it’s worth taking the time to understand all situations where your data could be compromised. And, if nothing else, it’s worth choosing a vendor who is paranoid and goes to extremes to keep your data safe.