Uncompromising Resiliency

The cardinal rule for enterprise storage systems is to never compromise when it comes to data integrity and resiliency.  Everything else, while important, is secondary.

Many storage consumers are not aware of what data integrity mechanisms are available or which ones are necessary to meet their protection expectations and requirements. It doesn’t help that a lot of the technologies and the errors they prevent are rather esoteric. However, if you want a storage system that safely stores your data and always returns it correctly, no measure is too extreme.

The Golden Rules of Storage Engineering

When architecting enterprise storage systems, there are three Golden Rules to follow.

In order of criticality:

  1. Integrity: Don’t ever return incorrect data
  2. Durability: Don’t ever lose any data
  3. Availability: Don’t ever lose access to data

To better understand the order, ask yourself, “what is preferred, temporary loss of access to data or the storage system returning the wrong data without anyone even knowing it’s wrong?”

Imagine life or death situations, where returning the wrong piece of information could have catastrophic consequences. Interestingly, some vendors exist that focus a lot on Availability (even offering uptime guarantees) but are lacking in Integrity and Durability. Being able to access the array but have data corruption is almost entirely useless, especially for really critical data.

Another reason why focusing on extreme data integrity is important: Consider modern storage arrays with data deduplication. The effects of any data corruption are far more severe now that a single block represents the data for potentially hundreds of  blocks.

Arrays nowadays do more than just store user data in a single place. Data moves around. Which means, further writes happen, beyond what the user intended. For example:

  • Systems that do autotiering move data around the system as a matter of course.
  • Systems that do garbage collection constantly re-write valid data and clean up segments.
  • Systems that do post-process data reduction constantly re-write data.
  • Systems that do wide striping routinely rebalance existing data, especially if new capacity has been added.

So it follows that you don’t only need to worry about the long-term validity of the data the user wrote in the first place.

You also need to worry about the validity of the data as the array is constantly re-writing it.

What would happen if the array, while re-writing data for its own housekeeping purposes, has no way to reliably protect against all eventualities of data corruption?

So  you also have to worry about arrays creating their own internal errors, and propagating them, undetected, while doing their normal internal maintenance…

The Nimble Storage Approach

Nimble Storage has taken a multi-stage approach to satisfy the Golden Rules, and in some cases, the amount of protection offered verges on being ultra-paranoid (but the good kind of paranoid).

Simply, Nimble employs these mechanisms:

  1. Integrity: Comprehensive multi-level checksums that protect even against lost writes, misplaced writes, misdirected reads
  2. Durability: Hardened RAID protection and resilience upon power loss
  3. Availability: Redundant hardware coupled with predictive analytics

We will primarily focus on the first two as they are often glossed over, assumed, or not well understood. Availability will be discussed in a separate blog, however it is good to mention a few details here.

To start, Nimble has greater than six nines measured uptime. This is measured over a long period of time, across more than 13,000 customers using multiple generations of hardware and software.

A key aspect of Nimble’s availability comes from InfoSight which continually improves and learns as more systems are used. Trillions of data points are analyzed and processed with the goal of predicting and preventing issues, not just in the array, but across the entire infrastructure. 86% of issues are detected and automatically resolved before the customer is even aware of the problem. To further enhance this capability, Nimble’s Technical Support Engineers can resolve issues faster as they have all the data available when an issue arises. This bypasses the hours-days-weeks often required to collect data, send to support, analyze, repeat – until a solution can be found.

Data Integrity Mechanisms in Detail

The goal is simple: What is read must always match what was written. And, if it doesn’t, we fix it on the fly.

What many people don’t realize is there are occasions where storage media will lose a write or place it at the wrong location on the media. RAID (including N-way mirroring) or Erasure Coding are not enough to protect against such issues. The T10 standards employed by many systems are not quite enough to protect against all eventualities like lost writes.

The anti-corruption solution in general involves using checksums which get more computationally intensive the more paranoid one is. As checksums are computationally intensive, certain vendors don’t employ or minimally employ them to gain more performance or faster time to market. Unfortunately, the trade-off can lead to data corruption.

Broadly, at the lowest level, Nimble creates a checksum and a “self-ID” for each piece of data. The checksum protects against data corruption. The self-ID protects against lost/misplaced writes and misdirected reads (incredible as it may seem, these things happen enough to warrant this level of protection).

For instance, if the written data has a checksum, and corruption occurs, when the data is read and checksummed again, the checksums will not match. However, if instead the data was placed at an incorrect location on the media, the checksums will match, but the self-IDs will not match.

checksums

Where it gets interesting:

Nimble doesn’t just do single-level block-level checksums/IDs. These multi-level checksums are also performed (I call this cascade checksumming):

  1. Per chunk (unit written to each drive, see here)
  2. Per block, before and after compression
  3. For snapshots (including all internal housekeeping snaps)
  4. For replication
  5. For all data movement within a cluster
  6. All data and metadata in NVRAM

This way, every likely data corruption event is covered, including metadata consistency and replication issues, which are often overlooked.

Durability Mechanisms in Detail

There are two kinds of data on a storage system and both need to be protected:

  1. Data in flight
  2. Data on persistent storage

One may differentiate between user data and metadata but we protect both with equal paranoid fervor. Some systems try to accelerate operations by not protecting metadata sufficiently, which greatly increases risk. This is especially true with deduplicating systems, where metadata corruption can mean losing everything!

Data in flight is data that is not yet committed to persistent storage. Nimble ensures all critical data in flight is checksummed and committed to both RAM and an ultra-fast byte-addressable NVDIMM-N memory module sitting right on the motherboard. The NVDIMM-N is mirrored to the partner controller and both controller NVDIMMs are protected against power loss via a supercapacitor. In the event of a power loss, the NVDIMMs simply flush their contents to flash storage. This approach is extremely reliable and doesn’t need inelegant solutions like a built-in UPS.

Data on persistent storage is protected by what we call Triple+ Parity RAID. Many orders of magnitude more resilient than RAID6. For comparison, RAID6 is several orders of magnitude more resilient than RAID5 (making Triple+ hundreds of thousands of times more resilient than RAID5).

The “+” sign means that there is extra intra-drive parity that can safeguard against entire sectors being lost even if three whole drives simultaneously fail in a single RAID group. The extra parity is at the chunk level (unit written per drive).

Think about it: A Nimble system can lose any 3 drives simultaneously, and, while parity is being rebuilt, every other remaining drive could be suffering simultaneously from sector read errors and still no data would be lost and rebuild would complete.

Some might say this is a bit over-engineered, however with drive sizes increasing rapidly (especially SSDs) and drive read error rates increasing as drives age, it was the architecturally correct choice to make for a future-proofed solution.

It also really helps in the case of a bad batch of drives (I’ve seen it happen to various vendors – delayed-onset errors).

If you want to test the reliability numbers on your own, find a Mean Time to Data Loss calculator (there are some online). Compare triple parity to other schemes (the calculator might have triple parity listed as z3). Then multiply the number for triple parity by 5 (the extra reliability the additional chunk parity affords) and you get the Nimble Triple+ number 🙂

A cool experiment is to try really large drive sizes in the MTTDL calculator, like 100TB, and assume poor MTBF and read error rates (simulating maybe a bad batch of drives, overheating, solar flare activity, whatever bad circumstances you can think of).

Look at the numbers.

In Summary

Users frequently assume that all storage systems will safely store their data. And they will, most of the time. But when it comes to your data, “most of the time” isn’t good enough. No measure should be considered too extreme. When looking for a storage system, it’s worth taking the time to understand all situations where your data could be compromised. And, if nothing else, it’s worth choosing a vendor who is uncompromisingly paranoid about data integrity and goes to extremes to keep your data safe.

D

4 Replies to “Uncompromising Resiliency”

Leave a comment for posterity...

This site uses Akismet to reduce spam. Learn how your comment data is processed.