HCI Failure Modes and Maintenance

I got the idea for this blog after speaking with multiple customers that were contemplating switching to certain kinds of grid computing/storage (like HCI) without fully understanding the ramifications of doing so.

You see, they were (rightly so) enamored by concepts such as automation, ease of consumption and scaling. But they forgot to ask some very important questions. See here for the dangers of getting too carried away with something new and taking things for granted.

This isn’t a post claiming HCI and grid-type storage constructs are bad. Like any tool, they can be used in various ways, some of them aggressively ill-advised. The point of this post is to help customers ask for the right configuration so they don’t get stuck with a sub-optimal and risky design.

I tried to make this post as short as possible but as someone once said, “Everything should be made as simple as possible, but not simpler”. Which, ironically, is a simplification of what Einstein actually said 🙂

Protection Trends

I find it interesting, and not a little depressing, that while modern storage approaches today default to extremely strong protection against component failures, some vendors are taking several steps backwards when it comes to such matters.

For instance, while almost all modern storage systems can sustain losing any two drives simultaneously by default (and Nimble can lose any three and then some), certain HCI vendors mostly sell systems with single drive loss protection.

But why do this? Why the cavalier attitude? Cutting costs is the obvious answer, but it’s worth exploring this a bit more to show just how much of a problem it can be and why it’s dangerous to ignore.

HCI Protection While Using Data Reduction

The HCI offering from a certain hypervisor vendor separates storage into something called Disk Groups (1 cache drive and up to 7 capacity drives per group). Each server node will have at least one Disk Group, sometimes two or more. This is not the same kind of construct as a RAID group. It’s best to consider it as a large logical drive consisting of 8 physical drives.

Here’s the interesting part: If the customer is using data reduction (compression+dedupe), and if ANY ONE drive in the disk group is lost, then the ENTIRE disk group is also considered lost and all data on it has to be rebuild elsewhere!

Oh, and even if not using data reduction at all, if the cache drive dies, then all the drives in the Disk Group are, again, marked as failed.

This all of a sudden turns a normal single drive loss event into an up to EIGHT drive loss event.

Yes, the contents of those drives are protected elsewhere in the cluster, but it’s a lot easier, safer and faster to rebuild one versus many drives.

But, more importantly, if the system only protects against single drive loss, you now have seven capacity drives with unprotected data, not one. Since only one copy of your data is now left for each of those seven drives, you may be at risk from suffering an Unrecoverable Read Error (URE) during rebuild, which would render the data corrupt and the rebuild failed. And, statistically, rebuilding the contents of all 7 drives means you need to read data from a lot of drives, which increases the chances of a URE.

Why subject yourself to that kind of risk?

HCI Protection While Performing Maintenance or In Case of Node Loss

Grid-type storage, and by extension most HCI, uses drives inside servers to build up the totality of the storage pool.

It follows then that when performing routine maintenance on servers, or a node disappears for any reason, the whole server’s storage is gone from the cluster until the maintenance is complete. Which means that there is now less protection, since copies of data in that server have now disappeared from the cluster.

It’s just how these systems work, but that’s another reason to require “simultaneous any” dual drive loss protection at a minimum. That way, even if one whole server is out of the equation, there’s still sufficient protection for all your data. 

Checksums – Yes, They Are Still Needed!

Concepts such as checksumming don’t become obsolete just because one decides to consume storage capacity in a different way. That’s like saying “I bought a battery-powered car so now I don’t need ABS brakes any more”.

Don’t ignore fundamental concepts.

Strong checksums can protect against errors that are not caught by RAID. The simplest one is a corrupt read, and almost all disk arrays at a minimum protect against that.

However, there are other errors that more strict checksums are needed for – as an example, misplaced reads, misplaced writes, lost writes, torn pages…

Such errors, if not detected, can result in the wrong information being read, and error propagation. Which can be a very bad thing indeed…

The problem is that many of the people designing scale-out and HCI systems don’t think about such concepts, so don’t be afraid to ask any vendor about the strength of their checksumming and what it protects against.

FYI – some people will say that SSDs do some of their own checksums etc. That is no excuse for not doing strong checksums outside the media! Case in point: A large SSD manufacturer recently had a bug in some of their high end drives that resulted in lost writes that were not caught by all systems. Without strong checksumming, this can’t be detected and the end result is silent data corruption.

Interestingly, one of the most popular HCI products didn’t even have any checksumming until relatively recently, so no vendor is immune to this line of questioning. Please keep your data safe.

This is Not Fear Mongering  – Just “Back to Basics” Advice

The challenge with having single parity protection for this kind of architecture is that, while you’re rebuilding those multitudes of drives, if you have an Unrecoverable Read Error (URE) from the drives you’re reading from in order to rebuild, then the rebuild will fail. Your data will be corrupt, and you’ll have to recover from backups.

Such read errors happen more frequently than most realize, and RAID + Scrubs simply cover it up and you don’t know about it. Unless you have run out of parity, in which case you will find out the very hard way.

Indeed, that’s the main reason stuff like RAID5 has gone the way of the dodo. It’s not so much the fear that many drives will fail simultaneously, but the far more likely possibility that, while rebuilding, a sector from the remaining “good” drives won’t be readable. Simple mirroring is also becoming obsolete.

Oh, and SSDs don’t have less UREs than spinning drives. As they age (regardless of how much they’ve been written to), they may have more. The Google drive reliability study is a really cool document that tackles this subject, take a look. This is a deeper issue than the UBER (Unrecoverable Bit Error Rate) you’ll see in drive specs, with many factors affecting the frequency of the various kinds of errors.

Advice for Customers

Fortunately, there’s a simple way to be safer. Follow these very straightforward rules and you’ll be fine. I will not tell you to avoid specific products or what to choose, just use common sense:

  • Never accept anything with less than “any two” simultaneous drive loss protection, no matter who the vendor is. This typically means at a minimum triple mirroring and/or RAID6 erasure coding.
  • Consider not using data reduction if you are planning to use the vendor that loses a whole disk group if a single drive doing data reduction fails. The risks and management complications are simply not worth it, especially if the space savings aren’t significant.
  • Never accept anything less than N+1 configurations. Meaning that if you lose a node or are doing maintenance, there’s at least one free node that can help with the workload. That’s not the same as losing a drive, I’m talking about a whole node. I’ve seen too many crazy N+0 configs, hence this rule. And in this day and age of Novel Coronavirus craziness, you need to consider that even spare parts may take longer to arrive, which means N+2 may be safer.
  • Ensure plenty of space for rebuilds – I recommend always reserving a free capacity of the equivalent of the largest two nodes. Some vendors will say a fixed percentage like 25% or 30%, but always make sure the math translates into having at a minimum the equivalent free capacity of the largest two nodes.

Don’t cut corners and don’t risk your business just to adopt a different consumption model! If you can’t follow the previous four rules, then maybe grid-type HCI just isn’t the best fit for you, and that’s perfectly fine. There’s more than one way to achieve a business outcome.

Just make sure you’re perfectly clear on what that outcome is. HCI, SDS and Scale-Out aren’t business outcomes, just like Cloud isn’t a business outcome. They’re all means to an end.

Use the tools at your disposal wisely, and don’t forget the basics.


3 Replies to “HCI Failure Modes and Maintenance”

  1. Seems like all these apply only to VMWare pseudo distributed architecture. The real distributed architecture from Nutanix wouldn’t have the problems explained here.

  2. Nutanix would have different issues – yes it is more distributed but the default data protection scheme (RF2, somewhat similar to VSAN’s FTT=1) is still fundamentally a single drive resiliency model unless you lose (or URE) the second drive from the same node that you lost the first drive in. I.e. node 1 has a copy of the data, and the 2nd copy is distributed across the drives of remaining nodes. That yields fast recovery of a single drive failure as the rebuild activity is widely distributed (remember XIV?) but until that rebuild completes there is no redundancy for the remaining copy (also remember XIV). Also, per Dimitris’ commentary that remaining copy may have bit rot – that’s a very real probability that’s a function of the media plus how often the storage subsystem checksums the blocks on that media. Dimitris’ core point here and elsewhere, that customers should thoroughly inspect the failure modes and statistical availability and durability of the storage architecture they’re considering, is a good one. That architectural inspection and statistical, data-driven analysis is why Google, for the original Google File System Nutanix and other HCI platforms cite as architectural inspiration, chose RF3, not RF2, for their data resiliency scheme, analyzing it to be approx 3500x more resilient than RF2.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.