Modern RAID Must Protect Against Multiple Temporally Correlated Errors

Modern data protection needs to adapt to protecting modern media. RAID is no exception. In this article I will explain why modern storage consumers need to be asking for certain kinds of protection and not settling for less.

To summarize, don’t bother with storage that can’t provide at least dual parity protection for any given piece of data (whether that’s an array, HCI or the cloud, it doesn’t matter).

Why? Two big reasons:

Because media these days is both larger and fails differently than in the past. Which means Temporally Correlated Errors are far more likely to happen, so you need protection against those. It’s not doom-mongering. It’s based on data.
In the olden days, arrays had small RAID groups that each held a handful of volumes. If something was damaged in a RAID group, at most you’d just lose that handful of volumes. Modern arrays use pools of space, typically made up of multiple RAID groups. This means that you can potentially damage all volumes in an array merely by losing data integrity in a single RAID group in the pool. I’m sure you aren’t exactly looking forward to experiencing that.

I will take you step by step through this, as is my idiom. It is though rather sad that I have to write this kind of thing in 2020…

How Modern Drives Fail and Why This is Important

SSDs tend to generate more Unrecoverable Read Errors (UREs) as they age. Interestingly, this is the case regardless of underlying NAND technology (SLC, eMLC, 3D MLC – doesn’t really make a material difference).

This age-related increased error rate is independent of errors due to increased wear level (which means you could be forever reading, and still see increased read errors over time).

There is a very interesting study (looking at a huge number of drives), good reading if you want to learn more about this. A key finding was that at least 20% of SSDs develop uncorrectable errors during a 4-year period. FYI, this is much higher than HDDs.

Why is this important though?

Simply because, over time, it greatly amplifies the amount of UREs RAID has to deal with. And RAID doesn’t like UREs while rebuilding failed disks. Or for multiple kinds of errors to happen at the same time in general.

And why is that important?

People like keeping their assets longer. This means that systems need to be able to deal with greatly increased read errors on aging media, even if that media’s wear level indicator is very low.

So, over time you may be faced with a lot more UREs than weak RAID can handle, and if those UREs happen at the same time as rebuilding a drive, or even during normal operation (called a Temporally Correlated error) it will lead to the inability to finish the disk rebuild and/or data corruption and/or a completely downed system.

Also referred to as “Not a good day in the datacenter”.

What Different RAID Types Can Simultaneously Protect Against

Let’s compare some RAID levels to see how they fare when faced with temporally correlated errors:

RAID5: Within the RAID protection domain (i.e. 8+1) – single URE if all drives are intact, zero UREs while rebuilding a lost drive
RAID10: Same as R5 but the domain is limited to 2 drives at a time which reduces the chances of error vs R5
RAID6: Two UREs if all drives intact, one URE while rebuilding a lost drive
HPE Nimble Storage Triple+ RAID: N UREs, where N is the number of drives in the system, even if any three drives have been lost. It is rather extreme 🙂

It follows then that to merely protect against just two UREs happening simultaneously, or just a single URE happening if a drive has been lost, the absolute minimum degree of RAID protection required is RAID6.

Guess what the most common reason RAID loses data is, by far? Suffering a URE while rebuilding a failed drive.

Why Sub-Disk RAID Alone Isn’t The Solution

There’s a class of RAID that doesn’t do RAID across whole disks, but rather smaller data segments. Such implementations have several useful attributes. Different vendors call it different things, but ultimately the basic protection rules are the same as all RAID:

You need to spread the protection across multiple physical devices
You need enough parity to protect against the desired number of temporally correlated errors
You need space to rebuild

See the following figure that shows a sub-disk RAID5 implementation. The different colors denote objects that are part of the same protection segment (a logical RAID5 6+1).

In this example there are 7 physical disks, and 7 segments, with one of the segments being parity (denoted by P in front of the segment number).

The top row is how it would be without a failure. The bottom row shows what happens during a disk failure – notice how the segments that were in the first disk have now been rebuilt in the free space of the other disks:

What would happen if you had a URE while rebuilding a sub-disk RAID5 you may ask? Well, pretty much the same thing that happens when having a URE while rebuilding any RAID5 implementation 🙂

Check out this next illustration. Same as before, the first disk dies, only now while rebuilding, there is a URE affecting the light blue (teal?) #10.

This results in the entire light blue segment becoming invalid, with absolutely no hope of repair. So now we have lost objects 4, 10, 16, 22, 28, 34 and 40.

Since array volumes are distributed and take space from potentially all objects, the entire array may now have to be wiped and rebuilt from a backup.

All of this nonsense could be avoided if there was just one more parity object for the light blue segment… 🙂

“But Vendor XYZ Claims Their RAID5 rebuilds Very Quickly – Does That Mean They Don’t Need RAID6 or Better?”

That’s like saying “this computer crashes all the time and may lose all your data, but it reboots very quickly!”

Quick rebuild speeds only do one thing: They help lower the time during which there is reduced parity protection. Which is useful, but doesn’t protect against read errors while rebuilding.

How about a simple reductio ad absurdum example to illustrate why quick rebuild speeds don’t help with Temporally Correlated Errors:

Imagine you have 15TB drives protected with RAID5. The drives are full. You lose a 15TB drive. You’re using some hypothetical RAID5 that is so awesome that it can rebuild instantly.

What this means is that you STILL have to rebuild 15TB worth of data (even if it happens instantly).

And what did we learn so far? Simply that if you encounter even a single URE during that huge read process that 15TB rebuild needs, if you only have single parity RAID, your rebuild is impossible.

That is why I roll my eyes every time I see a vendor claiming quick rebuilds and proposing RAID5 as a safe solution! The chances of a single URE happening as you’re reading 15TB worth of data from a bunch of SSDs isn’t exactly small.

Once again: quick rebuilds in single-parity RAID systems do absolutely nothing to protect you against Temporally Correlated UREs.

Quick rebuilds only help protect against losing a second drive before the first one has been rebuilt. Which begs the question – why bother with single parity RAID if it’s so sensitive to all these things?

I trust you have been sufficiently warned.

“But They Say Their RAID is Separated Into RAID Groups or Availability Zones And I could Lose a Drive in Each Group/Zone!”

Don’t Fall Into The Trap of “You Can Lose a Drive in Each RAID Group”!

That’s the language single-parity slingers use. Yes, with RAID5 you could lose one drive in each RAID group, and if you have many small groups, that looks like a lot of drives can be lost and still have a functioning system.

For example, let’s say you have 4 RAID5 groups in a pool. You could lose a single drive per group, so that whole pool could lose 4 drives as long as only one drive is lost per RAID group.

While rebuilding those 4 drives, you have now multiplied your risk 4 times, since 4 is the number of RAID groups with lost drives 🙂

This now means that you have to worry about Temporally Correlated Errors times 4.

So, the desired outcome is protecting against multiple Temporally Correlated Errors within a single group. Multiple groups don’t help with this situation.

“But Some Vendor Said Sync Replication Makes R5 Safe!”

FYI, that vendor isn’t running a philanthropy, they want to quickly sell something cheaper than anyone else and move on to the next sale.

In essence, what this does is make DR scenarios more likely, which nobody in their right mind should be OK with.

Imagine this example: You have two sites, A and B, sync replicating, each with a single storage system filled with important data, running RAID5. If you suffer catastrophic pool failure on site A due to Temporally Correlated Errors, here’s what would need to happen:

This is now a true DR situation. You must fail over your business to site B. Now you’re running on a single site with all the R5 risks
Fix the storage in site A
Replicate everything to site A from site B, which may take a very long time, during which time you are exposed.

All this could take days.

Are you sure you have absolutely nothing better to do with your time? Are you certain you are OK with all this extra risk?

Even when sync mirroring, your goal should be to make each site very resilient so you do not have to fail over. Failing over is the last resort, not a desired outcome.

Lessons Learned

As a consumer, the easiest thing you can do is to simply not accept implementations that don’t protect against a minimum of two Temporally Correlated Errors within any given system. That’s it.

Doesn’t matter if it’s a standard array, or HCI, or some sort of SDS, or in the cloud.

Always ask how many simultaneous read errors a given piece of data is protected against. The answer should be a minimum of 2. That will also shield you from silly tricks like claims such as “I can lose a drive from each RAID group, that’s several drives in total!”

It’s time to stop the single parity madness, all the science points to the fact that it’s simply not safe in today’s storage devices.

This is the reason all major vendors have completely given up on single parity RAID on their enterprise storage systems. With the notable and utterly bizarre exception of Dell EMC, who somehow seem to think math doesn’t apply to them, and that data integrity can be readily sacrificed on the altar of lower costs. Or perhaps their performance really suffers with RAID6.

And no, replicating to a second site isn’t the answer that allows sub-standard protection. You should be doing that anyway, and safeguarding each site as much as possible.

It’s your data. Keep it safe if you think it’s important. The only exception may be if you are considering a system with just a handful of drives, and they’re small drives, and you’re incredibly cost-sensitive.

“The bitterness of poor quality remains long after the sweetness of low price is forgotten”.

Possibly Benjamin Franklin. Whoever said it had a good point… 🙂

“The bitterness of poor quality remains long after the sweetness of low price is forgotten.” -Benjamin Franklin

3 Replies to “Modern RAID Must Protect Against Multiple Temporally Correlated Errors”

Anonymous says:

September 29, 2020 at 5:58 am

What about T10-PI? Array can re-read broken bits when the checksum doesn’t match.
1. Dimitris says:
  
  September 29, 2020 at 11:10 am
  
  T10-PI doesn’t fix Temporally Correlated Errors. If you lose a drive, and have single parity RAID, and are trying to rebuild and some sector simply can’t be read properly, you WILL lose data regardless of what checksums you have.
Pingback: HPE Alletra 6000 – Nimble Evolved | Recovery Monkey

Comments are closed.