The true XIV fail condition finally revealed (?)

I just got this information:

For XIV to be in jeopardy you need to lose 1 drive from one of the host-facing ingest nodes AND 1 drive from the normal data nodes within a few minutes (so there’s no time to rebuild) while writing to the thing.

Have no way of confirming this but it did come from a reliable source.

A customer recently tried pulling random drives and XIV didn’t shut down and was working fine, but they were from the data nodes.

Why can’t anyone post something concrete here? I’m sure IBM won’t post since the confusion serves them well.

For what it’s worth, the customer is really happy with the simplicity of the XIV GUI.

D

5 Replies to “The true XIV fail condition finally revealed (?)”

  1. “1 drive from one of the host-facing ingest nodes AND 1 drive from the normal data nodes within 30 minutes”

    That sounds reasonable; the “randomizing” function that duplicates the 1MB chunk may very well not use the host node for that purpose. But a bit unbalanced; the ratio is (6 host+9 data), so at some point there are going to be copies on two data nodes, unless it’s got even worse usable capacity than I thought.

    So, a failure on an XIV with (6+9) means death if 1 drive fails and the second, any 1 of 9*12=108 drives, fails in a 30 minutes period.

    Firstly, the second drive doesn’t need to fail; it just needs to spasm or declare the block missing due to lost write or any one of the hundreds or so possible data lost issues on SATA.

    Second, the drive bit error rates aren’t quoted by time; they’re quoted by uncorrectable errors per byte read and minutes don’t get a mention. Around 10^13 to 10^15 bits read on SATA iirc. Not a lot of TB, that. And that’s for just the disk; there’s a lot of other gubbins between disk and host that can increase that.

    I reckon this is as bad as a RAID-5 with 1 parity for every 108 data disks. Thoughts?

  2. Posting to a really old thread, but I just got a presentation from them.

    The sales guy was fairly obvious ready for the whole 2x drive failure thing (rolled his eyes and went into a canned statement mode). His pitch was that since they scrub every 2x days and they use SMART monitoring within the drive it was more than sufficient sufficient to proactively protect against have a URE. Which I was fairly dumbstruck to, and was very adament that’s a great protection and they’ve never had a customer lose a single byte of data. He then went on to imply that if under a rebuild you lost data (we had one a number of years ago under raid4 and is why everything is on raid-6) it was because the array didn’t use SMART. I didn’t want to be argumentative about it and just shook my head and let him rattle on with the rest of his presentation (he’s a sales guy not a tech and it was fairly obvious he wasn’t going to budge on the whole it’s awesome and you will never ever have a problem, SMART will fix the world no matter what the stats are or that the two are not related).

    A URE would probably only affect a single sector of the datastore during a rebuild, but that tiny miniscule spot may have a million dollar bank transaction.

  3. Dimitris,
    You are mostly correct. For every 1MB chunk, XIV selects two drives from two different modules. The algorithm is complicated, but basically XIV will not select two drives that are both on host-facing interface modules, so either the two drives are on two separate data modules, or one is on an interface module and the other is on a data module. Since nearly all double-drive failures in the industry happen in the same drawer, having this separation has greatly reduced the likelihood of data loss from a multi-component failure.

    Tony Pearson (IBM)

  4. Thanks for confirming, Tony.

    We can all agree that a 2-drive failure is highly improbable. It’s only happened to me a couple of times, and I’m sure to just a few others.

    I think everyone’s problem is that it’s actually POSSIBLE for a specific 2-drive failure to completely bring down the XIV. Mathematically, provably POSSIBLE. Which is why high rebuild speeds are of the utmost importance for XIV.

    The statistics that double-disk failures come from the same drawer is even more correct for XIV since a drawer is just a Linux server, which can go down wholesale.

    On legacy arrays, RAID groups are typically within the same shelf, so again your assertion is right.

    With more modern systems though, RAID groups can and will be spread out among many disk shelves.

    For instance, on NetApp, with enough shelves, you might end up, automatically, with only 1-2 disks in each RAID group in any given shelf.

    The likelihood of 2 disks failing in the same shelf is now not as high.

    Oh, and the NetApp (and EMC and others’) shelves are not a server, no single component failure can bring them down.

    And, of course, RAID-DP protects, demonstrably, so that ANY 2 disks can fail and the system won’t skip a beat.

    It’s all about risk-reward.

    D

  5. Has anyone with XIV system(s) seen a high number of drive failures…one per week? or do you not get notified (automatically) when a drive fails?

Leave a comment for posterity...