So what exactly is IBM trying to do with the XIV?

By now most people dealing with storage know that IBM acquired the XIV technology. What IBM is doing now is trying to push the technology to everyone and their dog, for reasons we’ll get into…

I just hope IBM gets their storage act together since now they’re selling products made by 4-5 different vendors, with zero interoperability between them (maybe SVC is the “one ring to rule them all”?)

In a nutshell, the way XIV works is by using normal servers running Linux and the XIV “sauce” and coupling them together via an Ethernet backbone. A few of the nodes get FC cards and can become FC targets. A few more of the features:

  • Thin provisioning
  • Snaps
  • Synchronous (only) replication
  • Easy to use (there’s not much you can do with it)
  • Uses RAID-X (no global spares, merely there’s space on each drive, faster rebuilds are possible)
  • Only mirrored
  • A good amount of total cache per system since each server has several GB of RAM BUT the cache is NOT global (each node simply caches the data for its local disks).

IBM claims insane performance numbers with the XIV (“it will destroy DMX/USP!” — sure). But let’s take a look at how everything looks:

  • 180 maximum (or minimum) drives (you can get a half config but I think you always get the 180 drives but license half, I might be mistaken – I believe you have to make a commitment that you’ll buy the whole thing in 1 year)
  • Normal Linux servers do everything
  • Only SATA
  • The backbone is Ethernet, not FC or Infiniband (much, much higher latency is incurred by Ethernet vs the other technologies)

The way IBM claims they can sustain high speed is to not try and make the SATA drives get bound by their low transactional performance vs 15K FC drives or, even worse, SSDs. From what I understand (and IBM employees feel free to chime in) XIV:

  1. Ingests data using a few of the front-end nodes
  2. Tries to break up the datastream into 1MB chunks
  3. The algorithm tries to pseudo-randomly spread the 1MB chunks and mirror them among the nodes (the simple rule being that a 1MB chunk cannot have a mirror on the same server/shelf of drives!)

Obviously, by doing effectively as much as possible large block writes to the SATA drives and using the cache to great effect, one should be able to see the 180 SATA drives perform pretty much as fast as possible (ideally, the drives should be seeing streaming instead of random data). However (there’s always that little word…)

  1. There is no magic!
  2. If the incoming random IOPS are coming at too great a rate (OLTP scenarios), any cache can get saturated (the writes HAVE to be flushed to disk, I don’t care what array you have!) and it all boils down to the actual number of disks in the box. The box is said to do 20,000 IOPS if that happens – which I think is optimistic at 111 IOPS/drive! At any rate, 20,000 IOPS is less than what even small boxes from EMC or other vendors can do when they run out of cache. Where’s the performance advantage of XIV?
  3. The “randomization removing algorithm”, if indeed there’s such a thing in the box, will have issues with more than 1-2 servers sending it stuff
  4. See #1!

Like with anything, you can only extract so much efficiency out of a given system before it blows up.

An EMC CX4-960 could be configured with 960 drives. Even assuming that not all are used due to spares etc. you are left with a system with over 5 times the number of physical disks vs an XIV, tons more capacity etc. Even if the “magic” of XIV makes it more efficient, are those XIV SATA drives really 5 times more efficient (5 times would make it EQUAL to the 960 performance, XIV would have to be well over 5 times more efficient than an EMC box of equivalent size to beat the 960).

Let’s put it that way:

If my system was as efficient as IBM claims, and I had IBM’s money, it’d buy all the competitive arrays, even at several times the size of my box, and publicize all kinds of benchmarks showing just how cool my box is vs the competition. You just can’t find that info anywhere, though.

Regarding innovation: Other vendors have had similar chunklet wide striping for years now (HP EVA, 3Par, Compellent if I’m not mistaken, maybe more). 3Par for sure does hot sparing similar to an XIV (they reserve space on each drive). 3Par can also grow way bigger than XIV (over 1,000 drives).

So, if I want a box with thin provisioning, wide striping, sparing like XIV but the ability to choose among different drive types, why not just get a 3Par? What is the compelling value of XIV, short of being able to push 180 SATA drives well? Nobody has been able to answer this.

I’m just trying to understand XIV’s value prop since:

  1. It’s not faster unless you compare it to poorly architected configs
  2. It has less than 50% efficiency at best, so it’s not good for bulk storage
  3. It’s not cheap from what I’ve seen
  4. Burns a ton of power
  5. Cannot scale AT ALL
  6. Cannot tier within the box (NO drive choices besides 1TB SATA)
  7. Cannot replicate asynchronously
  8. Has no application integration
  9. No Quality of Service performance guarantees
  10. No ability to granularly configure it
  11. Is highly immature technology with a small handful of reference customers and a tiny number of installs! (I guess everyone has to start somewhere but do YOU want to be the guinea pig?)

Unless your needs are exactly what XIV provides, why would you ever buy one? Even if your space/performance needs are in the XIV neighborhood there are other far more capable storage systems out there for less money!

IBM is not stupid, or at least I hope not. So, what IBM is doing is pretty much handing out XIVs to whoever will take one. If you get one, think of yourself as a beta tester. Because I hardly believe that IBM bought the XIV IP without seeing some kind of roadmap, otherwise the purchase would be kinda stupid! If you are a beta-tester, be aware that:

  • XIV cheats with benchmarks that write zeros to the disk or read from not previously-accessed addresses
  • XIV will be super-fast with 1-2 hosts pushing it, push it realistically with a real number of hosts
  • Try to load up the box since if it’s not full enough you’ll get an extremely skewed view of performance – put even dummy data inside but fill it to 80% and then run benchmarks!
  • Test with your applications, not artificial benchmarks
  • Do not accept the box in your datacenter before you see a quote! In at least 3 cases that I know of IBM drops off the box without giving you even a ballpark figure. I think that’s insane.

And last, but not least: I keep hearing and reading about the following being true, I’d love IBM engineers to disprove it:

If you remove 2-3 drives from different trays simultaneously from a loaded system then you will suffer a catastrophic failure (logically makes sense looking at how the chunks get allocated but I’d love to know how it works in real life). And before someone tells me that this never happens in real life, It’s personally happened to me at least once (lost 2 drives in rapid succession) and many other people I know that have any serious real-world experience…

D

12 Replies to “So what exactly is IBM trying to do with the XIV?”

  1. Have you seen one, have you realy tested one? Do you realy think company’s like BOEING are naive they just bought 10 XIV’s. Remember, the guy who invented symetrics designed XIV. We have tested the XIV and are realy surprised about the results. No tuning needed!!!!!

    Statisticaly proven double disk failures mostly happen in the same enclosure. XIV can handle 12 disk failures in one enclosure without failing!! Recovery of one disk failure is most of the times within 10 minutes, not within houres like others. It is statisticaly spoken almost impossible that 2 disks will fail in that time.

  2. Frans, not saying anyone is naive, just trying to figure out exactly what the amazing value prop is.

    I’ve worked in companies far larger than Boeing and trust me, the size of the company has nothing to do with how naive or not the company is… I remember having lots of budget dollars left over and not knowing what to do with it, so I’d get new technology to try out from time to time.

    Maybe someone from Boeing will be willing to be a reference, share the numbers and also pull 2 drives from 2 different trays for us! 🙂

    Yes Moshe Yanai was head of the very large team that developed Symmetrix (I wouldn’t go as far as to say HE invented it), then either he was fired or he quit years ago – I hear different stories and they’re all colorful, like the fact that he talked back to Tucci and that he had his own helicopter that took him from his house to EMC HQ and took a huge part of the parking space… or that he had a van full of gear that his whole team would ride in so they could keep working even in transit to various places… unsure how much is true.

    However, Moshe had NOTHING to do with the current DMX, and the current DMX is totally removed from the original Symm architecturally! So, readers beware, it’s not like the “inventor” of XIV left EMC to build something better than the current EMC product.

    XIV as an architecture is not even revolutionary! If it is, someone please explain to me how that is so, compared to 3Par.

    And, HOW did you test XIV? With artificial benchmarks? Real apps? How many hosts? How full was it? Were your tests always living in cache? If so, is that how you’ll use the box in real life?

    Finally, on the subject of rebuilds: Yes, if a drive is not nearly full I can see it being rebuilt in a very short time since you’re not rebuilding the whole thing, just the chunks that were on it, so if (best-case) there was 1MB of data, rebuild would take a few seconds!

    However, think how fast a 1TB SATA drive can do writes. Then figure how long 1TB would take to be written to it and you have the worst-case XIV rebuild time.

    Repeat after me: THERE IS NO MAGIC!!!!!!!!!!!

    No manner of magical cache algorithm can make 1TB of data be written faster to an individual drive than the drive will allow after a confluence of mechanical, interface and electronic factors!!!!! How is that not understood?

    And I disagree that 2 disks will statistically fail from the same tray. And I dont’ want to hear again how you can lose a whole tray with the XIV and be OK, of course that’s true based on the algorithm and it’s NECESSARY for the XIV since a tray is NOT as robust as a normal array tray!!!!

    EMC has been doing that exact thing with the Centera for years now. The idea is NOT new!

    An XIV enclosure/tray/shelf is really a server, and that server can crash, its motherboard can fry, etc!

    An enclosure from, say, an EMC Clariion, can never be a single point of failure due to the way it’s designed. Short of violent acts against it, it will just work since it has 2 of everything. You can also architect the RAID groups so that the array will survive losing the enclosure if you want to.

    An individual server is NOT that robust, unless you’re talking about paired servers running instructions in lockstep, i.e. from Stratus. XIV is NOT built like that.

    Lastly, the “no tuning needed” – most modern arrays need little if any tuning, unless you want the array to deal with edge cases, in which it is a BENEFIT to be able to tweak some things if need be! I absolutely do not buy that the algorithms are SO good that the array will adapt no matter what. There’s always some obscure scenario the engineers haven’t tested…

    D

  3. “However, think how fast a 1TB SATA drive can do writes. Then figure how long 1TB would take to be written to it and you have the worst-case XIV rebuild time.”

    Dimitris,
    The worst-case XIV rebuild time for 1TB SATA drives is 30 minutes. This is because we are not rebuilding a single drive as you suspect, but rather write up to 1TB of chunks across 168 drives in parallel, less than 6GB per drive in spare empty space set aside for this purpose.

    For more on the value proposition of XIV, see my post:
    http://www.ibm.com/developerworks/blogs/page/InsideSystemStorage?entry=when_factual_observations_mislead_and

    — Tony

  4. Just tossing in my 3PAR experience here. Recently purchased a T400 with 200x750GB SATA disks in it. Was pretty simple to configure, came standard with the ability to protect against an entire shelf/cage(40 disks) failing, no data layout planning needed. So of course I tested this. was writing 100MB/s via a half dozen clients to an NFS cluster from Exanet and I pulled a 3PAR drive magazine (4 disks), of course no impact. I plugged it back in a few minutes later and it re-synced, didn’t have to play with the management UI. A short while later, I powered off one of the shelves (32 drives in my case the other 8 slots were not populated), no impact again. Powered it back on a few minutes later and it re-synced on it’s own.

    A few days later we had a disk failure on one of our older LSI logic storage arrays, 146GB 10k RPM disks, took it a good 4 hours to rebuild 146GB, which I calculated I think to about 9 megabytes a second(RAID 5 12+1). To simulate a lone drive failure on the 3PAR I powered a disk down and after a few minutes the system determined the drive was dead(it goes into a logging mode waiting to see if it self recovers before marking it down), and the system rebuilt.

    The drive had 23GB of written data to it, and it took the system 4 minutes to rebuild, or approx 64 megabytes per second. The system doesn’t rebuild data that was never written to the disk. Whereas traditional RAID systems typically seem to rebuild bit for bit regardless of whether or not the drive is full or nearly empty. The 3PAR system
    is running mostly RAID 5 (3+1) to try to maximize the
    probability of full stripe writes.

    Even if the drive was full it would of taken the array, I think 19.5 minutes to rebuild the 750GB(715GB actually)
    worth of data on the SATA disks?

    For our real-world testing of the system so far we’ve managed to sustain roughly 650 megabytes per second of
    random writes to the system, so you can imagine writing 64
    megabytes per second won’t even make the system blink
    as far as latency goes. Meanwhile the LSI system that
    took 4 hours had it’s disk write latency spike to well
    over 100ms throughout the rebuild process according to
    the front end NAS head unit from BlueArc.

    If I pulled several random drives out of the system from different shelves it is likely I would suffer data loss of some sort as well, depending on what drives I pulled. And given that drives are pulled out in groups of 4. Though given the distributed RAID 5 3+1 split into 256MB chunks, that means I’d have to pull out 2 drive magazines out of a set of 4, from a population of 50 drive magazines to trigger
    data loss for a particular RAID group.

    Given the fast parallel rebuild times the system offers and the automatic protection against a full shelf failing, and the overall intelligence of the system as a whole the likelihood of something like this actually occurring is so small it’s not even on the radar. I’m more worried about someone accidentally deleting a LUN or something.

    The array defaults to reserving roughly 18GB per disk on the system as “spare”. This is adjustable, and if this space is exhausted(I think that’s more than 5 disks failing), then the system will automatically use any available unallocated space on the remaining drives as “spare” in the interim until disks can be replaced.

  5. To Tony:

    So does it or does it not crap out when, in a full system, 2 disks that are in different modules are removed simultaneously?

    To Nate:

    Yes, the mere fact that 3Par has been able to do that kind of stuff for years is what makes me wonder about the value prop of XIV.

    The one thing I do like: that adding bigger/smaller drives to XIV is OK since it doesn’t care about drive size.

    I think the limiting factor is the ethernet interconnect, the same architecture could scale way better with infiniband, 10G ethernet, or proprietary cluster-type interconnects (which add a lot to the price).

    D

  6. While one must admit the chunklet striping without the overhead of RAID is a clever scheme, I agree that it is certainly not a new approach.

    Frans, so I heard that Boeing actually purchased nine boxes and had a tenth one thrown in for free for testing (speaking of giveaways).

    BTW, has anyone else observed that all capacity information about this frame is presented in decimal form only. The 79 TBs of usable capacity (in decimal) actually represents only 73.5 TBs (7% Less) of binary capacity that’s usable by the supported OSs.

    In Boeing’s case those nine frames would provide 662 TBs of usable (active and snapshot) OS capacity. From what I understand, that would fall really short of what they stated in their RFP. Speaking of falling short, if this platform is only rated at 20K IOP/s then that too fell short FWIU.

    Quite frankly, I have to think that Boeing would be sensitive to the possibility of the leakage of classified data from their defense side. Since this is one of the most virtualized schemes (any data on any drive with continous redistribution?) I am surprised that Boeing would use these since a leakage, by DOD standards, would require formatting & wiping of all drives that the data might have ever been on. Does the capability to santize to DOD standards even exist on the XIV?

    Maybe Boeing was in love with the GUI and a giveaway pricing strategy? Given the economy, and the state of their 787 program, I can’t blame them for taking the lowest bidder, but how long do they think IBM can afford to give these away before they close up that shop?

    Anyway, keeping complicated caculations to a minimum, the Redbook states that 4% (40 GB) of each drive is given up to system overhead functions like metadata, traces, distribution and partition tables. Additionally, another 8% (80 GB) of each drive is reserved for the sparing area.

    1 TB 4% = 40GB 8% = 80GB Drives Usable
    ((1000 – (1000*0.04) – (1000*0.08)) * 180)/2 = 79,200

    This is right out of the Redbook (2.3.2) so I don’t know where the 6 GB cited above for sparing came from. The net of the math is that there is 880 TBs of available decimal capacity for housing the primary, mirrored and snapshot chunklets. That’s the maximum burden when recovering from a lost drive. On the other hand, if a relatively full data module fails, then we are talking about 10.56 TBs of data that must be read from and back to the remaining 168 drives to recover from the loss of all the primary, mirror and snapshot chunklets that resided in the failed data module. That calcs out to 62.8 GBs read from each drive and 62.8 GBs written to each drive.

    So out of all of that, here is the question. If it can take up to 30 minutes to rebuild after a single disk failure, would it take six hours to rebuiild after a data module failure or is it IBM’s ascertion that on a fully loaded system it can rebuild 10.56 TBs in that same 1/2 hour?

    BTW Frans, what do you define as an enclosure? Do you mean a tray of disk that shares the same backplane or do you consider this all the way up to a cabinet (as EMC would define a Symm cabinet?). I’d really like to see a reputable citation that states that most simultaneous double disk failures occur in the same ‘backplane’. Can you provide one?

    In closing, while I agree that the XIV may do a good job of balancing the performance of the OS tasks, does it not treat all of the primary data chunklets the same? If so, then I submit that all of the data going into and out of the front end is treated all the same. I guess those that want to use this box have no concern about QOS and despite having a room full of these things, you are still stuck with what, 20K IOP/s across the install base?

  7. I really don’t care how fast it does a rebuild of a drive. What I care about is what is the probability of during a rebuild that I can read all the sections of all the other drives? I can honestly say with the thousands of drives I’ve got, I haven’t seen two simultaneously stop rotating, but I have seen a single drive fail and then during the rebuild a portion of what was thought be a good drive when it went to request the data it failed. Even if it took 10 seconds to rebuild the drive you still would be dead because there is bad data already sitting there. ATA drives tend to have a much higher URE rate than other drives, and I’ve seen it in full affect. Since the XIV only does mirroring the probability is deffinetly reduced compared to raid-5 but it’s not zero, and it has a much higher failure probability than raid-6, etc. There have been multiple studies on the probability of going to a disk that is thought good, asking for data and it giving something wrong or failing; for ATA drives the studies generally don’t have good things to say. Having ran into these latent errors before on ATA drives (from two different manufactureres) I can say that one *must* consider this if you care about your data, bigger disks and more of them only amply this issue.

    What XIV needs to do is support some form of a raid-6 or a 3-way mirroring for ATA drives, before I’ll come near it with critical data. A fast rebuild time doesn’t mean anything if you are going to rebuild something bad.

  8. I’m confused, 12 out of the 180 drives can fail? Does that mean if 2 drives in any one “1 data module” fail there’s a system failure? is this what people here are saying with the double disk failure?

    Also what is the configurability? Yeah it’s great that it does everything without me having input but it doesn’t because I have no control over how it utilizes storage or what apps are truly tier 1 vs. tier 2.

    Then the question of expansion, they are saying 24 FC ports, and 6 iSCSI what if I need additional iSCSI ports for my shop? Can I just add a card?

    Lastly… omg 79TB usable out of 180TB raw, that’s pretty bad utilization even if it is fast, and unless it’s a cheap entry there’s alot of people going to run to other sata based arrays.

  9. Gary,

    The XIV is comprised of 15 data modules (A Xyratex Box with an Intel MB/Processors in it and 12 SATA Drives). The data module is not simply a shelf of drives, but a Linux PC with all with all the inherent issues that entails. If anything in that arrangement goes south, those 12 data drives, and all the data they contain, will be unavailable to the rest of the system.

    These systems come to you delivered, set up and HARDWIRED in a fixed configuration. I have seen nothing to indicate that there is any option to have additional FC Cards or ISCSI ports in the currently shipping XIV so no, the can’t be expanded, today.

    Lastly, 180 TBs (Decimal) RAW storage, as you phrased it, actually only delivers a total of 73.6 TBs of usable storage (Binary) to the OSes that would use it. The 79.2 capacity quoted by IBM is in Decimal so this is NOT the capacity that any systems connected to the XIV would ever realize.

    As an aside, I have noticed that some of IBM’s public facing materials represent the 79.2 TBs as NET capacity, while other materials call it usable capacity. For most of us, the RAW and NET capacity usually represent the physical (Decimal) capacity and USABLE is typically used to describe the logical (Binary) capacity so IBM’s inconsistent use of these terms is adding some confusion.

    For example, NET = 79 TBs is used here:

    ftp://ftp.software.ibm.com/common/ssi/pm/sp/n/tsd03055usen/TSD03055USEN.PDF

    But here, the materials call the 79 TBs usable capacity:
    http://www.redbooks.ibm.com/redpieces/pdfs/sg247659.pdf (Page 21/22).

    Sometimes I think it would be great if everyone would just start out all the capacity conversations not with the decimal measurement of capacity, but the Binary measurement of same.

    One other thing. As noted earlier, this frame has 15 PC type Linux Servers in it. I scanned the referenced Redbook and found no references to any process for upgrading the OS on them. The GUI Client has an upgrade feature but this appears to be for that GUI interface itself, not the OS on the individual servers. I wonder how this is to be done?

    As Always, Caveat Emptor.

  10. Well, if I had done a little more research, I would have found the answer to the OS upgrade question. This was actually snipped from here: http://www.drunkendata.com/?p=1800

    Microcode upgrades to Data Modules can be accomplished by a full outage (all at once), or, a slow rolling process where data is migrated off a DM, the DM is upgraded/restarted, then data is redistributed back.

    I presume this is actually the case given that Tony Pearson from IBM posted a comment to this same article and did not dispute the upgrade process.

    All I can say about this is wow…. This is certainly not an non-disruptive-upgrade process.

  11. We no longer worry about double fault drive failure at that big airplane company…the double power supply failure we had is killing us in St. Louis and I still don’t believe IBM is set up to support a global company when it comes to XIV…another XIV down in Bellevue for 7 hours without a callback from a customer serviceguy!

Leave a comment for posterity...