NetApp disk rebuild impact on performance (or lack thereof)

Due to the craziness in the previous blog, I decided to post an actual graph showing a NetApp system I/O latency while under load and a disk rebuild. It was a bakeoff vs another large storage vendor (which NetApp won).

The test was done at a large media company with over 70,000 Exchange seats. It was with no more than 84 drives, so we’re not talking about some gigantic lab queen system (I love Marc Farley’s term). The box was set up per best practices, with aggregate size being 28 disks in this case.

(Edited at the request of EMC’s CTO to include the performance tidbit): Over 4K IOPS were hitting each aggregate (much more than the customer needed) and the system had quite a lot of steam left in it.

There were several Exchange clusters hitting the box in parallel.

All of the testing for both vendors was conducted by Microsoft personnel for the customer.  The volume names have been removed from the graph to protect the identity of the customer:

 

Under a 53:47 read/write ratio 8K-size IOPS, a single disk was pulled.  Pretty realistic failure scenario, a disk breaks while the system is under production-level load. Plenty of writes, too, almost 50%.

Ok…  The fuzzy line around 6ms is the read latency.  At point 1 a disk was pulled and at point 2 the rebuild completed.  Read latency increased to 8ms during the rebuild, but dropped back down to 5 after the rebuild completed.  The line at less than 1 ms response time straight across the bottom is the write latency. Yes it’s that good.

So – there was a tiny bit of performance degradation for the reads but I wouldn’t say that it “killed” performance as a competitor alleged.

The rebuild time is a tad faster than 30 hours as well (look at the graph 🙂 ) but then again the box used faster, 15K drives (and smaller, 300GB vs 500GB), so before anyone complains, it’s not apples-to-apples compared to the Demartek report.

I just wanted to illustrate a real example from a real test at a real customer using a real application, and show the real effects of drive failures in a properly-implemented RAID-DP system.

The FUD-busting will continue, stay tuned…

D

28 Replies to “NetApp disk rebuild impact on performance (or lack thereof)”

  1. There was also an Enterprise Vault benchmark – unsolicited and not paid for by NetApp – that showed a strikingly similar result. Ingest performance did not waver. It was statistically consistent. And I say “statistically consistent” to customers because the statistics actually show write performance improve ever so slightly during disk failure and rebuild. I don’t know why, it sounds unreal and, God forbid, I don’t want customers going around yanking drives in an effort to improve write performance. Still, interesting results for customers who run Exchange and Enterprise Vault – and there are a few!

  2. Let’s have some fun…

    Hmmm, awfully quiet on the comments here. Where are the competitors with their hypothetical bashing scenarios? Oh yeah, I remember now.

    Disputing NetApp value in real-world customer scenarios is …. what’s the word …. oh yeah! Hard. ;^D

  3. < 1ms response times on writes? That's just sick! From what I recall (thanks John) this system used PAM I cards and during the (MS Best-Practice) Exchange DB verification was able to saturate the PCI bus on the attached host. Do the math folks. Sorry to steal any thunder here Dimitris. Guess you just proved WAFL actually rocks at sequential I/O as well as Random :) The cherry on this cake is what PAM II will likely do to read response times for this workload. My *conservative* prediction? 3ms @ steady state, 4ms during the "endless" RAID-DP disk rebuild. And with Exchange 2010 moving away from Single Instance Storage on DB's, primary dedupe makes this story impossibly even better. So to recap: * Real-world customer scenario on a popular horizontal app * Practical, affordable storage array * Insane performance, no real impact of "degraded mode" * Better double-disk failure protection than RAID 10 * Better disk failure protection efficiency than RAID 5 * Auto-tiered, zero-policy solid-state performance acceleration * Automatically thin-provisioned * Multiple Embedded RPO's with fast RTO's on the primary side * Application-consistent backup / recovery system (Snap)Vaulted on the secondary storage system / site * Forward-compatible with primary dedupe-friendly application upgrade Gotta love it when a plan comes together!

  4. Lest Jim McKinstry think his question is being avoided:

    The disks were full, since that other space-related FUD tidbit was being fought as well.

    D

  5. Thanks Jim. I hope you reciprocate: You never answered what happens to a Pillar system if you lose 2 disks in a RAID group – since the LUNs are all over the disks, what happens?

    D

  6. Kinda like 3Par then? Seriously I’m curious on the details.

    I’ll donate but won’t shave my head, I look scary enough as it is.

  7. @ Jim:

    I thought about this and mathematically it’s pretty cool. So, with enough drives I can see doing 2 levels of RAID:

    Level 1: within each shelf (brick, whatever)
    Level 2: at the chunklet level between shelves (to borrow a 3Par term)

    So, if you have 5 shelves, and a LUN is equally spread out among 5 shelves including parity, I could see losing an entire shelf. Is that how it works? And how does it affect efficiency?

    With most systems, given enough shelves you could spread things out so the system can sustain some pretty insane failures. I can easily configure EMC, NetApp and others to be able to sustain a full shelf failure.

    However, I’m curious how the math works with just a few bricks.

    If you buy an Axiom 300 with only 1 brick, according to your post you can lose the entire brick without data loss, which is frankly amazing and I probably need to go work for Pillar, then apply the same algorithm to my money!

    If you have an Axiom 300 with 2 bricks, again according to your post you can lose an entire brick and be OK. That means you have 2 RAID5 groups for the brick and mirroring on top of that, so RAID51. Obviously efficiency ain’t gonna be that great – parity for RAID within the brick, then mirroring on top of that, plus spares.

    So, I want to understand, how many disks do you need to be able to have the following:

    1. 80% efficiency vs raw
    2. be able to sustain the failures you’re describing

    Dude, honestly I’m intrigued, you probably think I’m trying to out Pillar or something, but this is not a NetApp corporate blog, it’s a personal technology blog and I’m as jaded a techie as you’re ever gonna find, so feed my addiction please 🙂

    Thx

    D

  8. @Jim:

    That comment of your’s can not apply to the config in that Dec’07 report:

    4 bricks, each with 2 Raid5 a 6 Disk (5 * 500G usable): 20T net capacity.

    From this, you carve out 30* 600G = 18T useable…

    This leaves about 2T worth of space to protect against the failure of 5T…

    How do you do that? Exceeding the Shannon Limit is a feat only rivaled by a perpetuum mobile of 1st order (which you could top by one of 2nd order)…

    I have seen a couple stunts, where people wanted to sell triple-bit redundancy by utilizing only two bits (disks) – for obvious reason, that didn’t work out too well either 🙂

    Anyway, covering one entire brick/shelf (which would naturally require an additional brick) sounds like an awful lot of wasted disks…

    In a 10 brick scenario, you have 130 disks.
    10 disks go away for the in-brick hot-spare
    20 disks go away for the Raid5 parity disk
    10 disks go away for brick-level redundancy (raid5+4 or 5+5 or whatever the scheme is, vertically).

    leaving 90 disks (69.2%)

    In comparison, 130 disks with Raid6 would come with
    8 RG of size 16 -> 8*14=112 data disks + 8*2 = 16 parity disks + 2 global hot spares (86.1%).

    The chance of any single RG to go astray being about a factor ~2000 lower (taking into consideration the 5+1 vs 14+2 raidsize, instead of the more common comparison of 7+1 vs. 14+2, which results in a factor of ~4000).

    Are these ballpark numbers about right?

  9. @Jim

    “Since I’ve brought your blog so much traffic how about donating some money towards kids for cancer?”

    I’ll do that however, in that what hair I have finally grew back I’ll pass on the head shaving part.

    John

  10. Still wondering what kind of drive rebuild times to expect on a full 1TB SATA drive, say one that is part of a 12+2 RAID-DP, or 5+2 for example? This industry seems to largely avoid this question, or fob it off with “it depends”. The only vendors that will talk about it are those with specific products designed to address it (e.g. XIV, DDN) but real live customers need to know, and I am struggling to find accurate rebuild expectations on Netapp, LSI, CX4, EVA etc. Thanks, Jim

  11. If a whole drive is being rebuild and not pieces of it like XIV/3Par and others can if it’s not too full, then it really will depend on the rated MB/s of the drive, + “nice” factors that might throttle this.

    If, simplistically, one can get 50MB/s to the drive, you’d be looking at 5.5 hours to do a full 1TB rebuild.

    If there’s other I/O happening and the system is trying to “nice” the rebuild so it’s throttled to 25MB/s then you’re looking at 11 hours.

    If the drive is slow, the throughput to it is hampered by other things like poor configuration etc etc then obviously that number will go up.

  12. And another thing @ Jim Kelly:

    The products specifically designed for super-fast rebuilds typically need them.

    It’s kinda like saying “my OS crashes more than yours but hey, it will reboot quicker”.

    Guess what – I’d rather take the more stable OS that takes longer to boot.

    I recommend you follow this thread:
    http://bit.ly/aSbEED since this was all discussed ad nauseum.

    And also this one… http://bit.ly/9bSMgB

    D

  13. I just found a presentation from IBM Advanced Tech Support in the US, dated Feb 2008 (when IBM first released 1TB SATA for N Series/Netapp) which says:

    1TB SATA Hard Disk Drives Continued
    Best Practices:
    RAID rebuilds
    ~12 hours with idle controller
    Consider FlexShareâ„¢ with system priority set on each volume
    Consider increasing raid.reconstruct.perf_impact
    Consider using smaller RAID groups

  14. Here is a nice analysis from Sun, talking about Raid-5 resiliency… And comparing this to Raid-6 and Raid w/ triple parity.

    p30-leventhal.pdf

    Also note, that this analysis is really very much independent on the percentage of Disk IOs dedicated to rebuilds – drive interface speeds – even when maxed out – will not cope with the cruel reality of a single read error every 10^14 sectors…

    Fortunately, most vendors have some way of dealing with Raid5 failures these days (Raid10, blocks copied/proteced across multiple Raid5 groups, Raid6, RaidDP…)

  15. I dunno, ASIS+thin provisioned volumes+thin provisioned VMs+PAMII so you’re data density does not overrun the spindles you’re running it on is pretty darn close to magic.

  16. Take as look at the models of drives going in your system. I fear for a vendor that uses consumer class SATA Drives. The UER rate for drives used by almost all SAN vendors are 1 Block in 10^15th, not 10^14th. This changed a few years ago as SATA vendors wants to increase reliability and claim they were as reliable as SAS/SCSI/FC drives. Of course the SAS/FC drives have moved from a 10^15th UER to a 10^16th UER….

    Of course with a 10 Drive + 1 RAID5, you are very likely to hit at least one UER, which is kind of the point of RAID6 or RAID-DP. You can Repair UER on the fly during a rebuild.
    RAID-5 is DEAD, long live RAID-6.

    Aggressive Background Scrubs lower the change, as do preemptive drive failures where you can copy off from the drive before it fails all the way.

Leave a comment for posterity...