Updated blog code, plus a bit about NetApp recovery for cloud providers

Sometime last night/this morning a config file in my blog got corrupted. Maybe it got hacked (I was running an ancient WordPress version 2.1) but at any rate the site was down.

It’s hosted on a large, famous service provider, and they use NetApp gear.

I was able to recover my file through NetApp snapshots – The provider makes this trivial by giving all users a GUI for it that looks like a normal file manager. All self-service.


No Vblocks, Avamar or Data Domain were harmed in the process that literally took all of one second to complete, most of which time was probably spent on Javascript doing its thing and the browser refreshing. BTW, I hadn’t touched that file since 2006.

This is a good example of storage for service providers doing more than just storing data.

With alternative solutions, a ticket would have to be opened, a helpdesk person would have to use a backup tool to find my file and restore it, then let me know. A whole lot more effort than what happened in this case.

In other news, I’m running the latest WordPress code, the site is now auto-optimized for mobile devices, and things are smooth again. Oh, and the old theme that most seemed to hate is gone. I’ll see if I can find a suitable picture for the header, for now this is OK.

If only that old version of WordPress I was using had a clean way of exporting stuff, if you look at older articles you’ll notice weird characters here and there. I might fix it. Probably not.


Has NetApp sold more flash than any other enterprise disk vendor?

NetApp has been selling our custom cache boards with flash chips for a while now. We have sold over 3PB of usable cache this way.

The question was raised in public forums such as Twitter – someone mentioned that this figure may be more usable Solid State storage than all other enterprise disk vendors have sold combined (whether it’s used for caching or normal storage – I know we have greatly outsold anyone else that does it for caching alone 🙂 ).

I don’t know if it is, maybe the boys from the other vendors can chime in on this and tell us, after RAID, how much usable SSD they’ve sold, but the facts remain:

  • NetApp has demonstrated thought leadership in pioneering the pervasive use of Megacaches
  • The market has widely adopted the NetApp Flash Cache technology (I’d say 3PB of usable cache is pretty wide adoption)
  • The performance benefits in the real world are great, due to the extra-granular nature of the cache (4KB blocks vs 64+ KB for others) and extremely intelligent caching algorithms
  • The cost of entry is extremely reasonable
  • It’s a very easy way to add extra performance without forcing data into faster tiers.

Comments welcome…


NetApp disk rebuild impact on performance (or lack thereof)

Due to the craziness in the previous blog, I decided to post an actual graph showing a NetApp system I/O latency while under load and a disk rebuild. It was a bakeoff vs another large storage vendor (which NetApp won).

The test was done at a large media company with over 70,000 Exchange seats. It was with no more than 84 drives, so we’re not talking about some gigantic lab queen system (I love Marc Farley’s term). The box was set up per best practices, with aggregate size being 28 disks in this case.

(Edited at the request of EMC’s CTO to include the performance tidbit): Over 4K IOPS were hitting each aggregate (much more than the customer needed) and the system had quite a lot of steam left in it.

There were several Exchange clusters hitting the box in parallel.

All of the testing for both vendors was conducted by Microsoft personnel for the customer.  The volume names have been removed from the graph to protect the identity of the customer:


Under a 53:47 read/write ratio 8K-size IOPS, a single disk was pulled.  Pretty realistic failure scenario, a disk breaks while the system is under production-level load. Plenty of writes, too, almost 50%.

Ok…  The fuzzy line around 6ms is the read latency.  At point 1 a disk was pulled and at point 2 the rebuild completed.  Read latency increased to 8ms during the rebuild, but dropped back down to 5 after the rebuild completed.  The line at less than 1 ms response time straight across the bottom is the write latency. Yes it’s that good.

So – there was a tiny bit of performance degradation for the reads but I wouldn’t say that it “killed” performance as a competitor alleged.

The rebuild time is a tad faster than 30 hours as well (look at the graph 🙂 ) but then again the box used faster, 15K drives (and smaller, 300GB vs 500GB), so before anyone complains, it’s not apples-to-apples compared to the Demartek report.

I just wanted to illustrate a real example from a real test at a real customer using a real application, and show the real effects of drive failures in a properly-implemented RAID-DP system.

The FUD-busting will continue, stay tuned…


Vendor FUD-slinging: at what point should legal action be taken? And who do you believe as a customer?

I’m all for a good fight, but in the storage industry it seems that all too many creative liberties are taken when competing.

Let’s assume, for a moment, that we’re talking about the car industry instead. I like cars, and I love car analogies. So we’ll use that, and it illustrates the absurdity really well.

The competitors in this example will be BMW and Mercedes. Nobody would argue that they are two of the most prominent names in luxury cars today.

BMW has the high-performance M-series. Let’s take as an example the M6 – a 500HP performance coupe. Looks nice on paper, right?

Let’s say that Mercedes has this hypothetical new marketing campaign to discredit BMW, with the following claims (I need to, again, clarify that this campaign is entirely fictitious, and used only to illustrate my point, lest I get attacked by their lawyers):

  1. Claim the M6 doesn’t really have 500HP, but more like 200HP.
  2. Claim the M6 only does 0-60 in under 5 seconds with only 5% of the gas tank filled, a 50lb driver, downhill, with a tail wind and help from nitrous.
  3. Claim that if you fill the gas tank past 50%, performance will drop so the M6 does 0-60 in more like 30 seconds. Downhill.
  4. Claim that it breaks like clockwork past 5K miles.
  5. Claim that they have one, they tested it, and performs as they say.
  6. Claim that, since they are Mercedes, the top name in the automotive industry, you should trust them implicitly.

Imagine Mercedes, at all levels, going to market with this kind of information – official company announcements, messages from the CEO, company blogs, engineers, sales reps, dealer reps and mechanics.

Now, imagine BMW’s reaction.

How quickly do you think they’d start suing Mercedes?

How quickly would they have 10 independent authorities testing 10 different M6 cars, full of gas, in uphill courses, with overweight drivers, just to illustrate how absurd Mercedes’ claims are?

How quickly would Mercedes issue a retraction?

And, to the petrolheads among us: wouldn’t such a stunt look like Mercedes is really, really afraid of the M6? And don’t we all know better?

More to the point – do you ever see Mercedes pulling such a stunt?

Ah, but you can get away with stuff like that in the storage industry!

Unfortunately, the storage industry is rife with vendors claiming all kinds of stuff about each other. Some of it is or was true, much of it is blown all out of proportion, and some is blatant fabrication.

For instance, XIV breaking if you pull 2 disks out as I state in a previous post, it’s possible if the right 2 drives fail within a few minutes of each other. I think it’s unacceptable, even though it’s highly unlikely to happen in real life. But I’ve seen sales campaigns against the XIV use this as the mantra, to the point that the fallacy is finally stated: “ANY 2 drive failure will bring down the system”.

Obviously this is not true and IBM can demonstrate how untrue that is. Still, it may slow down the IBM campaign.

Other fallacies are far more complicated to prove wrong, unfortunately.

An example: Pillar Data has an asinine yet highly detailed report by Demartek showing NetApp and EMC arrays having significantly lower rebuild speeds than Pillar (as if that’s the most important piece of data management, but anyway, rebuild speed hasn’t helped Pillar sales much, even if it’s true).

To anyone that knows how to configure NetApp and EMC, they’d see that the Pillar box was correctly configured, whereas the others intentionally made to look 4x worse (in the case of NetApp, they literally went against not just best practices but blatantly against system defaults in order to make it slower). However, some CIOs might read this and give credence to it, since they don’t know the details and don’t read past the first graph.

For EMC and NetApp to dispute this, they have to go to the trouble of configuring, properly, a similar system, and running similar tests, then writing a detailed and coherent response. It’s like wounding the enemy soldier instead of killing them, their squadmates have to help them out, wasting manpower. I get it – it’s effective in war. But is it legal in the business world?

Last but not least: EMC and HP, at the very least, have anti-NetApp reports, blogs, PPTs etc. that literally look just like the absurd Mercedes/BMW example above, sometimes worse. Some of it was true a long time ago (the famous FUD “2x + snap delta” space requirement for LUNs is really “1x + snap delta” and has been for years), some of it is pure fabrication (“it slows down to 5% of its original speed if you fill it up!”). See here for a good explanation.

Of course, again that’s like wounding the enemy soldiers: NetApp engineers have to go and defend their honor, show all kinds of reports, customer examples, etc etc. Even so, at some point many CIOs will just say “I trust EMC/HP, I’ve been buying their stuff forever, I’ll just keep buying it, it works”. The FUD is enough to make many people that were just about to consider something else, go running back to mama HP.

Should NetApp sue? I’ve seen some of the FUD examples and literally they are not just a bit wrong but magnificently, spectacularly, outrageously wrong. Is that slander? Tortuous interference? Simply a mistake? I’m sure some lawyer, somewhere, knows the answer. Maybe that lawyer needs to talk to some engineers and marketing people.

Let’s flip the tables:

If NetApp went ahead and simply claimed an EMC CX4-960 can only really hold 450TB, what would EMC do?

I can only imagine the insanity that would ensue.

I’ll finish with something simple from the customer standpoint:

NetApp sold 1 Exabyte of enterprise storage last year, if it was as bad as the other (obviously worried) vendors are saying, does that mean all those customers buying it by the truckload and getting all those efficiencies and performance are stupid and wasted their money?


Pillar claiming their RAID5 is more reliable than RAID6? Wizardry or fiction?

Competing against Pillar at an account. One of the things they said: That their RAID5 is superior in reliability to RAID6. I wanted to put this on the public domain and, if true, invite Pillar engineers to comment here and explain how it works for all to see. If untrue, again I invite the Pillar engineers to comment and explain why it’s untrue.

The way I see it: very simply, RAID5 is N+1 protection, RAID6 is N+2. Mathematically, RAID5 is about 4,000 times more likely to lose data than a RAID6 group with the same number of data disks. Even RAID10 is about 160 times more likely to lose data than RAID6.

The only downside to RAID6 is performance – if you want the protection of RAID6 but with extremely high performance then look at NetApp, the RAID-DP NetApp employs by default has in many cases better performance than RAID10 even. Oracle has several PB of DB’s running on NetApp RAID-DP. Can’t be all that bad.

See here for some info…