Vendor FUD-slinging: at what point should legal action be taken? And who do you believe as a customer?

I’m all for a good fight, but in the storage industry it seems that all too many creative liberties are taken when competing.

Let’s assume, for a moment, that we’re talking about the car industry instead. I like cars, and I love car analogies. So we’ll use that, and it illustrates the absurdity really well.

The competitors in this example will be BMW and Mercedes. Nobody would argue that they are two of the most prominent names in luxury cars today.

BMW has the high-performance M-series. Let’s take as an example the M6 – a 500HP performance coupe. Looks nice on paper, right?

Let’s say that Mercedes has this hypothetical new marketing campaign to discredit BMW, with the following claims (I need to, again, clarify that this campaign is entirely fictitious, and used only to illustrate my point, lest I get attacked by their lawyers):

  1. Claim the M6 doesn’t really have 500HP, but more like 200HP.
  2. Claim the M6 only does 0-60 in under 5 seconds with only 5% of the gas tank filled, a 50lb driver, downhill, with a tail wind and help from nitrous.
  3. Claim that if you fill the gas tank past 50%, performance will drop so the M6 does 0-60 in more like 30 seconds. Downhill.
  4. Claim that it breaks like clockwork past 5K miles.
  5. Claim that they have one, they tested it, and performs as they say.
  6. Claim that, since they are Mercedes, the top name in the automotive industry, you should trust them implicitly.

Imagine Mercedes, at all levels, going to market with this kind of information – official company announcements, messages from the CEO, company blogs, engineers, sales reps, dealer reps and mechanics.

Now, imagine BMW’s reaction.

How quickly do you think they’d start suing Mercedes?

How quickly would they have 10 independent authorities testing 10 different M6 cars, full of gas, in uphill courses, with overweight drivers, just to illustrate how absurd Mercedes’ claims are?

How quickly would Mercedes issue a retraction?

And, to the petrolheads among us: wouldn’t such a stunt look like Mercedes is really, really afraid of the M6? And don’t we all know better?

More to the point – do you ever see Mercedes pulling such a stunt?

Ah, but you can get away with stuff like that in the storage industry!

Unfortunately, the storage industry is rife with vendors claiming all kinds of stuff about each other. Some of it is or was true, much of it is blown all out of proportion, and some is blatant fabrication.

For instance, XIV breaking if you pull 2 disks out as I state in a previous post, it’s possible if the right 2 drives fail within a few minutes of each other. I think it’s unacceptable, even though it’s highly unlikely to happen in real life. But I’ve seen sales campaigns against the XIV use this as the mantra, to the point that the fallacy is finally stated: “ANY 2 drive failure will bring down the system”.

Obviously this is not true and IBM can demonstrate how untrue that is. Still, it may slow down the IBM campaign.

Other fallacies are far more complicated to prove wrong, unfortunately.

An example: Pillar Data has an asinine yet highly detailed report by Demartek showing NetApp and EMC arrays having significantly lower rebuild speeds than Pillar (as if that’s the most important piece of data management, but anyway, rebuild speed hasn’t helped Pillar sales much, even if it’s true).

To anyone that knows how to configure NetApp and EMC, they’d see that the Pillar box was correctly configured, whereas the others intentionally made to look 4x worse (in the case of NetApp, they literally went against not just best practices but blatantly against system defaults in order to make it slower). However, some CIOs might read this and give credence to it, since they don’t know the details and don’t read past the first graph.

For EMC and NetApp to dispute this, they have to go to the trouble of configuring, properly, a similar system, and running similar tests, then writing a detailed and coherent response. It’s like wounding the enemy soldier instead of killing them, their squadmates have to help them out, wasting manpower. I get it – it’s effective in war. But is it legal in the business world?

Last but not least: EMC and HP, at the very least, have anti-NetApp reports, blogs, PPTs etc. that literally look just like the absurd Mercedes/BMW example above, sometimes worse. Some of it was true a long time ago (the famous FUD “2x + snap delta” space requirement for LUNs is really “1x + snap delta” and has been for years), some of it is pure fabrication (“it slows down to 5% of its original speed if you fill it up!”). See here for a good explanation.

Of course, again that’s like wounding the enemy soldiers: NetApp engineers have to go and defend their honor, show all kinds of reports, customer examples, etc etc. Even so, at some point many CIOs will just say “I trust EMC/HP, I’ve been buying their stuff forever, I’ll just keep buying it, it works”. The FUD is enough to make many people that were just about to consider something else, go running back to mama HP.

Should NetApp sue? I’ve seen some of the FUD examples and literally they are not just a bit wrong but magnificently, spectacularly, outrageously wrong. Is that slander? Tortuous interference? Simply a mistake? I’m sure some lawyer, somewhere, knows the answer. Maybe that lawyer needs to talk to some engineers and marketing people.

Let’s flip the tables:

If NetApp went ahead and simply claimed an EMC CX4-960 can only really hold 450TB, what would EMC do?

I can only imagine the insanity that would ensue.

I’ll finish with something simple from the customer standpoint:

NetApp sold 1 Exabyte of enterprise storage last year, if it was as bad as the other (obviously worried) vendors are saying, does that mean all those customers buying it by the truckload and getting all those efficiencies and performance are stupid and wasted their money?

D

44 Replies to “Vendor FUD-slinging: at what point should legal action be taken? And who do you believe as a customer?”

  1. Great post — thanks for pointing out what needs to be said.

    As an employee of one of the companies (EMC) I can speak for many of us that we really, really hate it when anyone does that sort of thing, including our own employees.

    There are a lot of good storage products in the marketplace today. Customers have lots of choices. That’s good. And to the extent we can encourage all the participants to focus on their unique strengths, so much the better.

    My personal pet peeve isn’t so much the competition bashing, but the egregious overstatement of capabilities. The first one is annoying, the second one is downright dangerous.

    Thanks for sharing!

    — Chuck

  2. Excellent post.

    From a customer perspective, all the aggro that EMC give NetApp just makes me think worse of EMC. They should shout about the positives of their technolgy, not the supposed negatives of a competitor’s.

  3. Great post. I agree with you completely. The FUD has gotten to a point where it is almost comical. I just find it interesting that you did not mention any NetApp FUD examples…..

  4. Great post, but it’s not just the storage industry. Even in the server industry “Big Blue” is doing a good job about creating a marketing campaign to show how good their new 2 processor server based on the upcoming Intel Nehalem EX processor, but they compare it to the competition’s current generation offerings (based on the 2 processor Xeon 5500). The Big Blue field sales team sucks it up and then spouts it out to the partner community, not realizing how it’s making them look. I wish all of these technology vendors would stop the bashing and just start talking about their VALUES. I’d respect them much more.

  5. FUD might be considered as excesssively negative claims about a competitor’s products. The flip side, of course, is excessively positive claims made by the vendors themselves – and (as Chuck points out) that’s just as much of a problem in our industry. It’s as though the M6 really does have only 350HP and really does need nitrous to achieve that 0-60 number and really does have poor reliability, but they claim it’s better and Mercedes claims it’s worse. The two phenomena go hand in hand. If one were to abate, the other probably would (somewhat) as well. When “BMW” sticks to claims that are within some reasonable percentage of their reality, maybe “Mercedes” wouldn’t feel such a strong need to counter those claims.

  6. Chuck,

    I can’t believe you wrote that, you are the most regular Netapp slammer in the blogosphere. Otherwise, the sort of discussion Dimitri writes about here appear with some regularity as arguments in Barry Burke’s blog – straight from the guts of EMC’s competitive analysis bureau. If you really mean what you write then do something about it.

    (for people that don’t know me, I work for 3PAR – acompany that competes with both EMC and Netapp)

  7. Interesting comment on the Demartek report. How exactly was it not fair? Each system had 5+1 single parity RAID groups. This was done to guarantee that the rebuilds were “fair”. At the time of the test, each Pillar “brick” had two 5+1 RAID 5 disk groups. Rebuilds will be fast because of the dedicated RAID controllers and the small amount of drives in the RAID group. That’s storage 101. If Demartek had used bigger RAID groups than Pillar for EMC and NetApp then it would not have been a fair, apples-to-apples comparison. Demartek went out of its way to make sure this test was fair and above board. I think Dimitris is upset that the performance of the NetApp box looked bad. That wasn’t the point of the test. The system was configured to give NetApp the fastest rebuild times, not the best system performance.

    For those who don’t know why Dimitris is posting this, here’s a little background:
    last week @valb00 was tweeting his normal stream of NetApp marketing information and tried to get a little technical about RAID DP
    — I made a comment about NetApp rebuilds being slow and that’s why NetApp NEEDS RAID DP
    — He made a comment about FUD
    –I tweeted him a link to the Demartek report and said it’s fact, not FUD.
    — He never replied.
    At this point I imagine that he sent the link to Dimitris and asked him to blog about it. I find it hard to believe that Dimitris just happened to find this report that was published in December of 2007 on his own.

  8. @Jim: the netapp guys just love sticking the FUD badges onto everyone when their arguments get weak. Of course their single-processor-no-raid-controller filers will rebuild slowly, especially under heavy load…To be fair on them, I don’t think that they are being nasty or anything like that. In netapp, there is this widespread dogmatic belief that the 1992 wafl-raid4 code can do it all and for everyone.

  9. Uh. Jim. Your point about that NetApp rebuilds were slow, and you offered a motive.
    The NetApp FAS system doesnt support RAID-5, it uses RAID-4. Maybe that is why they allowed the NetApp to use a non-RAID-5. But you don’t use RAID-DP to increase performance. How does Double Parity increase performance?
    You use RAID-DP or RAID-6 to increase reliability. In fact I feel for anyone out there still running RAID-5. Move to RAID 10 or RAID 6 if you care about your data.
    Also, his lack of response to you doesn’t confirm your point, it just means he might be a very busy guy.

    I guess I should also disclose I work for NetApp.

    The real problem here is that most of us are experts on performance tools, and know how to make our own devices sing. We also know where they competition falls off. Its natural for marketing people to show off a product where it works best. If you have a Dragster, and I have a Formula one car, you want to race me on a straight strip from a stop, while I want to race around corners. Both devices are high performance, but they are both built on a different premise. No matter which track we use, someone is going to feel cheated.

  10. We hear that Dimitris Krekoukias thinks the drive rebuild comparison test that we conducted in the latter half of 2007 and published in April 2008 is somehow flawed and qualifies as FUD. I disagree. The storage systems were all configured as 5+1 single parity RAID groups. The exact same stress tests were run from the same host server onto each of the storage systems, with nothing else running on the host server and nothing else running on the storage systems. We ran the tests and reported the results. Has Dimitris run similar tests in his lab and received different results?

    In the interest of full disclosure, we have performed various tests for EMC, NetApp, Pillar Data Systems and other storage and system vendors. In every test we run where there are competing systems involved, we attempt to configure the hardware identically or as close to identically as possible, and run the exact same tests on each. By doing this, it our goal to eliminate FUD and simply present the facts.

    We have found that all vendors in a competitive marketplace, whether they be automobile manufacturers, cell phone service providers or computer storage vendors try to position their product as the best solution. This is fair and you would expect nothing less. However, sometimes marketing claims get exaggerated. In the case of technology products, these claims can be tested in the lab and validated or refuted.

  11. Dimitris was merely pointing out an example with the Demartek report. What happens is a customer gets this report. It sits on his desk maybe on top of a pile of other vendor foo while they go about putting out fires, trying to advance projects – they’re day job. They get a call from Pillar and the guy on the other end of the line asks if the customer had a chance to read the report. The only reason the customer answered the phone is they brought lunch back to their desk and, between bites they were cleaning out their Inbox, doing expenses, checking some football scores or other odds & ends. It is their lunch hour after all. The customer patronizes the Pillar guy by saying, yeah, they’re looking at it right now (glancing over at the report and maybe flipping open some pages). The Pillar guy then launches into the “facts” of the report; customer goes back to checking their Inbox. Lunch over, the customer walks to his next meeting where someone suggests they buy NetApp or EMC for the next project. The customer starts to repeat some of the “factoids” they just heard on the phone, and maybe they should research this a little deeper.

    Now, I’m all for research but the Demartek report is marketing malpractice. It doesn’t come remotely close to configuring a NetApp system according to NetApp best practices or even the default values. Customers don’t use Pillar manuals to configure NetApp systems so I’m not sure how this is Storage 101. Storage 101 would say, turn on the system; accept defaults; tune if needed. It should go without saying that customers shouldn’t use Pillar defaults when configuring NetApp systems. The bottom line is we see this all the time and there seems to be a level of tolerance in the storage industry that you don’t see anywhere else.

    Personally, I love it when a customer trots this stuff out and I’ll tell you a tactic I use. I briefly look over the doc, kindly refute one or two of the more glaring factual errors and then ask the customer if they wouldn’t mind inviting in the vendor that supplied them the report or list of competitive points. I’d be more than happy to discuss it with the other vendor in the room. In fact, I’d like to see the other vendor give a detailed technical explanation of why they believe this stuff to be true. 99% shot your competitors won’t come to that meeting. (I’ve only had one competitor take me up on this offer. Self-immolation is horrifying but you just can’t take your eyes off it.) It helps short-circuit this tactic and then let’s you focus on the customer to see if you can help them or not. And, yes, there are times where I have recommended a non-NetApp solution because I don’t want to shoehorn NetApp into a project where we don’t fit and ruin the prospects of helping in areas where I know we can excel.

  12. Mike, please point me to NetApps best practices for configuring a NetApp box to achieve fast drive rebuilds. The NetApp box could have been configured to achieve optimal performance but that would have made the drive rebuilds even worse. Every effort was made to make sure this test was done fairly. The report is very clear and there’s no FUD in it at all.

  13. @ Jim McKinstry:

    Nobody told me to write this, I was competing against Pillar recently and the VAR showed my prospect this report, claiming EMC and NetApp are super-slow and nobody in their right mind should consider buying such archaic technology, instead they should be buying something modern (they said the lack of market penetration over several years is because people are too afraid to switch to Pillar).

    I’m sure the data in the Demartek report is 100% accurate, it’s what the systems could do configured as such.

    But making the systems configured as close to each other as possible is counter-intuitive – not all systems have the same best practices and architectures. The goal is to set up each system the way the vendor would do it properly, provision the requisite amount of storage, THEN test.

    Looking at the Demartek report, some basic observations were:

    1. No RAID-DP (which is the default RAID for NTAP, someone had to actually override that)
    2. Small aggregates (not best practices, the goal is to aim for as few aggregates in a system as possible, with 500GB drives there could be 32-drive aggregates). EMC could do something similar with MetaLUNs.
    3. No multipathing on the back end per RAID group due to how it was done (again, going against best practices and defaults). Can be done on both NTAP and EMC.
    4. What was the reconstruct priority? Tweakable on both NTAP and EMC.
    5. Someone specified what disks to use on NTAP instead of letting the system pick automatically – this assumes that, this someone, knows better…

    D

  14. Dimitris
    1. No RAID-DP – right. Wanted to keep this apples-to-apples and give netapp the best shot for good performance

    2. Small aggregates — again, larger disk groups would slow rebuild performance. I wanted to give NetApp the best chance to succeed.

    3. No multipathing on the back end — This was not intentionally left out. Shouldn’t impact the rebuild performance though. Or are you saying that a 6-drive disk group can saturate the backe-nd of a NetApp filer?

    4. What was the reconstruct priority? — we used the default which, I believe, was “ASAP”

    5. Someone specified what disks to use — not done intentionally (didn’t even realize this was done). Shouldn’t impact the rebuild performance though.

    Are you saying that NetApp can rebuild a 32-disk RAID-DP group faster than a 6-disk RAID 4 group? I’d love to see those results.

  15. Oh Jim…

    All I’m saying is very very very simple:

    You should have run the test with the defaults, see what you get, then monkey with the config after reading EMC’s and NetApp’s best practices for performance.

    That’s typical in many tests: You want to show results with the defaults and then, if possible, with a tuned config.

    You went out of your way to NOT use the defaults.

    I’d say the onus is on Pillar and Demartek, NOT EMC and NetApp, to re-run the tests with the default configs.

    What you did assumes you know exactly how EMC, NetApp and Pillar all work. I’ll grant you know 100% how Pillar works, but unless you worked as an engineer at both EMC and NetApp I’d advise you to make no assumptions.

    Your assumptions in the above post are tragically wrong, BTW.

    What if you were testing XIV or 3Par – how would you carve up those arrays to make it “apples to apples”?

    XIV and 3Par rebuild performance is second to none BTW.

    And – look how much time we’re wasting talking about rebuild performance, which illustrates EXACTLY the point of the original post – who cares and under what circumstances?

    I’d say you need to really really care if you have RAID5 and don’t do RAID10, RAID-DP or RAID-6.

    A complete data management solution involves a little bit more than fast rebuild speeds.

    D

  16. Awesome blog. I am a NetApp employee (full disclosure) and was just having the same thoughts this morning. What an awesome analogy here Dimitris!

    Chuck, I agree with the overstatement of capabilities but who is to say they are overstated? The competition who may or may not know how the said technology works or how to configure it? Not in my mind.

    If a vendor grossly overstates a products capability the customer will know! How likely do you think that customer will be to buy from that vendor again? Do you not think customers often do proof of concepts either? Of course they do. A vendor has to be able to back up his claims or they will run out of customers really fast.

    Keith

  17. @ whoever “formerly netapp” is (pinco@yahoo.com):

    At least let everyone know who you are… I don’t moderate, indeed your comment somehow showed up as spam and I had to unspam it, but at least keep it more intelligent pls.

    D

  18. Dimitris, I’ve spoken to enough NetApp customers to understand that a failed drive kills their performance and takes a long time to rebuild. The Demartek report simply points that out in no uncertain terms. Ship me a 3170 and I’ll be happy to do some tests for you using your criteria. The only non-default that I’m aware of was the number of drives per RAID group.

    XIV does have very impressive rebuild times. They don’t actually rebuild a drive in the classic sense but they do recreate the copy of data that was on the failed drive very quickly. I think 3PAR does a similar thing. Kudos to them both.

  19. I can sell you a 3170 with all the trimmings, at list price, and you can test all you like…

    I have documents showing ZERO performance loss even with 2 failed drives (heavy-duty archival workload), so here we are again, you say you “have spoken to enough NetApp customers” that have this alleged problem (sounds like a big problem too – “kills their performance”), yet I have documentation proving the problem doesn’t exist in a properly configured system.

    You can of course MAKE the problem appear with some configs.

    Are you saying that a small Pillar box with under 20 drives will still rebuild drives like a champ?

    Parallelization is the key to storage performance.

    And why are you still so hung up on rebuild speeds?

    Look at the last sentence in my original post.

    D

  20. @ Jim again:

    You asked “Are you saying that NetApp can rebuild a 32-disk RAID-DP group faster than a 6-disk RAID 4 group? I’d love to see those results”

    Again – it really helps to understand how the “other guys” do it, otherwise you will make really bad assumptions that waste everyone’s time.

    There ain’t no such thing as a 32-disk RAID group, but there is such a thing as a 16-disk RAID-DP group, plus you can have several RAID groups in an aggregate (parallelization).

    You do NOT put each LUN in a single RAID group with NetApp (or 3Par or XIV).

    In a nutshell, this is part of why the Demartek test was flawed.

    A LUN is normally spread across many RAID groups by default, kinda like Pillar does, and like XIV and 3Par do (and EMC with MetaLUNs).

    D

  21. I’ve been thinking about this all day, and I’ve come to an interesting conclusion — vendor FUD happens because customers allow it to happen.

    Look at the evidence. Customers don’t tolerate shoddy products, so there are very few. Customers don’t tolerate overpaying, so prices are usually in line. Customers don’t tolerate crummy customer service, so that’s usually pretty good as well.

    In our economy, paying customers are God. And if they tell us “we won’t tolerate any more of this FUD”, by gosh, you’d see a helluva lot less of it.

    Interesting note: now that I think of it, I know several of our customers who handle the issue with all of us vendors the same way:

    If you claim something about your product, be prepared to back it up. If you claim something about someone else’s product, be prepared to back that up as well. Otherwise, don’t bother saying anything at all.

    Very effective, I would think.

    Thoughts?

  22. @Dimitris

    Disclaimer: Pillar Data Systems Advanced Support Engineer.

    The Pillar Axiom doesn’t put a LUN onto one array either, that’s the foundation of our system from the very beginning. We separate our drives into 6 disk RAID 5 arrays on SATA systems. Each storage enclosure contains two arrays, 1 global spare and two RAID Controllers, (1 per array). We then aggregate the space from those arrays into a central storage pool and place pieces of LUNs on separate arrays to maximize spindle count and therefore performance.

    The reason the rebuild test was fair is because the Axiom has one RAID controller for every six disk array. We can’t define a 16 disk DP RAID array, even if we wanted to, which we don’t, so to make a test that was close to what the Axiom has, Demartek had to define a 5+1 RAID4 array. The test was run and the Axiom was faster. That’s the end of the story. This was ONE test; rebuild times. Not performance, rebuild.

    Now, you may be right that this isn’t a typical config for a NetApp system and you may bemoan the need for fast rebuilds when you have RAID6/DP, but for Pillar it is important that the maximum redundancy of the Axiom be maintained at all times, and if we lose a layer, the faster we get it back the better. The fact that we have one RAID Controller for every six disks is a plus for that stability and performance reliability.

    Facts speak for themselves and trying to label this test as a distortion is why I’m not in marketing 🙂

  23. @ Jim – Why would you dumb down the capabilities of a box just to do a like for like comparison? It’s not like for like if something is capable for more.

    For example using the car analogy, would you test an M6 without the M mode button turned on and EDC set to the softest setting with DSC turned on and call that a like for like comparison? Would you compare a car on snow tires vs 1 running on r compound on the track? it’s a dumb and useless comparison. One that is done only for competition FUD throwing rather than serving your customers to their interest. I would ask what the point of that test is? So what if rebuild times are slow when you handle a double disk failure. Moot point.

  24. Dimitris, You are correct. I misread “32 disk aggregate” ad “32 disk RAID group”. At the time of the test you supported 16 disks in a group. I just assumed you increased it. Again, a 16-drive disk group will rebuild slower than a 6-disk group. I didn’t want to bake this test against NetApp.

    “Are you saying that a small Pillar box with under 20 drives will still rebuild drives like a champ?”
    A pillar box rebuilds a drive in the same amount of time if there are 12 disks or 832 disks. It’s always the same. That’s one of the many advantages of our distributed RAID architectures.

  25. [Full Disclosure]
    I’m a customer. I work for SAP and we use all kinds of storage vendors.
    [Full Disclosure]

    @Jesse –

    That’s not talking about facts. What you guys did is (going back to the car example) take a look at what you can do with your car and saying something like “we have the best performing 12 valve valve engine”. Since we don’t use a 24 valve engine like the competition we changed their engine so it would only use 12 valves and then compared the performance dates.

    That is just a non-useful comparison to me. Who in their right mind will buy a NetApp box and tell the engineers to go against best practices and carve up the box to get 5+1 RAID4 arrays? You name me one customer who would do something like that??

    Or to turn the question around. Say your box would support 16 disk DP RAID array but the 5+1 RAID4 would be faster in rebuilds. Would you actually think a customer would go against your recommendations and configure a 16 disk DP array “because this was the way they configured it on a benchmark”?

    As a customer I want an array that best suits my overall environment and workload, and that is not just performance, that is reliability, maintenance, support, performance, backup, and most likely a lot more. Rebuild times are just a part of that, and with the 2TB SATA drives, I don’t even care about DP configs, they suck in rebuild times for larger environments. Give me TP or define something else.

    I don’t give a hoot about the rebuild times, I just don’t want a big performance impact during a rebuild and I want to have the risk of losing data during the rebuild. How you do it is your thing, the result is the only thing I as a customer am interested in.

    Bas

  26. @ Bas and Josh – thanks, that’s the entire point.

    Pillar, NetApp and EMC architectures are so vastly different that the Demartek test is even more pointless.

    @ Jim’s latest: the whole point is that, since data on NetApp is spread out (which you did not do), a drive rebuild doesn’t affect speeds and, using multipathed RAID groups on multiple loops (default setting which you also did not use), rebuild times are a fraction of what was reported.

    @ Jesse – since Pillar spreads the LUNs over all disks, why not let NetApp do the same (which is the default) and EMC as well (MetaLUNs)?

    The Demartek test clearly illustrates why I even wrote the post in the first place.

    Unless someone KNOWS how EMC and NetApp arrays can be configured, they might read it and say “well – this makes sense, maybe I need to look into this Pillar company, I hear Larry gives them all the money he finds under his sofa cushions”.

    Then, EMC and NetApp engineers have to spend their time explaining why the test is asinine, and sometimes the customer believes them, sometimes not. Anyway, the FUD was successful in wounding the soldiers. Look how much time we’re spending on this!

    I suggest Pillar find better ways to show true value than go into accounts with a 3-year-old report that wasn’t done right in the first place. If Demartek wants the full explanation on why the report is useless, they can contact me off-line and I’ll send them a multi-page document explaining why their reasoning was flawed, plus detailed data showing NetApp systems suffering NO performance degradation with multiple disk failures (interestingly, with an IDENTICAL system to the one Demartek used). I will then expect a loud, public retraction, and never to see this document in the wild ever again.

    Look at 3Par: Their gear is arguably way cooler than Pillar and far superior when it comes to performance (both actual performance and rebuild time). They have the SPC record. Yet, they’re relegated to being a niche player since, while they take care of some problems well, they don’t tackle most of the problems. But still, I’ve never seen them use such silly tools in order to succeed – in a similar test they would utterly annihilate the Pillar box.

    On the other hand, NetApp and EMC keep growing, each for different reasons: NetApp for having a single solution that takes care of most problems (simpler), EMC for being able to cobble together multiple boxes to take care of most problems. Either way, the customer’s problems are being addressed.

    If I’m buying, parlor tricks aside, I need something that will help me deal with my data management issues holistically, not just a point solution.

    The customers are voting with their orders, and NetApp is deploying 2.8PB of enterprise storage PER DAY (and I’m sure EMC some huge number as well, Chuck will probably tell us).

    So – you either need better FUD to fight excellence, or you need to build excellence and win fair and square. FUD just makes you annoying. Excellence makes you liked and respected.

    Which would you rather be?

    D

  27. @ Dimitris

    To quote Sigmund Freud, “Analogies, it is true, decide nothing but they can make one feel more at home.”

    If everyone is so fond of analogies, how about this one: A car has as much in common with a storage system as a monkey does with a whale; they are both mammals, but beyond that there is nothing. Much like the car and the storage system are both machines, but that’s all they have in common, so if we can drop the tortured and ineffective argument and get down to the crux of our point, that would be super.

    The point of the Demartek test (not ours, theirs) was to gain metrics on rebuild times comparing storage systems to each other. I fully understand that each storage system has a different architecture and implementation in a fully functional environment, but that is NOT what is being tested here. The commonality between all (or at least most) storage systems is that we all use RAID standards to create the first level of data redundancy. Since the Axiom, at the time, had fixed 6-disk RAID5 arrays per controller, the NetApp box was set up to as close to that as possible. This was based on the fact that having more disks in the array or using DP would make rebuilds on NetApp LONGER. If this assumption is incorrect, than Demartek should be presented with evidence to illustrate that, otherwise the tester will go with what practical experience tells him is correct. Of course he may have just read the top of page 112 in this document: http://www.redbooks.ibm.com/redbooks/pdfs/sg247129.pdf

    Keep in mind that during these tests, there were several iterations of I/O, going form none to heavy. While there was a noted performance degradation, this is ancillary to the point of the tests. This was to test one metric and only one metric; rebuild times. I understand that this is not a practical NetApp configuration for production environments, but as stated before, it was used in order to give the NetApp box a comparable configuration using as many defaults as possible to match to the Axiom and not determent its rebuild times by adding DP or more disks.

    What you are failing to realize is that the test is not crippling the NetApp box, but giving it a fair shot. If this same box were to be put into production, they would most likely not use RAID4 5+1 arrays and would follow the best practices. THAT IS A DIFFERENT TEST.

    You can argue the practicality or applicability of the test as it pertains to your architectures, but the facts are the facts. In rebuild times, the Axiom was faster. End of story.

    That isn’t the only virtue of the Axiom (i.e. 80% utilization), but it is one and the suggestion that we not tout is as a positive attribute is, frankly, ridiculous. It just points out one of the benefits of Pillar’s distributed RAID architecture. Honestly, I’m wondering why Dimitris is putting so much effort into smearing the results if his sales figures are so dazzling. Perhaps he should extend NetApp’s “excellence” into rebuild times and the whole test becomes moot.

    BTW Dimitris, Larry funds us with the money under his mattress.

  28. @ AD: Why would you even post that? Don’t make me start deleting inane comments.

    @ Jesse: Page 112 of that manual is absolutely correct in general, the more disks in a group the longer it takes to rebuild. That’s a universal truth. AS LONG AS YOU ARE COMPARING IDENTICAL RAID TYPES. Which RAID-DP and RAID-4 ARE NOT.

    What you fail to realize (which is probably entirely NetApp’s fault for not explaining it better, NetApp utterly sucks at marketing) is that:

    * RAID-DP actually helps a lot with rebuild times vs RAID4
    * On a system with multiple shelves, RAID groups would belong to multiple back-end loops on many shelves, further helping with throughput
    * Shelves themselves need to be multipathed – 2 paths per shelf per controller
    * A typical aggregate could have multiple RAID groups
    * A LUN would be all over that aggregate
    * Ergo, a workload would be so spread over the disks that performance hits will be nil, and rebuild performance will be high

    All you need to do is let the system build the aggregate itself, you don’t need to specify anything besides the size of the aggregate to automatically get all the above. The process literally takes a few seconds and you’re done.

    I totally get that the purpose was noble, but there’s a reason all that automation is there in the first place 🙂

    Demartek chose to totally bypass this foolproof and 100% automated process, and instead laboriously made the RAID groups as close to Pillar as possible. All based on the flawed premise that RAID-DP would be slower. But wouldn’t it be a great proof point to show it both ways? Again – I fully accept that Demartek’s results were 100% correct in the AS-TESTED configs. But since the test was done this way, it can only mean 2 things:

    1. Either Demartek has no idea how to properly set up NetApp and the performance ramifications of what they did, therefore they are not qualified to test NetApp
    2. Demartek absolutely knows NetApp best practices and chose to set things up so that the system would behave as poorly as possible. This is the sinister option, I’d much rather go with #1 and give the benefit of the doubt.

    Either #1 or #2 invalidate the test thoroughly.

    It’s not just me saying this is flawed, everyone here but Pillar and Demartek (and not everyone was from NetApp) see this test as flawed.

    The rebuild times would be much faster otherwise, and there would be no performance hit.

    I’m about as good at letting things go as a rabid rottweiler, and I’ll say it until I’m blue in the face: The original post’s purpose was not to target the Demartek report specifically, but since you and your buddies so enthusiastically responded, I’m responding back.

    The test is POINTLESS.

    Forget flawed – merely POINTLESS.

    Sure – show how a NetApp system, configured like NO OTHER NETAPP SYSTEM IN THE ENTIRE WORLD, EVER, has poor rebuild times vs Pillar.

    It’s a great example of using a meaningless corner case in sales campaigns. I’ve only encountered it ONCE in my life and once was enough to infuriate me.

    Kinda like proving that one can freely jump off a cliff. Sure, you can jump off the cliff, but should you? I mean, unless you REALLY want to.

    Why was the test done in the first place? Who at marketing thought this up and said “wow, we will sell tons of boxes this way!”

    3Par and XIV would destroy anyone in this test yet they don’t make such a big deal of it.

    Why not have normal tests instead? Exchange with umpteen thousands of users, big SQL, big Oracle, SAP, high concurrency access, VMware… you know, stuff people care about.

    If you need more info (including proof that rebuild times on large RAID-DP groups are faster than RAID-4), why don’t you read these, they might help you compete (or apply for a job):

    http://media.netapp.com/documents/tr-3574.pdf
    http://media.netapp.com/documents/tr-3298.pdf
    http://media.netapp.com/documents/tr-3437.pdf

    And, as a special gift to anyone that still believes the 2x + delta space requirement, page 16 here:

    http://media.netapp.com/documents/tr-3578.pdf

    All documents are freely accessible without needing to be a customer, there’s tons more if you go to netapp.com, look at library -> technical reports. So nobody can say that best practices guides “weren’t available” or somesuch nonsense.

    Regarding utilization: I can prove you can have 80% utilization WITH RAID-DP! Including spares and all overheads. Not in a disk-limited system, you need over a certain number of drives, but that’s still pretty impressive for that level of protection and functionality. BTW – what will happen if you DO simultaneously lose 2 disks in one of your RAID5 groups? For whatever reason? Do you lose access to all LUNs in the system? (honestly I don’t know, I’m not trying to steer the conversation, I’m curious, as I’m curious to see how 3Par does it, Marc chime in pls).

    And, finally, @ Jim:

    The SPC-1 “stunt” was at the invitation of Chuck, EMC’s CTO. When asked why EMC has no SPC-1 results he invited anyone to go ahead and buy a CX, then post the results. So, NetApp did.

    BTW – NetApp doesn’t have anywhere like a top SPC-1 score, the honors go to IBM’s SVC & 3Par, the aim of the “stunt” was to test like-for-like EMC and NetApp systems in SPC-1.

    SPC-1 submissions are pretty thoroughly scrutinized, and something that public, while ballsy, couldn’t have the holes the Demartek/Pillar report had. Everyone and their mother pored over that report. Which, actually, showed the EMC and NetApp systems as pretty darn close in performance, until a snapshot was taken on CX. Even EMC admits that CX snaps, by nature, are not the fastest, so the test makes sense.

    When I first saw the post-snapshot CX result a few years back I had my doubts, but I found out who conducted the tests and the guy used EMC best practices docs after all.

    If you look at the Veritest report, you see that, unlike Demartek, they actually went to the trouble of configuring the CX with both MetaLUNs and straight-up, and posted both kinds of results (with MetaLUNs being much faster since the spanned LUNs lent themselves to the workload).

    Here’s how it goes: If you post some kind of test and I explain to you, logically, quite a few areas where the test was flawed, then there may be some merit to what I’m saying.

    Looking at the Demartek report, it took 5 minutes for a few engineers to say “that’s not how you do that – wait, WHAT? HELL NO that’s not how you do that – and that – and THAT?????”

    If there are that many holes, then it just isn’t a good doc, FUD or not. Frankly, it’s amusing Pillar is still using it today after almost 3 years.

    And, ultimately, @ at the EMC and HP folks and whoever else is culpable of FUD:

    I know you’re keeping quiet and are probably happy to see the discussion focusing on the Pillar guys but remember, Pillar is actually not guilty of the worst infractions regarding FUD… I have seen all the competitive documents used by the various vendors and guys, things need a rewrite.

    D

  29. @Jesse

    Well ok, if you dislike analogies so much I won’t use them anymore.

    Fact is that the guys performing the test didn’t configure the NetApp box to match optimal config, which is what was done with the other box. The NetApp array was configured to match Pillar box, and it might be that this was done under the assumption that this would be the best comparison since non-array specific logic would lead to this assumption.

    Problem is that if you want to have a fair shot, you need both boxes running with optimal metrics for a rebuild test.

    Varying RAID levels don’t mean sh*t! Who gives a hoot if in theory RAID-4 is faster in a rebuild than RAID-DP? If that was all it boiled down to, we wouldn’t see variations in cache, controllers, interconnects, shelves and all the rest.

    If you want a fair comparison, have each array configured according to their own best practices and then perform the same test. And you know what? That might even mean that in a rebuild a RAID-DP configuration can be faster than a RAID-4 config.

    But that might mean that you would lose a benchmark, so instead assumptions are used to configure a competitors box, and the configuration must match my own. That is after all a fair comparison since both arrays were built in the same way and only differ on their software implementation…?

  30. @Jim and Jesse,

    I would much rather see you guys do the same test with RAID 4. That way we can all truly compare apples with apples.

  31. @ Bas: RAID-DP is actually faster in rebuilds than RAID-4. There are tons of pluses going RAID-DP and almost no drawbacks, which is why it’s the default RAID on NetApp systems.

    And, since someone will ask, in a same-sized RAID group, RAID-DP has a few percentage points less performance than RAID-4. Problem is, the RAID-DP groups are never sized like RAID-4 (in fact, they’re 2x larger) so in practice RAID-DP always ends up being faster.

    D

  32. So, in the spirit of full disclosure – I’m part of the SE team at NetApp. I drank the NetApp KoolAid (must have done as I’ve been here nearly 10 years) and if you want to comment, my blog is at http://blogs.netapp.com/doneright.

    This is a very entertaining thread which has caused much debate for lots of people. However, at the risk of sounding like an old git, I’m losing the will to live.

    Guess what, RAID rebuilds happen because disks fail. It’s a fact of life and why all of us are in the business in the first place. To stop customers losing data when a drive inevitably goes to the great scrap heap in back of the server room.

    The problem I have with this whole thread is that it is largely irrelevant. The FUD piece is an interesting discussion – let’s just say that if everybody’s technology was as bad as the other vendors claim – would we really be able to sell it to the poor deluded punters you think we’re dealing with? I doubt it.

    So each vendor has some really good points and ideas (that’s how you got through VC funding in the first place…)

    BUT – this whole RAID rebuild thing? Please…

    Can we just kill the thread on it – here’s at least 3 reasons (there are more but off the top of my head…)

    1) No exec I ever talked to (who has budgetary power and the money to afford this kind of storage) gives a crap about raid rebuild times – as long as his users are happy (that’s why we NICE things so user performance is unaffected – if you really want to know, check out the raid.reconstruct.perf_impact option).
    2) We know RAID rebuilds are ‘expensive’ so a long time ago we introduced ‘sick-disk copy” to reduce the frequency we’d need to do a full read of all blocks (ie straight copy the good stuff and rebuild only the missing pieces). This means that the chances of users being affected are reduced even further.
    3) Based on (2), the reason for raid-dp is because of raid rebuilds – when are you most likely to hit a soft error in a disk – when you have to read every single other block in the raid group to rebuild one that has already failed. Failure against the second error whilst rebuilding the first is what got people concerned in the first place and why we can use wide-stripes(good perf) but maintain integrity and availability to users. That’s why we always use RAID-DP as a default.

    I’m sure this could run and run for ever but I think it’s now at the stage of pointlessness.

    Good work everyone on the proof points but probably time to go help some customers solve their problems now?

    Richard

  33. The SPC stunt as you mentioned went against EMC best practices doing things like turning off cache at various points due to the test to make the results seem negative.

    Had that cache been enabled the results would have been entirely different.

    There’s FUD right there.

    Hence the reason it was a stunt and the submission was technically worthless.

    As for lawyering up feel free, but be aware the once it becomes a legal matter it’s completely out of the hands of those involved.

  34. @ Zilla and Jim:

    Regarding the SPC tests:

    SPC, like SPEC, is not a commissioned test like Veritest or Demartek. Plus, NetApp knew that everyone and their monkey would doubt the benchmark, so all attempts were made to be accurate. SPC, like SPEC, is as public as benchmarking gets.

    NetApp had the tests heavily scrutinized by SPC auditors that made sure all efforts were made to get good results and that best practices were followed, and only the best results were allowed to be published. It’s not like NetApp gave SPC a single set of CX results and said “here you go”.

    You and I both know really well that EMC disables the CX write cache for several other applications. SPC-1 simply seems to be another one of those. It’s OK.

    SPC told EMC they had the right to challenge the benchmarks for 45 days before they were published.

    EMC also had the right to challenge the benchmarks for up to 60 days after they were published.

    Again – this is NOT the same as Veritest or Demartek, where they can test what they like and publish at any time, unchallenged (just look at the crazy discussion I had with the Pillar/Demartek guys).

    EMC did NOT challenge the results in any way and therefore they were not pulled.

    Anyway, the CX wasn’t much slower than the NetApp box it was compared with (25K vs 30K). It only slowed down with snapshots, which makes sense due to their nature on the CX.

    This all happened 2 years ago. EMC at the time said SPC was meaningless, yet today I am glad to see EMC participating more actively at least in SPEC. I guess SPEC is not meaningless?

    I strongly advise EMC to participate in SPC in general, since as ‘zilla said in twitter the Vmax is an “I/O monster”, you might even be able to crush the 3Par and latest SVC results, which would be really cool!

    BTW @ the IBM guys: congrats, awesome result, just saw it!

    @ Marc Farley – the IBM gear cost a cool $7m, is that the “lab queen” definition? 🙂

    Still, an amazingly good result.

    D

  35. @Zilla .. you said “The SPC stunt as you mentioned went against EMC best practices doing things like turning off cache at various points due to the test to make the results seem negative.”

    There’s a fair degree of innacuracy in that statement which I think needs to be addressed.

    1. “went against EMC best practices” – EMC has a number of different best practices regarding cache settings for various workloads on various platforms. Some of them such as this one recommends turning off write cache for VMware NFS workloads, (http://blog.scottlowe.org/2010/01/31/emc-celerra-optimizations-for-vmware-on-nfs/ ). Can you refer to the one your talking about that would be relevant to the SPC workload ?

    2. “turning off cache at various points” – The SPC rules says you’re not allowed to change the configuration of the system during the benchmark, so its either on _all the time_, or off _all the time_. As it turns out, write cache was turned off to _improve_ performance.

    3. “make the results seem negative” – The SPC rules says you have to submit the maximum performing benchmark, which means amongst other things, that we had to show the auditor that we had tried various configurations (including cache settings) etc, and had submitted the one that performed the best. If we got it wrong EMC, or Dell had plenty of time to challenge the results and force us to pull the submission. Notably that did not happen.

    4. “Had that cache been enabled the results would have been entirely different.” – Yes, they would have been slower, otherwise we would have published with write caching turned on.

  36. Full Disclosure – I am unemployed at this moment – but Chuck would call me a NetApp fanbois as I am going to say something in their favour.
    This FUDfest is typical of what happens when you try and have a shootout but insist on holding the other guys gun. As has been pointed out only independant third party controlled and peer reviewed testing really has credibility – the rest is useful but with a fistfull of salt.
    The critical comment I would use is from @Jesse “for Pillar it is important that the maximum redundancy of the Axiom be maintained at all times, and if we lose a layer, the faster we get it back the better.”
    here he clearly articulates that Pillar NEEDS fast rebuild…..or….oh dear it’s gone.
    In contrast NetApp has RAID-DP and should not care if rebuilds take longer, so you guys are spending too much effort reacting to the misdirection.
    FUD is aimed at creating uncertainty – the only defense is by creating confidence, and getiing into tech mode does not do this.
    @Richard – well said

  37. Sorry for being late. I think that in XIV everything is about… timing and failure rates=>Just Statistics. The questions here is: how often this “commodity hardware” fails, how long IBM test this system, how reliable are disks.
    XIV in a clustered-way-of-life 6 to 15 nodes sounds really great, and it is as secure as a NetApp controllers cluster (SW based,not appliance based).
    The “magic” here is: how secure is 30 minutes window so 2 disks in different nodes can fail simultaneously? How often a controller fails? And 2 controllers before HW Support arrives? How often a misconfiguration put a system down?
    I think XIV needs a chance to show how reliable it is

Leave a comment for posterity...