NetApp posts SPC-1 results

NetApp posted some SPC results showing their 3040 box performing pretty well in SPC-1 relative to an EMC box.

There have been rumors that when running multiple features in a NetApp box then performance suffers. Which kinda negates the whole value prop of NetApp (since that’s when people typically choose NetApp – they want one box to do everything).

A realistic test would be to have OTHER apps sharing the array (on other spindles), as is usually the case. Almost nobody dedicates an entire array of that size to a single app.

Have the box do CIFS, NFS, iSCSI AND FC.

Show performance over a significant period of time (another point NetApp detractors use –  performance declines over time due to WAFL fragmentation).

THEN show the performance delta as each feature is enabled.

Obviously hard to do and maintain kosher SPC results but it would be a worthwhile addendum and, if successful, would shut up the NetApp detractors (since that’s a usual technique for selling against NetApp). I’d also show performance in degraded mode.

Anyone have any data on NetApp performing either way when used as a multi-role box?

A note on the EMC config and interpreting those benchmarks in general, be they SPC or SPEC or whatever: ALWAYS READ THE FULL DISCLOSURE regarding the test, don’t just look at the graph. If you’re not technical, get a techie to explain it to you.

For instance, looking at the way the EMC box was set up, I highly doubt it was done using EMC’s best practices. To wit:

  1. They didn’t maximize the write cache
  2. They seem to not have used separate spindles for the snapshot area (a differentiator since, unlike NetApp, EMC not only allows such a thing to happen but actually encourages it)
  3. They could have used MetaLUNs more instead of striping using Windows.

I’d be willing to bet dollars to nuts that the NetApp box was set up properly 🙂

Another thing: look at the response times in the graphs.

Like they say, “only believe 50% of the statistics you read”.

EDIT added Feb 22, 2013:

I wrote this before I knew how NetApp tested the EMC gear. The tests for the Clariion were audited by the SPC-1 auditors, and only the best results were shown. MetaLUNs and write caching were both tried but resulted in slower results. EMC was given an opportunity to have the results not published, and after publication they were again given an opportunity to pull the test, neither of which were done. SPC-1 just proved to be an unfriendly workload for the Clariion, that’s all.

To this day EMC is a member of SPC yet has no submissions as of Feb 22, 2013.

D

18 Replies to “NetApp posts SPC-1 results”

  1. 1) various read/write cache setting were used and what was ultimately submitted was the best result that we could obtained after several runs and configs.

    2) seperate spindles were used for the snapshot area. This is shown towards the End of the full disclosure document where the LUNs and Raid groups are created

    3) NetApp tested metaluns as well but the results did not provide material changes compared the ones sumbitted. As you may or may not know congiguring a box full of metaLUNs is NOT an EMC best practice but more importantly as stated on pg. 16 of EMC’s Flare 26 best practices guide, metaLUNs are not used to provide a performance boost since MetaLUNs do not cause an increase in parallelism and multi-threading.

    Ceia sou
    Nick

  2. Hi Nick,

    Thanks for tackling one part of my question. The bigger question though still remains – what I’m most interested in is finding how NetApp boxes degrade over time and depending on what they’re doing (i.e. when used with all the features they’re capable of).

    Regarding the EMC config part:

    A CX3-40F should probably have been used (double the back-end throughput, exact same price but you lose iSCSI) given the number of drives. If you talk to EMC employees that are part of their performance group, their recommendations typically don’t quite match what one may find online or in the manuals. The “best practices” are pretty generic. The EMC people I’ve met would not have configured the box this way. Maybe it would be a good idea to invite them and let them configure it and re-run the benchmarks? I’m sure they’ll be game. That way, each box will have been configured by the company that makes it. Any other way is suspect in the eyes of consumers.

    How does that saying go again? “The Caesar’s wife must also appear virtuous, not merely be virtuous”. The less reason you give anyone to critique your results, the more successful you’ve been.

    Another question: exactly how many spindles were used for the EMC snapshot area? How many for the NetApp snapshot area?

    Reading the disclosure, it seems that 30 LUNs were used in EMC’s case – but are these separate SPINDLES from the rest of the data?

    Was the striping done properly so spindles were not stepping on each other?

    What bears mentioning is that EMC and NetApp do snapshots entirely differently, as well. NetApp doesn’t do a copy-on-first-write, but their approach (EMC and others say) screws up performance long-term due to fragmentation.

    Take care,

    D

  3. Dimitri,

    The number of spindles used in the CX is part of full disclosure report towards the end. All the data is there.

    Furthermore, it is important to note that as part of the SPC 1 rules, when publishing someone elses results there’s provision that requires a 60 day review period where the vendor can challenge the results or *any* member, *before* the result is officially accepted. EMC were notified of this by the auditor himself and *did not* challenge.

    I don’t know if you are aware of this, but for the better part of 2 years now, NetApp provides FlexShare (part of ONTAP), which is used to provide for safe multi-tenancy (i.e CIFS/NFS/iSCSI/FC).

    The best practices came right out of EMC’s documentation and if that’s “generic” and is not applied for production environments then there should be no reason for its existence.

    Furthermore, remember what I said above above regarding the fact that EMC did not officially challenge the results although they were given the opportunity to do so. That speaks for itself, so any comments regarding the configuration, in my opinion, is nothing more other than a waste of time.

    Cheers

  4. Nick,

    There are a number of problems here:

    Is now just OK for me to start sending NetApp a thousand benchmark runs daily for them to challenge? Cause if they don’t challenge the thousand a day (or 60 thousand within the 60 day review period), then they must be valid (at least that’s your argument). See where the SPC guys really messed up here? If you are a member of the SPC and you are a party to the voting process then I say you should be held to those rules (which is completely stupid to every agree to) until you pull out of the SPC group, but to randomly put out hack-attack jobs for non-members and require them to respond is in my opinion very, very low. (and speaks volumes of SPC & NetApp). As a customer of both companies for years, it used to be the opposite where NetApp played the moral high-ground and EMC was on the bottom, but now it’s flipped. NetApp is wallowing on the ground and EMC is morally much higher.

    There are best practices and then there are best practices. I’d like to see the document where it says to create things the way NetApp did. Sure there’s a concept of SAME (stripe and mirror everything), but then there’s stupidity. The way it was configured as the 3x windows stripes go across every single spindle. That means to move from one storage unit to the other requires *massive* head movements (like almost half the drive). My understanding of the SPC benchmark is that it trys to emulate a form of OLTP database where you are doing lots of I/O to different storage units (redo logs, data volumes and user volumes), so the way NetApp configured it guaranteed that this would perform poorly, relative to the way it could actually perform.

    I think the SPC and it’s members have no idea the can of worms they have opened up, they used this to try and for EMC into their benchmark club, the whole allow non-members testing was targetted at only one vendor. My prediction is this: In their haste to attack a single vendor the SPC members will start getting attacked by all the other members of the SPC and since it has a 120day period of staying published even after being withdrawn; members will realize just what kind of animal they’ve created. The SPC members will be angered that their “EMC specific” provision is now coming back to attack them and they will be forced to re-evaluate their decision and remove that provision.

  5. Dimitri,

    We can sit and argue about this all day long, but the BOTTOM line is that we published results based on a widely accepted, industry standard benchmark with a verifiable, audited configurations and COMPLETE transparency and not based on some benchmark created in the sanctity of our own labs.

    And speaking of “high moral grounds”, let me point you to this EMC report published not long ago on NetApp in sanctity of their OWN labs…

    http://www.mediafire.com/?50ddff9a5rc

    NetApp’s best practices are readily available on the technical library on http://www.netapp.com

    and btw, we published 2 reports (with and without snapshots) and not “thousands” and I could certainly expect that a large like EMC company that invests “billions” should easily be able to response in 60 days…

    Have a good day

  6. Nick,

    You’ll note I said NetApp *used* to play on a high moral ground compared to EMC, and now NetApp is wallowing in the same muck now. Yes two years ago EMC made a report that said NetApp performance wasn’t as good in their own tests, but are you actually trying to say that a fully published SPC benchmark is the same as a vendor claim against another? Really are you saying they carry the same weight to customers, and marketing on the frontpage of websites (but that tit for tat stuff from EMC pisses me off as well)? No matter what two wrongs don’t make a right, saying be he did it too doesn’t mean much.

    I’m not sure you quite understand what I was getting at when I was saying thousands. It was an example that was trying to be blatently obvious as to what the situation has a very high-probability of spinning to. Everybody testing each others storage, and publishing results; and if you don’t respond the results must be valid. It doesn’t take long to get to exponential numbers. See what I mean, it wasn’t meant that there are thousands today, but the pandora’s box that has now been opened by the SPC and officially santioning it; which in my mind is much worse than any trash talk between NetApp & EMC (correct benchmark or not). Now I’m going to get innodated by vendors trying to push the “numbers” they got from the competition; if I don’t refute it than it must be valid concept is wrong and shows very poorly in my eyes. What gives you or anybody else the right to force someone to respond to something incorrect? I could start a company today and start producing things against NetApp, heck 10 of my friends could do it as well, and 10 of their friends could as well. This is a very dangerous precident that the SPC is officially santioning; and it’s only going to annoy me the customer that much more.

    I’ve noticed that nobody has refuted the statement that the way it was configured required massive head movements, a way opposite most storage admins who know anything about the storage would configure it. I’m not a member of the SPC so I don’t know for sure, but the way it was configured if the benchmark runs as I expect it to; the way it was configured was setup to fail. Just like if I didn’t run waffle iron for 3 years on our filers, and then run a benchmark it would run like crap. Transperancy doesn’t change the fact that it was setup to fail, since someone who isn’t intimately familiar with that storage vendor would have no idea about it and would never be able to discern it. Seriously do you think a normal person never having touched a Clariion would be able to pickup that it was slapping the head back and forth, without having years of knowledge of the Clariion?

    There is a reason why I truely dislike storage benchmark numbers, THEY ALL SUCK. I’ve never been able to achieve the same numbers out of EMC, NetApp, Sun, etc that are published; they are not real world, they only are used to try and get an unrealistic edge on the competition. Only when the stuff hits my floor, and *I* run it does the benchmark count. Vendors who push benchmarks, or number of iops under my nose get an immediate minus when I’m looking at them; because it’s all unrealistic; but they don’t want to own up to it. I’ve gotten into actual arguments with people trying to sell me things on how the benchmark “couldn’t” be cooked up, that it’s real world; but in the end they won’t own up to running that level of performance.

    Don’t think I’m picking only on NetApp here, it’s just that NetApp has fired the first volley from an officially santioned benchmark, which to me is very disappointing; as I have a lot of their storage on the floor and thought they acted a bit differently; but it’s obvious they have no problem lowering themselves.

  7. Dimitri,

    Would have you have felt any better had netapp did this in their own labs like EMC did? The first volley officially or unoficially was fired by EMC. All we did was respond with a benchmark regarded by most an industry standard and the only one available.

    Furthermore some folks have never considered that fact that it’s was on Netapp’s best interest to publish the best results out of both boxes for obvious reasons…

    Benchmarks…I’m not fond of benchmarks myself and the reason we did this benchmark was NOT to trash the clarion. The reason was to solidify our position in the block space by comparing our array against the highest deployed array in the industry as well as prove what we’ve been saying for years that COW snapshots have a profound performance effect. Furthermore, our intent has never been to break world records in benchmarking. Our goal is to always be close to the top to the point performance is off the table at which point the conversation revolves around things that matter like flexibility, options, functionality and data management. And in that space, nothing comes close to a netapp box.

    Cheers

  8. Nick,

    I’m not the same person as “InsaneGeek”, BTW. It’s an open forum and I won’t mod anyone unless they get really obnoxious 🙂

    I do understand you completely and that your intentions are good, otherwise you’re putting yourself in a bad situation.

    What I was trying to say regarding special tunings for benchmarks is that, just like SPEC, boxes typically do NOT follow the “normal” best practices in the docs. Just look at ANY SPEC config disclosure, I’ve seen things like massive RAID0 arrays just to get SPEC SFS to go better. That’s insane and nobody would ever do it in real life. It’s fast, though.

    Especially when carving out an array that will only ever be doing ONE thing in life 🙂

    As to why EMC didn’t respond, don’t forget how big they are – who did you contact? It’s like saying “I contacted Italy and they didn’t respond”. Did you contact their performance team? Marketing? PR? CEO? Just curious.

    I kinda understand what “InsaneGeek” is trying to say regarding the striping config, not so sure it was done right (I’m not saying it was intentionally done in a way that would strip performance, mind).

    Also, consider my CX3-40F comment. I do believe the wrong model was chosen from the get-go – if I was going to get a box whose only goal in life would be to be fast at fiber-connected database work, I’d get the -40 with double the back-end throughput for sure. Doesn’t affect the price any.

    When it comes to straight low-throughput IOPS, it mostly comes down to spindle count anyway, regardless of manufacturer, unless the CPUs are maxed out and/or a silly setup has been implemented.

    In the end, I think ALL NetApp needs to do is amend the disclosure, and not just show the command lines but have a little chart with drives, showing how the LUNs are laid out, where the snaps reside, and so on (this is how most people would prefer to see it, nobody but hard-core CX admins will understand the commandline arguments).

    Just make it clear for the layman to get it, that’s all.

    Take care,

    D

  9. Dimitri,
    My apologies I thought you were InsaneGreek.

    EMC was contacted by the SPC 1 Auditor. In fact he has publicly stated so in an interview with Searchstorage recently.

    You’re assuming that the additional number of front-end targets and back-end loops on the CX3-40F would have made a difference…I would *not* assume that at all in this scenario and given the SPC 1 workload… 🙂

    The SPC rules dictate that vendors publish the steps used and these steps be verified by the auditor and go thru peer review. Can’t ammend that stuff as they ara part of the requirements.

    BTW…You used to work for MTI right? I did too back in 1999-2000.

    Like I’ve eluded before, we’ve tried several configs. Even if one could soemehow get lucky and squeeze 5-10% more that won’t be enough to alter the results given of where the bottlenecks were and certainly won’t change the COW effect either.

  10. Hey Nick,

    I’m Greek, and only a little insane, but never use pseudonyms. Plus, I spell way better than “InsaneGeek” (sorry dude, whoever you are).

    Again – WHO at EMC was contacted by the Auditor? If they just sent a letter to EMC HQ, good luck 😉

    I’m not familiar with the SPC-1 workload, if it’s pure IOPS then more back-end ports won’t necessarily help, true.

    OK, if amending the disclosure won’t work, then post the amendments in a blog or something – I’m sure you understand what I’m looking for. Stating that you tried it all and it wouldn’t go faster is fine but still doesn’t quite prove the point, and there’s still the issue that maybe the way it was all striped was horribly wrong.

    It’s not surprising NetApp snaps are faster, they’re a function of the filesystem and by definition are not shifting the data of a CoW snap. I’d still like to see if there’s degradation over time due to WAFL fragmentation though. EMC always says there is, NetApp customers know they have to defrag the boxes from time to time, but just how much is the slowdown? Any graph I’ve seen shows NetApp performance as great in the beginning, then sharply fall after capacity is used up and snaps come and go.

    Seems like everyone worked at MTI at one time or other 🙂

    D

  11. Dimitri,

    At Netapp there are a considerable amount of folks with very good knowledge of the Clariion and what it takes to configure it to achieve the representative results. Clearly, we didn’t grab the array, carved up some luns are run the benchmark.

    PS. I don’t know the person the auditor contacted and frankly I could care less. It’s not our job to contact EMC. The SPC makes the rules and is responsible to enforcing them. Anyone who wants specifics needs to contact the SPC.

    Over the years the old “WAFL Ffragmentation” candy has been worn out. It has become an urban legend in similar proportions to the Neiman Marcus Cookie Recipy. some people claim to have it but nobody ever does. Same with the WAFL fud. “Enlightening Whitepapers”, “Concocted how to procedures”, hearsay, innuendo, yet no data from production environments. Real environments as you called them…

    It’s always the same thing when you start peeling back the proverbial onion…” I have a friend of a friend, who’s father’s brother has experience it…”. I say it’s time for a new gig…

    Frankly, if since NetApp’s inception this is the only thing the competition can regurgitate, we’re in great shape…

    Take care

  12. Wow, this “geek” talk always cracks me up. Broken down in its simplest terms it goes like this: “I work for vendor A, and these test we did shows why vendor A’s product is better than vendor B’s product”. Certainly in short order you will get vendor B’s response which will look just like vendor A’s with the letters swapped.

    The bottom line is this, when you get into Enterprise level storage, software, etc. you can’t go wrong. No one would be buying this stuff if it didn’t work, and work well at that. Having used many products from both vendors it all boils down to cost and functionality. What are you trying to do with the product? For us, we wanted 0 impact snapshots which NetApp can do so well because of how it writes data. We also wanted the option to do NFS since we run so much VMWare, and there are benefits in large ESX deployments with NFS.

    Plain and simple more IOPS doesn’t mean anything to me. Does it work, does it do what I need, is it reliable, is it cost-effective, and does it help keep my company running and me employed? EMC, NetApp, IBM, HP, and Hitachi are all fine choices, and someone somewhere running storage from one of those vendors I’m sure will be pleased with the results, benchmark or no benchmark.

  13. Sometimes the “geek talk” is all about bragging rights.

    Like high-performance cars that are otherwise similar but one may have 5% more horsepower and the other 5% more torque. Entire forums are dedicated to discussing such issues 🙂

    In reality both cars will be plenty for almost any reasonable driving demand.

    With storage it’s more than the technology, a lot is politics.

    Do you want to be a shop with a single vendor or as close to that as possible? Then not many companies can satisfy that requirement – you’re looking at EMC, IBM, HP, maybe Sun.

    The problem is that going for the single vendor is almost never the right choice if you truly want the best technology.

    It also matters how much you want to push your gear. WAFL fragmentation or EMC taking snapshots slower are only issues in very few environments. But the issues do exist.

    And BTW NFS for ESX is nice but you do lose certain features. But it’s probably the easiest way to run ESX, by far. No messing with FC or iSCSI.

    Of course, a Sun cluster with some decent JBOD and 10 Gbit NICs and ZFS at the back end is one of the fastest and CHEAPEST ways to do NFS for ESX 🙂

    D

  14. Well, I guess I should say that we wanted flexibility in our environment. To us NetApp provided the most flexible options for configuring storage. FCP, iSCSI, NFS, CIFS, are all just software licenses. In fact, the software is already installed on all their appliances, I just need to enter a license key to activate them. We also liked that we could grow up into larger FAS models without doing a data migration. I remember going from a CX-300 to a CX-500, and it was a royal pain to have to copy that data between the SAN’s.

    To be honest with you for what we are doing we won’t be able to push any of these new SAN’s to the point where we will need to worry about performance tuning. Heck, we are moving from an IBM DS-4700 which doesn’t support any storage virtualization anyway (meta-luns, aggregates, disk groups, etc.), and performance is rocking on that thing. The 3040 FAS from NetApp was the perfect choice for us, and I hope we don’t ever run into the fragmentation issues that I keep seeing EMC people reference (I haven’t heard the same complaints from EVA or IBM users).

  15. Upgrading from CX to CX typically doesn’t mean you need to copy ANY data, it’s usually a data-in-place upgrade. If you do have to upgrade you can use SAN Copy, not a pain really.

    How did you do it?

    BTW there is a defrag command in the NetApp CLI, check it out. In the past I think you had to take it offline to do the defrag but maybe now you don’t.

    A 3040 is a decently-sized box, hope you have the clustered version (3040c) since otherwise if the controller fails you have no redundancy (the less-advertised fact about both NetApp and EMC Celerra boxes: they let you purchase the one controller if you don’t have enough cash. I say never do it).

    BTW NetApp normally charges you for 2x the software and controller hardware, I’ve noticed that EMC gives you breaks. Of course it all depends on the individual deal you cut.

    D

  16. Yes, we did use SAN Copy, but with the Netapp you just replace the head. We do have a full cluster unlike what we had with the Clarion. Here is a point that a lot of people don’t understand. There is a big difference between having redundant controllers (SPA and SPB) and having a clustered SAN. Cost wise EMC couldn’t come close once we started talking about a full clustered SAN and adding things like replication and snapshot capabilities.

    Additionally the replication stuff with the Netapp is licensed per SAN, not per LUN you want to replicate ala EMC.

  17. Oh, the “Reallocate” command is the defrag you are refering to. It does not require any downtime, and it can be scheduled. It is really only necessary to run if you don’t set your aggregates up the right way the first time, or if you add more storage and want to spread the IO across all the spindles, similar to the HP process of “Leveling” on the EVA.

  18. You can also replace the head on EMC (they call it a data-in-place upgrade). Typically it doesn’t involve much more than recabling and sometimes upgrading FLARE.

    You’d use SAN copy if you want to completely discard your old box, if you want to keep the old trays then the data-in-place upgrade works.

    Not sure what you mean about EMC charging per LUN for replication, EMC has various products and none of them are priced per LUN. Who’s giving you this information? Either your EMC rep was trying to rape you or someone didn’t quite understand what’s going on.

    Clustering can happen in different areas to solve different problems:

    Controllers (redundant) – SPA and SPB are indeed clustered (mirrored RAM, proper failover etc)
    Switches (dual)
    The entire array (replicate the array somewhere else)
    Data movers (in the case of a Celerra) provide clustering for the NAS services

    In an ideal world you need 2 or more of everything.

    I fail to see how the controllers in a Clariion are not clustered. I don’t really get what you mean…

    D

Leave a comment for posterity...