EMC’s VNX2 forklift: The importance of hardware re-use, slowing down obsolescence, and maximizing your investment

It was with interest that I watched the launch of EMC’s VNX refresh. The updated boxes got some long-awaited features, EMC talked a lot about how some pretty severe single-threaded bottlenecks were removed, more CPU and memory was put in, and there was much rejoicing.

Not really trying to pick on the new boxes (that will be in a future post, relax :)), but what I thought was interesting was that the code that makes most of the new features a possibility cannot be loaded on current-gen VNX boxes (not even the biggest, the 7500, which has plenty of CPU and RAM juice even compared to the next gen boxes).

Software-defined storage indeed.

The existing VNX disk shelves also seemingly can’t be re-used at the moment (correct me if I’m wrong please).

This forced obsolescence has been a theme with EMC: Clariion -> VNX -> VNX2 are all complete forklift upgrades. When the original VNX was released, it was utterly incompatible with the CX (Clariion) shelves (SAS replaced FC connectivity). Despite using the exact same code (FLARE).

Other vendors are guilty of this too – a new controller is released, and all prior investments on disk shelves are rendered useless (HDS did this with several iterations of AMS, maybe even with AMS -> HUS, HP with EVA…)

I understand that as technology progresses we sometimes have to abandon the old to make room for the new but, at the same time, customers make significant investments in N-1 technology – and often want to be able to re-use some of their investment with N (and sometimes N+1).

I just had this conversation with a customer, and he said “well, I throw away my gear every 3 years, why should I care?

Let’s try a thought experiment.

Imagine you just bought a system that’s running the fastest controllers a company sells today, and you got 1PB of storage behind it.

Now, imagine that a mere month after you purchased your controllers, new ones are released that are significantly faster. OK, that stuff happens.

Your gear is not 3 years old yet. It’s 1 month old. It’s running OK.

Now, imagine that your array runs out of steam 6 months later due to unprecedented performance growth. Your system is now 7 months old. You can’t just throw it away and start fresh.

You could buy a new storage system and migrate some of the data to share the load. However, you don’t need more space – you just ran out of controller headroom. Indeed, you still have tons of free space.

But what if you could replace the controllers with the new, beefier ones? And maintain your investment in the 1PB of storage, cache etc? Wouldn’t that be nice?

Or at least be able to move some of the storage pools you may have to the new family of controllers? Even if you had to reformat the disk?

Well – most vendors will tell you “sorry, no, you need to migrate your data to the new box”.

Let’s try another thought experiment.

You bought a storage system a year ago. It performs fine, but it lacks true deduplication capabilities. You have determined it would save you a lot of storage space (= money) if your array had deduplication.

The vendor you purchased the system from announces the refreshed storage OS that finally includes deduplication. And that same vendor made a truly gigantic fuss about software-defined storage, which made everyone feel software would be the big enabler and that coolness was a mere firmware upgrade away.

However, you are eventually told they will not allow the code that enables deduplication to be loaded to your array, and, instead, they ask you to migrate to the refreshed array that supports deduplication. Since the updated code somehow only runs on the new box. Something to do with unicorn milk.

But your array has plenty of CPU headroom to handle deduplication… and you could reformat the disks given some swing storage shelves if the underlying disk format is the issue. But the option is not provided.

How NetApp does things instead

At NetApp we sort of take it for granted so we don’t make a big fuss about software-defined storage, but hardware was always considered an enabler for the software and not the other way around. 

For instance: deduplication was released as a free software upgrade for Data ONTAP (the OS for our main line of storage). Back in 2007. For all storage protocols.

In general, we try to let systems be able to load at a minimum N+1 software releases, but most of the time we utterly spoil customers and go far above and beyond, unless we’re talking about the smallest boxes, which naturally have less headroom.

For example, the now aging FAS 3070 I have in the local lab (the bigger of the older midrange boxes, released in 2006) supports anything from ONTAP 7.2.1 (what it was released with) to ONTAP 8.1.3 (released in mid-2013).

This spans multiple major ONTAP releases – huge changes in the code have happened during those releases: 7.3, 8.0, 8.1… Multiple newer arrays were also released as replacements for the 3070: 3170, 3270, 3250.

Yet the 3070 soldiers on with a fully supported, modern OS, 7 years later.

What arrays did our competitors have back then? What is the most modern OS those same arrays can run today? What is that OS missing vs the OS that competitor’s more modern arrays have?

Let’s talk disk shelves.

We used FC loop connectivity for the older shelves (DS14). We then switched to fancy multi-channel SAS and totally different shelves and disks, but never stopped supporting the older shelf technology, even with newer controllers.

That’s the big one. I have customers with DS14 shelves that they purchased for a 3070 that they now have on a 3270 running 8.2. It all works, all supported. Other vendors cut support off after the transition from FC to SAS.

Will we support those older shelves forever? No, that’s impossible, but at least we give our customers a lot of leeway and let them stretch their hardware investments significantly longer than any other major storage vendor I can think of.

Think long term

I encourage customers to always think long term. Try to stop thinking in 3-year increments. Start thinking of other ways you can stretch your investment, other ways to deploy older gear while still keeping it interoperable with newer hardware.

And start thinking about what will happen to your investment once newer gear is released.


Technorati Tags: , , ,

How to decipher EMC’s new VNX pre-announcement and look behind the marketing.

It was with interest that I watched some of EMC’s announcements during EMC World. Partly due to competitor awareness, and partly due to being an irrepressible nerd, hoping for something really cool.

BTW: Thanks to Mark Kulacz for assisting with the proof points. Mark, as much as it pains me to admit so, is quite possibly an even bigger nerd than I am.

So… EMC did deliver something. A demo of the possible successor to VNX (VNX2?), unavailable as of this writing (indeed, a lot of fuss was made about it being lab only etc).

One of the things they showed was increased performance vs their current top-of-the-line VNX7500.

The aim of this article is to prove that the increases are not proportionally as much as EMC claims they are, and/or they’re not so much because of software, and, moreover, that some planned obsolescence might be coming the way of the VNX for no good reason. Aside from making EMC more money, that is.

A lot of hoopla was made about software being the key driver behind all the performance increases, and how they are now able to use all CPU cores, whereas in the past they couldn’t. Software this, software that. It was the theme of the party.

OK – I’ll buy that. Multi-core enhancements are a common thing in IT-land. Parallelization is key.

So, they showed this interesting chart (hopefully they won’t mind me posting this – it was snagged from their public video):

MCX core util arrow

I added the arrows for clarification.

Notice that the chart above left shows the current VNX using, according to EMCmaybe a total of 2.5 out of the 6 cores if you stack everything up (for instance, Core 0 is maxed out, Core 1 is 50% busy, Cores 2-4 do little, Core 5 does almost nothing). This is important and we’ll come back to it. But, currently, if true, this shows extremely poor multi-core utilization. Seems like there is a dedication of processes to cores – Core 0 does RAID only, for example. Maybe a way to lower context switches?

Then they mentioned how the new box has 16 cores per controller (the current VNX7500 has 6 cores per controller).

OK, great so far.

Then they mentioned how, By The Holy Power Of Software,  they can now utilize all cores on the upcoming 16-core box equally (chart above, right).

Then, comes the interesting part. They did an IOmeter test for the new box only.

They mentioned how the current VNX 7500 would max out at 170,000 8K random reads from SSD (this in itself a nice nugget when dealing with EMC reps claiming insane VNX7500 IOPS). And that the current model’s relative lack of performance is due to the fact its software can’t take advantage of all the cores.

Then they showed the experimental box doing over 5x that I/O. Which is impressive, indeed, even though that’s hardly a realistic way to prove performance, but I accept the fact they were trying to show how much more read-only speed they could get out of extra cores, plus it’s a cooler marketing number.

Writes are a whole separate wrinkle for arrays, of course. Then there are all the other ways VNX performance goes down dramatically.

However, all this leaves us with a few big questions:

  1. If this is really all about just optimized software for the VNX, will it also be available for the VNX7500?
  2. Why not show the new software on the VNX7500 as well? After all, it would probably increase performance by over 2x, since it would now be able to use all the cores equally. Of course, that would not make for good marketing. But if with just a software upgrade a VNX7500 could go 2x faster, wouldn’t that decisively prove EMC’s “software is king” story? Why pass up the opportunity to show this?
  3. So, if, with the new software the VNX7500 could do, say, 400,000 read IOPS in that same test, the difference between new and old isn’t as dramatic as EMC claims… right? :)
  4. But, if core utilization on the VNX7500 is not as bad as EMC claims in the chart (why even bother with the extra 2 cores on a VNX7500 vs a VNX5700 if that were the case), then the new speed improvements are mostly due to just a lot of extra hardware. Which, again, goes against the “software” theme!
  5. Why do EMC customers also need XtremeIO if the new VNX is that fast? What about VMAX? :)

Point #4 above is important. For instance, EMC has been touting multi-core enhancements for years now. The current VNX FLARE release has 50% better core efficiency than the one before, supposedly. And, before that, in 2008, multi-core was advertised as getting 2x the performance vs the software before that. However, the chart above shows extremely poor core efficiency. So which is it? 

Or is it maybe that the box demonstrated is getting most of its speed increase not so much by the magic of better software, but mostly by vastly faster hardware – the fastest Intel CPUs (more clockspeed, not just more cores, plus more efficient instruction processing), latest chipset, faster memory, faster SSDs, faster buses, etc etc. A potential 3-5x faster box by hardware alone.

It doesn’t quite add up as being a software “win” here.

However – I (or at least current VNX customers) probably care more about #1. Since it’s all about the software after all:)

If the new software helps so much, will they make it available for the existing VNX? Seems like any of the current boxes would benefit since many of their cores are doing nothing according to EMC. A free performance upgrade!

However… If they don’t make it available, then the only rational explanation is that they want to force people into the new hardware – yet another forklift upgrade (CX->VNX->”new box”).

Or maybe that there’s some very specific hardware that makes the new performance levels possible. Which, as mentioned before, kinda destroys the “software magic” story.

If it’s all about “Software Defined Storage”, why is the software so locked to the hardware?

All I know is that I have an ancient NetApp FAS3070 in the lab. The box was released ages ago (2006 vintage), and yet it’s running the most current GA ONTAP code. That’s going back 3-4 generations of boxes, and it launched with software that was very, very different to what’s available today. Sometimes I think we spoil our customers.

Can a CX3-80 (the beefiest of the CX3 line, similar vintage to the NetApp FAS3070) take the latest code shown at EMC World? Can it even take the code currently GA for VNX? Can it even take the code available for CX4? Can a CX4-960 (again, the beefiest CX4 model) take the latest code for the shipping VNX? I could keep going. But all this paints a rather depressing picture of being able to stretch EMC hardware investments.

But dealing with hardware obsolescence is a very cool story for another day.



Technorati Tags: , , , , , , ,

More EMC VNX caveats

Lately, when competing with VNX, I see EMC using several points to prove they’re superior (or at least not deficient).

I’d already written this article a while back, and today I want to explore a few aspects in more depth since my BS pain threshold is getting pretty low. The topics discussed:

  1. VNX space efficiency
  2. LUNs can be served by either controller for “load balancing”
  3. Claims that autotiering helps most workloads
  4. Claims that storage pools are easier
  5. Thin provisioning performance (this one’s interesting)
  6. The new VNX snapshots

References to actual EMC documentation will be used. Otherwise I’d also be no better than a marketing droid.

VNX space efficiency

EMC likes claiming they don’t suffer from the 10% space “tax” NetApp has. Yet, linked here are the best practices showing that in an autotiered pool, at least 10% free space should be available per tier in order for autotiering to be able to do its thing (makes sense).

Then there’s also a 3GB minimum overhead per LUN, plus metadata overhead, calculated with a formula in the linked article. Plus possibly more metadata overhead if they manage to put dedupe in the code.

My point is: There’s no free lunch. If you want certain pool-related features, there is a price to pay. Otherwise, keep using the old-fashioned RAID groups that don’t offer any of the new features but at least offer predictable performance and capacity utilization.

LUNs can be served by either controller for “load balancing”

This is a fun one. The claim is that LUN ownership can be instantly switched over from one VNX controller to another in order to load balance their utilization. Well – as always, it depends. It’s also important to note that VNX as of this writing does not do any sort of automatic load balancing of LUN ownership based on load.

  1. If it’s using old-fashioned RAID LUNs: Transferring LUN ownership is indeed doable with no issues. It’s been like that forever.
  2. If the LUN is in a pool – different story. There’s no quick way to shift LUN ownership to another controller without significant performance loss.

There’s copious information here. Long story short: You don’t change LUN ownership with pools, but rather need to do a migration of the LUN contents to the other controller (to another LUN, you can’t just move the LUN as-is – this also creates issues), otherwise there will be a performance tax to pay.

Claims that autotiering helps most workloads

Not so FAST. EMC’s own best practice guides are rife with caveats and cautions regarding autotiering. Yet this feature is used as a gigantic differentiator at every sales campaign.

For example, in the very thorough “EMC Scaling performance for Oracle Virtual Machine“, the following graph is shown on page 35:


The arrows were added by me. Notice that most of the performance benefit is provided once cache is appropriately sized. Adding an extra 5 SSDs for VNX tiering provides almost no extra benefit for this database workload.

One wonders how fast it would go if an extra 4 SSDs were added for even more cache instead of going to the tier… :)

Perchance the all-cache line with 8 SSDs would be faster than 4 cache SSDs and 5 tier SSDs, but that would make for some pretty poor autotiering marketing.

Claims that storage pools are easier

The typical VNX pitch to a customer is: Use a single, easy, happy, autotiered pool. Despite what marketing slicks show, unfortunately, complexity is not really reduced with VNX pools – simply because single pools are not recommended for all workloads. Consider this typical VNX deployment scenario, modeled after best practice documents:

  1. RecoverPoint journal LUNs in a separate RAID10 RAID group
  2. SQL log LUNs in a separate RAID10 RAID group
  3. Exchange 2010 log LUNs in a separate RAID10 RAID group
  4. Exchange 2010 data LUNs can be in a pool as long as it has a homogeneous disk type, otherwise use multiple RAID groups
  5. SQL data can be in an autotiered pool
  6. VMs might have to go in a separate pool or maybe share the SQL pool
  7. VDI linked clone repository would probably use SSDs in a separate RAID10 RAID group

OK, great. I understand that all the I/O separation above can be beneficial. However, the selling points of pooling and autotiering are that they should reduce complexity, reduce overall cost, improve performance and improve space efficiency. Clearly, that’s not the case at all in real life. What is the reason all the above can’t be in a single pool, maybe two, and have some sort of array QoS to ensure prioritization?

And what happens to your space efficiency if you over-allocate disks to the old-fashioned RAID groups above? How do you get the space back?

What if you under-allocated? How easy would it be to add a bit more space or performance? (not 2-3x – let’s say you need just 20% more). Can you expand an old-fashioned VNX RAID group by a couple of disks?

And what’s the overall space efficiency now that this kind of elaborate split is necessary? Hmm… ;)

For more detail, check these Exchange and SQL design documents.

Thin provisioning performance

This is just great.

VNX thin provisioning performs very poorly relative to thick and even more poorly relative to standard RAID groups. The performance issue makes complete sense due to how space is allocated when writing thin on a VNX, with 8KB blocks assigned as space is being used. A nice explanation of how pool space is allocated is here. A VNX writes to pools using 1GB slices. Thick LUNs pre-allocate as many 1GB slices as necessary, which keeps performance acceptable. Thin LUNs obviously don’t pre-allocate space and currently have no way to optimize writes or reads – the result is fragmentation, in addition to the higher CPU, disk and memory overhead to maintain thin LUNs :)

From the Exchange 2010 design document again, page 23:


Again, I added the arrows to point out a couple of important things:

  1. Thin provisioning is not recommended for high performance workloads on VNX
  2. Indeed, it’s so slow that you should run your thin pools with RAID10!!!

But wait – thin provisioning is supposed to help me save space, and now I have to run it with RAID10, which chews up more space?

Kind of an oxymoron.

And what if the customer wants the superior reliability of RAID6 for the whole pool? How fast is thin provisioning then?

Oh, and the VNX has no way to fix the fragmentation that’s rampant in its thin LUNs. Short of a migration to another LUN (kind of a theme it seems).

The new VNX snapshots

The VNX has a way to somewhat lower the traditionally extreme impact of FLARE snapshots by switching from COFW (Copy On First Write) to ROFW (Redirect On First Write).

The problem?

The new VNX snapshots need a pool, and need thin LUNs. It makes sense from an engineering standpoint, but…

Those are exactly the 2 VNX features that lower performance.

There are many other issues with the new VNX snapshots, but that’s a story for another day. It’s no wonder EMC pushes RecoverPoint far more than their snaps…

The takeaway

There’s marketing, and then there’s engineering reality.

Since the VNX is able to run both pools and old-fashioned RAID groups, marketing wisely chooses to not be very specific about what works with what.

The reality though is that all the advanced features only work with pools. But those come with significant caveats.

If you’re looking at a VNX – at least make sure you figure out whether the marketed features will be usable for your workload. Ask for a full LUN layout.

And we didn’t even talk about having uniform RAID6 protection in pools, which is yet another story for another day.


Technorati Tags: , , , , , , , , ,

NetApp posts world-record SPEC SFS2008 NFS benchmark result

Just as NetApp dominated the older version of the SPEC SFS97_R1 NFS benchmark back in May of 2006 (and was unsurpassed in that benchmark with 1 million SFS operations per second), the time has come to once again dominate the current version, SPEC SFS2008 NFS.

Recently we have been focusing on benchmarking realistic configurations that people might actually put in their datacenters, instead of lab queens with unusable configs focused on achieving the highest result regardless of cost.

However, it seems the press doesn’t care about realistic configs (or to even understand the configs) but instead likes headline-grabbing big numbers.

So we decided to go for the best of both worlds – a headline-grabbing “big number” but also a config that would make more financial sense than the utterly crazy setups being submitted by competitors.

Without further ado, NetApp achieved over 1.5 million SPEC SFS2008 NFS operations per second with a 24-node cluster based on FAS6240 boxes running ONTAP 8 in Cluster Mode. Click here for the specific result. There are other results in the page showing different size clusters so you can get some idea of the scaling possible.

See below table for a high-level analysis (including the list pricing I could find for these specific performance-optimized configs for whatever that’s worth). The comparison is between NetApp and the nearest scale-out competitor result (one of many EMC’s recent acquisitions –  Isilon, the niche, dedicated NAS box – nothing else is close enough to bother including in the comparison).

BTW – the EMC price list is publicly available from here (and other places I’m sure): http://www.emc.com/collateral/emcwsca/master-price-list.pdf

From page 422:

S200-6.9TB & 200GB SSD, 48GB RAMS200-6.9TB & 200GB SSD, 48GB RAM, 2x10GE SFP+ & 2x1G $84,061. Times 140…

Before we dive into the comparison, an important note since it seems the competition doesn’t understand how to read SPEC SFS results:

Out of 1728 450GB disks (the number includes spares and OS drives, otherwise it was 1632 disks), the usable capacity was 574TB (73% of all raw space – even more if one considers a 450GB disk never actually provides 450 real GB in base2). The exported capacity was 288TB. This doesn’t mean we tried to short-stroke or that there is a performance benefit exporting a smaller filesystem – the way NetApp writes to disk, the size of the volume you export has nothing to do with performance. Since SPEC SFS doesn’t use all the available disk space, the person doing the setup thought like a real storage admin and didn’t give it all the available space. 

Lest we be accused of tuning this config or manually making sure client accesses were load-balanced and going to the optimal nodes, please understand this:  23 out of 24 client accesses were not going to the nodes owning the data and were instead happening over the cluster interconnect (which, for any scale-out architecture, is worst-case-scenario performance). Look under the “Uniform Access Rules Compliance” in the full disclosure details of the result in the SPEC website here. This means that, compared to the 2-node ONTAP 7-mode results, there is a degradation due to the cluster operating (intentionally) through non-optimal paths.

EMC NetApp Difference
Cost (approx. USD List) 11,800,000 6,280,000 NetApp is almost half the cost while offering much higher performance
SPEC SFS2008 NFS operations per second 1,112,705 1,512,784 NetApp is over 35% faster, while using potentially better RAID protection
Average Latency (ORT) 2.54 1.53 NetApp offers almost 40% better average latency without using costly SSDs, and is usable for challenging random workloads like DBs, VMs etc.
Space (TB) 864 (out of which 128889GB was used in the test) 574 (out of which 176176GB was used in the test) Isilon offers about 50% more usable space (coming from a lot more drives, 28% more raw space and potentially less RAID protection – N+2 results from Isilon would be different)
$/SPEC SFS2008 NFS operation 10.6 4.15 Netapp is less than half the cost per SPEC SFS2008 NFS operation
$/TB 13,657 10,940 NetApp is about 20% less expensive than EMC per usable TB
RAID Per-file protection. Files < 128K are at least mirrored. Files over 128K are at a 13+1 level protection in this specific test. RAID-DP Ask EMC what 13+1 protection means in an Isilon cluster (I believe 1 node can be completely gone but what about simultaneous drive failures that contain the sameprotected file?)NetApp RAID-DP is mathematically analogous to RAID6 and has a parity drive penalty of 2 drives every 16-20 drives.
Boxes needed to accomplish result 140 nodes, 3,360 drives (incl. 25TB of SSDs for cache), 1,120 CPU cores, 6.7TB RAM. 24 unified controllers, 1,728 drives, 12.2TB Flash Cache, 192 CPU cores, 1.2TB RAM. NetApp is far more powerful per node, and achieves higher performance with a lotless drives, CPUs, RAM and cache.In addition, NetApp can be used for all protocols (FC, iSCSI, NFS, CIFS) and all connectivity methods (FC 4/8Gb, Ethernet 1/10Gb, FCoE).


Notice the response time charts:


NetApp exhibits traditional storage system behavior – latency is very low initially and gradually gets higher the more the box is pushed, as one would expect. Isilon on the other hand starts out slow and gets faster as more metadata gets cached, until the controllers run out of steam (SPEC SFS is very heavy in NAS metadata ops, and should not be compared to heavy-duty block benchmarks like SPC-1).

This is one of the reasons an Isilon cluster is not really applicable for low-latency DB-type apps, or low-latency VMs. It is a great architecture designed to provide high sequential speeds for large files over NAS protocols, and is not a general-purpose storage system. Kudos to the Isilon guys for even getting the great SPEC result in the first place, given that this isn’t what the box is designed to do (the extreme Isilon configuration needed to run the benchmark is testament to that). The better application for Isilon would be capacity-optimized configs (which is what the system is designed for to begin with).


Some important points:

  1. First and foremost, the cluster-mode ONTAP architecture now supports all protocols, it is the only unified scale-out architecture available. Any competitors playing in that space only have NAS or SAN offerings but not both in a single architecture.
  2. We didn’t even test with the even faster 6280 box and extra cache (that one can take 8TB cache per node). The result is not the fastest a NetApp cluster can go :) With 6280s it would be a healthy percentage faster, but we had a bunch of the 6240s in the lab so it was easier to test them, plus they’re a more common and less expensive box, making for a more realistic result.
  3. ONTAP in cluster-mode is a general-purpose storage OS, and can be used to run Exchange, SQL, Oracle, DB2, VMs, etc. etc. Most other scale-out architectures are simply not suitable for low-latency workloads like DBs and VMs and are instead geared towards high NAS throughput for large files (IBRIX, SONAS, Isilon to name a few – all great at what they do best).
  4. ONTAP in cluster mode is, indeed, a single scale-out cluster and administered as such. It should not be compared to block boxes with NAS gateways in front of them like VNX, HDS + Bluearc, etc.
  5. In ONTAP cluster mode, workloads and virtual interfaces can move around the cluster non-disruptively, regardless of protocol (FC, iSCSI, NFS and yes, even CIFS can move around non-disruptively assuming you have clients that can talk SMB 2.1 and above).
  6. In ONTAP cluster mode, any data can be accessed from any node in the cluster – again, impossible with non-unified gateway solutions like VNX that have individual NAS servers in front of block storage, with zero awareness between the NAS heads aside from failover.
  7. ONTAP cluster mode can allow certain cool things like upgrading storage controllers from one model to another completely non-disruptively, most other storage systems need some kind of outage to do this. All we do is add the new boxes to the existing cluster :)
  8. ONTAP cluster mode supports all the traditional NetApp storage efficiency and protection features: RAID-DP, replication, deduplication, compression, snaps, clones, thin provisioning. Again, the goal is to provide a scale-out general-purpose storage system, not a niche box for only a specific market segment. It even supports virtualizing your existing storage.
  9. There was a single namespace for the NFS data. Granted, not the same architecture as a single filesystem from some competitors.
  10. Last but not least – no “special” NetApp boxes are needed to run Cluster Mode. In contrast to other vendors selling a completely separate scale-out architecture (different hardware and software and management), normal NetApp systems can enter a scale-out cluster as long as they have enough connectivity for the cluster network and can run ONTAP 8. This ensures investment protection for the customer plus it’s easier for NetApp since we don’t have umpteen hardware and software architectures to develop for and support :)
  11. Since people have been asking: The SFS benchmark generates about 120MB per operation. The slower you go, the less space you will use on the disks, regardless of how many disks you have. This creates some imbalance in large configs (for example, only about 128TB of the 864TB available was used on Isilon).

Just remember – in order to do what ONTAP in Cluster Mode does, how many different architectures would other vendors be proposing?

  • Scale-out SAN
  • Scale-out NAS
  • Replication appliances
  • Dedupe appliances
  • All kinds of management software

How many people would it take to keep it all running? And patched? And how many firmware inter-dependencies would there be?

And what if you didn’t need, say, scale-out SAN to begin with, but some time after buying traditional SAN realized you needed scale-out? Would your current storage vendor tell you you needed, in addition to your existing SAN platform, that other one that can do scale-out? That’s completely different than the one you bought? And that you can’t re-use any of your existing stuff as part of the scale-out box, regardless of how high-end your existing SAN is?

How would that make you feel?

Always plan for the future…

Comments welcome.


PS: Made some small edits in the RAID parts and also added the official EMC pricelist link.


Technorati Tags: , , , , , , , , ,


NetApp vs EMC usability report: malice, stupidity or both?

Most are familiar with Hanlon’s Razor:

Never attribute to malice that which is adequately explained by stupidity.

A variation of that is:

Never attribute to malice that which is adequately explained by stupidity, but don’t rule out malice.

You see, EMC sponsored a study comparing their systems to ones from the company they look up to and try to emulate. The report deals with ease-of-use (and I’ll be the first to admit the current iteration of EMC boxes is far easier to use than in the past and the GUI has some cool stuff in it). I was intrigued, but after reading the official-looking report posted by Chuck Hollis, I wondered who in their right mind will lend it credence, and ignored it since I have a real day job solving actual customer problems and can’t possibly respond to every piece of FUD I see (and I see a lot).

Today I’m sitting in a rather boring meeting so I thought I’d spend a few minutes to show how misguided the document is.

In essence, the document tackles the age-old dilemma of which race car to get by comparing how easy it is to change the oil, and completely ignores the “winning the race with said car” part. My question would be: “which car allows you to win the race more easily and with the least headaches, least cost and least effort?”

And if you think winning a “race” is just about performance, think again.

It is also interesting how the important aspects of efficiency, reliability and performance are not tackled, but I guess this is a “usability” report…

Strange that a company named “Strategic Focus” reduces itself to comparing arrays by measuring the number of mouse clicks. Not sure how this is strategic for customers. They were commissioned by EMC, so maybe EMC considers this strategic.

I’ll show how wrong the document is by pointing at just some of the more glaring issues, but I’ll start by saying a large multinational company has many PB of NetApp boxes around the globe and 3 relaxed guys to manage it all. How’s that for a real example? :)

  1. Page 2, section 4, “Methodology”: EMC’s own engineers set up the VNX properly. No mention of who did the NetApp testing, what their qualifications are, and so on. So, first question: “Do these people even know what they’re doing? Have they really used a NetApp system before?”
  2. Page 10, Table A, showing the configurations. A NetApp FAS3070 was used, running the latest code at this moment (8.01). Thanks EMC for the unintended compliment – you see, that system is 2 generations old (the current one is 3270 and the previous one is 3170) yet it can still run the very latest 64-bit ONTAP code just fine. What about the EMC CX3? Can it run FLARE31? Or is that a forklift upgrade? Something to be said for investment protection :)
  3. Page 3 table 5-1, #1. Storage pools on all modern arrays would typically be created upfront, so the wording is very misleading. In order to create a new LUN one does NOT NEED to create a pool. Same goes for all vendors.
  4. Same table and section (also mentioned in section 7): Figuring out the space available is as simple as going to the aggregate page, where the space is clearly shown for the aggregates. So, unsure what is meant here.
  5. Regarding LUN creation… Let me ask you a question: After you create a LUN on any array, what do you need to do next? You see, the goal is to attach the LUN to a host, do alignment, partition creation, multipathing and create a filesystem and write stuff to it. You know, use it. NetApp largely automates end-to-end creation of host filesystems and, indeed, does not need an administrator to create a LUN on the array at all. Clearly the person doing the testing is either not aware of this or decided to omit this fact.
  6. Page 4, item 4 (thin provisioning). Asinine statement – plus, any NetApp LUN can be made thick or thin with a single click, whereas a VNX needs to do a migration. Indeed, NetApp does not complicate things whether thin or thick is required, does not differentiate between thin and thick when writing, and therefore does not incur a performance penalty, whereas EMC does (according to EMC documentation).
  7. Page 4, item 5 (Creation of virtual CIFS servers). The Multistore feature is free of charge on all new systems and allows one to create fully segregated, secure multitenancy virtual CIFS, NFS and iSCSI NetApp “partitions” – far beyond the capabilities of EMC. Again, misleading.
  8. Page 4, item 6 (growing storage elements). No measurable difference? Kindly show all the steps to grow a LUN until the new space is visible from the host side. End-to-end is important to real users since they want to use the storage. Or maybe not, for the authors of this document.
  9. Page 5, Item 1. We are really talking here about EMC snapshots? Seriously? Versus NetApp? To earn the right to do so assumes your snapshots are a usable and decent feature and that you can take a good number of them without the box crumbling to pieces. Ask any vendor about a production array with the most snaps and ask to talk to the customer using it. Then compare the number of snaps to a typical NetApp customer’s. Don’t be surprised if one number is a few hundred times less than the other.
  10. Page 5, item 3 (storage tiering): part of a longer conversation but this assumes all arrays need to do tiering. If my solution is optimized to the level that it doesn’t need to do this but yours is not optimized so it needs tiering, why on earth am I being penalized for doing storage more efficiently than you? (AKA the “not invented here” syndrome).
  11. Page 6 item 1 (VMware awareness): NetApp puts all the awareness inside vCenter and, indeed, datastore creation (including volume/LUN/NAS creation and resizing), VM cloning etc. all from within vCenter itself. Ask for a demo and prepare to be amazed.
  12. Page 6, item 2 – (dedupe/compress individual VMs): This one had my blood boiling. You see, EMC cannot even dedupe individual VMs, (impossible, given the fact that current DART code only does dedupe at the file and not block level and no two active VMs will ever be exactly the same), can’t dedupe at all for block storage (maybe in the future but not today), and in general doesn’t recommend compression for VMs! Ask to see the best practices guide that states all this is supported and recommended for active production VMs, and to talk to a customer doing it at scale (not 10 VMs). A feature you can theoretically turn on but that will never work is not quite useful, you see…
  13. Page 8, entire table: Too much to comment on, suffice it to say that NetApp systems come with tools not mentioned in this report that go so far beyond what Unisphere does that it’s not even funny (at no additional cost). Used by customers that have thousands of NetApp systems. That’s how much those tools scale. EMC would need vast portions of the Ionix suite to do anything remotely similar (at $$$). Of course, mentioning that would kinda derail this document… and the piece about support and upgrades is utterly wrong, but I like to keep the surprise for when I do the demos and not share cool IP ideas here :)
  14. Page 11, Table B1: In the end, the funniest one of all! If you add up the total number of mouse clicks, NetApp needed 92 vs EMC’s 111. Since the whole point of this usability report is to show overall ease of use by measuring the total number of clicks to do stuff, it’s interesting that they didn’t do a simple total to show who won in the end… :)

I could keep going but I need to pay attention to my meeting now since it suddenly became interesting.

Ultimately, when it comes to ease of use, it’s simple to just do a demo and have the customer decide for themselves which approach they like best. Documents such as this one mean less than nothing for actual end users.

I should have another similar list showing clicks and TIME needed to do certain other things. For instance, using RecoverPoint (or any other method), kindly show the number of clicks and time (and disk space) for creating 30 writable clones of a 10TB SQL DB and mounting them on 30 different DB servers simultaneously. Maintaining unique instance names etc. Kinda goes a bit beyond LUN creation, doesn’t it? :)

All this BTW doesn’t mean any vendor should rest on their laurels and stop working on improving usability. It’s a never-ending quest. Just stop it with the FUD, please…

Finishing with something funny: Check this video for a good demonstration of something needing few clicks yet not being that easy to do.

Comments welcome.


Technorati Tags: , , , ,