Tag Archives: SSD

How to decipher EMC’s new VNX pre-announcement and look behind the marketing.

It was with interest that I watched some of EMC’s announcements during EMC World. Partly due to competitor awareness, and partly due to being an irrepressible nerd, hoping for something really cool.

BTW: Thanks to Mark Kulacz for assisting with the proof points. Mark, as much as it pains me to admit so, is quite possibly an even bigger nerd than I am.

So… EMC did deliver something. A demo of the possible successor to VNX (VNX2?), unavailable as of this writing (indeed, a lot of fuss was made about it being lab only etc).

One of the things they showed was increased performance vs their current top-of-the-line VNX7500.

The aim of this article is to prove that the increases are not proportionally as much as EMC claims they are, and/or they’re not so much because of software, and, moreover, that some planned obsolescence might be coming the way of the VNX for no good reason. Aside from making EMC more money, that is.

A lot of hoopla was made about software being the key driver behind all the performance increases, and how they are now able to use all CPU cores, whereas in the past they couldn’t. Software this, software that. It was the theme of the party.

OK – I’ll buy that. Multi-core enhancements are a common thing in IT-land. Parallelization is key.

So, they showed this interesting chart (hopefully they won’t mind me posting this – it was snagged from their public video):

MCX core util arrow

I added the arrows for clarification.

Notice that the chart above left shows the current VNX using, according to EMCmaybe a total of 2.5 out of the 6 cores if you stack everything up (for instance, Core 0 is maxed out, Core 1 is 50% busy, Cores 2-4 do little, Core 5 does almost nothing). This is important and we’ll come back to it. But, currently, if true, this shows extremely poor multi-core utilization. Seems like there is a dedication of processes to cores – Core 0 does RAID only, for example. Maybe a way to lower context switches?

Then they mentioned how the new box has 16 cores per controller (the current VNX7500 has 6 cores per controller).

OK, great so far.

Then they mentioned how, By The Holy Power Of Software,  they can now utilize all cores on the upcoming 16-core box equally (chart above, right).

Then, comes the interesting part. They did an IOmeter test for the new box only.

They mentioned how the current VNX 7500 would max out at 170,000 8K random reads from SSD (this in itself a nice nugget when dealing with EMC reps claiming insane VNX7500 IOPS). And that the current model’s relative lack of performance is due to the fact its software can’t take advantage of all the cores.

Then they showed the experimental box doing over 5x that I/O. Which is impressive, indeed, even though that’s hardly a realistic way to prove performance, but I accept the fact they were trying to show how much more read-only speed they could get out of extra cores, plus it’s a cooler marketing number.

Writes are a whole separate wrinkle for arrays, of course. Then there are all the other ways VNX performance goes down dramatically.

However, all this leaves us with a few big questions:

  1. If this is really all about just optimized software for the VNX, will it also be available for the VNX7500?
  2. Why not show the new software on the VNX7500 as well? After all, it would probably increase performance by over 2x, since it would now be able to use all the cores equally. Of course, that would not make for good marketing. But if with just a software upgrade a VNX7500 could go 2x faster, wouldn’t that decisively prove EMC’s “software is king” story? Why pass up the opportunity to show this?
  3. So, if, with the new software the VNX7500 could do, say, 400,000 read IOPS in that same test, the difference between new and old isn’t as dramatic as EMC claims… right? :)
  4. But, if core utilization on the VNX7500 is not as bad as EMC claims in the chart (why even bother with the extra 2 cores on a VNX7500 vs a VNX5700 if that were the case), then the new speed improvements are mostly due to just a lot of extra hardware. Which, again, goes against the “software” theme!
  5. Why do EMC customers also need XtremeIO if the new VNX is that fast? What about VMAX? :)

Point #4 above is important. For instance, EMC has been touting multi-core enhancements for years now. The current VNX FLARE release has 50% better core efficiency than the one before, supposedly. And, before that, in 2008, multi-core was advertised as getting 2x the performance vs the software before that. However, the chart above shows extremely poor core efficiency. So which is it? 

Or is it maybe that the box demonstrated is getting most of its speed increase not so much by the magic of better software, but mostly by vastly faster hardware – the fastest Intel CPUs (more clockspeed, not just more cores, plus more efficient instruction processing), latest chipset, faster memory, faster SSDs, faster buses, etc etc. A potential 3-5x faster box by hardware alone.

It doesn’t quite add up as being a software “win” here.

However – I (or at least current VNX customers) probably care more about #1. Since it’s all about the software after all:)

If the new software helps so much, will they make it available for the existing VNX? Seems like any of the current boxes would benefit since many of their cores are doing nothing according to EMC. A free performance upgrade!

However… If they don’t make it available, then the only rational explanation is that they want to force people into the new hardware – yet another forklift upgrade (CX->VNX->”new box”).

Or maybe that there’s some very specific hardware that makes the new performance levels possible. Which, as mentioned before, kinda destroys the “software magic” story.

If it’s all about “Software Defined Storage”, why is the software so locked to the hardware?

All I know is that I have an ancient NetApp FAS3070 in the lab. The box was released ages ago (2006 vintage), and yet it’s running the most current GA ONTAP code. That’s going back 3-4 generations of boxes, and it launched with software that was very, very different to what’s available today. Sometimes I think we spoil our customers.

Can a CX3-80 (the beefiest of the CX3 line, similar vintage to the NetApp FAS3070) take the latest code shown at EMC World? Can it even take the code currently GA for VNX? Can it even take the code available for CX4? Can a CX4-960 (again, the beefiest CX4 model) take the latest code for the shipping VNX? I could keep going. But all this paints a rather depressing picture of being able to stretch EMC hardware investments.

But dealing with hardware obsolescence is a very cool story for another day.



Technorati Tags: , , , , , , ,

Are SSDs reliable enough? The importance of extensive testing under adverse conditions.

Recently, interesting research (see here) from researchers at Ohio State was presented at USENIX. 

To summarize, they tested 15 SSDs, several of them “Enterprise” grade, and subjected them to various power fault conditions. 

Almost all the drives suffered data loss that should not have occurred, and some were so corrupt as to be rendered utterly unusable (could not even be seen on the bus). It’s worth noting that spinning drives used in enterprise arrays would not have suffered the same way.

It’s not just an issue of whether or not the SSD has some supercapacitors in order to de-stage the built-in RAM contents to flash – a certain very prominent SSD vendor was hit with this issue even though the SSDs in question had the supercapacitors, generous overprovisioning and internal RAID. A firmware issue is suspected and this is not fixed yet.

You might ask, why am I mentioning this?

Several storage systems try to lower SSD costs by using cheap SSDs (often consumer models found in laptops, not even eMLC) and then try to get more longevity out said SSDs by using clever write techniques in order to minimize the amount of data written (dedupe, compression) as well as make the most of wear-leveling the flash chips in the box by also writing in flash-friendly ways (more appends, less overwrites, moving data around as needed, and more).

However, all those (perfectly valid) techniques have a razor-sharp focus on the fact that cheaper flash has a very limited number of write/erase cycles, but are utterly unrelated to things like massive corruption stemming from weird power failures or firmware bugs (and, after having lived through multiple UPS and generator failures, I don’t accept those as a complete answer, either).

On the other hand, the Tier 1 storage vendors typically do pretty extensive component testing, including various power failure scenarios, from the normal to the very strange. The system has to withstand those, then come up no matter what. Edge cases are tested as a matter of course – a main reason people buy enterprise storage is how edge cases are handled… :)

At NetApp, when we certify SSDs, they go through an extra-rigorous process since we are paranoid and they are still a relatively new technology. We also offer our standard dual-parity RAID, along with multiple ways to safeguard against lost writes, for all media. The last thing one needs is multiple drives failing due to a strange power failure or a firmware bug.

Protection against failures is even more important in storage systems that lack the extra integrity checks NetApp offers. Those non-NetApp systems that use SSDs either as their only storage or as part of a pool can suffer catastrophic failures if the integrity of the SSDs is compromised sufficiently since, by definition, if part of the pool fails, then the whole pool fails, which could mean the entire storage system may have to be restored from backup.

For those systems where cheap SSDs are merely used as an acceleration mechanism, catastrophic performance failures are a very real potential outcome. 1000 VDI users calling the helpdesk is not my idea of fun.

Such component behavior is clearly unacceptable.

Proper testing comes with intelligence, talent, but also experience and extensive battle scarring. Back when NetApp was young, we didn’t know the things we know today, and couldn’t handle some of the fault conditions we can handle today. Test harnesses in most Tier 1 vendors become more comprehensive as new problems are discovered, and sometimes the only way to discover the really weird problems is through sheer numbers (selling many millions of units of a certain component provides some pretty solid statistics regarding its reliability and failure modes).

“With age comes wisdom”.




Technorati Tags: , , , ,

Are some flash storage vendors optimizing too heavily for short-lived NAND flash?

I really resisted using the “flash in the pan” phrase in the title… first, because the term is overused and second, because I don’t believe solid state is of limited value. On the contrary.

However, I am noticing an interesting trend among some newcomers in the array business, desperate to find a flash niche to compete in:

Writing their storage OS around very specific NAND flash technologies. Almost as bad as writing an entire storage OS to support a single hypervisor technology, but that’s a story for another day.

Solid state technology is still too fluid. Unlike spinning disk technology that is overall very reliable and mature and likely won’t see huge advances in the years to come, solid state technology seems to advance almost weekly. New SSD controllers are coming out almost too frequently, and new kinds of solid state storage are either out now (Triple Level Cell, anyone?) or coming in the future (MRAM, ReRAM, FeRAM, PCM, PMC, and probably a lot more that I’m forgetting).

My point is:

How far ahead are certain vendors thinking if they are writing an entire storage OS around the limitations of a class of storage that may look very different in just a year or two?

Some of them go really deep and try to do all kinds of clever optimizations to ensure good wear leveling for the flash chips. Some write their own controller software and use bare NAND flash chips, not even off-the-shelf SSDs. Which is great, but what if you don’t need to do that in two years? Or what if the optimizations need to be drastically different for the new technologies? How long will coding for the new flash technologies take? Or will they be stuck using old technologies? Food for thought.

I guess some of us are in it for the long haul, and some aren’t. “Can’t see the forest for the trees” comes to mind. “Gold rush” also seems relevant.

I strongly believe general-purpose storage OSes need to be flexible enough to be reasonably adaptable to different underlying media. And storage OSes that are specifically designed for solid state storage need to be especially flexible regarding the underlying SSD technology to avoid the problems outlined above, and to avoid the relative lack of reliability of current SSD solutions (another story for another day).

At the moment I don’t see clear winners yet. I see a few great short-term stories, but who has the most flexible architecture to be able to deal with different kinds of technologies for years to come?


Technorati Tags: , ,

OS X and SSD – tunings plus performance with and without TRIM

I finally decided to spring for a SSD for my laptop since I hammer it heavily with a lot of mostly random I/O. It was money well spent.

I went for an Intel 320 model, since it includes extra capacitors for flushing the cache in the event of power failure, and has RAID-4 onboard for protection beyond sparing (there are other, faster SSDs but I need the reliability and can’t afford large-sized SLC).

I used the trusty postmark (here’s a link to the OS X executable) to generate a highly random workload with varying file sizes, using these settings:

set buffering false
set size 500 100000
set read 4096
set write 4096
set number 10000
set transactions 20000

All testing was done on OS X 10.6.7.

Here’s the result with the original 7200 RPM HDD:

198 seconds total
186 seconds of transactions (107 per second)

20163 created (101 per second)
Creation alone: 10000 files (1111 per second)
Mixed with transactions: 10163 files (54 per second)
10053 read (54 per second)
9945 appended (53 per second)
20163 deleted (101 per second)
Deletion alone: 10326 files (3442 per second)
Mixed with transactions: 9837 files (52 per second)

557.87 megabytes read (2.82 megabytes per second)
1165.62 megabytes written (5.89 megabytes per second)

I then replaced the internal drive with SSD, popped the old internal drive into an external caddy, plugged it into the Mac, reinstalled OS X and simply told it to move the user and app stuff from the old drive to the new (Apple makes those things so easy – on a PC you’d probably need something like an imaging program but that wouldn’t take care of very different hardware). I spent a ton of time testing to make sure it was all OK, in disbelief it was that easy. Kudos, Apple.

Here are the results with SSD (2/3rds full FWIW):

19 seconds total
13 seconds of transactions (1538 per second)

20163 created (1061 per second)
Creation alone: 10000 files (2500 per second)
Mixed with transactions: 10163 files (781 per second)
10053 read (773 per second)
9945 appended (765 per second)
20163 deleted (1061 per second)
Deletion alone: 10326 files (5163 per second)
Mixed with transactions: 9837 files (756 per second)

557.87 megabytes read (29.36 megabytes per second)
1165.62 megabytes written (61.35 megabytes per second)

A fair bit of improvement… :) The perceived difference is amazing. For some things I’ve caught it doing over 200MB/s sustained writes.

I also disabled the sudden motion sensor since there’s no point stopping I/O to a SSD if one shakes the laptop. From the command line:

sudo pmset -a sms 0 (this disables it)
sudo pmset –g (to verify it was done)

And since I don’t need hotfile adaptive clustering on a SSD, I decided to disable access time updates (noatime in UNIX parlance).

You need to put the script from here: http://dl.dropbox.com/u/5875413/Tools/com.my.noatime.plist

into /Library/LaunchDaemons

And make sure it has the right permissions:

sudo chown root:wheel com.my.noatime.plist

Then reboot, type mount from the command line, and see if the root filesystem shows noatime as one of the mount arguments.

For example mine shows

/dev/disk0s2 on / (hfs, local, journaled, noatime)

I then re-ran postmark, here are the results with noatime:

16 seconds total
11 seconds of transactions (1818 per second)

20163 created (1260 per second)
Creation alone: 10000 files (2500 per second)
Mixed with transactions: 10163 files (923 per second)
10053 read (913 per second)
9945 appended (904 per second)
20163 deleted (1260 per second)
Deletion alone: 10326 files (10326 per second)
Mixed with transactions: 9837 files (894 per second)

557.87 megabytes read (34.87 megabytes per second)
1165.62 megabytes written (72.85 megabytes per second)

Even better.

Now here comes the part that I hoped would work better than it did:

OS X doesn’t support the TRIM command for SSDs yet (unless you have a really new Mac with an Apple SSD). Fortunately, some enterprising users found out that it is possible to turn TRIM on OS X. There are various ways to do it but someone already automated the process. Be sure to do a backup first (both system backup and through the TRIM enabler application).

The process does work. However, it seems it tries to run TRIM too aggressively, messing up with the random access optimizations some drives have.

Benchmark after TRIM enabled:

39 seconds total
31 seconds of transactions (645 per second)

20163 created (517 per second)
Creation alone: 10000 files (3333 per second)
Mixed with transactions: 10163 files (327 per second)
10053 read (324 per second)
9945 appended (320 per second)
20163 deleted (517 per second)
Deletion alone: 10326 files (2065 per second)
Mixed with transactions: 9837 files (317 per second)

557.87 megabytes read (14.30 megabytes per second)
1165.62 megabytes written (29.89 megabytes per second)

This kind of performance loss is unacceptable to me, so I restored the kext file through the TRIM app, rebooted and re-ran the benchmark and all was fine again.

My recommendations:

  1. Always test before and after the tweaks – my results may only apply to Intel drives. Please post your results with other drives
  2. Always do backups before serious tweaks
  3. If TRIM seems to slow down random I/O on your Mac SSD, don’t keep it running, maybe enable it once a month, go to disk utility, and ask it to erase the free space. This will ensure the drive stays in good shape without adversely affecting normal random I/O.


EMC conclusively proves that VNX bottlenecks NAS performance

A bit of a controversial title, no?

Allow me to elaborate.

EMC posted a new SPEC SFS result as part of a marketing stunt (which is working, look at what I’m doing – I’m talking about them, if only to clear the air).

In simple terms, EMC got almost 500,000 SPEC SFS NFS IOPS (not to be confused with, say, block-based SPC-1 IOPS) with the following configuration:

  1. Four (4) totally separate VNX arrays, each loaded with SSD storage, utterly unaware of each other (8 total controllers since each box has 2)
  2. Five (5) Celerra VG8 NAS heads/gateways (1 spare), one on top of each VNX box
  3. 2 Control Stations
  4. 8 exported filesystems (2 per VG8 head/VNX system)
  5. Multiple pools of storage (at least 1 per VG8) – not shared among the various boxes, no data mobility between boxes
  6. Only 60TB NAS space with RAID5 (or 15TB per box)

Now, this post is not about whether this configuration is unrealistic and expensive (almost nobody would pay $6m for merely 60TB of NAS, not today). I get it that EMC is trying to publish the best possible number by loading a bunch of separate arrays with SSD. It’s OK as long as everyone understands the details.

My beef has to do with how it’s marketed.

EMC is very vague about the configuration, unless you look at the actual SPEC website. In the marketing materials they just mention VNX, as in “The EMC VNX performed at 497,623 SPECsfs2008_nfs.v3 operations per second”. Kinda like saying it’s OK to take 3 5-year olds and a 6-year old to a bar because their age adds up to 21.

No – the far more accurate statement is “four separate VNXs working independently and utterly unaware of each other did 124,405 SPEC fs2008_nfs.v3 operations per second each“.

All EMC did was add up the result of 4 boxes.

Heck, that’s easy to do!

NetApp already has a result for the 6240 (just 2 controllers doing a respectable 190,675 SPEC NFS ops taking care of NAS and RAID all at once since they’re actually unified, no cornucopia of boxes there) without using Solid State Drives (common SAS drives plus a large cache were used instead – a standard, realistic config we sell every day, and not a “lab queen”).

If all we’re doing is adding up the result of different boxes, simply multiply this by 4 (plus we do have Cluster-Mode for NAS so it would count as a single clustered system with failover etc. among the nodes) and end up with the following result:

  1. 762,700 SPEC SFS NFS operations
  2. 8 exported filesystems
  3. 343TB usable with RAID-DP (thousands of times more resilient than RAID5)

So, which one do you think is the better deal? More speed, 343TB and better protection, or less speed, 60TB and far less protection? :)

Customers curious about other systems can do the same multiplication trick for other configs, the sky is the limit!

The other, more serious part, and what prompted me to title the post the way I did, is that EMC’s benchmarking made pretty clear the fact that the VNX is the bottleneck, only able to really support a single VG8 head at top speed, necessitating the need for 4 separate VNX systems to accomplish the final result. So, the fact that a VNX can have up to 8 Celerra heads on top of it means nothing since the back-end is your limiting factor. You might as well stick to a dual-head VG8 config (1 active 1 passive) since that’s all it can comfortably drive (otherwise why benchmark it that way?)

But with only 1 active NAS head you’d be limited to just 256TB max NAS capacity, since that’s how much total space a Celerra head can address as of the time of this writing. Which is probably enough for most people.

I wonder if the NAS heads that can be bought as a package with VNX are slower than VG8 heads, and by how much. You see, most people buying the VNX will be getting the NAS heads that can be packaged with it since it’s cheaper that way. How fast does that go? I’m sure customers would like to know, since that’s what they will typically buy.

I also wonder how fast it would be with RAID6.

Here’s a novel idea: benchmark what customers will actually buy!

So apples-to-apples comparisons can become easier instead of something like this:


For the curious: on the left you see an “Autumn Glory” Malus Floribunda (miniature apple). Photo courtesy of John Fullbright.


Technorati Tags: , , , , , , , ,