Category Archives: New Technologies

How to decipher EMC’s new VNX pre-announcement and look behind the marketing.

It was with interest that I watched some of EMC’s announcements during EMC World. Partly due to competitor awareness, and partly due to being an irrepressible nerd, hoping for something really cool.

BTW: Thanks to Mark Kulacz for assisting with the proof points. Mark, as much as it pains me to admit so, is quite possibly an even bigger nerd than I am.

So… EMC did deliver something. A demo of the possible successor to VNX (VNX2?), unavailable as of this writing (indeed, a lot of fuss was made about it being lab only etc).

One of the things they showed was increased performance vs their current top-of-the-line VNX7500.

The aim of this article is to prove that the increases are not proportionally as much as EMC claims they are, and/or they’re not so much because of software, and, moreover, that some planned obsolescence might be coming the way of the VNX for no good reason. Aside from making EMC more money, that is.

A lot of hoopla was made about software being the key driver behind all the performance increases, and how they are now able to use all CPU cores, whereas in the past they couldn’t. Software this, software that. It was the theme of the party.

OK – I’ll buy that. Multi-core enhancements are a common thing in IT-land. Parallelization is key.

So, they showed this interesting chart (hopefully they won’t mind me posting this – it was snagged from their public video):

MCX core util arrow

I added the arrows for clarification.

Notice that the chart above left shows the current VNX using, according to EMCmaybe a total of 2.5 out of the 6 cores if you stack everything up (for instance, Core 0 is maxed out, Core 1 is 50% busy, Cores 2-4 do little, Core 5 does almost nothing). This is important and we’ll come back to it. But, currently, if true, this shows extremely poor multi-core utilization. Seems like there is a dedication of processes to cores – Core 0 does RAID only, for example. Maybe a way to lower context switches?

Then they mentioned how the new box has 16 cores per controller (the current VNX7500 has 6 cores per controller).

OK, great so far.

Then they mentioned how, By The Holy Power Of Software,  they can now utilize all cores on the upcoming 16-core box equally (chart above, right).

Then, comes the interesting part. They did an IOmeter test for the new box only.

They mentioned how the current VNX 7500 would max out at 170,000 8K random reads from SSD (this in itself a nice nugget when dealing with EMC reps claiming insane VNX7500 IOPS). And that the current model’s relative lack of performance is due to the fact its software can’t take advantage of all the cores.

Then they showed the experimental box doing over 5x that I/O. Which is impressive, indeed, even though that’s hardly a realistic way to prove performance, but I accept the fact they were trying to show how much more read-only speed they could get out of extra cores, plus it’s a cooler marketing number.

Writes are a whole separate wrinkle for arrays, of course. Then there are all the other ways VNX performance goes down dramatically.

However, all this leaves us with a few big questions:

  1. If this is really all about just optimized software for the VNX, will it also be available for the VNX7500?
  2. Why not show the new software on the VNX7500 as well? After all, it would probably increase performance by over 2x, since it would now be able to use all the cores equally. Of course, that would not make for good marketing. But if with just a software upgrade a VNX7500 could go 2x faster, wouldn’t that decisively prove EMC’s “software is king” story? Why pass up the opportunity to show this?
  3. So, if, with the new software the VNX7500 could do, say, 400,000 read IOPS in that same test, the difference between new and old isn’t as dramatic as EMC claims… right? :)
  4. But, if core utilization on the VNX7500 is not as bad as EMC claims in the chart (why even bother with the extra 2 cores on a VNX7500 vs a VNX5700 if that were the case), then the new speed improvements are mostly due to just a lot of extra hardware. Which, again, goes against the “software” theme!
  5. Why do EMC customers also need XtremeIO if the new VNX is that fast? What about VMAX? :)

Point #4 above is important. For instance, EMC has been touting multi-core enhancements for years now. The current VNX FLARE release has 50% better core efficiency than the one before, supposedly. And, before that, in 2008, multi-core was advertised as getting 2x the performance vs the software before that. However, the chart above shows extremely poor core efficiency. So which is it? 

Or is it maybe that the box demonstrated is getting most of its speed increase not so much by the magic of better software, but mostly by vastly faster hardware – the fastest Intel CPUs (more clockspeed, not just more cores, plus more efficient instruction processing), latest chipset, faster memory, faster SSDs, faster buses, etc etc. A potential 3-5x faster box by hardware alone.

It doesn’t quite add up as being a software “win” here.

However – I (or at least current VNX customers) probably care more about #1. Since it’s all about the software after all:)

If the new software helps so much, will they make it available for the existing VNX? Seems like any of the current boxes would benefit since many of their cores are doing nothing according to EMC. A free performance upgrade!

However… If they don’t make it available, then the only rational explanation is that they want to force people into the new hardware – yet another forklift upgrade (CX->VNX->”new box”).

Or maybe that there’s some very specific hardware that makes the new performance levels possible. Which, as mentioned before, kinda destroys the “software magic” story.

If it’s all about “Software Defined Storage”, why is the software so locked to the hardware?

All I know is that I have an ancient NetApp FAS3070 in the lab. The box was released ages ago (2006 vintage), and yet it’s running the most current GA ONTAP code. That’s going back 3-4 generations of boxes, and it launched with software that was very, very different to what’s available today. Sometimes I think we spoil our customers.

Can a CX3-80 (the beefiest of the CX3 line, similar vintage to the NetApp FAS3070) take the latest code shown at EMC World? Can it even take the code currently GA for VNX? Can it even take the code available for CX4? Can a CX4-960 (again, the beefiest CX4 model) take the latest code for the shipping VNX? I could keep going. But all this paints a rather depressing picture of being able to stretch EMC hardware investments.

But dealing with hardware obsolescence is a very cool story for another day.

D

 

Technorati Tags: , , , , , , ,

Are some flash storage vendors optimizing too heavily for short-lived NAND flash?

I really resisted using the “flash in the pan” phrase in the title… first, because the term is overused and second, because I don’t believe solid state is of limited value. On the contrary.

However, I am noticing an interesting trend among some newcomers in the array business, desperate to find a flash niche to compete in:

Writing their storage OS around very specific NAND flash technologies. Almost as bad as writing an entire storage OS to support a single hypervisor technology, but that’s a story for another day.

Solid state technology is still too fluid. Unlike spinning disk technology that is overall very reliable and mature and likely won’t see huge advances in the years to come, solid state technology seems to advance almost weekly. New SSD controllers are coming out almost too frequently, and new kinds of solid state storage are either out now (Triple Level Cell, anyone?) or coming in the future (MRAM, ReRAM, FeRAM, PCM, PMC, and probably a lot more that I’m forgetting).

My point is:

How far ahead are certain vendors thinking if they are writing an entire storage OS around the limitations of a class of storage that may look very different in just a year or two?

Some of them go really deep and try to do all kinds of clever optimizations to ensure good wear leveling for the flash chips. Some write their own controller software and use bare NAND flash chips, not even off-the-shelf SSDs. Which is great, but what if you don’t need to do that in two years? Or what if the optimizations need to be drastically different for the new technologies? How long will coding for the new flash technologies take? Or will they be stuck using old technologies? Food for thought.

I guess some of us are in it for the long haul, and some aren’t. “Can’t see the forest for the trees” comes to mind. “Gold rush” also seems relevant.

I strongly believe general-purpose storage OSes need to be flexible enough to be reasonably adaptable to different underlying media. And storage OSes that are specifically designed for solid state storage need to be especially flexible regarding the underlying SSD technology to avoid the problems outlined above, and to avoid the relative lack of reliability of current SSD solutions (another story for another day).

At the moment I don’t see clear winners yet. I see a few great short-term stories, but who has the most flexible architecture to be able to deal with different kinds of technologies for years to come?

D

Technorati Tags: , ,

Slide1

Buyer beware: is your storage vendor sizing properly for performance, or are they under-sizing technologies like Megacaching and Autotiering?

With the advent of performance-altering technologies (notice the word choice), storage sizing is just not what it used to be.

I’m writing this post because more and more I see some vendors not using scientific methods to size their solution, instead aiming to reach a price point, hoping the technology will work to achieve the requisite performance (and if it doesn’t, it’s sold anyway, either they can give some free gear to make the problem go away, or the customer can always buy more, right?)

Back in the “good old days”, with legacy arrays one could (and still can) get fairly deterministic performance by knowing the workload required and, given a RAID type, know roughly how many disks would be needed to maintain the required performance in a sustained fashion, as long as the controller and buses were not overloaded.

With modern systems, there is now a plethora of options that can be used to get more performance out of the array, or, alternatively, get the same average performance as before, using less hardware (hopefully for less money).

If anything, advanced technologies have made array sizing more complex than before.

For instance, Megacaches can be used to dramatically change the I/O reaching the back-end disks of the array. NetApp FAS systems can have up to 16TB of deduplication-aware, ultra-granular (4K) and intelligent read cache. Truly a gigantic size, bigger than the vast majority of storage users will ever need (and bigger than many customers’ entire storage systems). One could argue that with such an enormous amount of cache, one could dispense with most disk drives and instead save money by using SATA (indeed, several customers are doing exactly that). Other vendors are following NetApp’s lead and starting to implement similar technologies — simply because it makes a lot of sense.

However…

It is crucial that, when relying on caching, extra care is taken to size the solution properly, if a reduction in the number and speed of the back-end disks is desired.

You see, caches only work well if they can cache the majority of what’s called the active working set.

Simply put, the working set is not all your data, but the subset of the data you’re “touching” constantly over a period of time. For a customer that has, say, a 20TB Database, the true working set may only be something as small as 5% — enabling most of the active data to fit in 1TB of cache. So, during daily use, a 1TB cache could satisfy most of the I/O requirements of the DB. The back-end disks could comfortably be just enough SATA to fit the DB.

But what about the times when I/O is not what’s normally expected? Say, during a re-indexing, or a big DB export, or maybe month-end batch processing. Such operations could vastly change the working set and temporarily raise it from 5% to something far larger — at which point, a 1TB cache and a handful of back-end SATA may not be enough.

Which is why, when sizing, multiple measurements need to be taken, and not just average or even worst-case.

Let’s use a database as an example again (simply because the I/O can change so dramatically with DBs).You could easily have the following I/O types:

  1. Normal use – 20,000 IOPS, all random, 8K I/O size, 80% reads
  2. DB exports — high MB/s, mostly sequential write,large I/O size, relatively few IOPS
  3. Sequential read after random write — maybe data is added to the DB randomly, then a big sequential read (or maybe many parallel ones) are launched.

You see, the I/O profile can change dramatically. If you only size for case #1, you may not have enough back-end disk to sustain the DB exports or the parallel sequential table scans. If you size for case 2, you may think you don’t need much cache since the I/O is mostly sequential (and most caches are bypassed for sequential I/O). But that would be totally wrong during normal operation.

If your storage vendor has told you they sized for what generates the most I/O, then the question is, what kind of I/O was it?

The other new trendy technology (and the most likely to be under-sized) is Autotiering.

Autotiering, simply put, allows moving chunks of data around the array depending on their “heat index”. Chunks that are very active may end up on SSD, whereas chunks that are dormant could safely stay on SATA.

Different arrays do different kinds of Autotiering, mostly based on various underlying architectural characteristics and limitations. For example, on an EMC Symmetrix the chunk size is about 7.5MB. On an HDS VSP, the chunk is about 40MB. On an IBM DS8000, SVC or EMC Clariion/VNX, it’s 1GB.

With Autotiering, just like with caching, the smaller the chunk size, the more efficient the end result will ultimately be. For instance, a 7.5MB chunk could need as little as 3-5%% of ultra-fast disk as a tier, whereas a 1GB chunk may need as much as 10-15%, due to the larger size chunk containing not very active data mixed together with the active data.

Since most arrays write data with a geometric locality of reference (in contrast, NetApp uses geometric and temporal), with large-chunk autotiering you end up with pieces of data that are “hot” that always occupy the same chunk as neighboring “cool” pieces of data. This explains why the smaller the chunk, the better off you are.

So, with a large chunk, this can happen:

Slide1

The array will try to cache as much as it can, then migrate chunks if they are consistently busy or not. But the whole chunk has to move, not just the active bits within the chunk… which may be just fine, as long as you have enough of everything.

So what can you do to ensure correct sizing?

There are a few things you can do to make sure you get accurate sizing with modern technologies.

  1. Provide performance statistics to vendors — the more detailed the better. If we don’t know what’s going on, it’s hard to provide an engineered solution.
  2. Provide performance expectations — i.e. “I want Oracle queries to finish in 1/4th the time compared to what I have now” — and tie those expectations to business benefits (makes it easier to justify).
  3. Ask vendors to show you their sizing tools and explain the math behind the sizing — there is no magic!
  4. Ask vendors if they are sizing for all the workloads you have at the moment (not just different apps but different workloads within each app) — and how.
  5. Ask them to show you what your working set is and how much of it will fit in the cache.
  6. Ask them to show you how your data would be laid out in an Autotiered environment and what bits of it would end up on what tier. How is that being calculated? Is the geometry of the layout taken into consideration?
  7. Do you have enough capacity for each tier? On Autotiering architectures with large chunks, do you have 10-15% of total storage being SSD?
  8. Have the controller RAM and CPU overheads due to caching and autotiering been taken into account? Such technologies do need extra CPU and RAM to work. Ask to see the overhead (the smaller the Autotiering chunk size, the more metadata overhead, for example). Nothing is free.
  9. Beware of sizings done verbally or on cocktail napkins, calculators, or even spreadsheets – I’ve yet to see a spreadsheet model storage performance accurately.
  10. Beware of sizings of the type “a 15K disk can do 180 IOPS” — it’s a lot more complicated than that!
  11. Understand the difference between sequential, random, reads, writes and I/O size for each proposed architecture — the differences in how I/O is done depending on the platform are staggering and can result in vastly different disk requirements — making apples-to-apples comparisons challenging.
  12. Understand the extra I/O and capacity impact of certain CDP/Replication devices — it can be as much as 3x, and needs to be factored in.
  13. What RAID type is each vendor using? That can have a gigantic performance impact on write-intensive workloads (in addition to the reliability aspect).
  14. If you are getting unbelievably low pricing — ask for a contract ensuring upgrade pricing will be along the same lines. “The first hit is free” is true in more than one line of business.
  15. And, last but by no means least — ask how busy the proposed solution will be given the expected workload! It surprises me that people will try to sell a box that can do the workload but will be 90% busy doing so. Are you OK with that kind of headroom? Remember – disk arrays are just computers running specialized software and hardware, and as such their CPU can run out of steam just like anything else.

If this all seems hard — it’s because it is. But see it as due diligence — you owe it to your company, plus you probably don’t want to be saddled with an improperly-sized box for the next 3-5 years, just because the offer was too good to refuse…

D

 

Technorati Tags: , , , , , , , , ,

Stack Wars: The Clone Wars

It seems that everyone and their granny is trying to create some sort of stack offering these days. Look at all the brouhaha – HP buying 3Par, Dell buying Compellent, all kinds of partnerships being formed left and right. Stacks are hot.

To the uninitiated, a stack is what you can get when a vendor is able to offer multiple products under a single umbrella. For instance, being able to get servers, an OS, a DB, an email system, storage and network switches from a single manufacturer (not a VAR) is an example of a single-sourced stack.

The proponents of stacks maintain that with stacks, customers potentially get simpler service and better integration – a single support number to call, “one throat to choke”, no finger-pointing between vendors. And that’s partially true. A stack potentially provides simpler access to support. On the “better integration” part – read on.

My main problem with stacks is that nobody really offers a complete stack, and those that are more complete than others, don’t necessarily offer best-of-breed products (not even “good enough”), nor do they offer particularly great integration between the products within the stack.

Personal anecdote: a few years ago I had the (mis)fortune of being the primary backup and recovery architect for one of the largest airlines in the world. Said airline had a very close relationship with a certain famous vendor. Said vendor only flew with that airline, always bought business class seats, the airline gave the vendor discounts, the vendor gave the airline discounts, and in general there was a lot of mutual back-scratching going on. So much so that the airline would give that vendor business before looking at anyone else, and only considered alternative vendors if the primary vendor didn’t have anything that even smelled like what the airline was looking for.

All of which resulted in the airline getting a backup system that was designed for small businesses since that’s all that vendor had to offer (and still does).

The problem is, that backup product simply could not scale to what I needed it to. I ended up having to stop file-level logging and could only restore entire directories since the backup database couldn’t handle the load, despite me running multiple instances of the tool for multiple environments. Some of those directories were pretty large, so you can imagine the hilarity that ensued when trying to restore stuff…

The vendor’s crack development team came over from overseas and spent days with me trying to figure out what they needed to change about the product to make it work in my environment (I believe it was the single largest installation they had).

Problem is, they couldn’t deliver soon enough, so, after much pain, the airline moved to a proper enterprise backup system from another vendor, which fixed most problems, given the technology I had to work with at the time.

Had the right decision been made up front, none of that pain would have been experienced. The world would have been one IT anecdote short, but that’s a small price to pay for a working environment. And this is but just one way that single vendor stacks can fail.

How does one decide on a stack?

Let’s examine a high-level view of a few stack offerings. By no means an all-inclusive list.

Microsoft: They offer an OS (catering from servers to phones), a virtualization engine, a DB, a mail system, a backup tool and the most popular office apps in the world, among many other things. Few will argue that all the bits are best-of-breed, despite being hugely popular. Microsoft doesn’t like playing in the hardware space, so they’re a pure software stack. Oh, there’s the XBox, too.

EMC: Various kinds of storage platforms (7-10 depending on how you count), all but one (Symmetrix) coming from acquisition. A virtualization engine (80% owner of VMware – even though it’s run as a totally separate entity). DB, many kinds of backup, document management, security and all other kinds of software. Some bits are very strong, others not so much.

Oracle: They offer an OS, a virtualization engine, a DB, middleware, servers and storage. An office suite. No networking. Oracle is a good example of an incomplete software/hardware stack. Aside from the ultra-strong DB and good OS, few will say their products are all best-of-breed.

Dell: They offer servers desktops, laptops, phones, various flavors of storage, switches. Dell is an example of a pure hardware stack. Not many software aspirations here. Few will claim any of their products are best-of-breed.

HP: They offer servers desktops, laptops, phones, even more flavors of storage, a UNIX OS with its own type of virtualization (can’t run x86), switches, backup software, a big services arm, printers, calculators… All in all, great servers, calculators and printers, not so sure about the rest. Fairly complete stack.

IBM: Servers, 2 strong DBs, at least 3 different OSes, CPUs, many kinds of storage,  email system, middleware, backup software, immense services arm. No x86 virtualization (they do offer virtualization for their other platforms). Very complete stack, albeit without networking.

Cisco: All kinds of networking (including telephony), servers. Limited stack if networking is not your bag, but what it offers can be pretty good.

Apple: Desktops, laptops, phones, tablets, networking gear, software. Great example of a consumer-oriented hardware and software stack. They used to offer storage and servers but they exited that business.

Notice anything common about the various single-vendor stacks? Did you find a stack that can truly satisfy all your IT needs without giving anything up?

The fact of the matter is that none of the above companies, as formidable as they are, offers a complete stack – software and hardware. You can get some of the way there, but it’s next to impossible to single-source everything without shooting yourself in the foot. At a minimum, you’re probably looking at some kind of Microsoft stack + server/storage stack + networking stack – mixing a minimum of 3 different vendors to get what you need without sacrificing much (the above example assumes you either don’t want to virtualize or are OK with Hyper-V).

Most companies have something like this: Microsoft stack + virtualization + DB + server + storage + networking – 6 total stacks.

So why do people keep pushing single-vendor stacks?

Only a few valid reasons (aside from it being fashionable to talk about). One of them is control – the more stuff you have from a company, the tighter their hold on you. The other is that it at least limits the support points you have to deal with, and can potentially get you better pricing (theoretically). For instance, Dell has “given away” many an Equallogic box. Guess what – the cost of that box was blended into everything else you purchased, it’s all a shell game. But if someone does buy a lot of gear from a single vendor, there are indeed ways to get better deals. You just won’t necessarily get best-of-breed or even good enough gear.

What about integration?

One would think that buying as much as possible from a single vendor gets you better integration between the bits. Not necessarily. For instance, most large vendors acquire various technologies and/or have OEM deals – if one looks just at storage as an example, Dell has their own storage (Equallogic and Compellent – two different acquisitions) plus an OEM deal with EMC. There’s not much synergy between the various technologies.

HP has their own storage (EVA, Lefthand, Ibrix, Polyserve, 3Par – four different acquisitions) and two OEM deals with Dot Hill for the MSA boxes and HDS for the high-end XP systems. That’s a lot of storage tin to keep track of (all of which comes from 7 different places and 7 totally different codebases), and any HP management software needs to be able to work with all of those boxes (and doesn’t).

IBM has their own storage (XIV, DS6000, DS8000, SONAS, SVC – I believe three homegrown and two acquisitions) and two different OEM deals (NetApp for N Series and LSI Logic for DS5000 and below). The integration between those products and the rest of the IBM landscape should be examined on a case-by-case basis.

EMC’s challenge is that they have acquired too much stuff, making it difficult to provide proper integration for everything. Supporting and developing for that plethora of systems is not easy and teams end up getting fragmented and inconsistent. Far too many codebases to keep track of. This dilutes the R&D dollars available and prolongs development cycles.

Aren’t those newfangled special-purpose multi-vendor stacks better?

There’s another breed of stack, the special-purpose one where a third party combines gear from different vendors, assembles it and sells it as a supported and pre-packaged solution for a specific application. Such stacks are actually not new – they have been sold for military, industrial and healthcare applications for the longest time. Recently, Netapp and EMC have been promoting different versions of what a “virtualization stack” should be (as usual, with very different approaches, check out FlexPod and Vblock).

The idea behind the “virtualization stack” is that you sell the customer a rack that has inside it network gear, servers, storage, management and virtualization software. Then, all the customer has to do is load up the gear with VMs and off they go.

With such a stack, you don’t limit the customer by making them buy their gear all from one vendor, but instead you limit them by pre-selecting vendors that are “best of breed”. Not everyone will be OK with the choice of gear, of course.

Then there’s the issue of flexibility – some of the special-purpose stacks literally are like black boxes – you are not supposed to modify them or you lose support. To the point where you’re not allowed to add RAM to servers or storage to arrays, both limitations that annoy most customers, but are viewed as a positive by some.

Is it a product or a “kit”?

Back to the virtualization-specific stacks: This is the main argument, do you buy a ready-made “product” or a “kit” some third party assembles after following a detailed design guide. As of this writing (there have been multiple changes to how this is marketed), Vblock is built by a company known as VCE – efectively a third party that puts together a custom stack made of different kinds of EMC storage, Cisco switches and servers, VMware, and a management tool called UIM. It is not built by EMC, VMware or Cisco. VCE then resells the assembled system to other VARs or directly to customers.

NetApp’s FlexPod is built by VARs. The difference is that more than one VAR can build FlexPods (as long as they meet some specific criteria) and don’t need to involve a middleman (also translating to more profits for VARs). All VARs building FlexPods need to follow specific guidelines to build the product (jointly designed by VMware, Cisco and NetApp), use components and firmware tested and certified to work together, and add best-of-breed management software to manage the stack.

The FlexPod emphasis is on sizing and performance flexibility (from tiny to gigantic), Secure Multi Tenancy (SMT – a unique differentiator), space efficiency, application integration, extreme resiliency and network/workload isolation – all highly important features in virtualized environments. In addition, it supports non-virtualized workloads.

Ultimately, in both cases the customer ends up with a pre-built and pre-tested product.

What about support?

This has been both the selling point and the drawback of such multi-vendor stacks. In my opinion, it has been the biggest selling point for Vblock, since a customer calls VCE for support. VCE has support staff that is trained on Cisco, VMware and EMC and can handle many support cases via a single support number – obviously a nice feature.

Where this breaks down a bit: VCE has to engage VMware, EMC and Cisco for anything that’s serious. Furthermore, Vblock support doesn’t support the entire stack but stops at the hypervisor.

For instance, if a customer hits an Enginuity (Symmetrix OS) bug, then the EMC Symm team will have to be engaged, and possibly write a patch for the customer and communicate with the customer. VCE support simply cannot fix such issues, and is best viewed as first-level support. Same goes for Cisco or VMware bugs, and in general deeper support issues that the VCE support staff simply isn’t trained enough to resolve. In addition, Vblocks can be based on several different kinds of EMC storage, that itself requires different teams to support it.

Finally – ask VCE if they are legally responsible for maintaining the support SLAs. For instance, who is responsible if there is a serious problem with the array and the vendor takes 2 days to respond instead of 30 minutes?

FlexPod utilizes a cooperative support model between NetApp, Cisco and VMware, and cases are dealt with by experts from all three companies working in concert. The first company to be called owns the case.

When the going gets tough, both approaches work similarly. For easy cases that can be resolved by the actual VCE support personnel, Vblock probably has an edge.

Who needs the virtualization stack?

I see several kinds of customers that have a need for such a stack (combinations are possible):

  1. The technically/time constrained. For them, it might be easier to get a somewhat pre-integrated solution. That way they can get started more quickly.
  2. The customers needing to hide behind contracts for legal/CYA reasons.
  3. The large. They need such huge amounts of servers and storage, and so frequently, that they simply don’t have the time to put it in themselves, let alone do the testing. They don’t even have time to have a PS team come onsite and build it. They just want to buy large, ready-to-go chunks, shipped as ready to use as possible.
  4. The rapidly growing.
  5. Anyone wanting pre-tested, repeatable and predictable configurations.

Interesting factoid for the #3 case (large customer): They typically need extensive customizations and most of the time would prefer custom-built infrastructure pods with the exact configuration and software they need. For instance, some customers might prefer certain management software over others, and/or want systems to come preconfigured with some of their own customizations in-place – but they are still looking for the packaged product experience. FlexPod is flexible enough to allow that without deviating from the design. Of course, if the customer wants to dramatically deviate (i.e. not use one of the main components like Cisco switches or servers, for instance) – then it stops being a FlexPod and you’re back to building systems the traditional way.

What customers should really be looking for when building a stack?

In my opinion, customers should be looking for a more end-to-end experience.

You see – even with the virtualization stack in place, you will need to add:

  • OSes
  • DBs
  • Email
  • File Services
  • Security
  • Document Management
  • Chargeback
  • Backup
  • DR
  • etc etc.

You should partner with someone that can help you not just with the storage/virtualization/server/network stack, but also with:

  • Proper alignment of VMs to maintain performance
  • Application-level integration
  • Application protection
  • Application acceleration

In essense, treat things holistically. The vendor that provides your virtualization stack needs to be able to help you all the way to, say, configuring Exchange properly so it doesn’t break best practices, and ensuring that firmware revs on the hardware don’t clash with, say, software patch levels.

You still won’t avoid support complexity. Sure, maybe you can have Microsoft do a joint support exercise with the hardware stack VAR, but, no matter how you package it, you are going to be touching the support of the various vendors making up the entire stack.

And you know what?

It’s OK.

D

 

Technorati Tags: , , , , ,

 

Single wire and single OS: yet another way to tell true unified storage from the rest

This is going to be a mercifully short entry. I’m saving the big one for another day :)

One of the features of NetApp storage is that by using Converged Network Adapters (CNAs) one can use a single wire and transport over that FC, iSCSI, NFS and CIFS, at the same time.

You see, since NetApp storage is truly unified, we don’t need cables coming out of 5 different boxes running 3 or more different OSes to do something (which is what, say, a certain competitor’s “unified” box is like – actually it’s even more boxes if one counts the external replication devices).

You might say “OK, that’s cool but how does it affect my bottom line?”

Just a few benefits that immediately come to mind:

  1. Far less cables to run in your datacenter for both storage and all the servers (each server needs 2 cables for redundancy vs 4 or more)
  2. No compromises since there’s no need to be forced to choose between iSCSI, NAS and FC – each server can happily use whatever’s best for the task at hand yet retain the exact same connectivity
  3. Less switches (no need for both FC and Ethernet switches)
  4. Less OpEx since it’s a simpler solution to manage
  5. Very high speeds (each link is 10Gbit) and low latency (FCoE is similar to FC – no need to do iSCSI if the same link can do both)
  6. Overall a far simpler and cleaner Datacenter

The other part is also important: Single OS. Inherently, something running a single OS has 3x less moving parts than something running 3 totally different OSes, regardless of packaging.

Here are some cool throughput results. Line speeds :)

One can dance around such concepts with marchitecture and fancy Powerpoint slides, but, in the end, just use your head. It’s pretty simple.

Food for thought…

D

NetApp posts new SPEC SFS NFS results – far faster than V-Max with Celerra VG8

Following the new NetApp block-based SPC-1 results yesterday, here is some NAS benchmark action. This page contains all the SPEC SFS results. SPEC SFS is the NAS equivalent of SPC-1.

SPEC SFS is more cache-friendly than the brutal SPC-1, click here for some more information regarding this industry-standard NAS benchmark. The idea is that thousands of CIFS and NFS servers have been profiled and the benchmark reflects real-life NAS usage patterns.

In the same vein as the SPC-1 benchmarks, the configurations we submit to the standard benchmarking authorities are based on realistic systems customers could buy, not $7m lab queens. So, NetApp SPEC and SPC submissions:

  • Are always tested with RAID-DP (RAID-6 protection equivalent) – other vendors test with RAID10 most of the time, and never with RAID-6 (ask them why this is, BlueArc gets respect for being the only other one in the list doing our level of protection)
  • Have a target of using the most cost-effective configuration possible
  • Provide not just high IOPS but also very low latency
  • Are a realistic, deployable configuration, not just the fastest box we have (we still have the 1 million SPEC ops record for a 24-node system, that’s kind of pricy plus the result is old and can’t be compared with the current benchmark code – still, look at the rankings).

So, with those lofty goals in mind, we have 3 new submissions:

  1. CIFS benchmark, 3210 w/ SATA drives – typical low/mid-range system
  2. NFS benchmark, 3270 w/ SAS drives – typical mid-range system, no Flash Cache used in this one.
  3. NFS benchmark, 6240 w/ SAS drives – typical high-end (but not highest) system.

All NetApp systems included some Flash Cache memory boards to provide further acceleration (EDIT: aside from the 3270). We have an even faster system (6280) that we will be submitting later on as a special treat (there’s a certain degree of red tape and ceremony to even do one submission…)

Here’s an abbreviated chart in easily digestible form – showing the most recent results from perennial rivals NetApp and EMC (BTW – of all the systems in the chart, only one of them is truly unified and can provide block and NAS on the same architecture without the need for contortions).

System Result (higher is better) Overall Response Time (lower is better) # Disks Exported Capacity in TB RAID Protocol
NetApp 3210 64292 1.50 144x 1TB SATA 87 RAID-DP CIFS
NetApp 3270 101183 1.66 360x 15K RPM 450GB SAS 110 RAID-DP NFS
NetApp 6240 190675 1.17 288x 15K RPM 450GB SAS 85 RAID-DP NFS
EMC NS-G8 on V-Max 118463 1.92 Bunch o’ SSD (96 fancy STEC 400GB ZeusIOPS) 17 RAID-10 CIFS
EMC NS-G8 on V-Max 110621 2.32 Bunch o’ SSD (96 fancy STEC 400GB ZeusIOPS) 17 RAID-10 NFS
EMC VG8 on V-Max 135521 1.92 312x 15K RPM 450GB FC 19 RAID-10 NFS

Guide to reading the chart, and lessons learned:

  • A “puny” NetApp 3210 with SATA gets better overall response time than an all-SSD V-Max costing well over 10x
  • Notice the amount of usable space on NetApp systems, with even better protection than RAID10
  • The 6240 scored far higher even though it had less disks
  • The NetApp systems have “just” 2 controllers that do everything, vs. the EMC submissions with 4 V-Max engines, plus extra Celerra Data Movers and Control Stations on top. What do you think is more efficient?

In slide format:

I do have some questions to ask certain other vendors as a parting shot:

  1. Sun/Oracle – you keep saying your new boxes are a cheaper way to get NetApp-type functionality, you’ve had them for a while, why not submit to SPEC or SPC? (there is not a single SPEC result from Sun).
  2. EMC – maybe show the world how a system not based on V-Max runs? With RAID-6? (Even V-Max with RAID6, no problem…)
  3. EMC: What’s the deal with the exported capacity, even with 312x drives?
  4. All of you with large striped pools of RAID5 – have you bothered explaining to your customers what will happen to the pool if you have a dual-drive failure in any RAID group? Unacceptable.

D

A look at EMC’s FASTv2, FAST Cache and FLARE30 – EMC giveth, EMC taketh away

[Update: some grammar mistakes fixed and a few questions added]

Before anyone starts frothing at the mouth, notice that in the categories this post is part of FUD :) Always do your own analysis… I just wanted to give people some food for thought, like I did when FASTv1 came out. I didn’t make this up, it’s all based on various EMC documents available online. I advise people looking at this technology to ask for extensive documentation regarding best practices before taking the leap.

As a past longtime user and sometimes pusher of EMC gear, some of the enhancements in FASTv2 seemed pretty cool to me, and potentially worrisome from a competitive standpoint. So I decided to do some reading to see how cool the new technology really is.

Summary of the new features:

  • Large heterogeneous pools (a single pool can consist of different drive types and encompass all drives in a box minus the vault and spares)
  • FASTv2 – sub-LUN movement of hot or cold chunks of data for auto-tiering between drive types
  • FAST Cache – add plain SSDs as cache
  • Much-touted feature: ability to use SSD as a write cache
  • LUN compression
  • Thin LUN space reclamation

It all sounds so good and, if functional, could bring Clariions to parity with some of the more advanced storage arrays out there. However, some examination of the features reveals a few things (I’m sure my readers will correct any errors). In no particular order:

EMC now uses a filesystem

It finally had to happen, thin LUN pools at the very least live on a filesystem laid on top of normal RAID groups (and I suspect all new pools on both Symm and CX now live on a filesystem). So is it real FC or some hokey emulation? Not that it matters if it provides useful functionality impossible to achieve otherwise, it’s just an about-face. But how mature is this new filesystem? Does it automatically defragment itself or at least provide tools for manual defragmentation? Filesystem design is not trivial.

LUN compression

  1. Best practices indicate compression should not be used for anything I/O intensive and is best suited for static workloads (i.e. not good for VMs or DBs). However, new data is compressed as a post-process, which theoretically doesn’t penalize new writes – which I find interesting. Also: What happens with overwrites? Do compressed blocks that need to be overwritten get uncompressed and re-laid down uncompressed until the next compression cycle? Do the blocks to be overwritten get overwritten in their original place or someplace new? What happens with fragmentation? It all sounds so familiar :)
  2. The read performance hit is reported to be about 25% – makes sense since the CPU has to work harder to uncompress the data.
  3. Turning on compression for an existing traditional LUN means the LUN will need to be migrated to a thin LUN in a pool (not converted, migrated – indeed, you need to select where the new LUN will go). Not an in-place operation, it seems.
  4. Does data need to be migrated to a lower tier in order to be compressed?
  5. It follows you need enough space for the conversion to take place… (can you do more than one in parallel? If so, quite a bit of extra space will be needed).
  6. How does this work with external replication engines like RecoverPoint? Does data need to be uncompressed? (probably counts as a normal “read” operation which will uncompress the data).
  7. Does this kind of compression mess with alignment of, say, VMs? This could have catastrophic consequences regarding performance of such workloads…

Thin LUN space reclamation

  1. Another case where migration from thick to thin takes place (doesn’t seem like the LUN is converted in-place to thin)
  2. Unclear whether an already thin LUN that has temporarily ballooned in size can have its space reclaimed (NetApp and a few other arrays can actually do this). You see, LUNs don’t only grow in size… several operations (i.e. MS Exchange checking) can cause a LUN to temporarily expand in space consumption, then go back down to its original size. Thin provisioning is only truly useful if it can help the LUN remain thin :)

Dual-drive ownership, especially when it pertains to pool LUNs

Dual-drive ownership is not strictly a new feature, but best practices is for a single CX controller (SP) to own a drive, and not have it shared. Furthermore, with pool LUNs, if you change the controller ownership of a pool LUN, I/O will see much higher latencies – it’s recommended to do a migration to a new LUN controlled by the other SP (yet another scenario that needs migration). I’m mentioning this since EMC likes to make a big deal about how both controllers can use all drives at the same time… obviously this is not nearly as clean as it’s made to appear. The Symmetrix does it properly.

Metadata used per thin LUN

3GB is the minimum space a thin LUN will occupy due to metadata and other structures. Indeed, LUN space is whatever you allocate plus another 3GB. Depending on how many LUNs you want to create, this can add up, especially if you need many small LUNs.

Loss of performance with thin LUNs and pools in general

It’s not recommended to use pools and especially thin LUNs for performance-sensitive apps, and in general old-style LUNs are recommended for the highest performance, not pools. Which is interesting, since most of the new features need pools in order to work… I heard 30% losses if thin LUNs are used in particular, but that’s unconfirmed. I’m sure someone from EMC can chime in.

Expansion, RAID and scalability caveats with pools

  1. To maintain performance, you need to expand the pool by adding as many drives as the pool already has – I suspect this has something to do with the way data is striped. This could cause issues as the system gets larger (who will really expand a CX4-960 by 180 drives a pop? Because best practices state that’s what you have to do if you start with 180 drives in the pool) .
  2. Another thing that’s extremely unclear is how data is load-balanced among the pool drives. Most storage vendors are extremely open about such things. All I could tell is that there are maximum increments at which you can add drives to a pool, ranging from 40 on a CX4-120 to 180 on a CX4-960. Since a pool can theoretically encompass all drives aside from vault and spares, does this mean that striping happens in groups of 180 in a CX4-960 and if you add another 180 that’s another stripe and the stripes concatenated?
  3. What if you don’t add drives by the maximum increment, and you only add them, say, 30 at a time? What do you give up, if anything?
  4. RAID6 is recommended for large pools (which makes total sense since it’s by far the most reliable RAID at the moment when many drives are concerned). However, RAID6 on EMC gear has a serious write performance penalty. Catch-22?

FAST Cache (includes being able to cache writes)

  1. Cache only on or off for the entire pool, can’t tune it per LUN (can only be turned on/off per LUN if old-style RAID groups and LUNs are used).
  2. 64KB block size (which means that a hot 4K block will still take 64K in cache – somewhat inefficient).
  3. A block will only be cached if it’s hit more than twice. Is that really optimal for the best hit rate? Can it respond quickly to a rapidly changing working set?
  4. Unclear set associativity (important for cache efficiency).
  5. No option to automatically optimize for sequential read after random write workloads (many DB workloads are like that).
  6. Flash drives aren’t that fast for writes as confirmed by EMC’s Barry Burke (the Storage Anarchist) in his comment here and by Randy Loeschner here. Is the write benefit really that significant? Maybe for Clariions with SATA, possibly due to the heavy RAID write penalties, especially with RAID6.
  7. It follows that highly localized overwrites could be significantly optimized since the Clariion RAID suffers a great performance degradation with overwrites, especially with RAID6 (something other vendors neatly sidestep).
  8. EMC Clariions don’t do deduplication so the cache isn’t deduplicated itself, but is it at least aware of compression? Or do blocks have to be uncompressed in cache? Either way, it’s a lot less efficient than NetApp Flash Cache for environments where there’s a lot of block duplication.
  9. The use of standard SSDs versus a custom cache board is a mixed blessing – by definition, there will be more latency. At the speeds these devices are going, those latencies add up (since it’s added latency per operation, and you’re doing way more than one operation). All high-end arrays add cache in system boards, not with drives…
  10. Smaller Clariions have severely limited numbers of flash drives that can be used for caching (2-8 depending on the model, with the smaller ones only able to use very small cache drives). Only the CX4-960 can do 20 mirrored cache drives, which I predict will provide good performance even for fairly heavy write workloads. However, that will come at a steep price. The idea behind caches like NetApp’s Flash Cache is to reduce costs

For a very detailed discussion regarding megacaches in general read here.

I can see FAST Cache helping significantly on a system with lots of SATA in a well-configured CX4-960. And I can definitely see it helping with heavy read workloads that have good locality of reference, since SSDs are very good for reads.

And finally, the pièce de résistance,

FASTv2

This is EMC’s sub-LUN auto-tiering feature. Meaning that a LUN is chopped up into 1GB chunks, and that the 1GB chunks move to slower or faster disks depending on how heavily accessed they are. The idea being that, after a little while, you will achieve steady state and end up with the most appropriate data on the most appropriate drives.

Other vendors (most notably Compellent and now also 3Par, IBM and HDS) have some form of that feature (Compellent pioneered this technology and has the smallest possible chunks I believe at 512KB).

The issues I can see with the CX approach of FASTv2:

  1. Gigantic 1GB slice to be moved. EMC admits this was due to the Clariion not being fast enough to deal with the increased metadata of many smaller slices (the far more capable Symmetrix can do 768KB per slice, offering far more granularity). It follows that the bigger the slice the less optimal the results are from an efficiency standpoint.
  2. All RAID groups within the pool have to be of the same RAID type (i.e, RAID6). So you can’t have, say, SATA as RAID6 and SSD as RAID5 in the same pool. Important since RAID6 on most arrays has a big performance impact.
  3. Unknown performance impact for keeping track of the slices (possibly the same as using thin provisioning – 30% or so?)
  4. The most important problem in my opinion: Too much data can end up in expensive drives. For instance, imagine a 1TB DB LUN. That LUN will be sliced into 1,000x 1GB chunks. Unless the hotspots of the DB are extremely localized, even if a few hundred blocks are busy per slice, that entire slice will get migrated to SSD the next day (it’s a scheduled move). Now imagine if, say, half the slices have blocks that are deemed busy enough – half the LUN (512GB in this example) will be migrated to SSD, even if the hot data in those slices were more like 5GB (say a 10% working set size, quite typical). Clearly, this is not the most effective use of fast disks. EMC has hand-waved this objection away in the past, but if it’s not important, why does the Symmetrix go with the smaller slice?
  5. Extremely slow transactional performance for the data that has been migrated to SATA, especially with RAID6 – EMC says you need to pair this with FAST Cache, which makes sense… Of course, come next day that data will move to SSD or FC drives, but will that be fast enough? Policies will have to be edited and maintained per application (often removing the auto-tiering by locking an app at a tier), which removes much of the automation on offer.
  6. The migration is I/O intensive, and we’re talking about migrations of 1GB slices (on a large array, many thousands of them). What does that mean for the back-end? After all, once a day all the migrations need to be processed… and will need to contend with normal I/O activity.
  7. Doesn’t support compressed blocks, data needs to be uncompressed in order to be moved.
  8. I still think this technology is most applicable to fairly predictable, steady workloads with good locality of reference.

Messaging inconsistencies

As I’ve mentioned before, I don’t have an issue with EMC’s technology, merely with the manner in which the capabilities and restrictions are messaged (or not, as the case may be). For instance, I’ve seen marketing announcements and blog entries talking about doing VMware on thin LUNs with compression etc. – sure, that could be space-efficient, but will it also be fast?

Now that the limitations of the new features are more understood, EMC’s marketing message loses some of its punch.

  • Will compression really work with busy VMware or DB setups?
  • Will thin LUNs be OK for busy systems?
  • Unless 20 disks are used for FAST Cache (only with a CX4-960), is the performance really enough to accelerate highly random writes on large systems?
  • What is the performance impact of thin LUNs for highly-intensive workloads?
  • What is the performance of a large system all running RAID6?
  • Last but not least – does the filesystem EMC uses allow defragmentation? By definition, features such as thin provisioning, compression and FAST will create fragmentation.

Moreover – what all the messaging lacks is some comparison to other people’s technology. Showing a video booting 1000 VMs in 50 minutes where before it took 100 is cool until you realize others do it in 12.

And why is EMC (I’m picking on them since they’re the most culpable in this aspect) ridiculing technologies such as NetApp’s Flash Cache and Compellent’s Data Motion only to end up implementing similar technologies and presenting things to the world as if they are unique in figuring this out? “You see, none of the other guys did it right, now that we did it it’s safe”.

Too many of the new features are extremely obscure in their design, if storage professionals can’t easily figure them out, how is the average consumer expected to? I think more openness is in order, otherwise it just looks like you have something to hide.

Ultimately – the devil is in the details, so why would you have to choose between space OR performance, and not be able to optimize space utilization AND performance?

I think it has to do with the original design of your storage system. Not all systems lend themselves to advanced features because of the way they started.

But that’s a subject for another day.

D

Pillar claiming their RAID5 is more reliable than RAID6? Wizardry or fiction?

Competing against Pillar at an account. One of the things they said: That their RAID5 is superior in reliability to RAID6. I wanted to put this on the public domain and, if true, invite Pillar engineers to comment here and explain how it works for all to see. If untrue, again I invite the Pillar engineers to comment and explain why it’s untrue.

The way I see it: very simply, RAID5 is N+1 protection, RAID6 is N+2. Mathematically, RAID5 is about 4,000 times more likely to lose data than a RAID6 group with the same number of data disks. Even RAID10 is about 160 times more likely to lose data than RAID6.

The only downside to RAID6 is performance – if you want the protection of RAID6 but with extremely high performance then look at NetApp, the RAID-DP NetApp employs by default has in many cases better performance than RAID10 even. Oracle has several PB of DB’s running on NetApp RAID-DP. Can’t be all that bad.

See here for some info…

D

What if you could dramatically improve your application testing times? What would happen to your productivity and to the company’s bottom line?

So, let’s say the DBA (or insert some other discipline) wants to do some testing for a new product (known to happen occasionally) – and the way he would really like to test is to create 20 test cases, which requires 20 copies of the main database. He would then automate the test and therefore get results very quickly.

He approaches the storage admin with the problem, only to be told this isn’t possible since there isn’t enough space on the array. The DBA goes back to his cube frustrated, and figures out some ghetto way of creating at least 1 copy of the database, which creates the following problems:

  1. He has to figure out a way to do it (takes time)
  2. He can only test 1 case at a time (time)
  3. He cannot easily compare what-if scenarios between test cases (lack of flexibility)
  4. His ghetto way of doing it may involve single 1TB disks in a workstation (lack of reliability, time)

Ultimately, the testing takes longer, is error-prone, and the DBA’s productivity level goes way down.

What if the storage admin could, instead, tell the DBA that he can even take hundreds of copies of the DB, there’s no issue doing that?

What would happen to the DBA’s productivity?    

What new ideas would he be able to come up with?

How would that affect the quality of the product?

How would that affect the company’s bottom line? Being able to go to market with improved quality and quicker than the competition?

You see, intelligent storage – intelligently deployed – can solve many more problems than just “give me some space” or “give me more performance”.

There aren’t many technologies out there that can comfortably do this, which is probably why most storage people aren’t aware of this. But an array that can create space- and performance-efficient application-consistent DB clones is the ticket. Being able to create full copies and/or virtual space-efficient copies that end up being unusably slow doesn’t count… :)

The only vendor I know of that can pull this off (properly) is NetApp with their FlexClone technology. One can even use it to deploy thousands of identical VMs… there are some use cases for that, too :)

Activision (the company that makes the famous Guitar Hero game) is a good example of using this technology to rapidly accelerate development – and ended up making the Christmas deadline, which resulted in several more millions in sales. See here.

Oracle is another small company that uses this technology pervasively.

If anyone else knows of more vendors that can do this (properly) please chime in.

D

New ext4 vs XFS benchmarks using Fedora 11 Leonidas

What a difference a kernel rev and/or distribution make. If you recall from a previous post, I was unable to complete postmark testing on Ubuntu 9.04 using ext4, and had to recommend against ext4. Now, with the release of Fedora 11 “Leonidas”, a new kernel seems to make a big difference in performance and stability of ext4.

Some other observations before I show any numbers:

  • This is NOT the same computer as was used in the previous test, don’t use these numbers to compare between Ubuntu and Fedora. It’s a desktop with a 64-bit Athlon and 1GB RAM. I know, I know… I didn’t have access to the other box. Look at Phoronix.com for a comparison of the two.
  • The 2.6.29 kernel seems to have a much better implementation of the CFQ I/O elevator, I only noticed a slight decrease in performance using deadline instead of the increase I usually get with XFS (ext3 and ext4 have always been tuned for CFQ).
  • In this version, using my usual (and sometimes unsafe and daring) mount switches didn’t seem to make a huge difference on XFS and none in ext4 or even ext3, Fedora 11 is really a distribution that the developers want you to be able to use without much fussing.
  • On all tests, I created XFS with mkfs.xfs -f -l lazy-count=1 -l size=128m /dev/…  – this enables the 2 main (and safe) tunings that I believe everyone should follow with XFS. Kinda hard to do while installing a distribution, the Fedora 11 installed wasn’t happy about it. Ubuntu is more forgiving, it lets you boot into the LiveCD and you can manually create partitions before you let the installer do its thing. Convenient for single-root-partition installs…
  • “XFS tuned” means mounted with noatime,logbsize=256k,nobarrier (nobarrier is unsafe unless you’re on a UPS).
  • “ext3 tuned” means barrier=0,noatime,data=writeback. Used to make a big difference…
  • The same disk area was used for all tests
  • Scribefire on Firefox sucks compared to Mac- or Windows-based offline blog editors. There are some KDE-based ones but I didn’t want to download 100s of MB of KDE support infrastructure to run a 600K blog program…

Postmark numbers:

Filesystem Read MB/s Write MB/s IOPS
XFS defaults 4.9 10.34 215
XFS tuned 6.23 13.16 263
XFS noatime,logbsize 6.38 13.47 263
ext4 noatime 9.62 20.32 416
ext3 noatime 5.71 12.06 238
ext3 “tuned” 5.32 11.24 219
ext3 writeback,noatime 4.73 9.98 192

Bonnie++ numbers:

Filesystem
IOPS
Block writes KB/s Rewrite KB/s  
XFS defaults 328.4 116600 52066
XFS tuned 328.6 119981 51639
XFS noatime,logbsize 333 119781 50519
ext4 noatime 335.1 117285 48797
ext3 noatime 294.6 100771 43033

Verdict

  • Ext4 shows great promise!
  • For sheer MB/s on large files, XFS is still better by a small margin
  • If you want to be doing operations on many small files, ext4 is great
  • The reworked CFQ scheduler rocks

D