Category Archives: Speculation

Are some flash storage vendors optimizing too heavily for short-lived NAND flash?

I really resisted using the “flash in the pan” phrase in the title… first, because the term is overused and second, because I don’t believe solid state is of limited value. On the contrary.

However, I am noticing an interesting trend among some newcomers in the array business, desperate to find a flash niche to compete in:

Writing their storage OS around very specific NAND flash technologies. Almost as bad as writing an entire storage OS to support a single hypervisor technology, but that’s a story for another day.

Solid state technology is still too fluid. Unlike spinning disk technology that is overall very reliable and mature and likely won’t see huge advances in the years to come, solid state technology seems to advance almost weekly. New SSD controllers are coming out almost too frequently, and new kinds of solid state storage are either out now (Triple Level Cell, anyone?) or coming in the future (MRAM, ReRAM, FeRAM, PCM, PMC, and probably a lot more that I’m forgetting).

My point is:

How far ahead are certain vendors thinking if they are writing an entire storage OS around the limitations of a class of storage that may look very different in just a year or two?

Some of them go really deep and try to do all kinds of clever optimizations to ensure good wear leveling for the flash chips. Some write their own controller software and use bare NAND flash chips, not even off-the-shelf SSDs. Which is great, but what if you don’t need to do that in two years? Or what if the optimizations need to be drastically different for the new technologies? How long will coding for the new flash technologies take? Or will they be stuck using old technologies? Food for thought.

I guess some of us are in it for the long haul, and some aren’t. “Can’t see the forest for the trees” comes to mind. “Gold rush” also seems relevant.

I strongly believe general-purpose storage OSes need to be flexible enough to be reasonably adaptable to different underlying media. And storage OSes that are specifically designed for solid state storage need to be especially flexible regarding the underlying SSD technology to avoid the problems outlined above, and to avoid the relative lack of reliability of current SSD solutions (another story for another day).

At the moment I don’t see clear winners yet. I see a few great short-term stories, but who has the most flexible architecture to be able to deal with different kinds of technologies for years to come?

D

Technorati Tags: , ,

Has NetApp sold more flash than any other enterprise disk vendor?

NetApp has been selling our custom cache boards with flash chips for a while now. We have sold over 3PB of usable cache this way.

The question was raised in public forums such as Twitter – someone mentioned that this figure may be more usable Solid State storage than all other enterprise disk vendors have sold combined (whether it’s used for caching or normal storage – I know we have greatly outsold anyone else that does it for caching alone :) ).

I don’t know if it is, maybe the boys from the other vendors can chime in on this and tell us, after RAID, how much usable SSD they’ve sold, but the facts remain:

  • NetApp has demonstrated thought leadership in pioneering the pervasive use of Megacaches
  • The market has widely adopted the NetApp Flash Cache technology (I’d say 3PB of usable cache is pretty wide adoption)
  • The performance benefits in the real world are great, due to the extra-granular nature of the cache (4KB blocks vs 64+ KB for others) and extremely intelligent caching algorithms
  • The cost of entry is extremely reasonable
  • It’s a very easy way to add extra performance without forcing data into faster tiers.

Comments welcome…

D

Technorati Tags: , , , , , , , , ,

A look at EMC’s FASTv2, FAST Cache and FLARE30 – EMC giveth, EMC taketh away

[Update: some grammar mistakes fixed and a few questions added]

Before anyone starts frothing at the mouth, notice that in the categories this post is part of FUD :) Always do your own analysis… I just wanted to give people some food for thought, like I did when FASTv1 came out. I didn’t make this up, it’s all based on various EMC documents available online. I advise people looking at this technology to ask for extensive documentation regarding best practices before taking the leap.

As a past longtime user and sometimes pusher of EMC gear, some of the enhancements in FASTv2 seemed pretty cool to me, and potentially worrisome from a competitive standpoint. So I decided to do some reading to see how cool the new technology really is.

Summary of the new features:

  • Large heterogeneous pools (a single pool can consist of different drive types and encompass all drives in a box minus the vault and spares)
  • FASTv2 – sub-LUN movement of hot or cold chunks of data for auto-tiering between drive types
  • FAST Cache – add plain SSDs as cache
  • Much-touted feature: ability to use SSD as a write cache
  • LUN compression
  • Thin LUN space reclamation

It all sounds so good and, if functional, could bring Clariions to parity with some of the more advanced storage arrays out there. However, some examination of the features reveals a few things (I’m sure my readers will correct any errors). In no particular order:

EMC now uses a filesystem

It finally had to happen, thin LUN pools at the very least live on a filesystem laid on top of normal RAID groups (and I suspect all new pools on both Symm and CX now live on a filesystem). So is it real FC or some hokey emulation? Not that it matters if it provides useful functionality impossible to achieve otherwise, it’s just an about-face. But how mature is this new filesystem? Does it automatically defragment itself or at least provide tools for manual defragmentation? Filesystem design is not trivial.

LUN compression

  1. Best practices indicate compression should not be used for anything I/O intensive and is best suited for static workloads (i.e. not good for VMs or DBs). However, new data is compressed as a post-process, which theoretically doesn’t penalize new writes – which I find interesting. Also: What happens with overwrites? Do compressed blocks that need to be overwritten get uncompressed and re-laid down uncompressed until the next compression cycle? Do the blocks to be overwritten get overwritten in their original place or someplace new? What happens with fragmentation? It all sounds so familiar :)
  2. The read performance hit is reported to be about 25% – makes sense since the CPU has to work harder to uncompress the data.
  3. Turning on compression for an existing traditional LUN means the LUN will need to be migrated to a thin LUN in a pool (not converted, migrated – indeed, you need to select where the new LUN will go). Not an in-place operation, it seems.
  4. Does data need to be migrated to a lower tier in order to be compressed?
  5. It follows you need enough space for the conversion to take place… (can you do more than one in parallel? If so, quite a bit of extra space will be needed).
  6. How does this work with external replication engines like RecoverPoint? Does data need to be uncompressed? (probably counts as a normal “read” operation which will uncompress the data).
  7. Does this kind of compression mess with alignment of, say, VMs? This could have catastrophic consequences regarding performance of such workloads…

Thin LUN space reclamation

  1. Another case where migration from thick to thin takes place (doesn’t seem like the LUN is converted in-place to thin)
  2. Unclear whether an already thin LUN that has temporarily ballooned in size can have its space reclaimed (NetApp and a few other arrays can actually do this). You see, LUNs don’t only grow in size… several operations (i.e. MS Exchange checking) can cause a LUN to temporarily expand in space consumption, then go back down to its original size. Thin provisioning is only truly useful if it can help the LUN remain thin :)

Dual-drive ownership, especially when it pertains to pool LUNs

Dual-drive ownership is not strictly a new feature, but best practices is for a single CX controller (SP) to own a drive, and not have it shared. Furthermore, with pool LUNs, if you change the controller ownership of a pool LUN, I/O will see much higher latencies – it’s recommended to do a migration to a new LUN controlled by the other SP (yet another scenario that needs migration). I’m mentioning this since EMC likes to make a big deal about how both controllers can use all drives at the same time… obviously this is not nearly as clean as it’s made to appear. The Symmetrix does it properly.

Metadata used per thin LUN

3GB is the minimum space a thin LUN will occupy due to metadata and other structures. Indeed, LUN space is whatever you allocate plus another 3GB. Depending on how many LUNs you want to create, this can add up, especially if you need many small LUNs.

Loss of performance with thin LUNs and pools in general

It’s not recommended to use pools and especially thin LUNs for performance-sensitive apps, and in general old-style LUNs are recommended for the highest performance, not pools. Which is interesting, since most of the new features need pools in order to work… I heard 30% losses if thin LUNs are used in particular, but that’s unconfirmed. I’m sure someone from EMC can chime in.

Expansion, RAID and scalability caveats with pools

  1. To maintain performance, you need to expand the pool by adding as many drives as the pool already has – I suspect this has something to do with the way data is striped. This could cause issues as the system gets larger (who will really expand a CX4-960 by 180 drives a pop? Because best practices state that’s what you have to do if you start with 180 drives in the pool) .
  2. Another thing that’s extremely unclear is how data is load-balanced among the pool drives. Most storage vendors are extremely open about such things. All I could tell is that there are maximum increments at which you can add drives to a pool, ranging from 40 on a CX4-120 to 180 on a CX4-960. Since a pool can theoretically encompass all drives aside from vault and spares, does this mean that striping happens in groups of 180 in a CX4-960 and if you add another 180 that’s another stripe and the stripes concatenated?
  3. What if you don’t add drives by the maximum increment, and you only add them, say, 30 at a time? What do you give up, if anything?
  4. RAID6 is recommended for large pools (which makes total sense since it’s by far the most reliable RAID at the moment when many drives are concerned). However, RAID6 on EMC gear has a serious write performance penalty. Catch-22?

FAST Cache (includes being able to cache writes)

  1. Cache only on or off for the entire pool, can’t tune it per LUN (can only be turned on/off per LUN if old-style RAID groups and LUNs are used).
  2. 64KB block size (which means that a hot 4K block will still take 64K in cache – somewhat inefficient).
  3. A block will only be cached if it’s hit more than twice. Is that really optimal for the best hit rate? Can it respond quickly to a rapidly changing working set?
  4. Unclear set associativity (important for cache efficiency).
  5. No option to automatically optimize for sequential read after random write workloads (many DB workloads are like that).
  6. Flash drives aren’t that fast for writes as confirmed by EMC’s Barry Burke (the Storage Anarchist) in his comment here and by Randy Loeschner here. Is the write benefit really that significant? Maybe for Clariions with SATA, possibly due to the heavy RAID write penalties, especially with RAID6.
  7. It follows that highly localized overwrites could be significantly optimized since the Clariion RAID suffers a great performance degradation with overwrites, especially with RAID6 (something other vendors neatly sidestep).
  8. EMC Clariions don’t do deduplication so the cache isn’t deduplicated itself, but is it at least aware of compression? Or do blocks have to be uncompressed in cache? Either way, it’s a lot less efficient than NetApp Flash Cache for environments where there’s a lot of block duplication.
  9. The use of standard SSDs versus a custom cache board is a mixed blessing – by definition, there will be more latency. At the speeds these devices are going, those latencies add up (since it’s added latency per operation, and you’re doing way more than one operation). All high-end arrays add cache in system boards, not with drives…
  10. Smaller Clariions have severely limited numbers of flash drives that can be used for caching (2-8 depending on the model, with the smaller ones only able to use very small cache drives). Only the CX4-960 can do 20 mirrored cache drives, which I predict will provide good performance even for fairly heavy write workloads. However, that will come at a steep price. The idea behind caches like NetApp’s Flash Cache is to reduce costs

For a very detailed discussion regarding megacaches in general read here.

I can see FAST Cache helping significantly on a system with lots of SATA in a well-configured CX4-960. And I can definitely see it helping with heavy read workloads that have good locality of reference, since SSDs are very good for reads.

And finally, the pièce de résistance,

FASTv2

This is EMC’s sub-LUN auto-tiering feature. Meaning that a LUN is chopped up into 1GB chunks, and that the 1GB chunks move to slower or faster disks depending on how heavily accessed they are. The idea being that, after a little while, you will achieve steady state and end up with the most appropriate data on the most appropriate drives.

Other vendors (most notably Compellent and now also 3Par, IBM and HDS) have some form of that feature (Compellent pioneered this technology and has the smallest possible chunks I believe at 512KB).

The issues I can see with the CX approach of FASTv2:

  1. Gigantic 1GB slice to be moved. EMC admits this was due to the Clariion not being fast enough to deal with the increased metadata of many smaller slices (the far more capable Symmetrix can do 768KB per slice, offering far more granularity). It follows that the bigger the slice the less optimal the results are from an efficiency standpoint.
  2. All RAID groups within the pool have to be of the same RAID type (i.e, RAID6). So you can’t have, say, SATA as RAID6 and SSD as RAID5 in the same pool. Important since RAID6 on most arrays has a big performance impact.
  3. Unknown performance impact for keeping track of the slices (possibly the same as using thin provisioning – 30% or so?)
  4. The most important problem in my opinion: Too much data can end up in expensive drives. For instance, imagine a 1TB DB LUN. That LUN will be sliced into 1,000x 1GB chunks. Unless the hotspots of the DB are extremely localized, even if a few hundred blocks are busy per slice, that entire slice will get migrated to SSD the next day (it’s a scheduled move). Now imagine if, say, half the slices have blocks that are deemed busy enough – half the LUN (512GB in this example) will be migrated to SSD, even if the hot data in those slices were more like 5GB (say a 10% working set size, quite typical). Clearly, this is not the most effective use of fast disks. EMC has hand-waved this objection away in the past, but if it’s not important, why does the Symmetrix go with the smaller slice?
  5. Extremely slow transactional performance for the data that has been migrated to SATA, especially with RAID6 – EMC says you need to pair this with FAST Cache, which makes sense… Of course, come next day that data will move to SSD or FC drives, but will that be fast enough? Policies will have to be edited and maintained per application (often removing the auto-tiering by locking an app at a tier), which removes much of the automation on offer.
  6. The migration is I/O intensive, and we’re talking about migrations of 1GB slices (on a large array, many thousands of them). What does that mean for the back-end? After all, once a day all the migrations need to be processed… and will need to contend with normal I/O activity.
  7. Doesn’t support compressed blocks, data needs to be uncompressed in order to be moved.
  8. I still think this technology is most applicable to fairly predictable, steady workloads with good locality of reference.

Messaging inconsistencies

As I’ve mentioned before, I don’t have an issue with EMC’s technology, merely with the manner in which the capabilities and restrictions are messaged (or not, as the case may be). For instance, I’ve seen marketing announcements and blog entries talking about doing VMware on thin LUNs with compression etc. – sure, that could be space-efficient, but will it also be fast?

Now that the limitations of the new features are more understood, EMC’s marketing message loses some of its punch.

  • Will compression really work with busy VMware or DB setups?
  • Will thin LUNs be OK for busy systems?
  • Unless 20 disks are used for FAST Cache (only with a CX4-960), is the performance really enough to accelerate highly random writes on large systems?
  • What is the performance impact of thin LUNs for highly-intensive workloads?
  • What is the performance of a large system all running RAID6?
  • Last but not least – does the filesystem EMC uses allow defragmentation? By definition, features such as thin provisioning, compression and FAST will create fragmentation.

Moreover – what all the messaging lacks is some comparison to other people’s technology. Showing a video booting 1000 VMs in 50 minutes where before it took 100 is cool until you realize others do it in 12.

And why is EMC (I’m picking on them since they’re the most culpable in this aspect) ridiculing technologies such as NetApp’s Flash Cache and Compellent’s Data Motion only to end up implementing similar technologies and presenting things to the world as if they are unique in figuring this out? “You see, none of the other guys did it right, now that we did it it’s safe”.

Too many of the new features are extremely obscure in their design, if storage professionals can’t easily figure them out, how is the average consumer expected to? I think more openness is in order, otherwise it just looks like you have something to hide.

Ultimately – the devil is in the details, so why would you have to choose between space OR performance, and not be able to optimize space utilization AND performance?

I think it has to do with the original design of your storage system. Not all systems lend themselves to advanced features because of the way they started.

But that’s a subject for another day.

D

EMC’s incredible marketing and the FAST fairy tale (and a bit on how to reduce tiers)

I’m in MN prepping to teach a course (my signature anti-FUD extravaganza), and thought I’d get a few things off my chest that I’ve been meaning to write about for a while. Some Stravinsky to provide the vibes and I’m good to go. It’s getting really late BTW and I’m sure this will progressively get less coherent as time goes by, but I like to write my posts in one shot…

I never cease to be amazed by what’s possible with the power of great marketing/propaganda. And EMC is a company that has some of the best marketing anywhere. Other companies should take note!

Think about it: Especially on the CX, they took an auto-tiering implementation as baked as wheat that hasn’t been planted yet, and managed to create so much noise and excitement around it that many people think EMC actually invented the concept and, heavens, some even believe that the existing implementation is actually decent. Worse still, some have actually purchased it. Kudos to EMC. With the exception of some of Microsoft’s work, nobody reputable has the stones any more to release, amidst such fanfare, a product this unpolished. Talk about selling futures…

Perception is reality.

I’m an engineer by training and by trade first and foremost, and, regardless of bias, I consider the existing FAST implementation an affront. Allow me to explain, gentle reader…

The tiering concept

Some background info is in order. Most arrays of any decent size and complexity sold nowadays are configured with different kinds of disk, purely out of cost considerations. For instance, there may be 30 really fast drives where a bunch of important low-latency DBs live, another 100 pretty fast drives where most VMs and Exchange live, then 200 SATA drives for bulk storage and backups.

Don’t kid yourself: If the customer buying the aforementioned array had enough dough, they’d be getting the wunderbox with all super-fast drives inside – all the exact same kind of drives. That’s just simpler to deal with from a management standpoint and obviously the performance is stellar. Remember this point since we’ll get back to it…

Of course, not everyone is made of money, so arrays that look like the 3-tier example above are extremely common. Just enough drives of each type are purchased in order to achieve the end result.

What typically ends up happening is that, over time, some pieces of data end up in the wrong tier, for one reason or another. Maybe a DB that was super-important once now only needs to be accessed once a year; or a DB that was on SATA now has become the most frequently-accessed piece of data in the array. Or, perhaps, the importance of a DB flip-flops during a month, so it only needs to be fast maybe for month-end-processing. So now, you need to move stuff around so that what needs to be fast is shifted to the fast drives.

Pressure points and the need for passing the hot potato

But wait, there’s more…

The entire performance problem is created in the first place due to most array architectures being older than mud. In legacy array architectures, LUNs are carved out of RAID groups, typically made of relatively few disks. So, in an EMC Clariion, it’s best practices to have a 5-disk RAID5 group. You then ideally split up that group into no more than 2 LUNs and assign one to each controller.

With disks getting bigger and bigger, creating 1-2 LUNs can become exceedingly difficult – a 5-disk R5 group made with 450GB drives in a Clariion offers a bit over 1.5TB of space, which is too much for many application needs – maybe you just need 50GB here, another 300GB there… in the end, you may have 10 LUNs in that RAID group that’s supposed to have no more than 2. The new 600GB FC drives make this even worse.

So, in summary, what ends up happening is that you split up that RAID group into too many LUNs in order to avoid waste. And that’s where your array develops a serious pressure problem.

You see, now you may have 10 different servers hitting the exact same RAID group, creating undue pressure on the 5 poor disks struggling to cope with the crazy load. I/O service times get too high, queue lengths get crazy, users get cranky.

Again – this whole problem exists exactly because legacy array architectures don’t automatically balance I/O among all drives.

But for those afflicted Paleolithic systems, wouldn’t it be nice if we could move some of those hot LUNs, non-disruptively, to other RAID groups that don’t suffer from high pressure?

That’s what EMC’s FAST for the Symmetrix and CX does. It attempts to move entire LUNs to faster tiers like SSD. Which, BTW, is something you can do manually, but FAST attempts to automate the task (kinda, depends, etc).

The current FAST pitfalls

Let’s examine first how FAST (Fully Automated Storage Tiering) is implemented. Since it’s really 3 utterly different solutions, depending on whether you have Symm, CX or NS:

On the Symmetrix it’s always been there in the form of Symmetrix Optimizer, which may not have been aware of tiers but it definitely knew about migrating to less busy disks. Now you can teach it about tiers, too. But it’s not, in my mind, a new product, even if EMC would like you to believe it is. It looks to me too much like Optimizer + some new heuristics. But the Gods of Marketing managed to create unbelievable commotion about something that was an old feature. What amazes me is that nobody seems to have made the connection – maybe I’m really missing something. I’m sure someone from EMC will correct me if I’m wrong. In my experience, Optimizer, when purchased, often did more harm than good, was difficult to manage and, ultimately, was left inactive in many shops – with the beancounters lamenting the spending of precious funds on something that never quite worked that well. Oh, and it seems the current version doesn’t support thin LUNs. But of the FAST implementations on EMC gear it is the more complete version, exactly because Optimizer has been there for a long time…

On the far more popular CX platform, what happens is like a tribute to kludges everywhere. Consider this:

  1. Movement is one-way only (FC to SATA, or FC to SSD). More of a one-shot tool than continuous optimization!
  2. You need a separate PC that will crunch Navisphere Analyzer performance logs, this takes a while
  3. The PC will then provide a list of recommendations
  4. Depending on which LUNs you approve it will invoke a NaviCLI command to move the specified LUNs in the box
  5. Doesn’t support thin provisioning
  6. Not sure if it supports MetaLUNs
  7. It is NOT automatic since you have to approve the move! Ergo, it should not be sold under the name “FAST” since the “A” stands for “Automated”, aren’t there laws for false advertising?

On the Celerra NS platform (EMC’s NAS), one needs to purchase the Rainfinity FMA boxes, which then can move files between tiers of disk based on frequency of access. One is then limited by the scalability of the FMA – how many files can it track? How dynamically can it react to changing workloads? What if the FMA breaks? Why do I need yet more boxes to do this?

Ah, but it gets better with FASTv2! Or does it?

EMC has been upfront that FAST will become way cooler with v2. It better be, since as you can see it’s no great shakes at the moment. From what the various EMC bloggers have been posting, it seems FASTv2 will use the thin provisioning subsystem to go to a sub-LUN level of granularity.

The granularity will obviously depend on how many disks you have in the virtual provisioning pool, since a LUN (just like with MetaLUNs) will be split up so that it occupies all the disks in the pool. The bigger the pool, the better. This should provide better performance (it does with other vendors) yet EMC in their docs state the current version of virtual provisioning (at least on the CX) has higher overhead when compared to their traditional LUNs and will provide less performance. I guess that’s a subject for another day, and maybe they’ll finally revamp the architecture to fix it. Back to FASTv2:

The “busyness” of each LUN segment will be analyzed, and that segment will then move, if applicable, to another tier. Of course, how efficient that will end up being will depend on how you do I/O to the LUN in the first place! If the LUN I/O is fairly spatially uniform, then the whole thing will have to move just like FASTv1. But I guess with v2 there’s at least the potential of sub-LUN migration, for cases where a clearly delineated part of the LUN is really “hot” or “cold”. Obviously, since the chunk size will still be significantly large, expect a bunch of non-applicable data to move with the stuff that should be moved.

The real problem

First, to give credit where it’s due: Compellent already has had sub-LUN moves for a long, long time. Give those guys props. They actually deserve it.

However – both the Compellent approach as well as FASTv2 and, even worse, v1, suffer from this fundamental issue:

Lack of real-time acceleration.

Think about it – performance has to be analyzed periodically, heuristics followed, then LUNs or pieces of LUNs have to be moved around. This is not something that can respond instantly to performance demands.

Consider this scenario:

You have a payroll DB that, during most of the month, does absolutely nothing. A fully automated tiering system will say “hey, nobody has touched this LUN in weeks, I better move it to SATA!”

Then crunch time comes, and the DB is on the SATA drives. Oopsie.

People complain, and the storage admin is forced to manually migrate it back to SSD.

Kinda defeats the whole purpose… unless I’m missing something the size of Titanic.

So, you may have to write all kinds of exception rules (provided the system lets you). Some rules for most DBs, Exchange, a few apps here and there…

Soon, you’re actually in a worse state than where you begun: You have the added complexity and cost of FAST, plus you have to worry about creating exception rules.

Now here’s a novel idea…

What if you actually put your data in the right tier to begin with and what if, even if you didn’t, it didn’t matter too much?

For instance – normal fileshares, deep archives, large media files, backups to disk – most people would agree that those workloads should probably forever be on SATA if you’re trying to save some money. With 2TB drives, the SATA tier has become super-dense, which can be very useful for quite a few use cases.

DBs, VM OS files – should usually be on faster disk. But no need to go nuts with several tiers of fast disk, a single fast tier should be sufficient!

LUNs and other array objects should try to automatically span as many drives as possible by default without you having to tell the array to do that… that way you avoid the hot spots in the first place by design, thereby reducing or even removing the need for migrations (I can still see some very limited cases where migration would be useful).

And finally, large, intelligent cache (as in really large) to help with real-time workload demands, dynamically and as-needed, by caching tiny 4K chunks and not wasting space on gigantic pieces… with the ability to prioritize the caching if needed. Not to mention being deduplication-aware.

Wouldn’t that be a bit simpler to manage, more nimble and more useful in real-world scenarios? The cache will help out even the slower drives for both file and OLTP-type workloads.

Maybe life doesn’t need to be complicated after all.

It’s almost 0300 so I’d better go to bed…

D

 

 

 

 

 

Pillar claiming their RAID5 is more reliable than RAID6? Wizardry or fiction?

Competing against Pillar at an account. One of the things they said: That their RAID5 is superior in reliability to RAID6. I wanted to put this on the public domain and, if true, invite Pillar engineers to comment here and explain how it works for all to see. If untrue, again I invite the Pillar engineers to comment and explain why it’s untrue.

The way I see it: very simply, RAID5 is N+1 protection, RAID6 is N+2. Mathematically, RAID5 is about 4,000 times more likely to lose data than a RAID6 group with the same number of data disks. Even RAID10 is about 160 times more likely to lose data than RAID6.

The only downside to RAID6 is performance – if you want the protection of RAID6 but with extremely high performance then look at NetApp, the RAID-DP NetApp employs by default has in many cases better performance than RAID10 even. Oracle has several PB of DB’s running on NetApp RAID-DP. Can’t be all that bad.

See here for some info…

D

About the Data Domain acquisition – and is EMC really the best place for Data Domain?

Much has already been written about this imminent acquisition of Data Domain by either NetApp or EMC and, since opinions are like you-know-what, and I have one, here it is… if I ramble, forgive me. I have too much to say and I’m trying to be PC… I wrote and subsequently erased all kinds of stuff that could probably get me in trouble (the more you work with a company the more dirt you uncover, and I have several earth movers’ worth).

I do think that both companies waited too long to try and acquire Data Domain – frankly, it’s staggering to me that other companies that make decent products like CommVault haven’t been acquired yet (I mean, seriously, if EMC want to compete in the backup software space they should just drop Networker and buy CommVault). Consolidation is the trend…

Maybe both NetApp and EMC thought their in-house deduplication would work out for everything, maybe they thought Data Domain wouldn’t become a contender. Maybe they thought it was just a phase. Either way, the backup market is still strong, most people don’t want to move en masse to something like Avamar, not everyone needs VTL, and Data Domain does provide a very convenient way to keep using your existing backup product, make next to no changes, and get better efficiencies.

The simple truth is that EMC needed SOMETHING to combat Data Domain so they signed the agreement with Quantum and rushed the product to market. And then tried to strong-arm the resellers into forgetting about Data Domain and instead selling the new and amazing DL3D (that backfired BTW).

As far as EMC is concerned, the attempt to acquire Data Domain is a slap in the face for Quantum and all the customers that have been pitched/sold DL3D (the OEM’ed Quantum DXi product). EMC has spent quite a bit of time belittling Data Domain and instead pushing a product that has seen very limited testing (I know, I’ve been burned personally by it several times). A good example: EMC recently released a patch to allow backups done with EMC’s Networker to actually be deduplicated (talk about a reason to return a product if there ever was one – like a car that can’t go faster than 10 mph or that gets 2 mpg instead of 20 mpg). You see, there was an issue with the filter that figures out what backup app you’re using, and Networker backups were getting only plain old compression, NO deduplication. This is no secret, if anyone bothers to read the release notes of the recent patches they’ll see this info. Maybe if you’re a DL3D customer you should insist on reading the release notes if they’re not easily available? After all, you have a right to know what’s changing!

Think about this: EMC’s own backup product was not tested with DL3D. Yet EMC happily sold DL3D to customers with Networker. To me, this is a sales-driven company, not a customer-driven company.

Not to mention other crippling bugs, slow startup times (especially in the case of unclean shutdowns) and the abysmal performance which simply stems from how the product is designed – it’s spindle-happy and needs about 2 trays of drives to work well. Oh, and don’t EVER fill it beyond 80% capacity. You’re also not supposed to use it as a normal CIFS/NFS share for archiving anything like email or normal files (arguably a great place for dedup).

So, EMC knew about the DL3D issues (well, some of them, it’s not their product after all, indeed I helped them identify some of the bugs) and played coy with customers. Then, they saw NetApp making a move for Data Domain and realized that by buying Data Domain EMC could accomplish several things:

  • Minimize NetApp’s cash reserves if NetApp does in the end succeed in acquiring Data Domain (but is that necessarily a bad thing for NetApp?)
  • Remove the flailing DL3D and replace it with a product that actually works and is selling very well
  • Get a bunch of solid deduplication and consistency checking algorithms
  • Assimilate a competitor that’s been a huge thorn on EMC’s side in that space
  • Reduce the efficiency of NetApp as a competitor

But think from the customer standpoint for a minute (most of the analysts so far seem to miss the most important player here – and that’s certainly not EMC, NetApp or Data Domain, but the customer). You’ve been pitched DL3D, and now you must forget about that and all the bad things you were told about Data Domain – it’s all good now that it belongs to EMC, you’ll be taken care of. Or you can buy the DL3D if you still want it (and I don’t see EMC derailing ANY existing DL3D campaign, no matter what).

I were a DL3D prospect/customer, I’d be worried no matter what.

Let’s talk about the best place for Data Domain to end up. As far as investors go of course, if they want to make a quick buck and run, the EMC cash offer is tantalizing. But for Data Domain employees, EMC can be a black hole and the added complexity and bureaucracy anything but fun. EMC has become almost too diversified – let’s look at just some of EMC’s storage solutions (I won’t mention the software since then it’d be a REALLY long and weird post):

  • Symmetrix
  • Clariion
  • Celerra
  • Centera
  • Atmos
  • EDL
  • DL3D
  • RecoverPoint
  • Avamar (that’s both a software solution and an appliance)

What’s interesting is that, by and large, the teams in charge of the above products don’t talk much, if at all, with each other. Talk about islands! And, when it comes to sales, EMC has internally competing groups of people that sell the above products – for instance, “NAS overlay” guys only get paid on Celerra sales, and I’ve seen them screw up campaigns that were clearly a pure Clariion play just so they could somehow get some Celerra in so they get paid. The basic EMC sales guy you meet can sell them all and indeed doesn’t care, but the people he relies on for support cannot sell them all and do care about what gets sold. It’s all very fragmented and, again, not a model that operates with the customer’s best interests always in mind. It always baffled me why EMC would allow so much fluff in their sales organization.

So, if Data Domain got absorbed, they’d probably not be enjoying all the “melting pot” advantages the EMC corporate bloggers seem so keen on advertising, and the “large startup” feel (maybe it’s like that in MA for a few chosen people – in most other locations it’s decidedly not like that). They’d just be another acquired unit, internally competing with other units, dealing with large-company politics and other inefficiencies. The EMC stock wouldn’t really become much higher than it is now, if at all. It’s been about the same for quite some time now.

Let’s examine the scenario of NetApp buying Data Domain:

  • NetApp is much more focused than EMC – indeed they have literally less than a handful of major offerings that don’t really compete with each other
  • The NetApp sales force is unified and doesn’t internally compete about what to sell
  • NetApp culture is much closer to Data Domain culture
  • It’s not good for innovation to have one company hoarding 3 dedup technologies, NetApp + Data Domain will actually push EMC more and be better for the customers
  • Data Domain could make NetApp much stronger against EMC, in turn driving NetApp’s stock price up significantly. Which, in turn, would give investors back much more than $2bn, thereby making this the better deal.

The only drawback I see (as do most writing about this) is NetApp’s relatively poor history in managing the few acquisitions they’ve made. But I believe that as long as they leave Data Domain alone and slowly try to integrate the technology in the other products it will all work out.

Hopefully all this made some sense…

D

The true XIV fail condition finally revealed (?)

I just got this information:

For XIV to be in jeopardy you need to lose 1 drive from one of the host-facing ingest nodes AND 1 drive from the normal data nodes within a few minutes (so there’s no time to rebuild) while writing to the thing.

Have no way of confirming this but it did come from a reliable source.

A customer recently tried pulling random drives and XIV didn’t shut down and was working fine, but they were from the data nodes.

Why can’t anyone post something concrete here? I’m sure IBM won’t post since the confusion serves them well.

For what it’s worth, the customer is really happy with the simplicity of the XIV GUI.

D

So what exactly is IBM trying to do with the XIV?

By now most people dealing with storage know that IBM acquired the XIV technology. What IBM is doing now is trying to push the technology to everyone and their dog, for reasons we’ll get into…

I just hope IBM gets their storage act together since now they’re selling products made by 4-5 different vendors, with zero interoperability between them (maybe SVC is the “one ring to rule them all”?)

In a nutshell, the way XIV works is by using normal servers running Linux and the XIV “sauce” and coupling them together via an Ethernet backbone. A few of the nodes get FC cards and can become FC targets. A few more of the features:

  • Thin provisioning
  • Snaps
  • Synchronous (only) replication
  • Easy to use (there’s not much you can do with it)
  • Uses RAID-X (no global spares, merely there’s space on each drive, faster rebuilds are possible)
  • Only mirrored
  • A good amount of total cache per system since each server has several GB of RAM BUT the cache is NOT global (each node simply caches the data for its local disks).

IBM claims insane performance numbers with the XIV (“it will destroy DMX/USP!” — sure). But let’s take a look at how everything looks:

  • 180 maximum (or minimum) drives (you can get a half config but I think you always get the 180 drives but license half, I might be mistaken – I believe you have to make a commitment that you’ll buy the whole thing in 1 year)
  • Normal Linux servers do everything
  • Only SATA
  • The backbone is Ethernet, not FC or Infiniband (much, much higher latency is incurred by Ethernet vs the other technologies)

The way IBM claims they can sustain high speed is to not try and make the SATA drives get bound by their low transactional performance vs 15K FC drives or, even worse, SSDs. From what I understand (and IBM employees feel free to chime in) XIV:

  1. Ingests data using a few of the front-end nodes
  2. Tries to break up the datastream into 1MB chunks
  3. The algorithm tries to pseudo-randomly spread the 1MB chunks and mirror them among the nodes (the simple rule being that a 1MB chunk cannot have a mirror on the same server/shelf of drives!)

Obviously, by doing effectively as much as possible large block writes to the SATA drives and using the cache to great effect, one should be able to see the 180 SATA drives perform pretty much as fast as possible (ideally, the drives should be seeing streaming instead of random data). However (there’s always that little word…)

  1. There is no magic!
  2. If the incoming random IOPS are coming at too great a rate (OLTP scenarios), any cache can get saturated (the writes HAVE to be flushed to disk, I don’t care what array you have!) and it all boils down to the actual number of disks in the box. The box is said to do 20,000 IOPS if that happens – which I think is optimistic at 111 IOPS/drive! At any rate, 20,000 IOPS is less than what even small boxes from EMC or other vendors can do when they run out of cache. Where’s the performance advantage of XIV?
  3. The “randomization removing algorithm”, if indeed there’s such a thing in the box, will have issues with more than 1-2 servers sending it stuff
  4. See #1!

Like with anything, you can only extract so much efficiency out of a given system before it blows up.

An EMC CX4-960 could be configured with 960 drives. Even assuming that not all are used due to spares etc. you are left with a system with over 5 times the number of physical disks vs an XIV, tons more capacity etc. Even if the “magic” of XIV makes it more efficient, are those XIV SATA drives really 5 times more efficient (5 times would make it EQUAL to the 960 performance, XIV would have to be well over 5 times more efficient than an EMC box of equivalent size to beat the 960).

Let’s put it that way:

If my system was as efficient as IBM claims, and I had IBM’s money, it’d buy all the competitive arrays, even at several times the size of my box, and publicize all kinds of benchmarks showing just how cool my box is vs the competition. You just can’t find that info anywhere, though.

Regarding innovation: Other vendors have had similar chunklet wide striping for years now (HP EVA, 3Par, Compellent if I’m not mistaken, maybe more). 3Par for sure does hot sparing similar to an XIV (they reserve space on each drive). 3Par can also grow way bigger than XIV (over 1,000 drives).

So, if I want a box with thin provisioning, wide striping, sparing like XIV but the ability to choose among different drive types, why not just get a 3Par? What is the compelling value of XIV, short of being able to push 180 SATA drives well? Nobody has been able to answer this.

I’m just trying to understand XIV’s value prop since:

  1. It’s not faster unless you compare it to poorly architected configs
  2. It has less than 50% efficiency at best, so it’s not good for bulk storage
  3. It’s not cheap from what I’ve seen
  4. Burns a ton of power
  5. Cannot scale AT ALL
  6. Cannot tier within the box (NO drive choices besides 1TB SATA)
  7. Cannot replicate asynchronously
  8. Has no application integration
  9. No Quality of Service performance guarantees
  10. No ability to granularly configure it
  11. Is highly immature technology with a small handful of reference customers and a tiny number of installs! (I guess everyone has to start somewhere but do YOU want to be the guinea pig?)

Unless your needs are exactly what XIV provides, why would you ever buy one? Even if your space/performance needs are in the XIV neighborhood there are other far more capable storage systems out there for less money!

IBM is not stupid, or at least I hope not. So, what IBM is doing is pretty much handing out XIVs to whoever will take one. If you get one, think of yourself as a beta tester. Because I hardly believe that IBM bought the XIV IP without seeing some kind of roadmap, otherwise the purchase would be kinda stupid! If you are a beta-tester, be aware that:

  • XIV cheats with benchmarks that write zeros to the disk or read from not previously-accessed addresses
  • XIV will be super-fast with 1-2 hosts pushing it, push it realistically with a real number of hosts
  • Try to load up the box since if it’s not full enough you’ll get an extremely skewed view of performance – put even dummy data inside but fill it to 80% and then run benchmarks!
  • Test with your applications, not artificial benchmarks
  • Do not accept the box in your datacenter before you see a quote! In at least 3 cases that I know of IBM drops off the box without giving you even a ballpark figure. I think that’s insane.

And last, but not least: I keep hearing and reading about the following being true, I’d love IBM engineers to disprove it:

If you remove 2-3 drives from different trays simultaneously from a loaded system then you will suffer a catastrophic failure (logically makes sense looking at how the chunks get allocated but I’d love to know how it works in real life). And before someone tells me that this never happens in real life, It’s personally happened to me at least once (lost 2 drives in rapid succession) and many other people I know that have any serious real-world experience…

D

Postmark on late 2008 Macbook Pro

So I’m now the proud owner of a tricked-out 2.8GHz MBP.

Naturally I’ve been tinkering with it already (only had it for 2 days). I’ve disabled swapfile encryption, for instance, and I think it makes it have teh snappy.

I compiled postmark for it with -O3 -m64 and ran the usual. Before doing so though I did disable the Spotlight indexer like this:

sudo launchctl unload /System/Library/LaunchDaemons/com.apple.metadata.mds.plist

PostMark v1.5 : 3/27/01
pm>set number 10000
pm>set transactions 20000
pm>set subdirectories 5
pm>set size 500 100000
pm>set read 4096
pm>set write 4096
pm>run

Time:
273 seconds total
256 seconds of transactions (78 per second)

Files:
20092 created (73 per second)
Creation alone: 10000 files (833 per second)
Mixed with transactions: 10092 files (39 per second)
9935 read (38 per second)
10064 appended (39 per second)
20092 deleted (73 per second)
Deletion alone: 10184 files (2036 per second)
Mixed with transactions: 9908 files (38 per second)

Data:
548.25 megabytes read (2.01 megabytes per second)
1158.00 megabytes written (4.24 megabytes per second)

I then enabled spotlight and re-ran the benchmark:

Time:
483 seconds total
468 seconds of transactions (42 per second)

Files:
20092 created (41 per second)
Creation alone: 10000 files (909 per second)
Mixed with transactions: 10092 files (21 per second)
9935 read (21 per second)
10064 appended (21 per second)
20092 deleted (41 per second)
Deletion alone: 10184 files (2546 per second)
Mixed with transactions: 9908 files (21 per second)

Data:
548.25 megabytes read (1.14 megabytes per second)
1158.00 megabytes written (2.40 megabytes per second)

Obviously spotlight is very aggressive in its indexing and tries to do it ASAP – you lose half your performance when doing metadata-intensive processing. The results though, while sucky for the specs of the box, are far, far removed (and much better than) what an old colleague got on his beastie: http://recoverymonkey.net/wordpress/?p=62 – granted, my box is faster but it shouldn’t be THAT much faster.

I   urge my newfound Mac brethren to help out in determining the cause.

More benchmarks to follow.

D

ZFS in OSX

Not amazing news but an official announcement nonetheless: Saw this (www.macnn.com/articles/07/06/06/zfs.in.leopard/) and I couldn’t resist posting. This means a few things:

  1. Sun figured out how to make ZFS bootable (at least on OSX)
  2. Someone figured out how to deal with ZFS and resource forks (I can’t believe they are willing to break compatibility with so much software otherwise).

Now I just need a Mac so I can run some benchmarks before and after. I have some buddies that might oblige… finally the Macs get a decent FS.

Now if only Apple could lose the silly Mach legacy, it’s a common misconception that the kernel in OSX is FreeBSD – it ain’t. Run lmbench (www.bitmover.com/lmbench/) on different platforms and compare results such as context switching, thread creation and whatnot. Then you’ll see why OSX can’t always make a decent server OS.

D