Et tu, Brute? EMC offering capacity guarantees? The sky is falling! Will Chuck resign?

It came to my attention that EMC is offering a 20% efficiency guarantee vs the competition (they seem to be focusing on NetApp as usual but that’s besides the point in this post). See here.

Now, I won’t go ahead and attack their guarantee. Good luck with that, more power to you etc etc. They need all the competitive edge they can get.

No, what I’ll do is expose yet more EMC messaging inconsistency. If you’ve been following the posts in my site you’ll notice that I have absolutely nothing against EMC products – but I do have issues with how they’re sold and marketed and what they’ll say about the competition.

First and foremost: most major storage players, with the notable exception of EMC, have been offering some kind of efficiency guarantee. Sure, you needed to read the fine print to see if your specific use case would be covered (like with every binding document), but at least the guarantees were there. NetApp was first with our 50% efficiency guarantee, then came others (HDS and 3Par are just some that come to mind). We even offer a 35% guarantee if we virtualize EMC arrays :)

We all have different ways of getting the efficiency. NetApp has a combo of deduplication, thin provisioning, snapshots, highly efficient RAID and thin cloning, for instance. Others have a subset (3Par has their really good thin provisioning, for example). Regardless, we all tried to offer some measure of extra efficiency in these hard economic times.

And it’s not just marketing: I have multiple customers that, especially on virtualized environments, save at least 70% (that’s a real 70%, not 70% because we switched them from RAID10 to RAID-DP – literally, a 10TB data set is occupying 3TB). And for deployments like VDI, the savings are in the extreme range.

EMC’s stance was to, at a minimum, ridicule said guarantees. The inimitable Barry Burke (the storage anarchist) had this pretty funny post.

Chuck Hollis has been far more polemic about this – the worst was when he said he’d quit if EMC tried to do something similar (see here in the comments). BTW â we are all waiting for that resignation :) (on a more serious note, Chuck, if you don’t resign because of this, at least refrain from promising next time).

He also called other guarantees “shenanigans” here. I guess he’s really against the idea of guarantees.

But now it’s all good you see, EMC is offering a blanket 20% efficiency guarantee versus the competition! I.e. they will be able to provide 20% more actual usable storage or else they’ll give you free drives to cover the difference. You see, this guarantee is real, not like what all the other companies offer :)

Kidding aside, methinks they’re missing the point – this (to go back to my favorite car analogies) is like saying: :Both our car and your car have a 3-liter engine, but yours has twin turbos and a racing intercooler and 3 times the horsepower but we won’t take any of that into account, we will strictly examine whether you indeed have a 3-liter engine, and we’ll bore ours out to make it 3.6 liters for free”. Alrighty then. I’ll keep my turbos. But how will they deal with an existing NetApp customer that’s getting something like 3x efficiency already? Fulfilling the guarantee terms could get mighty expensive.

If a NetApp customer is getting 3x the usable storage due to deduplication and other means, will EMC come up with the difference or will they just make sure they offer 20% more raw storage?

To the customer, all that matters is how much effective storage they’re able to use, not how much raw storage is in the box.

But, still, this is not what this post is about.

Throughout the years, NetApp and other vendors have offered true innovation on different fronts. Each time that happens, EMC (that also innovates – through acquisition mostly – but likes to act as if nobody else does) employs their usual “minimize and divert” technique. Either they will trivialize the innovation (“who’d want to do that?”) or they will proclaim it false, then divert attention to something they already do (or will do in a few years).

This is even the case for technologies EMC eventually acquired, like Data Domain. Before EMC acquired Data Domain, they disparaged the product, claimed it was the worst kind of device you’d ever want in your datacenter, then tried to sell you the execrable DL3D (AKA Quantum DXi (don’t get me started, the first release was an utter mess).

We all know what happened to that story eventually: at the moment, EMC is offering to swap out existing DL3Ds for free in many cases, and put Data Domain in their place since it’s infinitely better. But wait, weren’t they saying how terrible Data Domain was compared to DL3D?

Some will say this is fine since they’re just trying to compete, and “all is fair”. Personally, if I were approached by sales teams with those about-face tactics, I’d be annoyed.

So, without further ado, I present you with a slide a colleague created. Some of the timing may be a bit off, but the gist should be fairly clear… :)

I could have added a few more lines (Flash Cache, for instance) but it would have made for too busy a slide.

EDIT: I’ll add something I posted as a comment on someone else’s blog that I think is germane.

Since, to provide apples-to-apples protection, EMC HAS to be configured with RAID6, where are the public benchmarks showing EMC RAID6? As you well know, ALL NetApp benchmarks (SPEC, SPC) are with RAID-DP. Any EMC benchmarks around are with RAID10.

Maybe another guarantee is needed:

Provide no worse protection, functionality, space and performance than X competitor.

Otherwise, you’re only tackling a relatively unimportant part of the big picture.

D

Technorati Tags: ,,,,,,,,,,

NetApp usable space – beyond the FUD

I come across all kinds of FUD, and some of the most ridiculous claims against NetApp regard usable space. I won’t post screenshots from competitive docs since who knows who’ll complain, but suffice it to say that one of the usual strategies against NetApp is to claim the system has something like well under 50% space efficiency using a variety of calculations, anecdotes and obsolete information. In one case, 34% usable space :) Right…

The purpose of this post is to outline the state of the art regarding NetApp usable space as of Spring of 2010.

Since NetApp systems can use free space in various ways instead of just for LUNs, there is frequent confusion regarding what each space-related parameter means, and what the best practices are. NetApp’s recommendations have changed over the years as the technology matured – my goal is to bring everybody up to speed.

Executive summary

Depending on the number and type of drives and the design, aside from edge cases dealing with small systems with a very low number of disks, the real usable space in NetApp systems can easily exceed 75% of the real usable space in the drives. I’ve seen it as high as about 78% of the actual space on the drives. That’s amazingly efficient for something with double-parity protection as default and includes spares. This number is the same whether it represents NAS or SAN data and doesn’t include deduplication, compression or space-efficient clones, which could inflate it to over 1000%. Indeed, NetApp systems are used in the biggest storage installations on the planet partly because they’re so space-efficient. Now, on to the details.

What’s space good for anyway?

Legacy arrays use space in very simple terms – you create RAID groups, then you create LUNs on them and those LUNs pretend they’re normal disks, and that’s that. Figuring out where your space goes is easy – there’s a 1:1 relationship between LUN size and space used on the array. You buy an array that can provide 10TB after RAID and spares, and that’s all you ever get – nothing more, nothing less.

Legacy arrays can sometimes use features such as snapshots, but frequently there are so many caveats around their use (performance being a big one) that either they’re never implemented, or their number is very small indeed to make them really useful.

Since NetApp gear doesn’t suffer from those limitations, customers invariably end up using snapshots a lot, and for various reasons, not just backup. I have customers with over 10,000 snapshots in their arrays – they replicate all those snapshots to another array, can retrieve data that’s several months old, and have stopped relying on legacy backup software, saving money and achieving far faster and easier DR in the process, since with snapshots there’s no restore needed.

What’s your effective space with NetApp gear?

If you consider that each snapshot looks like a complete copy of your data, without factoring in any deduplication at all, the effective logical space could be many, many times more than the physical space. A large law firm I deal with manages to fit about 2.5PB of data into 8TB of snapshot delta space – which is pretty efficient by anyone’s standards. We’re not talking about backups done on deduplicated disk here that need to be restored to become useful – we’re talking about many thousands of straight-up, application-consistent, “full” copies of LUNs, CIFS and NFS shares that you can mount at full speed instantly, without needing to restore from another medium or backup application.

Once you add deduplication and thin cloning, the storage efficiency goes even higher.

It’s not the size of your disk that matters, it’s how you use it

If you use a NetApp system like a legacy disk array, without taking advantage of any of the advanced features (maybe you just care for the multi-protocol functionality, with great performance and reliability) then your usable space falls right within norms. Once you start using the advanced snapshot features, they start eating space of course – but giving you something in return. What you need to figure out is if the tradeoffs are worth it: for instance, if I can keep a month’s worth of Exchange backups with a nominal capacity increase, what is that worth for me? Maybe:

  • I can eliminate backup software licenses
  • I can shrink my storage footprint
  • Avoid purchasing external disk for backups
  • I don’t need to buy external CDP hardware/software and a bunch of extra disk
  • My restores take seconds
  • DR becomes trivial

Or, if I can create 150 clones of my SQL database that my developers can simultaneously use and only chew up a small fraction of the space I’d otherwise need, what is that worth? With other systems, I’d need 150x the space…

Or, create thousands of VM clones for VDI…

How much money are you saving?

What do simplicity and speed mean to your business from an OpEx savings standpoint?

Another way to look at it:

How much more efficient would your business be if you weren’t hampered by the limitations of legacy technology? It’s all about becoming aware of the expanded possibilities.

What you buy

FYI, and to clear any misconceptions in case you can’t be bothered to read the rest: if you ask me for a 10TB usable system, you’ll get a system that will truly provide 10TB usable, honest-to-goodness Base2 space protected against dual-drive failure (no RAID5 silliness), and after all overheads, spares etc. have been taken out. If you want snapshot space we’ll have to add some (like you’d need to with any other vendor). It’s as simple as that.

Right-sized, real space vs raw capacity

Others have explained some of this before but, for completion, I’ll take a stab:

  • The real usable size of, say, a 450GB drive is not really 450GB regardless of the manufacturer.
  • The real usable capacity quoted depends on whether it’s Base2 or Base10 math and a bunch of other factors
  • All vendors that source drives from multiple manufacturers that use RAID groups need to right-size their drives – meaning that, if manufacturer A offers a tad more space in the drive than manufacturer B, in order to use both kinds of drives in the same RAID group, you kinda need to make them seem like the exact same size, meaning you go for the lowest common denominator between drive vendors.
  • Using our 450GB example above, the real addressable right-sized Base10 space in that drive is 438.3GB, and even less in Base2 (402.2). Base2 math simply means 1024 bytes in 1K, not 1000, and the rest follows.
  • Beware of analysis, comparisons or quotes showing Base10 from one vendor and Base2 from another, or raw disk space from one vendor vs right-sized from another! Always ask what base is what you’re seeing and whether the numbers reflect right-sized drives! If you look at the right-sized drive Base2 space from various vendors, it’s usually pretty close. Base your % usable calculations on that number and not the marketing 450GB number that’s not real for any vendor anyway.
  • Everyone pretty much buys the same drives from the same drive manufacturers

Some space reservation axioms

Any system that allows snapshots, clones etc. typically needs some space for those advanced operations. For instance, if you completely fill up a system and then want to take a snapshot, it may let you but if you modify any data then it won’t have space to store the writes and the snapshot will be invalidated and deleted – kinda pointless.

As usual, there is no magic. If you expect to be able to store multiple snapshots, the system needs space to store the data changed between snapshots, regardless of array vendor!

And, out of curiosity – how many man-made devices do you own that you max out all the time? Not leaving breathing room is a recipe for trouble for any piece of equipment.

Explanation of the NetApp data organization

For the uninitiated, here’s a hierarchical list of NetApp structures:

  1. Disks
  2. RAID groups – made of multiple disks. Default RAID is RAID-DP. The system automatically makes them, you don’t need to define them or worry about back-end balancing etc. NetApp RAID groups are typically large, 16 disks or so. RAID-DP ensures better protection than RAID10 (the math shows 163x better than RAID10 and 4,000 better than RAID5).
  3. Parity drives – drives containing extra information that can be used to rebuild data. RAID-DP uses 2 parity drives per RAID group.
  4. Spares – drives that can replace failed or failing drives (no need to wait until the drive is truly dead)
  5. Aggregates – a collection of RAID groups and the basic unit from which space is allocated. That’s really what you define, then the system figures out automatically how to allocate disks and create RAID groups for you (can even expand RAID groups on the fly as you add more disks to the aggregate, even 1 disk at a time).
  6. Volumes – a container that takes space from an Aggregate. A volume can be NAS or SAN. A volume can only belong to one Aggregate, and there will typically be many volumes within an Aggregate. Most people will enable the automatic growing of Volumes.
  7. LUNs – they are placed inside the Volumes. One or more per volume, depending on what you’re trying to do. Usually one.
  8. Snapshots – logical, space-efficient copies of either entire Volumes or structures within volumes. There are 3 kinds depending on what you’re trying to do (Snapshot, Snapvault and Flexclone) but they all use similar underlying technology. I might get into the differences in a future post. Briefly: Snapshot -shorter term, Snapvault – longer term, Flexclone – writeable Snapshot.

Explanation of the NetApp space allocations

  1. Snapshot Reserve – an accounting feature that sets aside a logical percentage of space on a Volume. For instance, if you create a 10TB volume and set a 10% Snap Reserve, the client system will see 9TB usable. Most people will enable automatic deletion of Snapshots. The percentage to set aside is at your discretion and is variable on the fly. The actual amount of space consumed is related to your rate of change between snapshots. See here for some real averages across thousands of systems.
  2. Aggregate Snap Reserve – this is pretty unique. One can actually roll back an entire Aggregate on a NetApp system – can come in handy if you accidentally deleted whole Volumes or in general did some gigantic boo-boo. Rolling back the entire Aggregate will undo whatever was done to that aggregate to break it! This feature is enabled by default and has a 5% reservation. It it not mandatory unless you are running Syncmirror (mostly in Metrocluster setups). Depending on what you want to do, you could disable this altogether or set it to a small number like 1% (my recommendation).
  3. Fractional Reserve – The one that confuses everyone. In a nutshell: it’s a legacy safety net in case you want to modify all the data within a LUN yet still keep the snapshots. Think about it: Let’s say you took a snapshot and you then went ahead and modified every single block of your data. Your snap delta would balloon to the total size of the LUN – regardless of whether you use NetApp, EMC, XIV, Compellent, 3Par, HDS, HP etc etc. The data has to go someplace! There’s a great explanation in this document and I suggest you read it since it covers quite a bit more, too. This one is great, too. Long story short: With snapshot autodelete, and/or volume autogrow, you can set it to zero. If you use the SnapManager products, they take care of snapshot deletion themselves.
  4. System reserve – this is the only one that’s not optional. It’s set to 10% by default. You can actually change it but I’m not telling you how. That space is there for a reason, and changing it will potentially cause problems with high write rate environments. That 10% is used for various operations and has been found to be a good percentage to maintain good performance. All NetApp sizing takes this into account. BTW – ask other vendors if it’s perfectly safe to fill their systems at 100% all the time and whether that impacts performance or prevents them from being able to do certain things. And finally, that 10% lost is gained back in spades with the other NetApp efficiency methodologies (starting at the low level with RAID-DP – please do some simple math based on our 16+ drive RAID group vs typical RAID group sizes) so it doesn’t even matter.

Bottom line: Aside from the 10% system reserve, the rest is all usable space.

The NetApp defaults and some advice

So, here’s where it can get interesting (and confusing) and where the competition gets all their ammunition. Depending on the age of the documentation and firmware, different best practices and defaults apply.

So, if you look at competitive docs from other vendors, they claim that if you use NetApp for LUNs you waste double the space for fractional reserve. That recommendation was true many years ago and it was a safety precaution regarding fractional reserve. The documentation has been updated years ago with zero fractional reserve as the recommendation, but of course that doesn’t help competitors so they left the old messaging. So here’s a basic list of quick recommendations for LUNs:

  1. Snap reserve – 0
  2. Fractional reserve – 0
  3. Snap autodelete on (unless you have SnapManager products managing the snap deletion)
  4. Volume autogrow on
  5. Leave at least a little space available in your volumes, don’t let a LUN 100% fill a volume (the LUN space can be thick but the volume space can be thin-provisioned). This space is needed for deduplication and other processes temporarily
  6. Do consider embracing thin provisioning, even if you don’t want to oversubscribe your disk. It’s much more flexible long-term, and allows for storage elasticity.

So, look at the defaults and ask your engineer if it’s OK to change them if they don’t agree with the settings above. Especially on older systems, I notice that the fractional reserve is still 100%, even after getting updated with the latest software (the update doesn’t change your config). Nothing like giving someone a bunch of disk space back with a few clicks…

If you want to do thin provisioning, depending on the firmware, you may see that using thin provisioning on a volume forces the fractional reserve to 100% – but, ultimately, no real space is being consumed. Was OK in 7.2x, changed to the 100% behavior in 7.3.1, fixed in 7.3.3 since it was confusing everyone.

The bottom line

Ultimately, I want you thinking of how you can use your storage as a resource that enables you to do more than just storing your LUNs. And, finally, I wanted to dispel notions that NetApp storage has less storage efficiency than legacy systems. Comments are always appreciated!

D

What exactly is Unified Storage and who can sell it to you?

It’s come to my attention that pretty much every storage manufacturer is trying to imitate NetApp’s thought leadership and keeps announcing “Unified Storage” products. Everyone can do it now, it seems :)

Now, this post is not going to be bashing them or claiming they don’t work.

This post is about arguing what “Unified Storage” really means. And, more importantly, whether you should care about the differences.

Now, NetApp has been shipping Unified Storage for 8+ years now, and has shipped 150,000 Unified Storage systems to date. See here and here. So, I’d think nobody can argue that NetApp has quite a bit of experience in the technology and, indeed, were the very first to do it. Depending on your definition of “Unified”, NetApp may still be the only one doing it, but read on.

The crazy success of NetApp’s Unified Storage (just look at the company’s growth) has forced the other vendors, who initially dismissed the concept, to take a harder look – imagine that, customers actually like the idea of a Unified Storage System!

Here’s how most (if not all) other vendors approach “Unified Storage”:

  • Start with your legacy Fiber Channel Array, use that to serve FC and maybe iSCSI. It’s probably a decent box, no reason to re-invent the wheel.
  • Connect some kind of Windows, Linux or UNIX server(s) to it that will then serve CIFS and NFS and maybe iSCSI (this is the NAS part)
  • Replicate them using different mechanisms for the FC and NAS parts

Pretty simple, really. You end up with the base legacy array, plus more boxes on top (ideally 2+ to ensure redundancy, plus some of them need an extra box or two called a “Control Station” in one implementation).

It all works – after all, it’s just like putting servers in front of your storage, you’re doing that anyway. You are able to serve FC, iSCSI, NFS and CIFS out of the same rack. If we assume that the rack is the termination point for the cables and that you don’t care much about exactly what happens within. So, most C-level execs are OK with it – the rack can serve out all those protocols, ergo the “Unified Storage” claim seems justified.

Here are some potentially business-impacting issues with this approach:

  1. Aside from a couple of exceptions, the add-on boxes used by the storage vendors to add the NAS protocols aren’t even made by that vendor (neither the OS nor the hardware). Obviously that raises some concerns with interoperability, manageability and the longevity of whatever NAS vendor was chosen. Support is now maybe not as robust since you are relying on using tech someone licensed from someone else.
  2. Replication gets complicated since you need to do it a few different ways depending on what protocol you’re replicating.
  3. Patching is more time-consuming since, apart from the legacy array, you need to also patch all the NAS paraphernalia.
  4. Management is frequently totally separate and laborious – you might have to take care of the legacy array separately from the NAS part
  5. Certain important features are only available to one part of the solution (file-level single-instancing/dedupe, for example, only available for CIFS and NFS and not for iSCSI or FC).
  6. And, finally, what I think is the biggest problem: Space allocation is split between the FC and NAS parts and you can’t reduce one to increase the other. For instance, if you started with a 50/50 split, once you’ve allocated the space to the NAS (that always has its own Volume Manager and now owns that 50% chunk of array space), and you realize you’re only using 10% of that space after all, you can’t go ahead and return the remainder of the space to the FC part. This can cause serious inefficiency, inflexibility, cost and manageability issues.

The NetApp approach

NetApp decided to do things a bit differently. Maybe by virtue of how the original systems started out, it turned out it was easier for NetApp to effectively create what is effectively a protocol engine. Maybe “Protocol Engine with Integrated Disk Control, Space Efficiency Technologies and Protection” is more appropriate than “Unified Storage” but it’s a bit wordy…

Effectively, a single NetApp box, without external hangers-on, allows you to:

  • Connect using a variety of methods – FC, 1GbE, 10GbE, FCoE
  • Use the proprietary NetApp RAID-DP protection for great performance and better protection than RAID10
  • Provision FC, iSCSI, CIFS and NFS out of the same pool of physical disk space
  • Reclaim space from FC, iSCSI, CIFS and NFS and put it back in the pool of space
  • Deduplicate FC, iSCSI, CIFS and NFS workloads
  • Perform application-aware replication regardless of protocol
  • Take application-aware snapshots regardless of protocol
  • Clone VMs, DBs and indeed, anything you like, without chewing up space and without impacting performance
  • Virtualize legacy arrays and impart on them the NetApp features
  • Perform workload and cache prioritization
  • Auto-tier hot blocks to gigantic cache to increase speeds (at a super-efficient 4K granularity)

As you can see, everything happens within one system, there’s no separate RAID controller or NAS box or replication box. And, like it or not, that’s a pretty impressive list of capabilities that a single architecture provides.

The potential business benefits with a true Unified Storage system:

  1. Single product, single OS, single architecture – you’re not relying on the marriage of completely different boxes.
  2. Better reliability, less things to break.
  3. Better support – no finger-pointing, it’s a single system from a single company.
  4. Consistent replication – one way to replicate things, yet still application-aware for 100% recoverability, improved CapEx and OpEx.
  5. Management simplicity – lower OpEx.
  6. All performance-enhancing and efficiency features are available to all protocols – Improved CapEx.
  7. There’s no dichotomy between FC, iSCSI and NAS space – allocations are fluid -  Improved CapEx and OpEx.
  8. Protect your existing investment by virtualizing existing legacy disk arrays – improved CapEx and OpEx.
  9. Overall lower OpEx and CapEx – in addition to the significant space-saving features (avoid purchasing as much storage long-term), there’s significant cost avoidance since you potentially don’t need to purchase: Backup software, deduplication appliances, replication appliances, fileservers, OS licenses.

So, should you care how “Unified Storage” is architected?

Beyond the philosophical debate (one box vs multiple), given what you read, what do you think? I believe that the multi-box approach has some inherent drawbacks that are difficult to overcome. Comments welcome as always.

D

Filesystem benchmark extravaganza – Win, Linux, NTFS, EXT4, XFS, BFS scheduler impact and more…

Technorati Tags: ,,,,,,,,,

It’s been a while since I checked the status of Linux-land regarding filesystems and CPU and I/O schedulers, so I thought I’d post some results.

A bunch of new distributions are coming out with Linux kernel 2.6.32 as standard (Ubuntu 10.04LTS one of them), and one distribution (PC Linux OS) will have Con Kolivas’ BFS as default process scheduler.

Since I’m a scheduler and filesystem aficionado, I was intrigued by the simplicity of the BFS and wanted to see how it might affect I/O performance. I don’t believe that there’s ever a single scheduling algorithm that fits all possible use cases, and, furthermore, I think the default Linux CFS scheduler is getting a bit unwieldy in its complexity (though the goal is to make it scale to very large numbers of cores, not all machines have that).

So I grabbed a spare machine that doesn’t have a ton of horsepower on purpose, and tested using the same benchmark (NetApp’s venerable postmark) on the same part of the disk, with Windows, Ubuntu 10.04LTS beta, and PCLOS 2010 beta2.

Here’s some of the averaged data (did multiple runs):

 

  time IOPS Creation Reads Appends Deletes Read MB/s Write MB/s
win tuned 208 244 156 121 123 183 2.68 5.658
win supercache 197 224 200 111 113 175 2.78 5.88
win uptempo 172 281 277 139 141 156 3.19 6.73
Ubuntu 10.04b 154 169 454 84 85 727 3.56 7.52
Ubuntu 10.04b deadline 136 173 500 86 87 10184 4.03 8.51
PCLOS ext4 CFQ 105 239 1011 118 120 4478 5.25 11.09
PCLOS ext4 deadline 97 240 1158 119 121 7828 5.73 12.21
PCLOS XFS tuned deadline 187 136 476 68 68 509 2.93 6.19

Some explanation on the fields:

  • Time: Total time to run
  • IOPS: mixed transactions per second (maybe the most useful number)
  • Creation: files created per second (not mixed with transactions)
  • Reads: reads per second
  • Appends: appends per second
  • Deletes: pure file deletions per second (not mixed with transactions)
  • Read and write MB/s: the effective MB/s

Some notes:

  • Windows was just vanilla XP with all service packs, tuned as a server (box was too weak to take Win7)
  • Supercache was using a portion of memory for the Supercache filter driver
  • Uptempo is a similar filter driver that provides caching (see previous posts here and here)
  • Ubuntu was the latest available beta of 10.04
  • Whenever you see “deadline” the deadline scheduler was used instead of CFQ
  • PCLOS is the latest available beta of PCLOS 2010.
  • I mounted ext4 with barrier=0,noatime
  • I mounted XFS with nobarrier,noatime,nodiratime,logbufs=8,logbsize=256k and made it with mkfs.xfs -f -d agcount=4 -l lazy-count=1 -l size=128m

Some pretty pictures for the ADD among us:

image

image

image

image

Some observations:

  • Some operations that are metadata-heavy like deletion, can get heavily cached in Linux, so that skews the MB/s and total time results when 10,000 files can be deleted in an instant. I wanted to show the numbers because, depending on what you do, that may or may not be important.
  • Intelligent caching still helps tremendously, note the results for Uptempo on Windows and the crazy file creation times on Linux (also skews the MB/s but is a useful number to know if you’re creating a lot of files constantly)
  • Depending on what you’re doing, Windows can be Just Fine…
  • The BFS seems to have been an excellent scheduler choice for PCLOS and something other Linux vendors should start looking at seriously – unless there are serious other I/O tweaks in the kernel that I don’t know about, the difference in performance is staggering between Ubuntu and PCLOS (Phoronix figured that out as well here – they have tons of additional benchmarks showing things other than I/O).
  • And, last but by no means least – the deadline scheduler is ahead of CFQ again, both for Ubuntu and PCLOS. Not by a huge margin but it’s a safe choice especially if you’re running Linux on enterprise storage that has its own decent I/O scheduler built-in. There have been cases of CFQ dramatically lowering performance with external arrays in some cases. Remember – most of the guys writing code for Linux don’t have access to enterprise storage… using the defaults could be harming your Linux performance!

D

6DTVJ7EB98KH

2UE67CVY82RZ 

FUD and The Invention of Lying

I watched “The Invention of Lying” movie the other day. Fairly entertaining, and it had an interesting concept:

Imagine a society where nobody can lie – the very concept of lying is alien and never even enters anyone’s mind. Obviously, tons of jokes can be made using that premise, and the movie is riddled with them – such as their fictional Pepsi ad: “Pepsi: when they’re out of Coke!”

In the movie, a single man stumbles upon the concept of lying, and realizes he can do whatever he wishes since nobody else can tell he’s lying.

Obviously, in our society lying is quite prevalent – a large percentage of the population wouldn’t have jobs or offspring without lying.

I thought – what if, just for fun, we applied “The Invention of Lying” movie concept to IT sales? (I guess this is another take on comparing vendors to cars or wines and whatnot). I’m going for an alphabetical, non-comprehensive list (and added a few non-storage entries). I’ll leave it to the reader to figure out if this is more accurate from the standpoint of a rep that cannot lie, or vice versa… :)

  • 3Par: Our best asset is Marc Farley, his highly entertaining blog is what sells our gear. Our gear is pretty fast, though the software not as good as others’. Unsure how we are still in business. Also unsure why nobody has bought us yet. We do have a handful of very large, loyal customers.
  • Apple: Our stuff is prettier but inside it’s all the same, actually often slower than others. Oh, and it’s a lot more expensive. But the software is cool (when you can find it). You’ll probably need to run Windows in a VM anyway to get the full functionality. Did we mention our stuff is prettier?
  • Bluearc: We have limited-functionality NAS with good sequential and random read speeds but not so much for random writes. Oh, and no application integration. But it’s good for certain workloads. Why is nobody acquiring us?
  • Compellent: Data Progression is the coolest thing we do, and we’ll probably go under now that the big vendors can do it. Oh, and it never did much in the real world, especially for performance. Hopefully we’ll get acquired, but if our technology is that good, why did nobody acquire us yet? We’re extremely affordable!
  • Equallogic: We’ll give you free storage (the first hit is free) if/since you also buy Dell servers. We might even throw in a free laptop and a projector. And a mouse pad. Make sure you convert everything to iSCSI since that’s all we do. Oh, you wanted to know specifics about the storage? Well – it’s free! If you buy some servers. You really want to know about the storage? Well, it’s free if… What? You want to understand the failure math of RAID 50? It’s atrocious, but the box is free if…
  • EMC: We buy companies since innovating is kinda hard and time-consuming, so our solutions end up being a mish-mash of technologies. It all mostly works, though interoperability between platforms sucks. Regarding storage, you should really only buy Symmetrix since all our other stuff doesn’t even come close to that quality, we have the other boxes just to meet price points and plug portfolio holes. We trash competitors until we acquire them or until we build something good enough that’s similar. We also sell futures. Hard. We focus too much on NetApp.
  • HDS: We don’t know how to write software but our high-end gear hardware is pretty solid. The cheaper stuff is OK, severely lacks in functionality but we’ll just drop the price enough that you’ll buy it anyway. Capisce?
  • HP: Seems that buying companies works for EMC, we’ll do the same, let’s see what happens. We used to make the best calculators in the world. Oh, and our best array is actually made by HDS. Our servers are great! Please, also buy some printers, they’re pretty good.
  • IBM: We used to be some of the best in storage, now our only 2 products are SVC and DS8K (oops, and now XIV), everything else we resell after we put our faceplates on it. Our biggest sellers are products made by LSI and NetApp. Oh, and we internally compete with the XIV team we acquired. Our storage solutions don’t talk to one another since they’re all made by different people. But SVC can tie it all together! Well, some of it, anyway.
  • Intel: We are so big that even if AMD has better stuff, eventually we catch up. Just you wait. In the meantime, buy more Intel to keep us going. Resistance is futile.
  • Isilon: We are decent for bulk sequential-access NAS, just don’t do any kind of random workload on our gear.
  • LeftHand: If you want any reasonable storage efficiency plus resiliency you need to buy a bunch of boxes (5 or so), since each box is essentially an HP server with internal disks, and the whole server can die. Oh, and we only do iSCSI. So you better make sure you only do iSCSI.
  • NetApp: We probably have some of the worst marketing of all vendors, and often can’t clearly articulate what makes our systems better to C-level execs, focusing almost entirely on techies. We also have issues with making some acquisitions pan out. ONTAP 8 is taking us forever to release, and until then you won’t have very wide striping (update: GA’d 3/19/10). We complicate sales because our engineers are too technical and insist on explaining how the boxes work at a low level, frequently confusing customers, that seldom care about understanding Row-Diagonal Parity equations. Too much good information is tribal knowledge, including performance tuning and the gigantic customers we have. We focus too much on EMC.
  • Pillar: We cry ourselves to sleep because all we have is Larry Ellison and QoS. Maybe Larry will finally force Oracle to finally buy some of his^H^H^H our gear? I wonder how that will go down since Oracle is already using a superior technology and achieving great savings… but we do make a fairly fast box if you’re OK with limited functionality and RAID50.
  • Sun: We can sell you some LSI storage, but even that may be going away. You can also get the exact same storage from IBM that also resells LSI. How about a Thumper? We may also have some leftover HDS gear that we can give you real cheap.
  • Xiotech: Our value prop is extremely obscure and only understood well by about 5 engineers. Out of those 5 engineers, 2 understand the exact failure scenarios of our ISE architecture, and they can’t explain it to anyone else. We are pretty cheap though.
  • XIV: We believe in success through obfuscation. Our box can only do about 17K IOPS if the workload isn’t cache-friendly but we know how to cheat in benchmarks and make it seem faster (make sure your benchmark writes all zeros and/or fits in cache). The box also consumes more power and space than any other storage system. Our reps compete with IBM reps even though we are owned by IBM, since we only get paid on XIV sales, regardless of what the customer’s needs are. Oh, and under certain conditions, a 2-disk failure will bring down the entire system. But don’t you worry about that. BTW, the GUI is amazingly pretty.

Hope you had a chuckle reading some of this!

(minor edits – typo plus some on Twitter complained I was too gentle in the NetApp section :) )

D

Are you using the features of your existing platforms? And, if not, why not?

This is going to be another post that was inspired by sheer frustration…

It’s one thing talking to someone about adopting a totally new platform and meeting with resistance – I get it, it’s not what they’re used to, it’s new stuff, they don’t know if it will work etc. etc.

However, recently I’m encountering an alarming percentage of existing users of technology that are not using a lot of the features available to them – and I don’t mean small things, I’m talking about the features that someone literally buys the equipment for…

I understand if we’re talking about a feature you actually have to pay extra for, there may not be money in the budget for it. But this is not what this post is about…

Do you use the freely available or already paid for features? How do you know?

Consider this (I have more examples but we’ll keep it simple): I have a handful of customers that use our equipment (NetApp) with VMware that steadfastly refuse to even consider:

  • Deduplication
  • Thin Provisioning
  • Snapshots
  • Rapid, thin VM cloning

Those 4 technologies are frequently the reasons someone buys NetApp in the first place for virtualized environments, since they can lead to:

  • Vastly reduced storage footprint
  • Faster performance
  • Easier management
  • Easier and faster backup and recovery
  • Tremendous money savings

In my sample base, those customers absolutely would benefit from those technologies – it’s not a “maybe” or “your mileage may vary”. I know how their data is laid out and what kind of data it is, and the difference will be staggering.

Unjustified anger

I’ve also had customers tell me “where are my promised efficiencies?” They get really irate, and when I tell them exactly what to do in order to get said efficiencies, they start backpedalling and telling me how they can’t turn the features on during production hours. They then promise to turn some on during a maintenance window, then time goes by, they seem to forget about it and call me again, irate, complaining about the lack of features and efficiencies. And the cycle continues.

Is it an education problem? Lack of time?

Maybe it’s just a matter of education, but when someone is presented with the facts, several use cases from other local and global customers (including huge household names everyone recognizes), customers with hundreds of PB of data, all of them using the technology and achieving in many cases more than a 3:1 reduction in storage footprint, and still ignores the advice, there’s something wrong…

The other excuse for “shelfware” (software you never use but you just leave on the shelf) is lack of time to implement the features. For complex software I can see time being an issue, but my example is about things that can be done with a few mouse clicks.

The not invented here syndrome

There’s a term called “the not invented here syndrome”. This is an affliction suffered by professionals in all kinds of fields, not just IT. Some symptomps include:

  • Extreme resistance to any new ideas that were not developed within the company (frequently, by that person)
  • Extreme resistance to any kind of change, no matter how benign, low-risk, low-cost and beneficial it might be
  • Dismissing irrefutable proof
  • Thinking that your problems are more challenging than everyone else’s
  • The inability to recognize the real challenges facing their organization (“can’t see the forest for the trees”)

This is a perfectly normal human condition. We each have our world view, and some of us really don’t like having that view challenged. The human mind will actually go to amazing lengths to ensure that the existing worldview stays unmodified. The examples are all around us – people ignore what seems to be common sense all the time. History is full of horrific examples. I don’t want to depress anyone, so here are some humorous examples:

"I don’t trust fire, it can burn you!"

"That wheel thing seems like the devil’s own work!"

"Nobody needs more than 640K RAM in their PC".

Some friendly advice…

Back to the IT world. There are a few simple things you can do in order to make life a bit easier for all.

  1. Please read the documentation suggested by your engineer
  2. Then read it again and take notes and prepare questions
  3. Be open to new ideas – “luddite technologist” is a contradiction in terms
  4. Be flexible – try new things on copies of data or less important data, there’s always a way…
  5. Reach out to your engineer, don’t always wait for them to reach out (our schedules are usually crazy)
  6. Think in terms of the business problems you’re trying to solve, not in terms of technology (you may not know that what you have can already solve your problems)
  7. If your vendor reaches out to you, maybe it’s not just to sell you more stuff… maybe we’re even trying to help out. Imagine that!
  8. Never assume anything (including that you always know better than the vendor, or that everyone’s lying to you, especially if you already own their gear!)
  9. If presented with irrefutable proof of something, consider graciously conceding
  10. Be aware of your shortcomings and prejudices (we all have them)
  11. Accept you don’t know it all (guess what – the customer is not always right!)
  12. And, last but not least: put the business first, and your ego a distant last.

I’ll get off my soapbox now.

D

More tales from the field: Sizing best practices – does Compellent follow them?

Technorati Tags: ,,

Note: I edited this a bit to remove some confusing pieces of info.

Another one came in. I’ll keep calling the offenders out until the craziness stops. Fellow engineers – remember that, regardless of where we work, our mission should be to help the customer out first and foremost. Then make a sale, if possible/applicable. I implore you to get your priorities straight. If it looks like you’re losing the fight, figure out what your true value is. If you have no true value, you always have the option of bombing the price. But please, don’t sell someone an under-configured system…

This time, it’s Compellent not seeming to follow basic sizing rules in a specific campaign (I’m not implying this is how all Compellent deals go down). The executive summary: In a deal I’m involved in, they seem to be proposing a lot less disks than are necessary for a specific workload, just so they are perceived as being lower in price. This is their second strike as far as I’m concerned (first case I witnessed was Exchange sizing where they were proposing a single shelf for a workload that needed several times the # drives). Third strike gets you a personal visit. You will never repeat the offense after that, but it gets tiring. Education is better.

And before someone jumps on me and tells me that I don’t know how to properly size for Compellent (which I freely admit) I’ll ask you to consider the following:

There is no magic.

This is not a big NetApp FAS+PAM vs multi-engine Symmetrix V-Max discussion, where the gigantic caches will play a huge role. No – this specific case is a fight between 2 very small systems, both with very limited cache and regular ol’ 15K SAS drives. They’re not quoting SSD that could alleviate the read IOPS issue, and we’re not quoting PAM.

Ergo, this is about to get spindle-bound…

And for all the seasoned pros out there: I know you may know all this, it’s not for you, so don’t complain that it’s too basic. This post is for people new to performance sizing (and maybe some engineers :) )

Some preliminaries:

This is a Windows-only environment. So, the customer sent perfmon data for their servers over for me to analyze and recommend a box.

They’ll be running Exchange plus some databases.

From my days of doing EMC I learned some very important sizing lessons (thanks guys) that I will try to summarize here.

For instance – there is peak performance, average, and what we called “steady-state”.

In any application, there will be some very high I/O spikes from time to time. Those spikes are normal and are usually absorbed by host and array caches. This is the “peak performance”.

The trick is to figure out how long the spikes last for, and see if the caches would be able to accommodate them. If a spike is lasting for 30 min it’s not a spike any more, but rather a real workload you need to accommodate.

If the spikes are in the range of seconds, then cache is usually enough. Depends on the magnitude of the spike, the length of the spike and the size of the cache :)

Then, you have your average performance. That just takes a straight math average across all performance points – so, for instance, if you have, at night, very long periods of inactivity, they will affect the average dramatically. Short-lived spike data points won’t affect it as much since there are so few of them. So the average typically gets skewed towards the low end.

Then there’s the concept of “steady state”.

This effectively tries to get a more meaningful average of steady-state performance during normal working periods. Easy to eyeball actually if you’re looking at the IOPS graphs instead of letting excel do its averaging for you.

A picture will make things clearer:

image

In this simplified example chart, the vertical axis represents the IOPS and the horizontal is the individual samples over time. You can see there are very quiet periods, a brief spike, then sustained periods of activity. Without needing a degree in Statistics, one can see that the IOPS needed are about 500 in this chart. However, if you just take the average, that’s only 260, or about half! Obviously, not a small difference. But, again obviously, some extra care is required in order to figure out the real requirements instead of just calculating averages!

So, to summarize: it’s usually not correct to size for maximum or average since they’re both misleading (unless you’re sizing for a minimum-latency DB application – then you often size for maximums to accommodate any and all performance requirements). This is the same for every array vendor. The array and host cache accommodate some of the maximum spikes anyway, but the true average steady-state is what you’re trying to accommodate.

So, now that you know the steady-state true average the customer is seeing, the next step in estimating performance is to look at the current disk queues and service times.

I won’t go into disk queuing theory but, simply speaking, if you have a lot of outstanding I/O requests, they end up getting queued up, and the disk tries to service them ASAP but it just can’t quite catch up. You typically want to see low numbers for the queue (as in the very low single digits).

Then, there’s the response time. If the current response times are overly long (anything over 20ms for most DB/email work), then you have a problem…

What this means is that the observed steady-state workload is often constrained by the current hardware. By examining performance reports, all you are seeing is what the current system is doing.

So, the trick is to find out what performance the customer actually NEEDS, at a reasonably low ms response time with low queuing. The perfmon data is just to ensure you don’t make the performance even WORSE than they’re currently seeing! Finding out the true requirements is really the difficult part.

Finally, once you figure out the final, desired steady-state IOPS requirements, you need to translate them into your specific system, since there’s cache helping, but always some overhead to be considered. For instance, in a system that relies on RAID10/RAID5, you need to adjust for the read/write penalties of RAID. That increases the IOPS needed by nature. Again, this is the same for all array vendors – the only time there’s no I/O penalty, is if you’re doing RAID0 (= no protection).

You see, RAID5 for instance, in order to perform writes, has to do some reads as well, to calculate and write the parity. All very normal for the algorithm. Depending on the read/write mix, this extra I/O can be significant, and absolutely needs to be considered when sizing storage! RAID10 doesn’t need to read in order to write, but has to write 2 of everything, so that needs to be considered as well.

You also need to figure out read vs write percentage, I/O block size distributions, random vs sequential… not rocket science, but definitely extra work in order to do right.

The last thing that needs to be taken into account is the working set. Basically, it means this:

Imagine you have a 10TB database, but you’re really only accessing about 100GB of it repeatedly and consistently. Your working set it that 100GB, not the entire 10TB DB. Which is why the more advanced arrays have ways of prioritizing/partitioning cache allocations, since you typically don’t want a big 50TB file share with 10,000 users causing cache starvation for your 10TB DB with the 100GB working set. You need to retain as much of the cache as possible for the DB, since the 50TB file share is too large and unpredictable a working set to fit in cache.

Unless you understand the true working set, you will have no idea how much cache will be able to truly help that particular workload.

Going back to the reason I wrote this post in the first place:

In this specific, small environment, the non-RAID steady-state percentile IOPS required were close to 3,000, with a working set and I/O pattern that wouldn’t fit in the cache of the small systems. Once adjusted for RAID5, the specific I/O mix demanded 50% more IOPS from the disk. The spikes were fairly high, in excess of 10x the steady-state.

Back to basics: A 15K RPM disk can provide about 220 IOPS with reasonable (<20ms) latency, so about 14 disks are needed to accommodate the pre-RAID performance with under 20ms latency. Remember – that doesn’t include spares or RAID overheads, and will not even accommodate I/O spikes. Calculating with the RAID overhead, about 21 drives are needed, at a minimum. Add a spare or two, and you’re up to 22-23 drives, bare minimum, to satisfy steady-state performance without cache starvation in this specific workload.

And, finally, the offense in question:

Compellent said that with their combo RAID1-RAID5 they only needed a single 12-drive SAS enclosure for the entire workload. Take spares out, and, best case, you’re talking about 11 drives doing I/O. Apparently, the writes happen in RAID1, and the reads as RAID5. I’m not the expert, I’m sure someone will chime in. Maybe my math is a bit off since Compellent has the funky RAID1/RAID5 mix, but there are still I/O penalties…

Based on the above analysis, this somehow doesn’t compute with 11 drives, half what my calculations indicate… so, my final question is:

How do Compellent engineers size for performance?

D

EMC’s incredible marketing and the FAST fairy tale (and a bit on how to reduce tiers)

I’m in MN prepping to teach a course (my signature anti-FUD extravaganza), and thought I’d get a few things off my chest that I’ve been meaning to write about for a while. Some Stravinsky to provide the vibes and I’m good to go. It’s getting really late BTW and I’m sure this will progressively get less coherent as time goes by, but I like to write my posts in one shot…

I never cease to be amazed by what’s possible with the power of great marketing/propaganda. And EMC is a company that has some of the best marketing anywhere. Other companies should take note!

Think about it: Especially on the CX, they took an auto-tiering implementation as baked as wheat that hasn’t been planted yet, and managed to create so much noise and excitement around it that many people think EMC actually invented the concept and, heavens, some even believe that the existing implementation is actually decent. Worse still, some have actually purchased it. Kudos to EMC. With the exception of some of Microsoft’s work, nobody reputable has the stones any more to release, amidst such fanfare, a product this unpolished. Talk about selling futures…

Perception is reality.

I’m an engineer by training and by trade first and foremost, and, regardless of bias, I consider the existing FAST implementation an affront. Allow me to explain, gentle reader…

The tiering concept

Some background info is in order. Most arrays of any decent size and complexity sold nowadays are configured with different kinds of disk, purely out of cost considerations. For instance, there may be 30 really fast drives where a bunch of important low-latency DBs live, another 100 pretty fast drives where most VMs and Exchange live, then 200 SATA drives for bulk storage and backups.

Don’t kid yourself: If the customer buying the aforementioned array had enough dough, they’d be getting the wunderbox with all super-fast drives inside – all the exact same kind of drives. That’s just simpler to deal with from a management standpoint and obviously the performance is stellar. Remember this point since we’ll get back to it…

Of course, not everyone is made of money, so arrays that look like the 3-tier example above are extremely common. Just enough drives of each type are purchased in order to achieve the end result.

What typically ends up happening is that, over time, some pieces of data end up in the wrong tier, for one reason or another. Maybe a DB that was super-important once now only needs to be accessed once a year; or a DB that was on SATA now has become the most frequently-accessed piece of data in the array. Or, perhaps, the importance of a DB flip-flops during a month, so it only needs to be fast maybe for month-end-processing. So now, you need to move stuff around so that what needs to be fast is shifted to the fast drives.

Pressure points and the need for passing the hot potato

But wait, there’s more…

The entire performance problem is created in the first place due to most array architectures being older than mud. In legacy array architectures, LUNs are carved out of RAID groups, typically made of relatively few disks. So, in an EMC Clariion, it’s best practices to have a 5-disk RAID5 group. You then ideally split up that group into no more than 2 LUNs and assign one to each controller.

With disks getting bigger and bigger, creating 1-2 LUNs can become exceedingly difficult – a 5-disk R5 group made with 450GB drives in a Clariion offers a bit over 1.5TB of space, which is too much for many application needs – maybe you just need 50GB here, another 300GB there… in the end, you may have 10 LUNs in that RAID group that’s supposed to have no more than 2. The new 600GB FC drives make this even worse.

So, in summary, what ends up happening is that you split up that RAID group into too many LUNs in order to avoid waste. And that’s where your array develops a serious pressure problem.

You see, now you may have 10 different servers hitting the exact same RAID group, creating undue pressure on the 5 poor disks struggling to cope with the crazy load. I/O service times get too high, queue lengths get crazy, users get cranky.

Again – this whole problem exists exactly because legacy array architectures don’t automatically balance I/O among all drives.

But for those afflicted Paleolithic systems, wouldn’t it be nice if we could move some of those hot LUNs, non-disruptively, to other RAID groups that don’t suffer from high pressure?

That’s what EMC’s FAST for the Symmetrix and CX does. It attempts to move entire LUNs to faster tiers like SSD. Which, BTW, is something you can do manually, but FAST attempts to automate the task (kinda, depends, etc).

The current FAST pitfalls

Let’s examine first how FAST (Fully Automated Storage Tiering) is implemented. Since it’s really 3 utterly different solutions, depending on whether you have Symm, CX or NS:

On the Symmetrix it’s always been there in the form of Symmetrix Optimizer, which may not have been aware of tiers but it definitely knew about migrating to less busy disks. Now you can teach it about tiers, too. But it’s not, in my mind, a new product, even if EMC would like you to believe it is. It looks to me too much like Optimizer + some new heuristics. But the Gods of Marketing managed to create unbelievable commotion about something that was an old feature. What amazes me is that nobody seems to have made the connection – maybe I’m really missing something. I’m sure someone from EMC will correct me if I’m wrong. In my experience, Optimizer, when purchased, often did more harm than good, was difficult to manage and, ultimately, was left inactive in many shops – with the beancounters lamenting the spending of precious funds on something that never quite worked that well. Oh, and it seems the current version doesn’t support thin LUNs. But of the FAST implementations on EMC gear it is the more complete version, exactly because Optimizer has been there for a long time…

On the far more popular CX platform, what happens is like a tribute to kludges everywhere. Consider this:

  1. Movement is one-way only (FC to SATA, or FC to SSD). More of a one-shot tool than continuous optimization!
  2. You need a separate PC that will crunch Navisphere Analyzer performance logs, this takes a while
  3. The PC will then provide a list of recommendations
  4. Depending on which LUNs you approve it will invoke a NaviCLI command to move the specified LUNs in the box
  5. Doesn’t support thin provisioning
  6. Not sure if it supports MetaLUNs
  7. It is NOT automatic since you have to approve the move! Ergo, it should not be sold under the name “FAST” since the “A” stands for “Automated”, aren’t there laws for false advertising?

On the Celerra NS platform (EMC’s NAS), one needs to purchase the Rainfinity FMA boxes, which then can move files between tiers of disk based on frequency of access. One is then limited by the scalability of the FMA – how many files can it track? How dynamically can it react to changing workloads? What if the FMA breaks? Why do I need yet more boxes to do this?

Ah, but it gets better with FASTv2! Or does it?

EMC has been upfront that FAST will become way cooler with v2. It better be, since as you can see it’s no great shakes at the moment. From what the various EMC bloggers have been posting, it seems FASTv2 will use the thin provisioning subsystem to go to a sub-LUN level of granularity.

The granularity will obviously depend on how many disks you have in the virtual provisioning pool, since a LUN (just like with MetaLUNs) will be split up so that it occupies all the disks in the pool. The bigger the pool, the better. This should provide better performance (it does with other vendors) yet EMC in their docs state the current version of virtual provisioning (at least on the CX) has higher overhead when compared to their traditional LUNs and will provide less performance. I guess that’s a subject for another day, and maybe they’ll finally revamp the architecture to fix it. Back to FASTv2:

The “busyness” of each LUN segment will be analyzed, and that segment will then move, if applicable, to another tier. Of course, how efficient that will end up being will depend on how you do I/O to the LUN in the first place! If the LUN I/O is fairly spatially uniform, then the whole thing will have to move just like FASTv1. But I guess with v2 there’s at least the potential of sub-LUN migration, for cases where a clearly delineated part of the LUN is really “hot” or “cold”. Obviously, since the chunk size will still be significantly large, expect a bunch of non-applicable data to move with the stuff that should be moved.

The real problem

First, to give credit where it’s due: Compellent already has had sub-LUN moves for a long, long time. Give those guys props. They actually deserve it.

However – both the Compellent approach as well as FASTv2 and, even worse, v1, suffer from this fundamental issue:

Lack of real-time acceleration.

Think about it – performance has to be analyzed periodically, heuristics followed, then LUNs or pieces of LUNs have to be moved around. This is not something that can respond instantly to performance demands.

Consider this scenario:

You have a payroll DB that, during most of the month, does absolutely nothing. A fully automated tiering system will say “hey, nobody has touched this LUN in weeks, I better move it to SATA!”

Then crunch time comes, and the DB is on the SATA drives. Oopsie.

People complain, and the storage admin is forced to manually migrate it back to SSD.

Kinda defeats the whole purpose… unless I’m missing something the size of Titanic.

So, you may have to write all kinds of exception rules (provided the system lets you). Some rules for most DBs, Exchange, a few apps here and there…

Soon, you’re actually in a worse state than where you begun: You have the added complexity and cost of FAST, plus you have to worry about creating exception rules.

Now here’s a novel idea…

What if you actually put your data in the right tier to begin with and what if, even if you didn’t, it didn’t matter too much?

For instance – normal fileshares, deep archives, large media files, backups to disk – most people would agree that those workloads should probably forever be on SATA if you’re trying to save some money. With 2TB drives, the SATA tier has become super-dense, which can be very useful for quite a few use cases.

DBs, VM OS files – should usually be on faster disk. But no need to go nuts with several tiers of fast disk, a single fast tier should be sufficient!

LUNs and other array objects should try to automatically span as many drives as possible by default without you having to tell the array to do that… that way you avoid the hot spots in the first place by design, thereby reducing or even removing the need for migrations (I can still see some very limited cases where migration would be useful).

And finally, large, intelligent cache (as in really large) to help with real-time workload demands, dynamically and as-needed, by caching tiny 4K chunks and not wasting space on gigantic pieces… with the ability to prioritize the caching if needed. Not to mention being deduplication-aware.

Wouldn’t that be a bit simpler to manage, more nimble and more useful in real-world scenarios? The cache will help out even the slower drives for both file and OLTP-type workloads.

Maybe life doesn’t need to be complicated after all.

It’s almost 0300 so I’d better go to bed…

D

 

 

 

 

 

Protecting your existing legacy storage investment with virtualization “do’s and don’ts”

It’s an undeniable fact that many customers, while they would love to use the highly advanced features of modern disk arrays, have already made a big investment in legacy storage. Sure, it doesn’t have all the great features, but it’s already there, frequently there’s a lot of it, and the maintenance isn’t expiring for another year or two so it’s not economically feasible to get rid of it.

Another issue most enterprises face is data migration – whether that’s to move from old to new on the same vendor, or from vendor to vendor. No matter how you cut it, you’ll have to do it someday.

A third issue is performance on the existing gear – maybe you have a ton of legacy storage but it’s just not performing the way you’d expect.

The final issue is managing disparate arrays. Nobody really wants to do that.

There are storage virtualization products that, conceptually, try to solve some of those issues in a similar way to how VMware, Hyper-V and Xen address similar issues with servers.

The idea is that you virtualize your existing storage behind gear that will give it some extra capabilities, centralized management and thereby extend its service life and maybe even eke out some more performance out of it. Your existing hosts will typically address the storage via the virtualizing device, so obviously some assembly is required (rezoning etc).

The devices I’m aware of fall into 3 basic categories:

  1. Devices that encapsulate existing LUNs and don’t need other equipment or much reconfiguration besides dropping them in, zoning and presenting the LUNs to the hosts through them. Examples are: EMC VPLEX, FalconStor NSS, IBM SVC, HDS USP-V, HP SVSP.
  2. Devices that don’t need other equipment, offer some compelling extra features but cannot encapsulate LUNs and therefore need an initial migration besides the zoning. Example: NetApp V-Series.
  3. Devices that need extensive fabric upgrades besides reconfiguration. Example: EMC Invista (I’m not sure if it needs LUN migrations, I don’t think so but I’m sure someone from EMC will chime in).

There are other differences in the devices listed above, so I created a table and highlighted the areas where there’s either the odd man out or there’s some feature not available with the others. I’m aware that the table is nowhere near complete, but as it is I doubt it will fit onto a web page nicely. If there are inaccuracies, let me know and I’ll fix it. I admit I know little about HP’s SVSP. (re-posted with some SVC edits – but HP is canning the product anyway).

It’s a bit of an eyechart, I’ll see if I can make my theme a variable-width one.

 

Thin Provisioning

Thin Clones

Snapshots

Also an Array

In-Band

Deduplication

Replication

Needs Migration

NAS

Needs fabric Upgrade

FCoE

Perf Acceleration

Can do live FC migrations

Needs some space on array

EMC Invista

N

N

N

N

N

N

N (needs RecoverPoint)

? (prob N)

N

Y

N

N

Y

N

EMC VPLEX

N

N

N

N

Y

N

Y

N

N

N

N

Y (RAM cache but not for writes)

Y

N

HP

Y (? perf impact)

?

Y (? perf impact)

N

split-path

N

Y

?

N

N

N

N

Y

N

FalconStor

Y (? perf impact)

Y (perf impact)

Y (perf impact)

N

Y

N

Y

Y

N

N

N

Y (SSD cache)

Y

N

HDS

Y (perf impact)

?

Y (perf impact)

Y

Y

N

Y

N

N

N

N

Y (huge cache, RAM)

Y

N

IBM

Y (no perf impact)

Y (perf impact)

Y (perf impact)

Y (limited 4x SSD per node)

Y

N

Y

N

N

N

N

Y (192GB large cache with 8 nodes)

Y

N

NetApp

Y (no perf impact)

Y (no perf impact)

Y (no perf impact)

Y

Y

Y

Y

Y

Y

N

Y (also 10GbE)

Y (gigantic cache, multi-TB)

N (iSCSI, NFS, CIFS only at present)

Y

 

The design decisions are interesting.

Of the above, IBM and FalconStor take the “pure appliance” approach, using Linux servers with custom code – that’s what those boxes were designed to do from the get-go. The idea is that you either have a bunch of old arrays or you buy a bunch of new, cheap and not very capable arrays, then front them with SVC or NSS, thereby making them decent.

Since IBM and FalconStor were always designed to perform this function, they are also, in my opinion, the best-suited for tasks like migrations. Indeed, I believe one can do a “hit and run” with said boxes, i.e. do the migration then remove the boxes from the fabric, making them popular with certain PS organizations.

On the other hand, HDS and NetApp instead offer the virtualization functionality as an additional feature to their arrays – as in, “you’ll probably buy our disk but we can enhance your legacy box, too”.

EMC took a completely different approach and uses out-of-band control servers and intelligent fabric switches to perform the virtualization trickery.

It’s important to note that NetApp lacks the live migration feature, but instead offers deduplication, application-aware snaps, great replication and NAS, and is arguably the most feature-rich platform (I’m trying to not be biased as I’m writing this). The biggest caveat (a deal-breaker for some) is that it can’t encapsulate your existing LUNs – instead, you need to chop up your RAID groups into LUNs, then present them to the NetApp system, which will then need to reformat said LUNs. This process also takes away some space for extra checksum calculations and other overheads. Arguably, you can make this up (and then some) in the end after using the features on tap (sorry). But you still need to figure time to migrate your stuff over gradually.

I believe EMC offers the least features and the most complex implementation – you can do stuff like mirror your LUNs from box to box and do migrations, but your arrays don’t really gain any new features. I have yet to meet a customer that owns this solution. I know there are a few big ones that went that way; it’s just not very common.

Of the devices mentioned above, the SVC is probably the most commonly used, then the USP-V (IBM and HDS always argue on that point since the capability to virtualize comes with HDS boxes whereas virtualization is the only thing the SVC does), then come FalconStor and NetApp, then HP with the relative newcomer SVSP, and last EMC (Invista hasn’t been a particularly successful product for EMC).

Storage Virtualization do’s and don’ts

I’d say that you should only really consider buying a virtualization product if you have well over 10TB of older gear (I’d say over 50TB IMHO) that is not TOO old (i.e. not older than 3-4 years). Quite frequently, if your gear is really old, refreshing it with new just ends up being cheaper. Of course, there’s always eBay.

I’d also recommend not buying new low-end arrays and using virtualization to make them “better”. You are introducing more complexity into the environment, and it won’t necessarily be cheaper, either (something like the SVC has licenses that cost by the TB). Just buy a decent modern array that has all the features you need and be done with it.

Furthermore – don’t get into virtualization just to migrate from your older to your newer arrays. There are other ways.

You should use common sense (imagine that). As you’re not supposed to mix drive types within RAID groups even if you can, you typically don’t want to have an application straddling 5 different arrays, all vastly different in capability, just because you can.

It’s tempting to say “I’ll create a LUN that’s striped among every single disk on 5 different arrays”. Not to say that this should never be done (I’ve RAID-0′d across Symmetrix to get enough performance, long story), but only do it if you know what you’re doing and the exact layout that you’ll end up with. Nothing spells misery like RAID0 across many LUNs in an existing RAID group :)

Finally – figure out what features are the most important to you. If you want dedupe, NAS and tight app integration, NetApp is the ticket. If you prefer ease of migration, you may want to look at the other solutions.

The guarantees

In order to entice customers to try their stuff, HDS and NetApp have some space savings guarantees in place regarding virtualization. HDS has a flat 50% guarantee (predicated upon converting from RAID1 to RAID5 + thin provisioning) or 20% guarantee (just thin provisioning).

NetApp has the ZIP program. It’s a bit different – there’s no hard number in the savings. Rather, the customer’s data is analyzed and the customer presented with the savings % NetApp guarantees to achieve in their case. If the customer agrees and NetApp achieves the guaranteed savings, then the gear gets purchased. If the savings are not reached, then the customer gets to keep the gear free of charge (that’s right).

Such guarantee programs have been much ridiculed by the vendors that don’t offer them, but I think they show the respective companies believe in their products enough to wrap some kind of guarantee around them.

In conclusion

Properly deployed, storage virtualization can be effective in increasing the efficiencies of legacy storage footprints lacking in functionality. Just be careful and examine your motives for virtualization before making the move. Sometimes it’s a decidedly false economy.

D

So, are there any independent bloggers? Really?

There was some weird backlash against my site and my person recently – see here and here and in the comments here. Chuck Hollis got all uppity about whether I work at NetApp (with, for) or not.

I find it interesting that this only came up when I wrote something pro-NetApp. Wasn’t even anti-EMC.

It never came up when I was extolling the virtues of RecoverPoint (which I still think is awesome). I didn’t see anyone from NetApp or any EMC competitor start questioning where I worked, where the full disclosure was etc etc. Maybe they all just assumed I worked for EMC. Well – not directly, I was selling a ton of EMC gear, which was in turn paying my mortgage, which is as good as. But, ultimately, I just like the product since, properly deployed, it can solve some real problems.

So why is NetApp the company everyone loves to hate? Is it fear? Disrespect? Lack of understanding? All the above? But, I digress. NetApp customers love the product, and the company’s recent earnings announcement, as well as the fact we sold 1 Exabyte of enterprise storage last year, tells the real story. The People want their highly-functional, space-efficient, simple-to-use, application-aware storage, not 50 different products that are loosely integrated. Volkslagerung! Is that right, German-speaking readers?(edit: Volksdatenspeicher seems better as “storage for the people”).

So, I clarified things in the About page (upper left), I thought it was already clear but apparently not. Chuck is still not satisfied, so I think I’ll have to figure out a way to show some fancy animation of me in some NetApp uniform, hugging Hitz, Lau, Georgens and Mendoza and receiving my MVP award. Plus another animation showing the super-secret initiation ceremony and the extensive branding on my left buttock. Right.What was most interesting in this ad hominem attack was that the important discussion topics were largely ignored, a very efficient tactic to lure the unsuspecting reader’s mind away from the real issues.Which brings us to the subject of this post.

There seems to be this cute, romantic notion that there is such a thing as a truly independent blogger, and if I’m not independent, then what I say is tainted.

Well – let me break it to you and disabuse you of this notion: There ain’t no such thing as an independent blogger.

We are all biased, one way or another, about everything. Our past experiences shape our biases and the automatic stories our brains will create to explain any information we are presented with.

It doesn’t matter whether we work for a storage vendor or are customers – indeed, customers are typically among the most biased IT folks around! (storage vendor employees are usually crusty, jaded, cynical, have been around the block and typically have the dirt on multiple technologies).

I’ve been in customer meetings where I was told the customer doesn’t ever want to talk to EMC again because they treated him badly 10 years ago, or that he doesn’t want to talk to NetApp because he read in Barry’s blog that it only has 30% usable space, another that has FC queuing issues with HDS gear and wants to get rid of it at all costs, yet another that has had some controller panics with IBM gear and wants to get off of that and never touch IBM ever again, the list goes on. Those guys become zealots.

Then you have the other customer type, the one that receives Rolexes and other cool gifts in order to say whatever he’s told to say. Some actually will demand it (I’ve been in one of those meetings, too – “if you give me your watch we may have a deal”. I chose to assume he was kidding, lest I completely lose my faith in mankind).

You then have your “analyst” type that’s an independent industry “expert” – most of those guys haven’t touched the products they’re writing about, ever, and are just rehashing whatever they read in other publications or are told by their vendor drinking buddy. Yet they’re among the most trusted and read. They, too have their personal favorite horses they’re backing…

Finally you have your VAR bloggers. People – those guys make money selling the stuff. Yes, they know the tech, but don’t exactly expect an impartial discussion-  plus, they get all kinds of incentives from vendors.

So, who do you trust, when you can’t even trust yourself? Since, by definition, you are also biased, gentle reader…

I wish I could tell you. Ultimately, everyone has an agenda, whether conscious or subconscious. You just need to become shrewd enough to see through the agenda.

Maybe a good starting point is a truly intelligent, fact-based discussion bereft of ad hominem attacks?

D