7-Mode to Clustered ONTAP Transition

I normally deal with different aspects of storage (arguably far more exciting) but I thought I would write something to provide some common sense perspective on the current state of 7-Mode to cDOT adoption.

I will tackle the following topics:

  1. cDOT vs 7-Mode capabilities
  2. Claims that not enough customers are moving to cDOT
  3. 7-Mode to cDOT transition is seen by some as difficult and expensive
  4. Some argue it might make sense to look at competitors and move to those instead
  5. What programs and tools are offered by NetApp to make transition easy and quick
  6. Migrating from competitors to cDOT

cDOT vs 7-Mode capabilities

I don’t want to make this into a dissertation or take a trip down memory lane. Suffice it to say that while cDOT has most of the 7-Mode features, it is internally very different but also much, much more powerful than 7-Mode – cDOT is a far more capable and scalable storage OS in almost every possible way (and the roadmap is utterly insane).

For instance, cDOT is able to nondisruptively do anything, including crazy stuff like moving SMB shares around cluster nodes (moving LUNs around is much easier than dealing with the far more quirky SMB protocol). Some reasons to move stuff around the cluster could be node balancing, node evacuation and replacement… all done on the fly in cDOT, regardless of protocol.

cDOT also handles Flash much better (cDOT 8.3.x+ can be several times faster than 7-Mode for Flash on the exact same hardware). Even things like block I/O (FC and iSCSI) are completely written from the ground up in cDOT. Cloud integration. Automation. How failover is done. Or how CPU cores are used, how difficult edge conditions are handled… I could continue but then the ADD-afflicted would move on, if they haven’t already…

In a nutshell, cDOT is a more flexible, forward-looking architecture that respects the features that made 7-Mode so popular with customers, but goes incredibly further. There is no competitor with the breadth of features available in cDOT, let alone the features coming soon.

cDOT is quite simply the next logical step for a 7-Mode customer.

Not enough customers moving to cDOT?

The reality is actually pretty straightforward.

  • Most new customers simply go with cDOT, naturally. 7-Mode still has a couple of features cDOT doesn’t, and if those features are really critical to a customer, that’s when someone might go with 7-Mode today. With each cDOT release the feature delta list gets smaller and smaller. Plus, as mentioned earlier, cDOT has a plethora of features and huge enhancements that will never make it to 7-Mode, with much more coming soon.
  • The things still missing in cDOT (like WORM) aren’t even offered by the majority of storage vendors… many large customers use our WORM technology.
  • Large existing customers, especially ones running critical applications, naturally take longer to cycle technologies. The average time to switch major technologies (irrespective of vendor) is around 5 years. The big wave of cDOT transitions hasn’t even hit yet!
  • Given that cDOT 8.2 with SnapVault was the appropriate release for many of our customers and 8.3 the release for most of our customers, a huge number of systems are still within that 5-year window prior to converting to cDOT, given when those releases came out.
  • Customers with mission-critical systems will typically not convert an existing system – they will wait for the next major refresh cycle. Paranoia rules in those environments (that’s a general statement regardless of vendor). And we have many such customers.

7-Mode to cDOT transition is seen by some as difficult and expensive

This is a fun one, and a favorite FUD item for competitors and so-called “analysts”. I sometimes think we confused people by calling cDOT “ONTAP”. I bet expectations would be different if we’d called it “SuperDuper ClusterFrame OS”.

You see, cDOT is radically different in its internals versus 7-Mode – however, it’s still officially also called “ONTAP”. As such, customers are conditioned to super-easy upgrades between ONTAP releases (just load the new code and you’re done). cDOT is different enough that we can’t just do that.

I lobbied for the “ClusterFrame” name but was turned down BTW. I still think it rocks.

The fact that you can run either 7-Mode or cDOT on the same physical hardware confuses people even further. It’s a good thing to be able to reuse hardware (software-defined and all that). Some vendors like to make each new rev of the same family line (and its code) utterly incompatible with the last one… we don’t do that.

And for the startup champions: Startups haven’t been around long enough to have seen a truly major hardware and/or software change! (another thing conveniently ignored by many). Nor do they have the sheer amount of features and ancillary software ONTAP does. And of course, some vendors forget to mention what even a normal tech refresh looks like for their fancy new “built from the ground up” box with the extremely exciting name.

We truly know how to do upgrades… probably better than any vendor out there. For instance: What most people don’t know is that WAFL (ONTAP’s underlying block layout abstraction layer) has been quietly upgraded many, many times over the years. On the fly. In major ways. With a backout option. Another vendor’s product (again the one with the extremely exciting name) needed to be wiped twice by as many “upgrades” in one year in order to have its block layout changed.

Here’s the rub:

Transition complexity really depends on how complex your current deployment is, your appetite for change and tolerance of risk. But transition urgency depends on how much you need the fully nondisruptive nature of cDOT and all the other features it has vs 7-Mode.

What I mean by that:

We have some customers that lose upwards of $4m/hour of downtime. The long-term benefits of a truly nondisruptive architecture make any arguments regarding migration efforts effectively moot.

If you are using a lot of the 7-Mode features and companion software (and it has more features than almost any other storage OS), specific tools written only for 7-Mode, older OS clients only supported on 7-Mode, tons of snaps and clones going back to several years’ retention etc…

Then, in order to retain that kind of similar elaborate deployment in cDOT, the migration effort will also naturally be a bit more complex. But still doable. And we can automate most of it. Including moving over all the snapshots and archives!

On the other hand, if you are using the system like an old-fashioned device and aren’t taking advantage of all the cool stuff, then moving to anything is relatively easy. And especially if you’re close to 100% virtualized, migration can be downright trivial (though simply moving VMs around storage systems ignores any snapshot history – the big wrinkle with VM migrations).

Some argue it might make sense to look at competitors and move to those instead

Looking at options is something that makes business sense in general. It would be very disingenuous of me to say it’s foolish to look at options.

But this holds firmly true: If you want to move to a competitor platform, and use a lot of the 7-Mode features, it would arguably be impossible to do cleanly and maintain full functionality (at a bare minimum you’d lose all your snaps and clones – and some customers have several years’ worth of backup data in SnapVault – try asking them to give that up).

This is true for all competitor platforms: Someone using specific features, scripts, tools, snaps, clones etc. on any platform, will find it almost impossible to cleanly migrate to a different platform. I don’t care who makes it. Doesn’t matter. Can you cleanly move from VMware + VMware snaps to Hyper-V and retain the snaps?

Backup/clone retention is really the major challenge here – for other vendors. Do some research and see how frequently customers switch backup platforms… :) We can move snaps etc. from 7-Mode to cDOT just fine :)

The less features you use, the easier the migration and acclimatization to new stuff becomes, but the less value you are getting out of any given product.

Call it vendor lock-in if you must, but it’s merely a side effect of using any given device to its full potential.

The reality: it is incredibly easier to move from 7-Mode to cDOT than from 7-Mode to other vendor products. Here’s why…

What programs and tools are offered by NetApp to make transition easy and quick?

Initially, migration of a complex installation was harder. But we’ve been doing this a while now, and can do the following to make things much easier:

  1. The very cool 7MTT (7-Mode Transition Tool). This is an automation tool we keep rapidly enhancing that dramatically simplifies migrations of complex environments from 7-Mode to cDOT. Any time and effort analysis that ignores how this tool works is quite simply a flawed and incomplete analysis.
  2. After migrating to a new cDOT system, you can take your old 7-Mode gear and convert it to cDOT (another thing that’s impossible with a competitor – you can’t move from, say, a VNX to a VMAX and then convert the VNX to a VMAX).
  3. As of cDOT 8.3.0: We made SnapMirror replication work for all protocols from 7-Mode to cDOT! This is the fundamental way we can easily move over not just the baseline data but also all the snaps, clones etc. Extremely important, and something that moving to a competitor would simply be impossible to carry forward.
  4. As of cDOT 8.3.2 we are allowing something pretty amazing: CFT (Copy Free Transition). Which does exactly what the name suggests: Allows not having to move any data over to cDOT! It’s a combination ONTAP and 7MTT feature, and allows disconnecting the shelves from 7-Mode controllers and re-attaching them to cDOT controllers, and thereby converting even a gigantic system in practically no time. See here for a quick guide, here for a great blog post.
And before I forget…

What about migrating from other vendors to cDOT?

It cuts both ways – anything less would be unfair. Not only is it far easier to move from 7-Mode to cDOT than to a competitor, it’s also easy to move from a competitor to cDOT! Since it’s all about growth, and that’s the only way real growth happens.

As of version 8.3.1 we have what’s called Online Foreign LUN Import (FLI). See link here. It’s all included – no special licenses needed.

With Online FLI we can migrate LUNs from other arrays and maintain maximum uptime during the migration (a cutover is needed at some point of course but that’s quick).

And all this we do without external “helper” gear or special software tools.

In the case of NAS migrations, we have the free and incredibly cool XCP software that can migrate things like high file count environments 25-100x faster than traditional methods. Check it out here.

In summary

I hardly expect to change the minds of people suffering from acute confirmation bias (I wish I could say “you know who you are”, not knowing you are afflicted is the major problem), but hopefully the more level-headed among us should recognize by now that:

  • 7-Mode to cDOT migrations are extremely straightforward in all but the most complex and custom environments
  • Those same complex environments would find it impossible to transparently migrate to anything else anyway
  • Backups/clones is one of those things that complicates migrations for any vendor – ONTAP happens to be used by a lot of customers to handle backups as part of its core value prop
  • NetApp provides extremely powerful tools to help with migrations from 7-Mode to cDOT and from competitors to cDOT (with amazing tools for both SAN and NAS!) that will also handle the backups/clones/archives!
  • The grass isn’t always greener on the other side – The transition from 7-Mode to cDOT is the first time NetApp has asked customers to do anything that major in over 20 years. Other, especially younger, vendors haven’t even seen a truly major code change yet. How will they react to such a thing? NetApp is handling it just fine :)


Technorati Tags: , , , , , ,

Architecture has long term scalability implications for All Flash Appliances

Recently, NetApp announced the availability of a 3.84TB SSD. It’s not extremely exciting – it’s just a larger storage medium. Sure, it’s really advanced 3D NAND, it’s fast and ultra-reliable, and will allow some nicely dense configurations at a reduced $/GB. Another day in Enterprise Storage Land.

But, ultimately, that’s how drives roll – they get bigger. And in the case of SSD, the roadmaps seem extremely aggressive regarding capacities.

Then I realized that several of our competitors don’t have this large SSD capacity available. Some don’t even have half that.

But why? Why ignore such a seemingly easy and hugely cost-effective way to increase density?

In this post I will attempt to explain why certain architectural decisions may lead to inflexible design constructs that can have long-term flexibility and scalability ramifications.

Design Center

Each product has its genesis somewhere. It is designed to address certain key requirements in specific markets and behave in a better/different way than competitors in some areas. Plug specific gaps. Possibly fill a niche or even become a new product category.

This is called the “Design Center” of the product.

Design centers can evolve over time. But, ultimately, every product’s Design Center is an exercise in compromise and is one of the least malleable parts of the solution.

There’s no such thing as a free lunch. Every design decision has tradeoffs. Often, those tradeoffs sacrifice long term viability for speed to market. There’s nothing wrong with tradeoffs as long as you know what those are, especially if the tradeoffs have a direct impact on your data management capabilities long term.

It’s all about the Deduplication/RAM relationship

Aside from compression, scale up and/or scale out, deduplication is a common way to achieve better scalability and efficiencies out of storage.

There are several ways to offer deduplication in storage arrays: Inline, post-process, fixed chunk, variable chunk, per volume scope, global scope – all are design decisions that can have significant ramifications down the line and various business impacts.

Some deduplication approaches require very large amounts of memory to store metadata (hashes representing unique chunk signatures). This may limit scalability or make a product more expensive, even with scale-out approaches (since many large, costly controllers would be required).

There is no perfect answer, since each kind of architecture is better at certain things than others. This is what is meant by “tradeoffs” in specific Design Centers. But let’s see how things look for some example approaches (this is not meant to be a comprehensive list of all permutations).

I am keeping it simple – I’m not showing how metadata might get shared and compared between nodes (in itself a potentially hugely impactful operation as some scale-out AFA vendors have found to their chagrin). In addition, I’m not exploring container vs global deduplication or different scale-out methods – this post would become unwieldy… If there’s interest drop me a line or comment and I will do a multi-part series covering the other aspects.

Fixed size chunk approach

In the picture below you can see the basic layout of a fixed size chunk deduplication architecture. Each data chunk is represented by a hash value in RAM. Incoming new chunks are compared to the RAM hash store in order to determine where and whether they may be stored:

Hashes fixed chunk

The benefit of this kind of approach is that it’s relatively straightforward from a coding standpoint, and it probably made a whole lot of sense a couple of years ago when small SSDs were all that was available and speed to market was a major design decision.

The tradeoff is that a truly exorbitant amount of memory is required in order to store all the hash metadata values in RAM. As SSD capacities increase, the linear relationship of SSD size vs RAM size results in controllers with multi-TB RAM implementations – which gets expensive.

It follows that systems using this type of approach will find it increasingly difficult (if not impossible) to use significantly larger SSDs without either a major architectural change or the cost of multiple TB of RAM dropping dramatically. You should really ask the vendor what their roadmap is for things like 10+TB SSDs… and whether you can expand by adding the larger SSDs into a current system without having to throw everything you’ve already purchased away.

Variable size chunk approach

This one is almost identical to the previous example, but instead of a small, fixed block, the architecture allows for variable size blocks to be represented by the same hash size:

Hashes variable chunk

This one is more complex to code, but the massive benefit is that metadata space is hugely optimized since much larger data chunks are represented by the same hash size as smaller data chunks. The system does this chunk division automatically. Less hashes are needed with this approach, leading to better utilization of memory.

Such an architecture needs far less memory than the previous example. However, it is still plagued by the same fundamental scaling problem – only at a far smaller scale. Conversely, it allows a less expensive system to be manufactured than in the previous example since less RAM is needed for the same amount of storage. By combining multiple inexpensive systems via scale-out, significant capacity at scale can be achieved at a lesser cost than with the previous example.

Fixed chunk, metadata both in RAM and on-disk

An approach to lower the dependency on RAM is to have some metadata in RAM and some on SSD:

Hashes fixed chunk metadata on disk

This type of architecture finds it harder to do full speed inline deduplication since not all metadata is in RAM. However, it also offers a more economical way to approach hash storage. SSD size is not a concern with this type of approach. In addition, being able to keep dedupe metadata on cold storage aids in data portability and media independence, including storing data in the cloud.

Variable chunk, multi-tier metadata store

Now that you’ve seen examples of various approaches, it starts making logical sense what kind of architectural compromises are necessary to achieve both high deduplication performance and capacity scale.

For instance, how about variable blocks and the ability to store metadata on multiple tiers of storage? Upcoming, ultra-fast Storage Class Memory technologies are a good intermediate step between RAM and SSD. Lots of metadata can be placed there yet retain high speeds:

Hashes variable chunk metadata on 3 tiers

Coding for this approach is of course complex since SCM and SSD have to be treated as a sort of Level 2/Level 3 cache combination but with cache access time spans in the days or weeks, and parts of the cache never going “cold”. It’s algorithmically more involved, plus relies on technologies not yet widely available… but it does solve multiple problems at once. One could of course use just SCM for the entire metadata store and simplify implementation, but that would somewhat reduce the performance afforded by the approach shown (RAM is still faster). But if the SCM is fast enough… :)

However, being able to embed dedupe metadata in cold storage can still help with data mobility and being able to retain deduplication even across different types of storage and even cloud. This type of flexibility is valuable.

Why should you care?

Aside from the academic interest and nerd appeal, different architecture approaches have a business impact:

  • Will the storage system scale large enough for significant future growth?
  • Will I be able to use significantly new media technologies and sizes without major disruption?
  • Can I use extremely large media sizes?
  • Can I mix media sizes?
  • Can I mix controller types in a scale-out cluster?
  • Can I use cost-optimized hardware?
  • Does deduplication at scale impact performance negatively, especially with heavy writes?
  • If inline efficiencies aren’t comprehensive, how does that affect overall capacity sizing?
  • Does the deduplication method enforce a single large failure domain? (single pool – meaning that any corruption would result in the entire system being unusable)
  • What is the interoperability with Cloud and Disk technologies?
  • Can data mobility from All Flash to Disk to Cloud retain deduplication savings?
  • What other tradeoffs is this shiny new technology going to impose now and in the future? Ask to see a 5-year vision roadmap!

Always look beyond the shiny feature and think of the business benefits/risks. Some of the above may be OK for you. Some others – not so much.

There’s no free lunch.


Technorati Tags: , , , ,

Proper Testing vs Real World Testing

The idea for this article came from seeing various people attempt product testing. Though I thought about storage when writing this, the ideas apply to most industries.

Three different kinds of testing

There are really three different kinds of testing.

The first is the incomplete, improper, useless in almost any situation testing. Typically done by people that have little training on the subject. Almost always ends in misleading results and is arguably dangerous, especially if used to make purchasing decisions.

The second is what’s affectionately and romantically called “Real World Testing”. Typically done by people that will try to simulate some kind of workload they believe they encounter in their environment, or use part of their environment to do the testing. Much more accurate than the first kind, if done right. Usually the workload is decided arbitrarily :)

The third and last kind is what I term “Proper Testing”. This is done by professionals (that usually do this type of testing for a living) that understand how complex testing for a broad range of conditions needs to be done. It’s really hard to do, but pays amazing dividends if done thoroughly.

Let’s go over the three kinds in more details, with some examples.

Useless Testing

Hopefully after reading this you will know if you’re a perpetrator of Useless Testing and never do it again.

A lot of what I deal with is performance, so I will draw examples from there.

Have you or someone you know done one or more of the following after asking to evaluate a flashy, super high performance enterprise storage device?

  • Use tiny amounts of test data, ensuring it will all be cached
  • Try to test said device with a single server
  • With only a couple of I/O ports
  • With a single LUN
  • With a single I/O thread
  • Doing only a certain kind of operation (say, all random reads)
  • Using only a single I/O size (say, 4K or 32K) for all operations
  • Not looking at latency
  • Using extremely compressible data on an array that does compression

I could go on but you get the idea. A good move would be to look at the performance primer here before doing anything else…

Pitfalls of Useless Testing

The main reason people do such poor testing is usually that it’s easy to do. Another reason is that it’s easy to use it to satisfy a Confirmation Bias.

The biggest problem with such testing is that it doesn’t tell you how the system behaves if exposed to different conditions.

Yes, you see how it behaves in a very specific situation, and that situation might even be close to a very limited subset of what you need to do in “Real Life”, but you learn almost nothing about how the system behaves in other kinds of scenarios.

In addition, you might eliminate devices that would normally behave far better for your applications, especially under duress, since they might fail Useless Testing scenarios (since they weren’t designed for such unrealistic situations).

Making purchasing decisions after doing Useless Testing will invariably result in making bad purchasing decisions unless you’re very lucky (which usually means that your true requirements weren’t really that hard to begin with).

Part of the problem of course is not even knowing what to test for.

“Real World” Testing

This is what most sane people strive for.

How to test something in conditions most approximating how it will be used in real life.

Some examples of how people might try to perform “Real World” testing:

  • One could use real application I/O traces and advanced benchmarking software that can replay them, or…
  • Spin synthetic benchmarks designed to simulate real applications, or…
  • Put one of their production applications on the system and use it like they normally would.

Clearly, any of this would result in more accurate results than Useless Testing. However…

Pitfalls or “Real World” Testing

The problem with “Real World” testing is that it invariably does not reproduce the “Real World” closely enough to be comprehensive and truly useful.

Such testing addresses only a small subset of the “Real World”. The omissions dictate how dangerous the results are.

In addition, extrapolating larger-scale performance isn’t really possible. How the system ran one application doesn’t mean you know how it will run ten applications in parallel.

Some examples:

  • Are you testing one workload at a time, even if that workload consists of multiple LUNs? Shared systems will usually have multiple totally different workloads hitting them in parallel, often wildly different and conflicting, coming and going at different times, each workload being a set of related LUNs. For instance, an email workload plus a DSS plus an OLTP plus a file serving plus a backup workload in parallel… :)
  • Are you testing enough workloads in parallel? True Enterprise systems thrive on crunching many concurrent workloads.
  • Are you injecting other operations that you might be doing in the “Real World”? For instance, replicate and delete large amounts of data while doing I/O on various applications? Replications and deletions have been known to happen… 😉
  • Are you testing how the system behaves in degraded mode while doing all the above? Say, if it’s missing a controller or two, or a drive or two… also known to happen.

That’s right, this stuff isn’t easy to do properly. It also takes a very long time. Which is why serious Enterprise vendors have huge QA organizations going through these kinds of scenarios. And why we get irritated when we see people drawing conclusions based on incomplete testing 😉

Proper Testing

See, the problem with the “Real World” concept is that there are multiple “Real Worlds”.

Take car manufacturers for instance.

Any car manufacturer worth their salt will torture-test their cars in various wildly different conditions, often rapidly switching from one to the other if possible to make things even worse:

  • Arctic conditions
  • Desert, dusty conditions plus harsh UV radiation
  • Extremely humid conditions
  • All the above in different altitudes

You see, maybe your “Real World” is Alaska, but another customer’s “Real World” will be the very wet Cherrapunji, and a third customer’s “Real World” might be the Sahara desert. A fourth might be a combination (cold, very high altitude, harsh UV radiation, or hot and extremely humid).

These are all valid “Real World” conditions, and the car needs to be able to deal with all of them while maintaining full functioning until beyond the warranty period.

Imagine if moving from Alaska to the Caribbean meant your car would cease functioning. People have been known to move, too… :)

Storage manufacturers that actually know what they’re doing have an even harder job:

We need to test multiple “Real Worlds” in parallel. Oh, and we do test the different climate conditions as well, don’t think we assume everyone operates their hardware in ideal conditions… especially folks in the military or any other life-or-death situation.

Closing thoughts

If I’ve succeeded in stopping even one person from doing Useless Testing, this article was a success :)

It’s important to understand that Proper Testing is extremely hard to do. Even gigantic QA organizations miss bugs, imagine someone doing incomplete testing. There are just too many permutations.

Another sobering thought is that very few vendors are actually able to do Proper Testing of systems. The know-how, sheer time and numbers of personnel needed is enough to make most smaller vendors skimp on the testing out of pure necessity. They test what they can, otherwise they’d never ship anything.

A large number of data points helps. Selling millions of units of something provides a vendor with way more failure data than selling a thousand units of something possibly can. Action can be taken to characterize the failure data and see what approach is best to avoid the problem, and how common it really is.

One minor failure in a million may not be even worth fixing, whereas if your sample size is ten, one failure means 10%. It might be the exact same failure, but you don’t know that. Or you might have zero failures in the sample size of ten, which might make you think your solution is better than it really is…

But I digress into other, admittedly interesting areas.

All I ask is… think really hard before testing anything in the future. Think what you really are trying to prove. And don’t be afraid to ask for help.


Technorati Tags: , , ,

Are some flash storage vendors optimizing too heavily for short-lived NAND flash?

I really resisted using the “flash in the pan” phrase in the title… first, because the term is overused and second, because I don’t believe solid state is of limited value. On the contrary.

However, I am noticing an interesting trend among some newcomers in the array business, desperate to find a flash niche to compete in:

Writing their storage OS around very specific NAND flash technologies. Almost as bad as writing an entire storage OS to support a single hypervisor technology, but that’s a story for another day.

Solid state technology is still too fluid. Unlike spinning disk technology that is overall very reliable and mature and likely won’t see huge advances in the years to come, solid state technology seems to advance almost weekly. New SSD controllers are coming out almost too frequently, and new kinds of solid state storage are either out now (Triple Level Cell, anyone?) or coming in the future (MRAM, ReRAM, FeRAM, PCM, PMC, and probably a lot more that I’m forgetting).

My point is:

How far ahead are certain vendors thinking if they are writing an entire storage OS around the limitations of a class of storage that may look very different in just a year or two?

Some of them go really deep and try to do all kinds of clever optimizations to ensure good wear leveling for the flash chips. Some write their own controller software and use bare NAND flash chips, not even off-the-shelf SSDs. Which is great, but what if you don’t need to do that in two years? Or what if the optimizations need to be drastically different for the new technologies? How long will coding for the new flash technologies take? Or will they be stuck using old technologies? Food for thought.

I guess some of us are in it for the long haul, and some aren’t. “Can’t see the forest for the trees” comes to mind. “Gold rush” also seems relevant.

I strongly believe general-purpose storage OSes need to be flexible enough to be reasonably adaptable to different underlying media. And storage OSes that are specifically designed for solid state storage need to be especially flexible regarding the underlying SSD technology to avoid the problems outlined above, and to avoid the relative lack of reliability of current SSD solutions (another story for another day).

At the moment I don’t see clear winners yet. I see a few great short-term stories, but who has the most flexible architecture to be able to deal with different kinds of technologies for years to come?


Technorati Tags: , ,

NetApp delivers 2TB/s performance to giant supercomputer for big data

(Edited: My bad, it was 2TB/s, up from 1.3TB/s, the solution has been getting bigger and upgraded, plus the post talks about the E5400, the newer E5600 is much faster).

What do you do when you need so much I/O performance that no one single storage system can deliver it, no matter how large?

To be specific: What if you needed to transfer data at over 1TB per second? (or 2TB/s, as it eventually turned out to be)?

That was the problem faced by the U.S. Department of Energy (DoE) and their Sequoia supercomputer at the Lawrence Livermore National Laboratory (LLNL), one of the fastest supercomputing systems on the planet.

You can read the official press release here. I wanted to get more into the technical details.

People talk a lot about “big data” recently – no clear definition seems to exist, in my opinion it’s something that has some of the following properties:

  • Too much data to be processed by a “normal” computer or cluster
  • Too much data to work with using a relational DB
  • Too much data to fit in a single storage system for performance and/or capacity reasons – or maybe just simply:
  • Too much data to process using traditional methods within an acceptable time frame

Clearly, this is a bit loose – how much is “too much”? How long is “too long”? For someone only armed with a subnotebook computer, “too much” does not have the same meaning as for someone rocking a 12-core server with 256GB RAM and a few TB of SSD.

So this definition is relative… but in some cases, such as the one we are discussing, absolute – given the limitations of today’s technology.

For instance, the amount of storage LLNL required was several tens of PB in a single storage pool that could provide unprecedented I/O performance to the tune of 2TB/s. Both size and performance needed to be scalable. It also needed to be reliable and fit within a reasonable budget and not require extreme space, power and cooling. A tall order indeed.

This created some serious logistics problems regarding storage:

  • No single disk array can hold that amount of data
  • No single disk array can perform anywhere close to 2TB/s

Let’s put this in perspective: The storage systems that scale the biggest are typically scale-out clusters from the usual suspects of the storage world (we make one, for example). Even so, they max out at less PB than the deployment required.

The even bigger problem is that a single large scale-out system can’t really deliver more than a few tens of GB/s under optimal conditions – more than fast enough for most “normal” uses but utterly unacceptable for this case.

The only realistic solution to satisfy the requirements was massive parallelization, specifically using the NetApp E-Series for the back-end storage and the Lustre cluster filesystem.


A bit about the solution…

Almost a year ago NetApp purchased the Engenio storage line from LSI. That storage line is resold by several companies like IBM, Oracle, Quantum, Dell, SGI, Teradata and more. IBM also resells the ONTAP-based FAS systems and calls them “N-Series”.

That purchase has made NetApp the largest provider of OEM arrays on the planet by far. It was a good deal – very rapid ROI.

There was a lot of speculation as to why NetApp would bother with the purchase. After all, the ONTAP-based systems have a ton more functionality than pretty much any other array and are optimized for typical mostly-random workloads – DBs, VMs, email, plus megacaching, snaps, cloning, dedupe, compression, etc – all with RAID6-equivalent protection as standard.

The E-Series boxes on the other hand don’t do thin provisioning, dedupe, compression, megacaching… and their snaps are the less efficient copy-on-first-write instead of redirect-on-write. So, almost the anti-ONTAP :)

The first reason for the acquisition was that, on purely financial terms, it was a no-brainer deal even if one sells shoes for a living, let alone storage. Even if there were no other reasons, this one would be enough.

Another reason (and the one germane to this article) was that the E-Series has a tremendous sustained sequential performance density. For instance, the E5400 system can sustain about 4GB/s in 4U (real GB/s, not out of cache), all-in. That’s 4U total for 60 disks including the controllers. Expandable, of course. It’s no slouch for random I/O either, plus you can load it with SSDs, too… :) (Update: the newer E5600 can go up to 12GB/s in 2U with SSDs!)

Again, note – 60 drives per 4U shelf and that includes the RAID controllers, batteries etc. In addition, all drives are front-loading and stay active while servicing the shelf – as opposed to most (if not all) dense shelves in the market that need the entire (very heavy) shelf pulled out and/or several drives offlined in order to replace a single failed drive… (there’s some really cool engineering in the shelf to do this without thermal problems, performance loss or vibrations). All this allows standard racks and no fear of the racks tipping over while servicing the shelves :) (you know who you are!)

There are some vendors that purely specialize in sequential I/O and tipping racks – yet they have about 3-4x less performance density than the E5400, even though they sometimes have higher per-controller throughput. In a typical marketing exercise, some of our more usual competitors have boasted 2GB/s/RU for their controllers, meaning that in 4U the controllers (that take up 4U in that example) can do 8GB/s, but that requires all kinds of extra rack space to achieve (extra UPSes, several shelves, etc). Making their resulting actual throughput number well under 1GB/s/RU. Not to mention the cost (those systems are typically more expensive than a 5400). Which is important with projects of the scale we are talking about.

Most importantly, what we accomplished at the LLNL was no marketing exercise…


The benefits of truly high performance density

Clearly, if your requirements are big enough, you end up spending a lot less money and needing a lot less rack space, power and cooling by going with a highly performance-dense solution.

However, given the requirements of the LLNL, it’s clear that you can’t use just a single E5400 to satisfy the performance and capacity requirements of this use case. What you can do though is use a bunch of them in parallel… and use that massive performance density to achieve about 40GB/s per industry-standard rack with 600x high-capacity disks (1.8PB raw per rack).

For even higher performance per rack, the E5400 can use the faster SAS or SSD drives – 480 drives per rack (up to 432TB raw), providing 80GB/s reads/60GB/s writes.


Enter the cluster filesystem

So, now that we picked the performance-dense, reliable, cost-effective building block, how do we tie those building blocks together?

The answer: By using a cluster filesystem.

Loosely defined, a cluster filesystem is simply a filesystem that can be accessed simultaneously by the servers mounting it. In addition, it also typically means it can span storage systems and make them look as one big entity.

It’s not a new concept – and there are several examples, old and new: AFS, Coda, GPFS, and the more prevalent Stornext and Lustre are some.

The LLNL picked Lustre for this project. Lustre is a distributed filesystem that breaks apart I/O into multiple Object Storage Servers, each connected to storage (Object Storage Targets). Metadata is served by dedicated servers that are not part of the I/O stream and thus not a bottleneck. See below for a picture (courtesy of the Lustre manual) of how it is all connected:


Lustre Scaled Cluster


High-speed connections are used liberally for lower latency and higher throughput.

A large file can reside on many storage servers, and as a result I/O can be spread out and parallelized.

Lustre clients see a single large namespace and run a proprietary protocol to access the cluster.

It sounds good in theory – and it delivered in practice: 1.3TB/s sustained performance was demonstrated to the NetApp block devices. Work is ongoing to finalize the testing with the complete Lustre environment. Not sure what the upper limit would be. But clearly it’s a highly scalable solution.


Putting it all together

NetApp has fully realized solutions for the “big data” applications out there – complete with the product and services needed to complete each engagement. The Lustre solution employed by the LLNL is just one of the options available. There is Hadoop, Full Motion uncompressed HD video, and more.

So – how fast do you need to go?




Technorati Tags: , ,