How Nimble Customers Benefit from Big Data Predictive Analytics

In past articles I have covered topics about how Nimble Storage lowers customer risk with technologies internal to the array such as:

Now, it’s time to write a more in-depth article about the immense customer benefits of InfoSight, Nimble’s Predictive Analytics platform.

InfoSight

I’ve been at Nimble Storage about a year now, and this technology still remains the fundamental reason I chose to come to Nimble versus the many other opportunities I had.

My opinion hasn’t changed – if anything, I increasingly see that this is the cornerstone of how to fundamentally advance how infrastructure works. Not having such a technology has become a major competitive disadvantage for other vendors, since not having it increases risk for customers.

I will show how InfoSight is far beyond what any other storage vendor has in the analytics area in both scope and capability. Because it’s not about just the storage…

Imagine if you had a technology that could, among many other things, reduce problem resolution time from several days to minutes? How could that potentially help your company in case something really hard to diagnose (and well outside storage) affected a critical part of the business? What would the business value be for such an ability?

Why InfoSight was Built

The fundamental reason behind building this technology is to reduce business risk by improving the reliability of the entire infrastructure, not just the reliability of storage.

A secondary goal for the Predictive Analytics technology in InfoSight is to provide statistics, trending analysis, recommendations and overall visibility in the infrastructure.

Most storage vendors are very focused on, predictably, storage. Their reporting, troubleshooting and analytics – they’re focused on the storage piece. However, statistically, when dealing with quality enterprise storage gear, storage is quite simply not the leading cause of overall infrastructure issues affecting the delivery of data to applications.

The App-Data Gap

Any delay between the delivery of data to the end user is called the App-Data Gap.

App data gap

Clearly, App-Data Gap issues could stem from any number of sources.

Business owners care about applications serving data, and arguably many couldn’t care less about the infrastructure details as long as it’s meeting business SLAs.

But what happens when things inevitably go wrong? Because it is a case of when, not if. Especially when something goes wrong in a strange way so that even redundant systems don’t quite help? Or if it’s an intermittent issue that can’t be tracked down? Or an endemic issue that has been plaguing the environment for a long time?

The State of IT Troubleshooting

With the complexity of today’s data centers (whether on-premises or cloud-based), there is so much going on that troubleshooting is hard, even for basic “point me in the right direction” root cause analysis. And if something really arcane is going on, timely troubleshooting may prove impossible.

There is no shortage of data – if anything, there is often too much of it, but usually not of a high enough quality and/or presented in ways that are not necessarily helpful nor correlated.

IT Automation

The true value starts once quality data is correlated and troubleshooting becomes proactive and, ultimately, automatic.

Cars are always a good example…

Imagine cars with basic sensors and instrumentation – if something goes wrong, do you get enough information? Do you know the optimal course of action? Quickly enough?

Let’s say the fuel lamp lights up. What does that mean? That I have X amount of fuel left because I think it said so in the car’s manual that I lost years ago when trying to fend off that surprisingly irate koala? Is knowing the amount of fuel left helpful enough?

Would it not be incredibly better if I had this information instead:

  • How much farther can I drive?
  • Where is the next gas station on my route?
  • Is the route to that gas station open?
  • Does that gas station have the fuel type I need? In stock or just in theory?
  • Are they open for business? For sure or do I just know their official hours?
  • Do they accept my form of payment?
  • Can I even make it to that gas station or should I go elsewhere even if it takes me wildly off course?
  • Can all the above information be summarized in a “just go there” type action if I didn’t want to know all the details and wanted to know what to do quickly?

Old Car Sensors

I’m sure most of us can relate – especially the people that have had the misfortune of being stranded due to one of the above reasons.

How InfoSight was Built

In order to get insights from data, the data has to be collected in the first place. And quite frequently, a lot of good data is required in order to have enough information to get meaningful insights.

Sensors First

One of the things Nimble Storage did right from the beginning, was to ask the developers to put sensors in software functions, even if there was no immediately apparent way to get value from those sensors (we constantly invent new ways to leverage the data).

This has created a coding methodology that cannot easily be retrofitted by other vendors playing catch up – the Nimble code is full of really deep sensors (about 4,000 as of the time of this publication in March of 2017) and an entire process that is without equal.

Yet another car example

A good parallel to our rich sensor methodology is Tesla. They have a lot of very advanced sensors in each car, in order to collect extremely detailed information about how the car is used, where, in what conditions, and what’s around it:

Tesla Sensors

Without the correct type of rich sensors in the right places, the ability to know enough about the environment (including context) is severely compromised. Most competitor cars don’t have anything remotely close to Tesla’s sensor quality and quantity. Which is one of the reasons it is really hard for their competitors to catch up in the self driving car game. To my knowledge, Tesla has collected far more sensor data about real world driving than any other car company.

Quality sensor data by itself is of limited value, but it’s a good (and necessary) start.

Collection Second

Every day, each Nimble Storage system collects and uploads up to 70 million sensor values, which is many orders of magnitude richer than what competitors do (since they didn’t build the system with that in mind to begin with). In addition to sensors, we also collect logs and config variables.

We don’t only collect data from the Nimble Storage systems.

We also want to collect data from the fabric, the hypervisors, hosts, applications, everything we can get our hands on (and the list is expanding).

A nice side effect is that due to the richness of data collected, Nimble Support will not bother customers with endless requests for huge diagnostic uploads to Support, unlike certain other vendors that shall remain nameless…

For the security conscious: this is anonymized metadata, collecting actual customer data would be illegal and stupid to do (yes, we get the question). All this information just tells us how you’re using the system (and how it’s reacting), not what your data contains. In addition, a customer has to opt in, otherwise no metadata is sent to Nimble (almost everyone opts in, for benefits that will become extremely clear later on). Here’s the InfoSec policy.

All this sensor data is collected from all customer systems and sent to a (really) Big Data system. It’s a massively parallel implementation and it’s all running on Nimble gear (eat your own dog food as they say, or 3-Michelin-Star meal as the case may be).

Fun fact: In a few hours we collect more data than much bigger vendors (with many more systems deployed) collect in several years with their antediluvian “analytics” frameworks.

Fun fact #2: Nimble has collected more data about how real-world infrastructure is used than any other company. This is the infrastructure database that will enable the “self driving car” AI in IT. Imagine the applications and benefits if even complicated tasks can now be automated with AI.

Now that the data is collected, optimized and placed in an area where it can rapidly be examined and manipulated, the fun begins.

Correlation, Context & Causation Third

It is very easy to draw the wrong conclusions if one relies on simple statistics and even advanced correlation without context awareness.

A nice illustration of this is Anscombe’s Quartet. If someone does a simple statistical analysis for all four datasets, the results are identical (the blue trend line). However, if one simply looks at a plot of the data points, it becomes abundantly clear that the blue trend line does not always tell the right story.

Anscombe Quartet

Being able to accurately tell what’s really going on is a big part of this technology.

Another good way to see the drawback of lack of situational awareness is playing chess without knowing the board exists. The illustrious Mr. Wardley has a great article on that. Big Data by itself isn’t enough.

Some of the techniques Nimble Storage uses to separate the wheat from the chaff, without giving away too much:

Machine Learning related:

  • Eigenvalue Decomposition
  • Sliding window correlations
  • Differential equation models of IO flux in order to assess workload contention
  • Autoregressive, Bootstrapping and Monte Carlo methods
  • Correlation across the entire customer population
  • Outlier detection and filtering
  • Semi-supervised mixture models, Bayesian logic for probability inference, Random Forests, Support Vector Machines, Multifeature Clustering Techniques
  • Application behavior models
Others:
  • Interactions between different components in the stack such as array code, hypervisors, server types, switches, etc.
  • Advanced visualizations
  • Zero Day Prevention

As you can see, the InfoSight framework is massive in both scope and execution. In addition, the computational power needed to do this properly is pretty serious.

What Benefits do Nimble Customers get from InfoSight?

The quickest way to describe the overall benefits is massively lower business risk.

A good way to show one of the advantages of the complete solution (Nimble Storage plus InfoSight) is to share some real-life examples of where we were able to help customers identify and resolve incredibly challenging problems.

Example #1: Widespread Hypervisor bug. Business impact: complete downtime

There was a bug in a certain hypervisor that would disconnect the hosts from the storage. It affected the storage industry as a whole. Nimble was instrumental in helping the hypervisor vendor understand exactly what the problem was and how it was manifesting. The fix helped the entire storage industry.

Example #2: Insidious NIC issue. Business impact: very erratic extreme latency affecting critical applications

The customer was seeing huge latency spikes but didn’t call Nimble initially since the array GUI showed zero latency. They called every other vendor, nobody could see anything wrong and kept pointing the finger at the storage. After calling Nimble support, we were able to figure out that it was the server NIC that was causing issues (a NIC that had just passed all the server vendor’s diagnostics with flying colors). Replacing it fixed the problem.

Example #3: Network-wide latency spikes. Business impact: lost revenue

A large financial company was experiencing huge latency spikes for everything connected to a certain other vendor’s array. They had a support case open for 6 months with both their network and storage vendors. The case was unresolved. They brought Nimble in for a POC. We had exact the same latency issues. The difference was we identified the real problem in 20 minutes (an obscure setting on the switches). Changing the setting fixed the original storage vendor’s issues. That customer is buying from us now anyway.

Example #4: Climate control issues. Business impact: data center shutdown

A customer had a failing air conditioning system. We proactively identified the issue a full 36 minutes before their own monitoring systems managed to do so. Very serious, and apparently more common than one would think.

We have many more examples but this is already an abnormally long blog post as it is… I hope this is enough to show that the value of InfoSight goes far beyond the storage, and extends all the way up the stack (and even tangentially helps with environmental issues, and we didn’t even discuss how it helps identify security issues).

Indeed, due to all the Nimble technology, our average case resolution time is around 30 minutes (from the time you pick up the phone, not from the time you get a call back).

So my question, gentle reader, is: If any of the aforementioned problems happened to you, how long would it take for you to resolve them? What would the process be? How many vendors would you have to call? What would the impact to your business be while having the problems? What would the cost be? And does that cost include intangible things like company reputation?

Being Proactive

Solving difficult problems more quickly than ever was possible in the past is, of course, one of the major benefits of InfoSight.

But would it not be better to not have the problem in the first place?

The Prime Directive

We stand by this adage: If any single customer has a problem, no other customer ever should. So we implement Zero Day Prevention.

It doesn’t just mean we will fix bugs as we find them. It means we will try to automatically prevent the entire customer base from hitting that same problem, even if the problem has affected only one customer.

Only good things can come from such a policy.

Inoculation & Blacklisting

There are two technologies at play here (which can also work simultaneously):

  1. Inoculation: Let’s assume a customer has a problem and that problem is utterly unrelated to any Nimble code. Instead of saying “Not Us”, we will still try to perform root cause analysis, and if we can identify a workaround (even if totally unrelated to us), we will automatically create cases for all customers that may be susceptible to this issue. The case will include instructions on how to proactively prevent the problem from occurring.
  2. Blacklisting: In this case, there is something we can do that is related to our OS. For instance, a hypervisor bug that only affects certain versions of our OS, or some sort of incompatibility. We will then automatically prevent susceptible customers from installing any versions of our OS that clash with their infrastructure. Contrast that to other vendors asking customers to look at compatibility matrices or, worse, letting customers download and install any OS version they’d like…

Here’s an example of a customer that we won’t let upgrade until we help them resolve the underlying issue:

Blacklisting

Hardware upgrade recommendations

Another proactive thing InfoSight does is to recommend not just software but also hardware upgrades based on proprietary and extremely intelligent analysis of trending (and not based on simple statistical models, per the Anscombe’s Quartet example earlier on).

Things like controller, cache, media and capacity are all examined, and the customers see specific and easy to follow recommendations:

Upgrade recommendation

Notice we don’t say something vague like “not enough cache”. We explicitly say “add this much more cache.”

There is still enough cache in this example system for now, but with intelligent projections, the recommended amount is calculated and presented. The customer can then easily take action without needing to call support nor hire expensive consultants. And take action before things get too tight.

Other InfoSight Benefits

There are many other benefits, but this post is getting dangerously long so I will describe just a few of them in a list instead of showing screenshots and detailed descriptions for everything.

  • Improved case automation: About 90% of all cases are opened automatically, and a huge percentage of them are closed automatically
  • Accelerated code evolution: Things that take other companies years to detect and fix/optimize are now tackled in a few weeks or even days
  • App-aware reporting: Instead of reporting generically, be able to see detailed breakdowns by application type (and even different data types for the same application, for example Oracle data vs Oracle logs)
  • Insight into how real applications work: check here for an excellent paper.
  • Sizing: Using the knowledge of how applications work across the entire customer base, help array sizing become more accurate, even in the face of incomplete requirements data
  • Protection: tell customers what is protected via snaps/replication, including lag times
  • Visualizations: Various ways of showing heretofore difficult or impossible to see behaviors and relationships
  • Heuristics-based storage behavior: Using knowledge from InfoSight, teach the storage system how to behave in the face of complex issues, even when it’s not connected to InfoSight
  • Extension into cloud consumers: Provide visibility into AWS and Azure environments using Nimble Cloud Volumes (NCV).
And we are only getting started. Think of what kind of infrastructure automation could be possible with this amount of intelligence? Especially at large scale?

Regarding vendors that claim they will perform advanced predictive analytics inside their systems…

Really? They can spare that much computational power on their systems, and dedicate incredible amounts of spare capacity, and have figured out all the advanced algorithms necessary, and built all the right sensors in their code, and accurately cross-correlate across the entire infrastructure, and automatically consult with every other system on the planet, and help identify weird problems that are way outside the province of that vendor? Sold! Do they sell bridges, too?

There is only so much one can do inside the system, or even if adding a dedicated “monitoring server” in the infrastructure. This is why we elected to offload all this incredibly heavy processing to a massively parallel infrastructure in our cloud instead of relying on weak solipsistic techniques.

A Common Misconception

Instead of the usual summary, I will end this post rather differently.

Often, Nimble customers, resellers, competitors (and sometimes even a few of the less technical Nimble employees) think that the customer-visible GUI of InfoSight is all of InfoSight.

OneDoesNotSimplySeeInfoSight

InfoSight actually is the combination of the entire back end, machine learning, heuristics, proactive problem prevention, automation, the myriad of internal visualization, correlation and data massaging mechanisms, plus the customer-visible part.

Arguably, what we let customers see is a tiny, tiny sliver of the incredible richness of data we have available in our Big Data system. Internally we see all the data, and use it to provide the incredible reliability and support experience Nimble customers have come to love.

There’s a good reason we don’t expose the entire back end to customers.

For normal humans without extensive training it would be a bit like seeing The Matrix code, hence the easy visualizations we provide for a subset of the data, with more coming online all the time (if you’re a customer see if you have access yet to the cool new Labs portion, found under the Manage tab).

So this (for instance, an app-aware histogram proving that SQL DBs do almost zero I/O at 32KB):

IO Histogram SQL

Or this (a heatmap showing a correlation of just sequential I/O, just for SQL, for the whole week – easy to spot patterns, for example at 0600 every day there’s a burst):

Heatmap

Instead of this:

Xmatrix

So, next time you see some other vendor showing a pretty graph, telling you “hey, we also have InfoSight! Ours looks even prettier!” – remember this section. It’s like saying “all cars have steering wheels, including the rusty pinto on cinder blocks”.

They’re like kids arguing about a toy, while we are busy colonizing other planets…

KidsVscolonization

And if you read all the way to this, congratulations, and apologies for the length. If it’s any consolation, it’s about a third of what it should have been to cover the subject properly…

D

Going Green: Why I Joined Nimble Storage

I am proud to announce that, as of today, I am a member of the Nimble Storage team.

Nimble Logo

This marks the end of an era – I spent quite a bit of time at NetApp: learned a lot, did a lot – by the end I had my hands in all kinds of sausage making… 🙂

I wish my friends at NetApp the best of luck for the future. The storage industry is a very tough arena, and one that will be increasingly harder and with less tolerance than ever before.

Why?

I compared Nimble Storage with many competitors before making my decision. Quite simply, Nimble’s core values agree with mine. It goes without saying that I wouldn’t choose to move to a company unless I believed they had the best technology (and the best support), but the core values is where it all starts. The product is built upon those core values.

I firmly believe that modern storage should be easy to consume. Indeed, it should be a joy to consume, even for complex multi-site environments. It should not be a burden to deal with. Nor should it be a burden to sell.

Systems that are holistically easy to consume have several business benefits, some of which are: lower OPEX and CAPEX, increased productivity, less risk, easier planning, faster problem resolution.

It’s important to understand that easy to consume is not at all the same as easy to usethat is but a very small subset of easy consumption.

The core value of easy consumption encompasses several aspects, many of which are ignored by most storage vendors. Most modern players will focus on ease of use, show demos of a pretty GUI and suchlike. “Look how easy it is to install” or “look how easy it is to create a LUN”. Well – there’s a lot more to worry about in real life.

The lifecycle of a storage system

Beyond initial installation and simple element creation, there is a multitude of things users and vendors need to be concerned with. Here’s a partial list:

  • Installation
  • Migration to/from
  • Provisioning
  • Host/fabric configuration
  • Backups, restores, replication
  • Scaling up/out
  • Upgrading from a smaller/older version
  • Firmware updates for all components (including drives)
  • Tech refresh
  • Support

What about more advanced topics?

A storage solution cannot exist in a vacuum. There are several ancillary (but extremely important) services needed in order to help consume storage, especially at scale. Services that typically cannot (and many that should not) reside on the storage system itself. How about…

  • Initial and future sizing
  • Capacity planning based on long-term usage data
  • Performance analysis and profiling
  • Performance issue resolution/recommendations
  • Root cause analysis
  • What-if scenario modeling
  • Support case resolution
  • Comprehensive end-to-end monitoring and alerting
  • Comprehensive reporting (including auditing)
  • Security (including RBAC and delegation)
  • Upgrade planning
  • Pervasive automation (including host-side)
  • Ensuring adherence to best practices
If a storage solution doesn’t make all or most of the above straightforward, then it is not truly easy to consume.

The problem:

Storage vendors will typically either be lacking in many of the above areas, or may need many different tools, and significant manual effort, in order to provide even some of these services.

Not having the tools creates an obvious problem – the customer and vendor simply can’t perform these functions, or the implementation is too basic. Most smaller vendors are in this camp. Not much functionality beyond what’s inside the storage device itself. Long-term consumption, especially at scale, becomes challenging.

On the other hand, having a multitude of tools to help with these areas also makes the solution hard to consume overall. Larger vendors fall into this category.

For instance: Customers may need to access many different tools just to monitor and alert on various metrics. One tool may provide certain information, another tool provides different metrics (often with significant overlap with the first tool), and so on. And not all tools work with all versions of the product. This increases administrative complexity and overall time and money spent. And the end result is often compromised and incredibly hard to support.

Vendors that need many different tools also create a problem for themselves: Almost nobody on staff will have the expertise to deal with the plethora of tools necessary to do certain things like sizing, performance troubleshooting or even a tech refresh. Or optimizing a product for specific workloads. Deep expertise is often needed to interpret the results of the tools. This causes interminable delays in problem resolution, lengthens sales cycles, complicates product development, creates staffing challenges, increases costs, and in general makes life miserable.

RG autojack

How?

What always fascinated me about Nimble Storage is that not only did they recognize these challenges, they actually built an entire infrastructure and innovative approach in order to solve the problem.

Nimble recognized the value of Predictive Analytics.

The challenge: How to use Big Data to solve the challenges faced by storage customers and storage vendors. And how to do this in a way that achieves a dramatically better end result.

While most vendors have call-home features, and some even have rudimentary capacity, configuration and maybe even performance telemetry being sent to some central repository (usually very infrequently), Nimble elected instead to send extremely comprehensive sensor telemetry to a huge analytics engine. A difficult undertaking, but one that would define the company in the years to come.

Nimble also recognized the need to do this from the very beginning. Each Nimble array sends 30-70 million data points back to Nimble every day. Trying to retrofit telemetry of this scope would be extremely difficult if not impossible to achieve effectively.

This wealth of data (the largest storage-related analytics engine in the world, by far) is used to help customers with the challenges mentioned previously, while at the same time lowering complexity.

It also, crucially, helps Nimble better support customers and design better products without having to bother customers for data dumps.

For example: What if a Nimble engineer trying to optimize SQL I/O performance wants to see detailed I/O statistics only for SQL workloads on all Nimble arrays in the world? Or on one array? Or on all arrays at a certain customer? It’s only a simple query away… and that’s just scratching the surface of what’s possible. It certainly beats trying to design storage based on arbitrary synthetic benchmarks, or manually grabbing performance traces from customer gear…

What?

Enter InfoSight. That’s the name of the gigantic analytics engine currently ingesting trillions of anonymized sensor data points every week. And growing. Check some numbers here

Nimble Storage customers do not need to install custom monitoring tools to perform highly advanced storage analytics, performance troubleshooting, and even hardware upgrade recommendations based on automated performance analysis heuristics.

No need to use the CLI, no need to manually send data dumps to the vendor, no need to use 10 different tools.

All the information customers need is available through a browser GUI. Even the vast majority of support cases are automatically handled by InfoSight, and I’m not talking about simply sending replacement hardware (that’s trivial).

I always saw InfoSight as the core offering of Nimble Storage, the huge differentiator that works hand in hand with the hardware and helps customers consume storage easily. Yes, Nimble Storage arrays are fast, reliable, easy to use, have impressive data reduction abilities, scale nicely, have great features, are cost-effective etc. But other vendors can claim they can satisfy at least some of those attributes.

Nobody else has anything even remotely the depth and capability of InfoSight. This is why Nimble calls their offering the Predictive Flash Platform. InfoSight Predictive Analytics + great hardware = Predictive Flash.

I will be covering this fascinating topic in a lot more depth in the future. An AI Expert System powered by a behemoth analytics engine, helping reduce complexity and making the solution Easy To Consume is a pretty impressive piece of engineering.

Watch this space…

D

Technorati Tags: , , , ,

NetApp delivers 2TB/s performance to giant supercomputer for big data

(Edited: My bad, it was 2TB/s, up from 1.3TB/s, the solution has been getting bigger and upgraded, plus the post talks about the E5400, the newer E5600 is much faster).

What do you do when you need so much I/O performance that no one single storage system can deliver it, no matter how large?

To be specific: What if you needed to transfer data at over 1TB per second? (or 2TB/s, as it eventually turned out to be)?

That was the problem faced by the U.S. Department of Energy (DoE) and their Sequoia supercomputer at the Lawrence Livermore National Laboratory (LLNL), one of the fastest supercomputing systems on the planet.

You can read the official press release here. I wanted to get more into the technical details.

People talk a lot about “big data” recently – no clear definition seems to exist, in my opinion it’s something that has some of the following properties:

  • Too much data to be processed by a “normal” computer or cluster
  • Too much data to work with using a relational DB
  • Too much data to fit in a single storage system for performance and/or capacity reasons – or maybe just simply:
  • Too much data to process using traditional methods within an acceptable time frame

Clearly, this is a bit loose – how much is “too much”? How long is “too long”? For someone only armed with a subnotebook computer, “too much” does not have the same meaning as for someone rocking a 12-core server with 256GB RAM and a few TB of SSD.

So this definition is relative… but in some cases, such as the one we are discussing, absolute – given the limitations of today’s technology.

For instance, the amount of storage LLNL required was several tens of PB in a single storage pool that could provide unprecedented I/O performance to the tune of 2TB/s. Both size and performance needed to be scalable. It also needed to be reliable and fit within a reasonable budget and not require extreme space, power and cooling. A tall order indeed.

This created some serious logistics problems regarding storage:

  • No single disk array can hold that amount of data
  • No single disk array can perform anywhere close to 2TB/s

Let’s put this in perspective: The storage systems that scale the biggest are typically scale-out clusters from the usual suspects of the storage world (we make one, for example). Even so, they max out at less PB than the deployment required.

The even bigger problem is that a single large scale-out system can’t really deliver more than a few tens of GB/s under optimal conditions – more than fast enough for most “normal” uses but utterly unacceptable for this case.

The only realistic solution to satisfy the requirements was massive parallelization, specifically using the NetApp E-Series for the back-end storage and the Lustre cluster filesystem.

 

A bit about the solution…

Almost a year ago NetApp purchased the Engenio storage line from LSI. That storage line is resold by several companies like IBM, Oracle, Quantum, Dell, SGI, Teradata and more. IBM also resells the ONTAP-based FAS systems and calls them “N-Series”.

That purchase has made NetApp the largest provider of OEM arrays on the planet by far. It was a good deal – very rapid ROI.

There was a lot of speculation as to why NetApp would bother with the purchase. After all, the ONTAP-based systems have a ton more functionality than pretty much any other array and are optimized for typical mostly-random workloads – DBs, VMs, email, plus megacaching, snaps, cloning, dedupe, compression, etc – all with RAID6-equivalent protection as standard.

The E-Series boxes on the other hand don’t do thin provisioning, dedupe, compression, megacaching… and their snaps are the less efficient copy-on-first-write instead of redirect-on-write. So, almost the anti-ONTAP 🙂

The first reason for the acquisition was that, on purely financial terms, it was a no-brainer deal even if one sells shoes for a living, let alone storage. Even if there were no other reasons, this one would be enough.

Another reason (and the one germane to this article) was that the E-Series has a tremendous sustained sequential performance density. For instance, the E5400 system can sustain about 4GB/s in 4U (real GB/s, not out of cache), all-in. That’s 4U total for 60 disks including the controllers. Expandable, of course. It’s no slouch for random I/O either, plus you can load it with SSDs, too… 🙂 (Update: the newer E5600 can go up to 12GB/s in 2U with SSDs!)

Again, note – 60 drives per 4U shelf and that includes the RAID controllers, batteries etc. In addition, all drives are front-loading and stay active while servicing the shelf – as opposed to most (if not all) dense shelves in the market that need the entire (very heavy) shelf pulled out and/or several drives offlined in order to replace a single failed drive… (there’s some really cool engineering in the shelf to do this without thermal problems, performance loss or vibrations). All this allows standard racks and no fear of the racks tipping over while servicing the shelves 🙂 (you know who you are!)

There are some vendors that purely specialize in sequential I/O and tipping racks – yet they have about 3-4x less performance density than the E5400, even though they sometimes have higher per-controller throughput. In a typical marketing exercise, some of our more usual competitors have boasted 2GB/s/RU for their controllers, meaning that in 4U the controllers (that take up 4U in that example) can do 8GB/s, but that requires all kinds of extra rack space to achieve (extra UPSes, several shelves, etc). Making their resulting actual throughput number well under 1GB/s/RU. Not to mention the cost (those systems are typically more expensive than a 5400). Which is important with projects of the scale we are talking about.

Most importantly, what we accomplished at the LLNL was no marketing exercise…

 

The benefits of truly high performance density

Clearly, if your requirements are big enough, you end up spending a lot less money and needing a lot less rack space, power and cooling by going with a highly performance-dense solution.

However, given the requirements of the LLNL, it’s clear that you can’t use just a single E5400 to satisfy the performance and capacity requirements of this use case. What you can do though is use a bunch of them in parallel… and use that massive performance density to achieve about 40GB/s per industry-standard rack with 600x high-capacity disks (1.8PB raw per rack).

For even higher performance per rack, the E5400 can use the faster SAS or SSD drives – 480 drives per rack (up to 432TB raw), providing 80GB/s reads/60GB/s writes.

 

Enter the cluster filesystem

So, now that we picked the performance-dense, reliable, cost-effective building block, how do we tie those building blocks together?

The answer: By using a cluster filesystem.

Loosely defined, a cluster filesystem is simply a filesystem that can be accessed simultaneously by the servers mounting it. In addition, it also typically means it can span storage systems and make them look as one big entity.

It’s not a new concept – and there are several examples, old and new: AFS, Coda, GPFS, and the more prevalent Stornext and Lustre are some.

The LLNL picked Lustre for this project. Lustre is a distributed filesystem that breaks apart I/O into multiple Object Storage Servers, each connected to storage (Object Storage Targets). Metadata is served by dedicated servers that are not part of the I/O stream and thus not a bottleneck. See below for a picture (courtesy of the Lustre manual) of how it is all connected:

 

Lustre Scaled Cluster

 

High-speed connections are used liberally for lower latency and higher throughput.

A large file can reside on many storage servers, and as a result I/O can be spread out and parallelized.

Lustre clients see a single large namespace and run a proprietary protocol to access the cluster.

It sounds good in theory – and it delivered in practice: 1.3TB/s sustained performance was demonstrated to the NetApp block devices. Work is ongoing to finalize the testing with the complete Lustre environment. Not sure what the upper limit would be. But clearly it’s a highly scalable solution.

 

Putting it all together

NetApp has fully realized solutions for the “big data” applications out there – complete with the product and services needed to complete each engagement. The Lustre solution employed by the LLNL is just one of the options available. There is Hadoop, Full Motion uncompressed HD video, and more.

So – how fast do you need to go?

D

 

 

Technorati Tags: , ,