How Nimble Customers Benefit from Big Data Predictive Analytics

In past articles I have covered topics about how Nimble Storage lowers customer risk with technologies internal to the array such as:

Now, it’s time to write a more in-depth article about the immense customer benefits of InfoSight, Nimble’s Predictive Analytics platform.

InfoSight

I’ve been at Nimble Storage about a year now, and this technology still remains the fundamental reason I chose to come to Nimble versus the many other opportunities I had.

My opinion hasn’t changed – if anything, I increasingly see that this is the cornerstone of how to fundamentally advance how infrastructure works. Not having such a technology has become a major competitive disadvantage for other vendors, since not having it increases risk for customers.

I will show how InfoSight is far beyond what any other storage vendor has in the analytics area in both scope and capability. Because it’s not about just the storage…

Imagine if you had a technology that could, among many other things, reduce problem resolution time from several days to minutes? How could that potentially help your company in case something really hard to diagnose (and well outside storage) affected a critical part of the business? What would the business value be for such an ability?

Why InfoSight was Built

The fundamental reason behind building this technology is to reduce business risk by improving the reliability of the entire infrastructure, not just the reliability of storage.

A secondary goal for the Predictive Analytics technology in InfoSight is to provide statistics, trending analysis, recommendations and overall visibility in the infrastructure.

Most storage vendors are very focused on, predictably, storage. Their reporting, troubleshooting and analytics – they’re focused on the storage piece. However, statistically, when dealing with quality enterprise storage gear, storage is quite simply not the leading cause of overall infrastructure issues affecting the delivery of data to applications.

The App-Data Gap

Any delay between the delivery of data to the end user is called the App-Data Gap.

App data gap

Clearly, App-Data Gap issues could stem from any number of sources.

Business owners care about applications serving data, and arguably many couldn’t care less about the infrastructure details as long as it’s meeting business SLAs.

But what happens when things inevitably go wrong? Because it is a case of when, not if. Especially when something goes wrong in a strange way so that even redundant systems don’t quite help? Or if it’s an intermittent issue that can’t be tracked down? Or an endemic issue that has been plaguing the environment for a long time?

The State of IT Troubleshooting

With the complexity of today’s data centers (whether on-premises or cloud-based), there is so much going on that troubleshooting is hard, even for basic “point me in the right direction” root cause analysis. And if something really arcane is going on, timely troubleshooting may prove impossible.

There is no shortage of data – if anything, there is often too much of it, but usually not of a high enough quality and/or presented in ways that are not necessarily helpful nor correlated.

IT Automation

The true value starts once quality data is correlated and troubleshooting becomes proactive and, ultimately, automatic.

Cars are always a good example…

Imagine cars with basic sensors and instrumentation – if something goes wrong, do you get enough information? Do you know the optimal course of action? Quickly enough?

Let’s say the fuel lamp lights up. What does that mean? That I have X amount of fuel left because I think it said so in the car’s manual that I lost years ago when trying to fend off that surprisingly irate koala? Is knowing the amount of fuel left helpful enough?

Would it not be incredibly better if I had this information instead:

  • How much farther can I drive?
  • Where is the next gas station on my route?
  • Is the route to that gas station open?
  • Does that gas station have the fuel type I need? In stock or just in theory?
  • Are they open for business? For sure or do I just know their official hours?
  • Do they accept my form of payment?
  • Can I even make it to that gas station or should I go elsewhere even if it takes me wildly off course?
  • Can all the above information be summarized in a “just go there” type action if I didn’t want to know all the details and wanted to know what to do quickly?

Old Car Sensors

I’m sure most of us can relate – especially the people that have had the misfortune of being stranded due to one of the above reasons.

How InfoSight was Built

In order to get insights from data, the data has to be collected in the first place. And quite frequently, a lot of good data is required in order to have enough information to get meaningful insights.

Sensors First

One of the things Nimble Storage did right from the beginning, was to ask the developers to put sensors in software functions, even if there was no immediately apparent way to get value from those sensors (we constantly invent new ways to leverage the data).

This has created a coding methodology that cannot easily be retrofitted by other vendors playing catch up – the Nimble code is full of really deep sensors (about 4,000 as of the time of this publication in March of 2017) and an entire process that is without equal.

Yet another car example

A good parallel to our rich sensor methodology is Tesla. They have a lot of very advanced sensors in each car, in order to collect extremely detailed information about how the car is used, where, in what conditions, and what’s around it:

Tesla Sensors

Without the correct type of rich sensors in the right places, the ability to know enough about the environment (including context) is severely compromised. Most competitor cars don’t have anything remotely close to Tesla’s sensor quality and quantity. Which is one of the reasons it is really hard for their competitors to catch up in the self driving car game. To my knowledge, Tesla has collected far more sensor data about real world driving than any other car company.

Quality sensor data by itself is of limited value, but it’s a good (and necessary) start.

Collection Second

Every day, each Nimble Storage system collects and uploads up to 70 million sensor values, which is many orders of magnitude richer than what competitors do (since they didn’t build the system with that in mind to begin with). In addition to sensors, we also collect logs and config variables.

We don’t only collect data from the Nimble Storage systems.

We also want to collect data from the fabric, the hypervisors, hosts, applications, everything we can get our hands on (and the list is expanding).

A nice side effect is that due to the richness of data collected, Nimble Support will not bother customers with endless requests for huge diagnostic uploads to Support, unlike certain other vendors that shall remain nameless…

For the security conscious: this is anonymized metadata, collecting actual customer data would be illegal and stupid to do (yes, we get the question). All this information just tells us how you’re using the system (and how it’s reacting), not what your data contains. In addition, a customer has to opt in, otherwise no metadata is sent to Nimble (almost everyone opts in, for benefits that will become extremely clear later on). Here’s the InfoSec policy.

All this sensor data is collected from all customer systems and sent to a (really) Big Data system. It’s a massively parallel implementation and it’s all running on Nimble gear (eat your own dog food as they say, or 3-Michelin-Star meal as the case may be).

Fun fact: In a few hours we collect more data than much bigger vendors (with many more systems deployed) collect in several years with their antediluvian “analytics” frameworks.

Fun fact #2: Nimble has collected more data about how real-world infrastructure is used than any other company. This is the infrastructure database that will enable the “self driving car” AI in IT. Imagine the applications and benefits if even complicated tasks can now be automated with AI.

Now that the data is collected, optimized and placed in an area where it can rapidly be examined and manipulated, the fun begins.

Correlation, Context & Causation Third

It is very easy to draw the wrong conclusions if one relies on simple statistics and even advanced correlation without context awareness.

A nice illustration of this is Anscombe’s Quartet. If someone does a simple statistical analysis for all four datasets, the results are identical (the blue trend line). However, if one simply looks at a plot of the data points, it becomes abundantly clear that the blue trend line does not always tell the right story.

Anscombe Quartet

Being able to accurately tell what’s really going on is a big part of this technology.

Another good way to see the drawback of lack of situational awareness is playing chess without knowing the board exists. The illustrious Mr. Wardley has a great article on that. Big Data by itself isn’t enough.

Some of the techniques Nimble Storage uses to separate the wheat from the chaff, without giving away too much:

Machine Learning related:

  • Eigenvalue Decomposition
  • Sliding window correlations
  • Differential equation models of IO flux in order to assess workload contention
  • Autoregressive, Bootstrapping and Monte Carlo methods
  • Correlation across the entire customer population
  • Outlier detection and filtering
  • Semi-supervised mixture models, Bayesian logic for probability inference, Random Forests, Support Vector Machines, Multifeature Clustering Techniques
  • Application behavior models
Others:
  • Interactions between different components in the stack such as array code, hypervisors, server types, switches, etc.
  • Advanced visualizations
  • Zero Day Prevention

As you can see, the InfoSight framework is massive in both scope and execution. In addition, the computational power needed to do this properly is pretty serious.

What Benefits do Nimble Customers get from InfoSight?

The quickest way to describe the overall benefits is massively lower business risk.

A good way to show one of the advantages of the complete solution (Nimble Storage plus InfoSight) is to share some real-life examples of where we were able to help customers identify and resolve incredibly challenging problems.

Example #1: Widespread Hypervisor bug. Business impact: complete downtime

There was a bug in a certain hypervisor that would disconnect the hosts from the storage. It affected the storage industry as a whole. Nimble was instrumental in helping the hypervisor vendor understand exactly what the problem was and how it was manifesting. The fix helped the entire storage industry.

Example #2: Insidious NIC issue. Business impact: very erratic extreme latency affecting critical applications

The customer was seeing huge latency spikes but didn’t call Nimble initially since the array GUI showed zero latency. They called every other vendor, nobody could see anything wrong and kept pointing the finger at the storage. After calling Nimble support, we were able to figure out that it was the server NIC that was causing issues (a NIC that had just passed all the server vendor’s diagnostics with flying colors). Replacing it fixed the problem.

Example #3: Network-wide latency spikes. Business impact: lost revenue

A large financial company was experiencing huge latency spikes for everything connected to a certain other vendor’s array. They had a support case open for 6 months with both their network and storage vendors. The case was unresolved. They brought Nimble in for a POC. We had exact the same latency issues. The difference was we identified the real problem in 20 minutes (an obscure setting on the switches). Changing the setting fixed the original storage vendor’s issues. That customer is buying from us now anyway.

Example #4: Climate control issues. Business impact: data center shutdown

A customer had a failing air conditioning system. We proactively identified the issue a full 36 minutes before their own monitoring systems managed to do so. Very serious, and apparently more common than one would think.

We have many more examples but this is already an abnormally long blog post as it is… I hope this is enough to show that the value of InfoSight goes far beyond the storage, and extends all the way up the stack (and even tangentially helps with environmental issues, and we didn’t even discuss how it helps identify security issues).

Indeed, due to all the Nimble technology, our average case resolution time is around 30 minutes (from the time you pick up the phone, not from the time you get a call back).

So my question, gentle reader, is: If any of the aforementioned problems happened to you, how long would it take for you to resolve them? What would the process be? How many vendors would you have to call? What would the impact to your business be while having the problems? What would the cost be? And does that cost include intangible things like company reputation?

Being Proactive

Solving difficult problems more quickly than ever was possible in the past is, of course, one of the major benefits of InfoSight.

But would it not be better to not have the problem in the first place?

The Prime Directive

We stand by this adage: If any single customer has a problem, no other customer ever should. So we implement Zero Day Prevention.

It doesn’t just mean we will fix bugs as we find them. It means we will try to automatically prevent the entire customer base from hitting that same problem, even if the problem has affected only one customer.

Only good things can come from such a policy.

Inoculation & Blacklisting

There are two technologies at play here (which can also work simultaneously):

  1. Inoculation: Let’s assume a customer has a problem and that problem is utterly unrelated to any Nimble code. Instead of saying “Not Us”, we will still try to perform root cause analysis, and if we can identify a workaround (even if totally unrelated to us), we will automatically create cases for all customers that may be susceptible to this issue. The case will include instructions on how to proactively prevent the problem from occurring.
  2. Blacklisting: In this case, there is something we can do that is related to our OS. For instance, a hypervisor bug that only affects certain versions of our OS, or some sort of incompatibility. We will then automatically prevent susceptible customers from installing any versions of our OS that clash with their infrastructure. Contrast that to other vendors asking customers to look at compatibility matrices or, worse, letting customers download and install any OS version they’d like…

Here’s an example of a customer that we won’t let upgrade until we help them resolve the underlying issue:

Blacklisting

Hardware upgrade recommendations

Another proactive thing InfoSight does is to recommend not just software but also hardware upgrades based on proprietary and extremely intelligent analysis of trending (and not based on simple statistical models, per the Anscombe’s Quartet example earlier on).

Things like controller, cache, media and capacity are all examined, and the customers see specific and easy to follow recommendations:

Upgrade recommendation

Notice we don’t say something vague like “not enough cache”. We explicitly say “add this much more cache.”

There is still enough cache in this example system for now, but with intelligent projections, the recommended amount is calculated and presented. The customer can then easily take action without needing to call support nor hire expensive consultants. And take action before things get too tight.

Other InfoSight Benefits

There are many other benefits, but this post is getting dangerously long so I will describe just a few of them in a list instead of showing screenshots and detailed descriptions for everything.

  • Improved case automation: About 90% of all cases are opened automatically, and a huge percentage of them are closed automatically
  • Accelerated code evolution: Things that take other companies years to detect and fix/optimize are now tackled in a few weeks or even days
  • App-aware reporting: Instead of reporting generically, be able to see detailed breakdowns by application type (and even different data types for the same application, for example Oracle data vs Oracle logs)
  • Insight into how real applications work: check here for an excellent paper.
  • Sizing: Using the knowledge of how applications work across the entire customer base, help array sizing become more accurate, even in the face of incomplete requirements data
  • Protection: tell customers what is protected via snaps/replication, including lag times
  • Visualizations: Various ways of showing heretofore difficult or impossible to see behaviors and relationships
  • Heuristics-based storage behavior: Using knowledge from InfoSight, teach the storage system how to behave in the face of complex issues, even when it’s not connected to InfoSight
  • Extension into cloud consumers: Provide visibility into AWS and Azure environments using Nimble Cloud Volumes (NCV).
And we are only getting started. Think of what kind of infrastructure automation could be possible with this amount of intelligence? Especially at large scale?

Regarding vendors that claim they will perform advanced predictive analytics inside their systems…

Really? They can spare that much computational power on their systems, and dedicate incredible amounts of spare capacity, and have figured out all the advanced algorithms necessary, and built all the right sensors in their code, and accurately cross-correlate across the entire infrastructure, and automatically consult with every other system on the planet, and help identify weird problems that are way outside the province of that vendor? Sold! Do they sell bridges, too?

There is only so much one can do inside the system, or even if adding a dedicated “monitoring server” in the infrastructure. This is why we elected to offload all this incredibly heavy processing to a massively parallel infrastructure in our cloud instead of relying on weak solipsistic techniques.

A Common Misconception

Instead of the usual summary, I will end this post rather differently.

Often, Nimble customers, resellers, competitors (and sometimes even a few of the less technical Nimble employees) think that the customer-visible GUI of InfoSight is all of InfoSight.

OneDoesNotSimplySeeInfoSight

InfoSight actually is the combination of the entire back end, machine learning, heuristics, proactive problem prevention, automation, the myriad of internal visualization, correlation and data massaging mechanisms, plus the customer-visible part.

Arguably, what we let customers see is a tiny, tiny sliver of the incredible richness of data we have available in our Big Data system. Internally we see all the data, and use it to provide the incredible reliability and support experience Nimble customers have come to love.

There’s a good reason we don’t expose the entire back end to customers.

For normal humans without extensive training it would be a bit like seeing The Matrix code, hence the easy visualizations we provide for a subset of the data, with more coming online all the time (if you’re a customer see if you have access yet to the cool new Labs portion, found under the Manage tab).

So this (for instance, an app-aware histogram proving that SQL DBs do almost zero I/O at 32KB):

IO Histogram SQL

Or this (a heatmap showing a correlation of just sequential I/O, just for SQL, for the whole week – easy to spot patterns, for example at 0600 every day there’s a burst):

Heatmap

Instead of this:

Xmatrix

So, next time you see some other vendor showing a pretty graph, telling you “hey, we also have something like InfoSight! Ours looks even prettier!” – remember this section. It’s like saying “all cars have steering wheels, including the rusty pinto on cinder blocks”.

They’re like kids arguing about a toy, while we are busy colonizing other planets…

KidsVscolonization

InfoSight is the largest extant database of IT infrastructure behavior. We have been collecting data for years, and have more sensors than any other storage system I’m aware of. We have data on behavior as a function of all kinds of applications, workloads, types of client systems, hypervisor versions… and this existing data is part of what allows us to make intelligent decisions, even in the event of not having enough data.

Even if one hypothetically assumes other vendors somehow manage to:

  1. Build the same quality and quantity of sensors and
  2. Build an equivalent back end and
  3. Manage to catch up with all the insane R&D we’ve done when it comes to intelligently analyzing things and automatically figuring out solutions to problems…

They still won’t have the sheer history, the database size and quality Nimble has simply because they haven’t been doing it as long and/or don’t have enough systems collecting data. Hard to extrapolate when you don’t have a good starting point…

This stuff isn’t easy to do folks.

And if you read all the way to this, congratulations, and apologies for the length. If it’s any consolation, it’s about a third of what it should have been to cover the subject properly…

D

Progress Needs Trailblazers

I got the idea for this stream-of-consciousness (and possibly too obvious) post after reading some comments regarding new ways to do high speed I/O. Not something boring like faster media or protocols, but rather far more exotic approaches that require drastically rewriting applications to get the maximum benefits of radically different architectures.

The comments, in essence, stated that such advancements would be an utter waste of time since 99% of customers have absolutely no need for such exotica, but rather just need cheap, easy, reliable stuff, and can barely afford to buy infrastructure to begin with, let alone custom-code applications to get every last microsecond of latency out of gear.

Why Trailblazers Matter

It’s an interesting balancing game: There’s a group of people that want progress and innovation at all costs, and another group of people that think the status quo is OK, or maybe just “give me cheaper/easier/more reliable and I’m happy”.

Both groups are needed, like Yin and Yang.

The risk-averse are needed because without them it would all devolve into chaos.

But without people willing to take risks and do seemingly crazy and expensive things, progress would simply be much harder.

Imagine if the people that prevailed were the ones that thought trepanning and the balancing of humours was state-of-the-art medicine and needed no further advancements. Perchance the invention of MRI machines might have taken longer?

Trepanning

What if traditional planes were deemed sufficient, would a daring endeavor like the SR-71 have been undertaken?

Planes

Advanced Tech Filters Down to the Masses

It’s interesting how the crazy, expensive and “unnecessary” eventually manages to become part of everyday life. Think of such commonplace technologies like:

  • ABS brakes
  • SSDs
  • 4K TVs
  • High speed networking
  • Smartphones
  • Virtualization
  • Having more than 640KB of RAM in a PC

They all initially cost a ton of money and/or were impractical for everyday use (the first ABS brakes were in planes!)

What is Impractical or Expensive Now Will be Normal Tomorrow…

…but not every new tech will make it. Which is perfectly normal.

  • Rewriting apps to take advantage of things like containers, microservices and fast object storage? It will slowly happen. Most customers can simply wait for the app vendors to do it.
  • In-memory databases? Not a niche any more, when even the ubiquitous SQL Server is doing it…
  • Using advanced and crazy fast byte-addressable NVDIMM in storage solutions? Some mainstream vendors like Nimble and Microsoft are already doing it. Others are working on some really interesting implementations.
  • AI, predictive analytics, machine learning? Already happening and providing huge value (for example Nimble’s InfoSight).

Don’t Belittle the Risk Takers!

Even if you think the crazy stuff some people are working on isn’t a fit for you, don’t belittle it. It can be challenging to hold your anger, I understand, especially when the people working on crazy stuff are pushing their viewpoint as if it will change the world.

Because you never know, it just might.

If you don’t agree – get out of the way, or, even better, help find a better way.

Just Because it’s New to You, Doesn’t Mean it’s Really New, or Risky

Beware of turning into an IT Luddite.

Just because you aren’t using a certain technology, it doesn’t make the technology new or risky. If many others are using it, and reaping far more benefits than you with your current tech, then maybe they’re onto something… 🙂

At what point does your resistance to change become a competitive disadvantage?

Tempered by: is the cost of adopting the new technology overshadowed by the benefits?

Stagnation is Death

…or, at a minimum, boring. Τα πάντα ρει, as Heraclitus said. I, for one, am looking forward to the amazing innovations coming our way, and actively participate in trying to make them come to fruition faster.

What will you do?

D

Practical Considerations for Implementing NVMe Storage

Before we begin, something needs to be clear: Although dual-ported NVMe drives are not yet cost effective, the architecture of Nimble Storage is NVMe-ready today. And always remember that in order to get good benefits from NVMe, one needs to implement it all the way from the client. Doing NVMe only at the array isn’t as effective.

In addition, Nimble already uses technology far faster than NVMe: Our write buffers use byte-addressable NVDIMM-N, sitting in a DIMM slot next to the CPU, instead of slower NVRAM HBAs or NVMe drives that other vendors use. Think about it: I/O happens at DDR4 RAM speeds, which makes even the fastest NVMe drive (or even NVDIMM accessed through NVMe) seem positively glacial.

nvdimm-n

I did want to share my personal viewpoint of where storage technology in general may be headed if NVMe is to be mass-adopted in a realistic fashion and without making huge sacrifices.

About NVMe

Lately, a lot of noise is being made about NVMe technology. The idea being that NVMe will be the next step in storage technology evolution. And, as is the natural order of things, new vendors are popping up to take advantage of this perceived opening.

For the uninitiated: NVMe is a relatively new standard that was created specifically for devices connected over a PCI bus. It has certain nice advantages vs SCSI such as reduced latency and improved IOPS. Sequential throughput can be significantly higher. It can be more CPU-efficient. It needs a small and simple driver, the standard requires only 13 commands, and it can also be used over some FC or Ethernet networks (NVMe over Fabrics). Going through a fabric only adds a small amount of extra latency to the stack compared to DAS.

NVMe is strictly an optimized block protocol, and not applicable to NAS/object platforms unless one is talking about their internal drives.

Due to the additional performance, NVMe drives are a no brainer in systems like laptops and DASD/internal to servers. Usually there is only a small number (often just one device) and no fancy data services are running on something like a laptop… replacing the media with better media+interface is a good idea.

For enterprise arrays though, the considerations are different.

NVMe Performance

Marketing has managed to confuse people regarding NVMe’s true performance. It’s important to note that tests illustrating NVMe performance show a single NVMe device being faster than a single SAS or SATA SSD. But storage arrays usually don’t have a single device and so drive performance isn’t the bottleneck as it is with low media count systems.

In addition, most tests and research papers comparing NVMe to other technologies use wildly dissimilar SSD models. For instance, pitting a modern, ultra-high-end NVMe SSD against an older consumer SATA SSD with a totally different internal controller. This can make proper performance comparisons difficult. How much of the performance boost is due to NVMe and how much because the expensive, fancy SSD is just a much better engineered device?

For instance, consider this chart of NVMe device latency, courtesy of Intel:

3dxpoint 

As you can see, regarding latency, NVMe as a drive connection protocol will offer better latency than SAS or SATA but the difference is in the order of a few microseconds. The protocol differences become truly important only with next gen technologies like 3D Xpoint, which ideally needs a memory interconnect to shine (or, at a minimum, PCI) since the media is so much faster than the usual NAND. But such media will be prohibitively expensive to be used as the entire storage within an array in the foreseeable future, and would quickly be bottlenecked by the array CPUs at scale.

NVMe over Fabrics

Additional latency savings will come from connecting clients using NVMe over Fabrics. By doing I/O over an RDMA network, a latency reduction of around 100 microseconds is possible versus encapsulated SCSI protocols like iSCSI, assuming all the right gear is in place (HBAs, switches, host drivers). Doing NVMe at the client side also helps with lowering CPU utilization, which can make client processing overall more efficient.

Where are the Bottlenecks?

The reality is that the main bottleneck in today’s leading modern AFAs is the controller itself and not the SSDs (simply because there is enough performance in just a couple of dozen modern SAS/SATA SSDs to saturate most systems). Moving to competent NVMe SSDs will mean that those same controllers will now be saturated by maybe 10 NVMe SSDs. For example, a single NVMe drive may be able to read sequentially at 3GB/s, whereas a single SATA drive 500MB/s. Putting 24 NVMe drives in the controller doesn’t mean that magically the controller will now deliver 72GB/s. In the same way, a single SATA SSD might be able to do 100000 read small block random IOPS and an NVMe with better innards 400000 IOPS. Again, it doesn’t mean that same controller with 24 devices will all of a sudden now do 9.6 million IOPS!

How Tech is Adopted

Tech adoption comes in waves until a significant technology advancement is affordable and reliable enough to become pervasive. For instance, ABS brakes were first used in planes in 1929 and were too expensive and cumbersome to use in everyday cars. Today, most cars have ABS brakes and we take for granted the added safety they offer.

But consider this: What if someone told you that in order to get a new kind of car (that has several great benefits) you would have to utterly give up things like airbags, ABS brakes, all-wheel-drive, traction control, limited-slip differential? Without an equivalent replacement for these functions?

You would probably realize that you’re not that excited about the new car after all, no matter how much better than your existing car it might be in other key aspects.

Storage arrays follow a similar paradigm. There are several very important business reasons that make people ask for things like HA, very strong RAID, multi-level checksums, encryption, compression, data reduction, replication, snaps, clones, hot firmware updates. Or the ability to dynamically scale a system. Or comprehensive cross-stack analytics and automatic problem prevention.

Such features evolved over a long period of time, and help mitigate risk and accelerate business outcomes. They’re also not trivial to implement properly.

NVMe Arrays Today

The challenge I see with the current crop of ultra-fast NVMe over Fabrics arrays is that they’re so focused on speed that they ignore the aforementioned enterprise features in lieu of sheer performance. I get it: it takes great skill, time and effort to reliably implement such features, especially in a way that they don’t strip the performance potential of a system.

There is also a significant cost challenge in order to safely utilize NVMe media en masse. Dual-ported SSDs are crucial in order to deliver proper HA. Current dual-ported NVMe SSDs tend to be very expensive per TB vs current SAS/SATA SSDs. In addition, due to the much higher speed of the NVMe interface, even with future CPUs that include FPGAs, many CPUs and PCI switches are needed to create a highly scalable system that can fully utilize such SSDs (and maintain enterprise features), which further explains why most NVMe solutions using the more interesting devices tend to be rather limited.

There are also client-side challenges: Using NVMe over Fabrics can often mean purchasing new HBAs and switches, plus dealing with some compromises. For instance, in the case of RoCE, DCB switches are necessary, end-to-end congestion management is a challenge, and routability is not there until v2.

There’s a bright side: There actually exist some very practical ways to give customers the benefits of NVMe without taking away business-critical capabilities.

Realistic Paths to NVMe Adoption

We can divide the solution into two pieces, the direction chosen will then depend on customer readiness and component availability. All the following assumes no loss of important enterprise functionality (as we discussed, giving up on all the enterprise functionality is the easy way out when it comes to speed):

Scenario 1: Most customers are not ready to adopt host-side NVMe connectivity:

If this is the case, a good option would be to have something like a fast byte-addressable ultra-fast device inside the controller to massively augment the RAM buffers (like 3D Xpoint in a DIMM), or, if not available, some next-gen NVMe drives to act as cache. That would provide an overall speed boost to the clients and not need any client-side modifications. This approach would be the most friendly to an existing infrastructure (and a relatively economical enhancement for arrays) without needing all internal drives to be NVMe nor extensive array modifications.

You see, part of any competent array’s job is using intelligence to hide any underlying media issues from the end user. A good example: even super-fast SSDs can suffer from garbage collection latency incidents. A good system will smooth out the user experience so users won’t see extreme latency spikes. The chosen media and host interface are immaterial for this, but I bet if you were used to 100μs latencies and they suddenly spiked to 10ms for a while, it would be a bad day. Having an extra-large buffer in the array would help do this more easily, yet not need customers to change anything host-side.

An evolutionary second option would be to change all internal drives to NVMe, but to make this practical would require wide availability of cost-effective dual-ported devices. Note that with low SSD counts (less than 12) this would provide speed benefits even if the customer doesn’t adopt a host-side NVMe interface, but it will be a diminishing returns endeavor at larger scale, unless the controllers are significantly modified.

Scenario 2: Large numbers of customers are ready and willing to adopt NVMe over Fabrics.

In this case, the first thing that needs to change is the array connectivity to the outside world. That alone will boost speeds on modern systems even without major modifications. Of course, this will often mean client and networking changes to be most effective, and often such changes can be costly.

The next step depends on the availability of cost-effective dual-ported NVMe devices. But in order for very large performance benefits to be realized, pretty big boosts to CPU and PCI switch counts may be necessary, necessitating bigger changes to storage systems (and increased costs).

Architecture Matters

In the quest for ultra-low latency and high throughput without sacrificing enterprise features (yet remaining reasonably cost-effective), overall architecture becomes extremely important.

For instance, how will one do RAID? Even with NVMe over Fabrics, approaches like erasure coding and triple mirroring can be costly from an infrastructure perspective. Erasure coding remains CPU-hungry (even more so when trying to hit ultra-low latencies), and triple mirroring across an RDMA fabric would mean massive extra traffic on that fabric.

Localized CPU:RAID domains remain more efficient, and mechanisms such as Nimble NCM can fairly distribute the load across multiple storage nodes without relying on a cluster network for heavy I/O. This technology is available today.

Next Steps

In summary, I urge customers to carefully consider the overall business impact of their storage making decisions, especially when it comes to new technologies and protocols. Understand the true benefits first. Carefully balance risk with desired outcome, and consider the overall system and not just the components. Of course, one needs to understand the risks vs rewards first, hence this article. Just make sure that, in order to achieve a certain ideal, you don’t give up on critical functionality that you’ve been taking for granted.

Uncompromising Resiliency

(cross-posted at https://www.nimblestorage.com/blog/uncompromising-resiliency/)

The cardinal rule for enterprise storage systems is to never compromise when it comes to data integrity and resiliency.  Everything else, while important, is secondary.

Many storage consumers are not aware of what data integrity mechanisms are available or which ones are necessary to meet their protection expectations and requirements. It doesn’t help that a lot of the technologies and the errors they prevent are rather esoteric. However, if you want a storage system that safely stores your data and always returns it correctly, no measure is too extreme.

The Golden Rules of Storage Engineering

When architecting enterprise storage systems, there are three Golden Rules to follow.

In order of criticality:

  1. Integrity: Don’t ever return incorrect data
  2. Durability: Don’t ever lose any data
  3. Availability: Don’t ever lose access to data

To better understand the order, ask yourself, “what is preferred, temporary loss of access to data or the storage system returning the wrong data without anyone even knowing it’s wrong?”

Imagine life or death situations, where the wrong piece of information could have catastrophic consequences. Interestingly, vendors exist that focus a lot on Availability (even offering uptime “guarantees”) but are lacking in Integrity and Durability. Being able to access the array but have data corruption is almost entirely useless. Consider modern storage arrays with data deduplication and/or multi-petabyte storage pools. The effects are far more severe now that a single block represents the data for 1-100+ blocks and data is spread across 10’s – 100’s of drives instead of a few drives.

The Nimble Storage Approach

Nimble Storage has taken a multi-stage approach to satisfy the Golden Rules, and in some cases, the amount of protection offered verges on being paranoid (but the good kind of paranoid).

Simply, Nimble employs these mechanisms:

  1. Integrity: Comprehensive multi-level checksums
  2. Durability: Hardened RAID protection and resilience upon power loss
  3. Availability: Redundant hardware coupled with predictive analytics

We will primarily focus on the first two as they are often glossed over, assumed, or not well understood. Availability will be discussed in a separate blog, however it is good to mention a few details here.

To start, Nimble has greater than six nines measured uptime (more info here). This is measured across more than 9,000 customers using multiple generations of hardware and software. A key aspect of Nimble’s availability comes from InfoSight which continually improves and learns as more systems are used. Each week, trillions of data points are analyzed and processed with the goal of predicting and preventing issues, not just in the array, but across the entire infrastructure. 86% of issues are detected and automatically resolved before the customer is even aware of the problem. To further enhance this capability, Nimble’s Technical Support Engineers can resolve issues faster as they have all the data available when an issue arises. This bypasses the hours-days-weeks often required to collect data, send to support, analyze, repeat – until a solution can be found.

Data Integrity Mechanisms in Detail

The goal is simple: What is read must always match what was written. And, if it doesn’t, we fix it on the fly.

What many people don’t realize is there are occasions where storage media will lose a write, corrupt it or place it at the wrong location on the media. RAID (including 3-way mirroring) or Erasure Coding are not enough to protect against such issues. The older T10 PI employed by some systems is also not enough to protect against all eventualities.

The solution involves using checksums which get more computationally intensive the more paranoid one is. As checksums are computationally intensive, certain vendors don’t employ or minimally employ them to gain more performance or faster time to market. Unfortunately, the trade-off can lead to data corruption.

Broadly, Nimble creates a checksum and a “self-ID” for each piece of data. The checksum protects against data corruption. The self-ID protects against lost/misplaced writes and misdirected reads (incredible as it may seem, these things happen enough to warrant this level of protection).

For instance, if the written data has a checksum, and corruption occurs, when the data is read and checksummed again, the checksums will not match. However, if instead the data was placed at an incorrect location on the media, the checksums will match, but the self-IDs will not match.

checksums

Where it gets interesting:

Nimble doesn’t just do block-level checksums/IDs. These multi-level checksums are also performed:

  1. Per segment in each write stripe
  2. Per block, before and after compression
  3. Per snapshot (including all internal housekeeping snaps)
  4. For replication
  5. For all data movement within a cluster
  6. All data and metadata in NVRAM

This way, every likely data corruption event is covered, including metadata consistency and replication issues, which are often overlooked.

Durability Mechanisms in Detail

There are two kinds of data on a storage system and both need to be protected:

  1. Data in flight
  2. Data on persistent storage

One may differentiate between user data and metadata but we protect both with equal paranoid fervor. Some systems try to accelerate operations by not protecting metadata sufficiently, which greatly increases risk. This is especially true with deduplicating systems, where metadata corruption can mean losing everything!

Data in flight is data that is not yet committed to persistent storage. Nimble ensures all critical data in flight is checksummed and committed to both RAM and an ultra-fast byte-addressable NVDIMM-N memory module sitting right on the motherboard. The NVDIMM-N is mirrored to the partner controller and both controller NVDIMMs are protected against power loss via a supercapacitor. In the event of a power loss, the NVDIMMs simply flush their contents to flash storage. This approach is extremely reliable and doesn’t need inelegant solutions like a built-in UPS.

Data on persistent storage is protected by what we call Triple+ Parity RAID. Three orders of magnitude more resilient than RAID6. For comparison, RAID6 is three orders of magnitude more resilient than RAID5. The “+” sign means that there is extra intra-drive parity that can safeguard against entire sectors being lost even if three whole drives fail in a single RAID group.

Some might say this is a bit much, however with drive sizes increasing rapidly (especially SSDs) and drive read error rates increasing as drives age, it was the architecturally correct choice to make.

In Summary

Users frequently assume that all storage systems will safely store their data. And they will, most of the time. But when it comes to your data, “most of the time” isn’t good enough. No measure should be considered too extreme. When looking for a storage system, it’s worth taking the time to understand all situations where your data could be compromised. And, if nothing else, it’s worth choosing a vendor who is paranoid and goes to extremes to keep your data safe.

D

The Importance of SSD Firmware Updates

I wanted to bring this crucial issue to light since I’m noticing several storage vendors being either cavalier about this or simply unaware.

I will explain why solutions that don’t offer some sort of automated, live SSD firmware update mechanism are potentially extremely risky propositions. Yes, this is another “vendor hat off, common sense hat on” type of post.

Modern SSD Architecture is Complex

The increased popularity and lower costs of fast SSD media are good things for storage users, but there is some inherent complexity within each SSD that many people are unaware of.

Each modern SSD is, in essence, an entire pocket-sized storage array, that includes, among other things:

  • An I/O interface to the outside world (often two)
  • A CPU
  • An OS
  • Memory
  • Sometimes Compression and/or Encryption
  • What is, in essence, a log-structured filesystem, complete with complex load balancing and garbage collection algorithms
  • An array of flash chips driven in parallel through multiple channels
  • Some sort of RAID protection for the flash chips, including sparing, parity, error checking and correction…
  • A supercapacitor to safely flush cache to the flash chips in case of power failure.

Sounds familiar?

With Great Power and Complexity Come Bugs

To make something clear: This discussion has nothing to do with overall SSD endurance & hardware reliability. Only the software aspect of the devices.

All this extra complexity in modern SSDs means that an increased number of bugs compared to simpler storage media is a statistical certainty. There is just a lot going on in these devices.

Bugs aren’t necessarily the end of the world. They’re something understood, a fact of life, and there’s this magical thing engineers thought of called… Patching!

As a fun exercise, go to the firmware download pages of various popular SSDs and check the release notes for some of the bugs fixed. Many fixes address some rather abject gibbering horrors… 🙂

Even costlier enterprise SSDs have been afflicted by some really dangerous bugs – usually latent defects (as in: they don’t surface until you’ve been using something for a while, which may explain why these bugs were missed by QA).

I fondly remember a bug that hit some arrays at a previous place of employment: the SSDs would work great but after a certain number of hours of operation, if you shut your machine down, the SSDs would never come up again. Or, another bug that hit a very popular SSD that would downsize itself to an awesome 8MB of capacity (losing all existing data of course) once certain conditions were met.

Clearly, these are some pretty hairy situations. And, what’s more, RAID, checksums and node-level redundancy wouldn’t protect against all such bugs.

For instance, think of the aforementioned power off bug – all SSDs of the same firmware vintage would be affected simultaneously and the entire array would have zero SSDs that functioned. This actually happened, I’m not talking about a theoretical possibility. You know, just in case someone starts saying “but SSDs are reliable, and think of all the RAID!”

It’s all about approaching correctness from a holistic point of view. Multiple lines of defense are necessary.

The Rules: How True Enterprise Storage Deals with Firmware

Just like with Fight Club, there are some basic rules storage systems need to follow when it comes to certain things.

  1. Any firmware patching should be a non-event. Doesn’t matter what you’re updating, there should be no downtime.
  2. ANY firmware patching should be a NON-EVENT. Doesn’t matter what you’re updating, there should be NO downtime!
  3. Firmware updates should be automated even when dealing with devices en masse.
  4. The customer should automatically be notified of important updates they need to perform.
  5. Different vintage and vendor component updates should be handled automatically and centrally. And, most importantly: Safely.

If these rules are followed, bug risks are significantly mitigated and higher uptime is possible. Enterprise arrays typically will follow the above rules (but always ask the vendor).

Why Firmware Updating is a Challenge with Some Storage Solutions

Certain kinds of solutions make it inherently harder to manage critical tasks like component firmware updates.

You see, being able to hot-update different kinds of firmware in any given set of hardware means that the mechanism doing the updating must be intimately familiar with the underlying hardware & software combination, however complex.

Consider the following kind of solution, maybe for someone sold on the idea that white box approaches are the future:

  • They buy a bunch of diskless server chassis from Vendor A
  • They buy a bunch of SSDs from Vendor B
  • They buy some Software Defined Storage offering from Vendor C
  • All running on the underlying OS of Vendor D…

Now, let’s say Vendor B has an emergency SSD firmware fix they made available, easily downloadable on their website. Here are just some of the challenges:

  1. How will that customer be notified by Vendor B that such a critical fix is available?
  2. Once they have the fix located, which Vendor will automate updating the firmware on the SSDs of Vendor B, and how?
  3. How does the customer know that Vendor B’s firmware fix doesn’t violently clash with something from Vendor A, C or D?
  4. How will all that affect the data-serving functionality of Vendor C?
  5. Can any of Vendors A, B, C or D orchestrate all the above safely?
  6. With no downtime?

In most cases I’ve seen, the above chain of events will not even progress past #1. The user will simply be unaware of any update, simply because component vendors don’t usually have a mechanism that alerts individual customers regarding firmware.

You could inject a significant permutation here: What if you buy the servers pre-built, including SSDs, from Vendor A, including full certification with Vendors C and D? 

Sure – it still does not materially change the steps above. One of Vendors A, C or D still need to somehow:

  1. Automatically alert the customer about the critical SSD firmware fix being available
  2. Be able to non-disruptively update the firmware…
  3. …While not clashing with the other hardware and software from Vendors A, C and D
I could expand this type of conversation to other things like overall environmental monitoring and checksums but let’s keep it simple for now and focus on just component firmware updates…

Always Remember – Solve Business Problems & Balance Risk

Any solution is a compromise. Always make sure you are comfortable with the added risk certain areas of compromise bring (and that you are fully aware of said risk).

The allure of certain approaches can be significant (at the very least because of lower promised costs). It’s important to maintain a balance between increased risk and business benefit.

In the case of SSDs specifically, the utter criticality of certain firmware updates means that it’s crucially important for any given storage solution to be able to safely and automatically address the challenge of updating SSD firmware.

D