Practical Considerations for Implementing NVMe Storage

Before we begin, something needs to be clear: Although dual-ported NVMe drives are not yet cost effective, the architecture of Nimble Storage is NVMe-ready today. And always remember that in order to get good benefits from NVMe, one needs to implement it all the way from the client. Doing NVMe only at the array isn’t as effective.

In addition, Nimble already uses technology far faster than NVMe: Our write buffers use byte-addressable NVDIMM-N, instead of slower NVRAM HBAs or NVMe drives that other vendors use. Think about it: I/O happens at DDR4 RAM speeds, which makes even the fastest NVMe drive seem positively glacial.

nvdimm-n

I did want to share my personal viewpoint of where storage technology in general may be headed if NVMe is to be mass-adopted in a realistic fashion and without making huge sacrifices.

About NVMe

Lately, a lot of noise is being made about NVMe technology. The idea being that NVMe will be the next step in storage technology evolution. And, as is the natural order of things, new vendors are popping up to take advantage of this perceived opening.

For the uninitiated: NVMe is a relatively new standard that was created specifically for devices connected over a PCI bus. It has certain nice advantages vs SCSI such as reduced latency and improved IOPS. Sequential throughput can be significantly higher. It can be more CPU-efficient. It needs a small and simple driver, the standard requires only 13 commands, and it can also be used over some FC or Ethernet networks (NVMe over Fabrics). Going through a fabric only adds a small amount of extra latency to the stack compared to DAS.

NVMe is strictly an optimized block protocol, and not applicable to NAS/object platforms unless one is talking about their internal drives.

Due to the additional performance, NVMe drives are a no brainer in systems like laptops and DASD/internal to servers. Usually there is only a small number (often just one device) and no fancy data services are running on something like a laptop… replacing the media with better media+interface is a good idea.

For enterprise arrays though, the considerations are different.

NVMe Performance

Marketing has managed to confuse people regarding NVMe’s true performance. It’s important to note that tests illustrating NVMe performance show a single NVMe device being faster than a single SAS or SATA SSD. But storage arrays usually don’t have a single device and so drive performance isn’t the bottleneck as it is with low media count systems.

In addition, most tests and research papers comparing NVMe to other technologies use wildly dissimilar SSD models. For instance, pitting a modern, ultra-high-end NVMe SSD against an older consumer SATA SSD with a totally different internal controller. This can make proper performance comparisons difficult. How much of the performance boost is due to NVMe and how much because the expensive, fancy SSD is just a much better engineered device?

For instance, consider this chart of NVMe device latency, courtesy of Intel:

3dxpoint 

As you can see, regarding latency, NVMe as a drive connection protocol will offer better latency than SAS or SATA but the difference is in the order of a few microseconds. The protocol differences become truly important only with next gen technologies like 3D Xpoint, which ideally needs a memory interconnect to shine (or, at a minimum, PCI) since the media is so much faster than the usual NAND. But such media will be prohibitively expensive to be used as the entire storage within an array in the foreseeable future, and would quickly be bottlenecked by the array CPUs at scale.

NVMe over Fabrics

Additional latency savings will come from connecting clients using NVMe over Fabrics. By doing I/O over an RDMA network, a latency reduction of around 100 microseconds is possible versus encapsulated SCSI protocols like iSCSI, assuming all the right gear is in place (HBAs, switches, host drivers). Doing NVMe at the client side also helps with lowering CPU utilization, which can make client processing overall more efficient.

Where are the Bottlenecks?

The reality is that the main bottleneck in today’s leading modern AFAs is the controller itself and not the SSDs (simply because there is enough performance in just a couple of dozen modern SAS/SATA SSDs to saturate most systems). Moving to competent NVMe SSDs will mean that those same controllers will now be saturated by maybe 10 NVMe SSDs. For example, a single NVMe drive may be able to read sequentially at 3GB/s, whereas a single SATA drive 500MB/s. Putting 24 NVMe drives in the controller doesn’t mean that magically the controller will now deliver 72GB/s. In the same way, a single SATA SSD might be able to do 100000 read small block random IOPS and an NVMe with better innards 400000 IOPS. Again, it doesn’t mean that same controller with 24 devices will all of a sudden now do 9.6 million IOPS!

How Tech is Adopted

Tech adoption comes in waves until a significant technology advancement is affordable and reliable enough to become pervasive. For instance, ABS brakes were first used in planes in 1929 and were too expensive and cumbersome to use in everyday cars. Today, most cars have ABS brakes and we take for granted the added safety they offer.

But consider this: What if someone told you that in order to get a new kind of car (that has several great benefits) you would have to utterly give up things like airbags, ABS brakes, all-wheel-drive, traction control, limited-slip differential? Without an equivalent replacement for these functions?

You would probably realize that you’re not that excited about the new car after all, no matter how much better than your existing car it might be in other key aspects.

Storage arrays follow a similar paradigm. There are several very important business reasons that make people ask for things like HA, very strong RAID, multi-level checksums, encryption, compression, data reduction, replication, snaps, clones, hot firmware updates. Or the ability to dynamically scale a system. Or comprehensive cross-stack analytics and automatic problem prevention.

Such features evolved over a long period of time, and help mitigate risk and accelerate business outcomes. They’re also not trivial to implement properly.

NVMe Arrays Today

The challenge I see with the current crop of ultra-fast NVMe over Fabrics arrays is that they’re so focused on speed that they ignore the aforementioned enterprise features in lieu of sheer performance. I get it: it takes great skill, time and effort to reliably implement such features, especially in a way that they don’t strip the performance potential of a system.

There is also a significant cost challenge in order to safely utilize NVMe media en masse. Dual-ported SSDs are crucial in order to deliver proper HA. Current dual-ported NVMe SSDs tend to be very expensive per TB vs current SAS/SATA SSDs. In addition, due to the much higher speed of the NVMe interface, even with future CPUs that include FPGAs, many CPUs and PCI switches are needed to create a highly scalable system that can fully utilize such SSDs (and maintain enterprise features), which further explains why most NVMe solutions using the more interesting devices tend to be rather limited.

There are also client-side challenges: Using NVMe over Fabrics can often mean purchasing new HBAs and switches, plus dealing with some compromises. For instance, in the case of RoCE, DCB switches are necessary, end-to-end congestion management is a challenge, and routability is not there until v2.

There’s a bright side: There actually exist some very practical ways to give customers the benefits of NVMe without taking away business-critical capabilities.

Realistic Paths to NVMe Adoption

We can divide the solution into two pieces, the direction chosen will then depend on customer readiness and component availability. All the following assumes no loss of important enterprise functionality (as we discussed, giving up on all the enterprise functionality is the easy way out when it comes to speed):

Scenario 1: Most customers are not ready to adopt host-side NVMe connectivity:

If this is the case, a good option would be to have something like a fast byte-addressable ultra-fast device inside the controller to massively augment the RAM buffers (like 3D Xpoint in a DIMM), or, if not available, some next-gen NVMe drives to act as cache. That would provide an overall speed boost to the clients and not need any client-side modifications. This approach would be the most friendly to an existing infrastructure (and a relatively economical enhancement for arrays) without needing all internal drives to be NVMe nor extensive array modifications.

You see, part of any competent array’s job is using intelligence to hide any underlying media issues from the end user. A good example: even super-fast SSDs can suffer from garbage collection latency incidents. A good system will smooth out the user experience so users won’t see extreme latency spikes. The chosen media and host interface are immaterial for this, but I bet if you were used to 100μs latencies and they suddenly spiked to 10ms for a while, it would be a bad day. Having an extra-large buffer in the array would help do this more easily, yet not need customers to change anything host-side.

An evolutionary second option would be to change all internal drives to NVMe, but to make this practical would require wide availability of cost-effective dual-ported devices. Note that with low SSD counts (less than 12) this would provide speed benefits even if the customer doesn’t adopt a host-side NVMe interface, but it will be a diminishing returns endeavor at larger scale, unless the controllers are significantly modified.

Scenario 2: Large numbers of customers are ready and willing to adopt NVMe over Fabrics.

In this case, the first thing that needs to change is the array connectivity to the outside world. That alone will boost speeds on modern systems even without major modifications. Of course, this will often mean client and networking changes to be most effective, and often such changes can be costly.

The next step depends on the availability of cost-effective dual-ported NVMe devices. But in order for very large performance benefits to be realized, pretty big boosts to CPU and PCI switch counts may be necessary, necessitating bigger changes to storage systems (and increased costs).

Architecture Matters

In the quest for ultra-low latency and high throughput without sacrificing enterprise features (yet remaining reasonably cost-effective), overall architecture becomes extremely important.

For instance, how will one do RAID? Even with NVMe over Fabrics, approaches like erasure coding and triple mirroring can be costly from an infrastructure perspective. Erasure coding remains CPU-hungry (even more so when trying to hit ultra-low latencies), and triple mirroring across an RDMA fabric would mean massive extra traffic on that fabric.

Localized CPU:RAID domains remain more efficient, and mechanisms such as Nimble NCM can fairly distribute the load across multiple storage nodes without relying on a cluster network for heavy I/O. This technology is available today.

Next Steps

In summary, I urge customers to carefully consider the overall business impact of their storage making decisions, especially when it comes to new technologies and protocols. Understand the true benefits first. Carefully balance risk with desired outcome, and consider the overall system and not just the components. Of course, one needs to understand the risks vs rewards first, hence this article. Just make sure that, in order to achieve a certain ideal, you don’t give up on critical functionality that you’ve been taking for granted.

Uncompromising Resiliency

(cross-posted at https://www.nimblestorage.com/blog/uncompromising-resiliency/)

The cardinal rule for enterprise storage systems is to never compromise when it comes to data integrity and resiliency.  Everything else, while important, is secondary.

Many storage consumers are not aware of what data integrity mechanisms are available or which ones are necessary to meet their protection expectations and requirements. It doesn’t help that a lot of the technologies and the errors they prevent are rather esoteric. However, if you want a storage system that safely stores your data and always returns it correctly, no measure is too extreme.

The Golden Rules of Storage Engineering

When architecting enterprise storage systems, there are three Golden Rules to follow.

In order of criticality:

  1. Integrity: Don’t ever return incorrect data
  2. Durability: Don’t ever lose any data
  3. Availability: Don’t ever lose access to data

To better understand the order, ask yourself, “what is preferred, temporary loss of access to data or the storage system returning the wrong data without anyone even knowing it’s wrong?”

Imagine life or death situations, where the wrong piece of information could have catastrophic consequences. Interestingly, vendors exist that focus a lot on Availability (even offering uptime “guarantees”) but are lacking in Integrity and Durability. Being able to access the array but have data corruption is almost entirely useless. Consider modern storage arrays with data deduplication and/or multi-petabyte storage pools. The effects are far more severe now that a single block represents the data for 1-100+ blocks and data is spread across 10’s – 100’s of drives instead of a few drives.

The Nimble Storage Approach

Nimble Storage has taken a multi-stage approach to satisfy the Golden Rules, and in some cases, the amount of protection offered verges on being paranoid (but the good kind of paranoid).

Simply, Nimble employs these mechanisms:

  1. Integrity: Comprehensive multi-level checksums
  2. Durability: Hardened RAID protection and resilience upon power loss
  3. Availability: Redundant hardware coupled with predictive analytics

We will primarily focus on the first two as they are often glossed over, assumed, or not well understood. Availability will be discussed in a separate blog, however it is good to mention a few details here.

To start, Nimble has achieved greater than 99.9997% measured uptime since 2014. This is measured across more than 9,000 customers using multiple generations of hardware and software. A key aspect of Nimble’s availability comes from InfoSight which continually improves and learns as more systems are used. Each week, trillions of data points are analyzed and processed with the goal of predicting and preventing issues, not just in the array, but across the entire infrastructure. 86% of issues are detected and automatically resolved before the customer is even aware of the problem. To further enhance this capability, Nimble’s Technical Support Engineers can resolve issues faster as they have all the data available when an issue arises. This bypasses the hours-days-weeks often required to collect data, send to support, analyze, repeat – until a solution can be found.

Data Integrity Mechanisms in Detail

The goal is simple: What is read must always match what was written. And, if it doesn’t, we fix it on the fly.

What many people don’t realize is there are occasions where storage media will lose a write, corrupt it or place it at the wrong location on the media. RAID (including 3-way mirroring) or Erasure Coding are not enough to protect against such issues. The older T10 PI employed by some systems is also not enough to protect against all eventualities.

The solution involves using checksums which get more computationally intensive the more paranoid one is. As checksums are computationally intensive, certain vendors don’t employ or minimally employ them to gain more performance or faster time to market. Unfortunately, the trade-off can lead to data corruption.

Broadly, Nimble creates a checksum and a “self-ID” for each piece of data. The checksum protects against data corruption. The self-ID protects against lost/misplaced writes and misdirected reads (incredible as it may seem, these things happen enough to warrant this level of protection).

For instance, if the written data has a checksum, and corruption occurs, when the data is read and checksummed again, the checksums will not match. However, if instead the data was placed at an incorrect location on the media, the checksums will match, but the self-IDs will not match.

checksums

Where it gets interesting:

Nimble doesn’t just do block-level checksums/IDs. These multi-level checksums are also performed:

  1. Per segment in each write stripe
  2. Per block, before and after compression
  3. Per snapshot (including all internal housekeeping snaps)
  4. For replication
  5. For all data movement within a cluster
  6. All data and metadata in NVRAM

This way, every likely data corruption event is covered, including metadata consistency and replication issues, which are often overlooked.

Durability Mechanisms in Detail

There are two kinds of data on a storage system and both need to be protected:

  1. Data in flight
  2. Data on persistent storage

One may differentiate between user data and metadata but we protect both with equal paranoid fervor. Some systems try to accelerate operations by not protecting metadata sufficiently, which greatly increases risk. This is especially true with deduplicating systems, where metadata corruption can mean losing everything!

Data in flight is data that is not yet committed to persistent storage. Nimble ensures all critical data in flight is checksummed and committed to both RAM and an ultra-fast byte-addressable NVDIMM-N memory module sitting right on the motherboard. The NVDIMM-N is mirrored to the partner controller and both controller NVDIMMs are protected against power loss via a supercapacitor. In the event of a power loss, the NVDIMMs simply flush their contents to flash storage. This approach is extremely reliable and doesn’t need inelegant solutions like a built-in UPS.

Data on persistent storage is protected by what we call Triple+ Parity RAID. Three orders of magnitude more resilient than RAID6. For comparison, RAID6 is three orders of magnitude more resilient than RAID5. The “+” sign means that there is extra intra-drive parity that can safeguard against entire sectors being lost even if three whole drives fail in a single RAID group.

Some might say this is a bit much, however with drive sizes increasing rapidly (especially SSDs) and drive read error rates increasing as drives age, it was the architecturally correct choice to make.

In Summary

Users frequently assume that all storage systems will safely store their data. And they will, most of the time. But when it comes to your data, “most of the time” isn’t good enough. No measure should be considered too extreme. When looking for a storage system, it’s worth taking the time to understand all situations where your data could be compromised. And, if nothing else, it’s worth choosing a vendor who is paranoid and goes to extremes to keep your data safe.

D

The Importance of SSD Firmware Updates

I wanted to bring this crucial issue to light since I’m noticing several storage vendors being either cavalier about this or simply unaware.

I will explain why solutions that don’t offer some sort of automated, live SSD firmware update mechanism are potentially extremely risky propositions. Yes, this is another “vendor hat off, common sense hat on” type of post.

Modern SSD Architecture is Complex

The increased popularity and lower costs of fast SSD media are good things for storage users, but there is some inherent complexity within each SSD that many people are unaware of.

Each modern SSD is, in essence, an entire pocket-sized storage array, that includes, among other things:

  • An I/O interface to the outside world (often two)
  • A CPU
  • An OS
  • Memory
  • Sometimes Compression and/or Encryption
  • What is, in essence, a log-structured filesystem, complete with complex load balancing and garbage collection algorithms
  • An array of flash chips driven in parallel through multiple channels
  • Some sort of RAID protection for the flash chips, including sparing, parity, error checking and correction…
  • A supercapacitor to safely flush cache to the flash chips in case of power failure.

Sounds familiar?

With Great Power and Complexity Come Bugs

To make something clear: This discussion has nothing to do with overall SSD endurance & hardware reliability. Only the software aspect of the devices.

All this extra complexity in modern SSDs means that an increased number of bugs compared to simpler storage media is a statistical certainty. There is just a lot going on in these devices.

Bugs aren’t necessarily the end of the world. They’re something understood, a fact of life, and there’s this magical thing engineers thought of called… Patching!

As a fun exercise, go to the firmware download pages of various popular SSDs and check the release notes for some of the bugs fixed. Many fixes address some rather abject gibbering horrors… 🙂

Even costlier enterprise SSDs have been afflicted by some really dangerous bugs – usually latent defects (as in: they don’t surface until you’ve been using something for a while, which may explain why these bugs were missed by QA).

I fondly remember a bug that hit some arrays at a previous place of employment: the SSDs would work great but after a certain number of hours of operation, if you shut your machine down, the SSDs would never come up again. Or, another bug that hit a very popular SSD that would downsize itself to an awesome 8MB of capacity (losing all existing data of course) once certain conditions were met.

Clearly, these are some pretty hairy situations. And, what’s more, RAID, checksums and node-level redundancy wouldn’t protect against all such bugs.

For instance, think of the aforementioned power off bug – all SSDs of the same firmware vintage would be affected simultaneously and the entire array would have zero SSDs that functioned. This actually happened, I’m not talking about a theoretical possibility. You know, just in case someone starts saying “but SSDs are reliable, and think of all the RAID!”

It’s all about approaching correctness from a holistic point of view. Multiple lines of defense are necessary.

The Rules: How True Enterprise Storage Deals with Firmware

Just like with Fight Club, there are some basic rules storage systems need to follow when it comes to certain things.

  1. Any firmware patching should be a non-event. Doesn’t matter what you’re updating, there should be no downtime.
  2. ANY firmware patching should be a NON-EVENT. Doesn’t matter what you’re updating, there should be NO downtime!
  3. Firmware updates should be automated even when dealing with devices en masse.
  4. The customer should automatically be notified of important updates they need to perform.
  5. Different vintage and vendor component updates should be handled automatically and centrally. And, most importantly: Safely.

If these rules are followed, bug risks are significantly mitigated and higher uptime is possible. Enterprise arrays typically will follow the above rules (but always ask the vendor).

Why Firmware Updating is a Challenge with Some Storage Solutions

Certain kinds of solutions make it inherently harder to manage critical tasks like component firmware updates.

You see, being able to hot-update different kinds of firmware in any given set of hardware means that the mechanism doing the updating must be intimately familiar with the underlying hardware & software combination, however complex.

Consider the following kind of solution, maybe for someone sold on the idea that white box approaches are the future:

  • They buy a bunch of diskless server chassis from Vendor A
  • They buy a bunch of SSDs from Vendor B
  • They buy some Software Defined Storage offering from Vendor C
  • All running on the underlying OS of Vendor D…

Now, let’s say Vendor B has an emergency SSD firmware fix they made available, easily downloadable on their website. Here are just some of the challenges:

  1. How will that customer be notified by Vendor B that such a critical fix is available?
  2. Once they have the fix located, which Vendor will automate updating the firmware on the SSDs of Vendor B, and how?
  3. How does the customer know that Vendor B’s firmware fix doesn’t violently clash with something from Vendor A, C or D?
  4. How will all that affect the data-serving functionality of Vendor C?
  5. Can any of Vendors A, B, C or D orchestrate all the above safely?
  6. With no downtime?

In most cases I’ve seen, the above chain of events will not even progress past #1. The user will simply be unaware of any update, simply because component vendors don’t usually have a mechanism that alerts individual customers regarding firmware.

You could inject a significant permutation here: What if you buy the servers pre-built, including SSDs, from Vendor A, including full certification with Vendors C and D? 

Sure – it still does not materially change the steps above. One of Vendors A, C or D still need to somehow:

  1. Automatically alert the customer about the critical SSD firmware fix being available
  2. Be able to non-disruptively update the firmware…
  3. …While not clashing with the other hardware and software from Vendors A, C and D
I could expand this type of conversation to other things like overall environmental monitoring and checksums but let’s keep it simple for now and focus on just component firmware updates…

Always Remember – Solve Business Problems & Balance Risk

Any solution is a compromise. Always make sure you are comfortable with the added risk certain areas of compromise bring (and that you are fully aware of said risk).

The allure of certain approaches can be significant (at the very least because of lower promised costs). It’s important to maintain a balance between increased risk and business benefit.

In the case of SSDs specifically, the utter criticality of certain firmware updates means that it’s crucially important for any given storage solution to be able to safely and automatically address the challenge of updating SSD firmware.

D

The Well-Behaved Storage System: Automatic Noisy Neighbor Avoidance

This topic is very near and dear to me, and is one of the big reasons I came over to Nimble Storage.

I’ve always believed that storage systems should behave gracefully and predictably under pressure. Automatically. Even under complex and difficult situations.

It sounds like a simple request and it makes a whole lot of sense, but very few storage systems out there actually behave this way. This creates business challenges and increases risk and OpEx.

The Problem

The simplest way to state the problem is that most storage systems can enter conditions where workloads can suffer from unfair and abrupt performance starvation under several circumstances.

OK, maybe that wasn’t the simplest way.

Consider the following scenarios:

  1. A huge sequential I/O job (backup, analytics, data loads etc.) happening in the middle of latency-sensitive transaction processing
  2. Heavy array-generated workloads (garbage collection, post-process dedupe, replication, big snapshot deletions etc.) happening at the same time as user I/O
  3. Failed drives
  4. Controller failover (due to an actual problem or simply a software update)

#3 and #4 are more obvious – a well-behaved system will ensure high performance even during a drive failure (or three), and after a controller fails over. For instance, if total system headroom is automatically kept at 50% for a dual-controller system (or, simplistically, 100/n, where n is the controller count for shared-everything architectures), even after a controller fails, performance should be fine.

#1 and #2 are a bit more complicated to deal with. Let’s look at this in more detail.

The Case of Competing Workloads During Hard Times

Inside every array, at any given moment, a balancing act occurs. Multiple things need to happen simultaneously.

Several user-generated workloads, for instance:

  • DB
  • VDI
  • File Services
  • Analytics

Various internal array processes – they also are workloads, just array-generated, and often critical:

  • Data reduction (dedupe, compression)
  • Cleanup (object deletion, garbage collection)
  • Data protection (integrity-related)
  • Backups (snaps, replication)

If the system has enough headroom, all these things will happen without performance problems.

If the system runs out of headroom, that’s where most arrays have challenges with prioritizing what happens when.

The most common way a system may run out of headroom is the sudden appearance of a hostile “bully” workload. This is also called a “noisy neighbor”. Here’s an example of system behavior in the presence of a bully workload:

bully_vs_victim_2

 

In this example, the latency-sensitive workload will greatly and unfairly suffer after the “noisy neighbor” suddenly appears. If the latency-sensitive workload is a mission-critical application, this could cause a serious business problem (slow processing of financial transactions, for instance).

This is an extremely common scenario. A lot of the time it’s not even a new workload. Often, an existing workload changes behavior (possibly due to an application change – for instance a patch or a modified SQL query). This stuff happens.

In addition, the noisy neighbor doesn’t have to be a different LUN/Volume. It could be competing workload within the same Volume! In such cases, simple fair sharing will not work.

How some vendors have tried to fix the issue with Manual “QoS”

As always, there is more than one way to skin a cat, if one is so inclined. Here are a couple of manual methods to fix workload contention:

  • Some arrays have a simple IOPS or throughput limit that an administrator can manually adjust in order to fix a performance problem. This is an iterative and reactive method and hard to automate properly in real time. In addition, if the issue was caused by an internal array-generated workload, there is often no tooling available to throttle those processes.
  • Other arrays insist on the user setting up minimum, maximum and burst IOPS values for every single volume in the system, upon volume creation. This assumes the user knows in advance what performance envelope is required, in detail, per volume. The reality is that almost nobody knows these things beforehand, and getting the numbers wrong can itself cause a huge problem with latencies. Most people just want to get on with their lives and have their stuff work without babysitting.
  • A few modern arrays do “fair sharing” among LUNs/Volumes. This can help a bit overall but doesn’t address the issue of competing workloads within the same Volume.
Manual mechanisms for fixing the “bully” workload challenge result in systems that are hard to consume and complex to support while under performance pressure. Moreover, when a performance issue occurs, speed of resolution is critical. The issue needs to be resolved immediately, especially for latency-sensitive workloads. Manual methods will simply not be fast enough. Business will be impacted.

How Nimble Storage Fixed the Noisy Neighbor Issue

No cats were harmed in the process. Nimble engineers looked at the extensive telemetry in InfoSight, used data science, and neatly identified areas that could be massively automated in order to optimize system behavior under a wide variety of adverse conditions. Some of what was done:

  • Highly advanced Fair Share disk scheduling (separate mechanisms that deal with different scenarios)
  • Fair Share CPU scheduling
  • Dynamic Weight Adjustment (the most important innovation listed) – automatically adjust priorities in various ways under different resource contention conditions, so that the system can always complete critical tasks and not fall dangerously behind. For instance, even within the same Volume, preferentially prioritize latency-sensitive I/O (like a DB redo log write or random DB access) vs things like a large DB table scan (which isn’t latency-sensitive).

The end result is a system that:

  • Lets system latency increase gracefully and progressively as load increases
  • Carefully and automatically balances user and system workloads, even within the same Volume
  • Achieves I/O deadlines and preemption behavior for latency-sensitive I/O
  • Eliminates the Noisy Neighbor problem without the need for any manual QoS adjustments
  • Allows latency-sensitive small-block I/O to proceed without interference from bully workloads even at extremely high system loads.
Such automation achieves a better business result: less risk, less OpEx, easy supportability, simple and safe overall consumption even under difficult conditions.

What Should Nimble Customers do to get this Capability?

As is typical with Nimble systems and their impressive Ease of Consumption, nothing fancy needs to be done apart from simply upgrading to a specific release of the code (in this case 3.1 and up – 2.3 did some of the magic but 3.1 is the fully realized vision).

A bit anticlimactic, apologies… if you like complexity, watching this instead is probably more fun than juggling QoS manually.

D

Technorati Tags: , , , ,

7-Mode to Clustered ONTAP Transition

I normally deal with different aspects of storage (arguably far more exciting) but I thought I would write something to provide some common sense perspective on the current state of 7-Mode to cDOT adoption.

I will tackle the following topics:

  1. cDOT vs 7-Mode capabilities
  2. Claims that not enough customers are moving to cDOT
  3. 7-Mode to cDOT transition is seen by some as difficult and expensive
  4. Some argue it might make sense to look at competitors and move to those instead
  5. What programs and tools are offered by NetApp to make transition easy and quick
  6. Migrating from competitors to cDOT

cDOT vs 7-Mode capabilities

I don’t want to make this into a dissertation or take a trip down memory lane. Suffice it to say that while cDOT has most of the 7-Mode features, it is internally very different but also much, much more powerful than 7-Mode – cDOT is a far more capable and scalable storage OS in almost every possible way (and the roadmap is utterly insane).

For instance, cDOT is able to nondisruptively do anything, including crazy stuff like moving SMB shares around cluster nodes (moving LUNs around is much easier than dealing with the far more quirky SMB protocol). Some reasons to move stuff around the cluster could be node balancing, node evacuation and replacement… all done on the fly in cDOT, regardless of protocol.

cDOT also handles Flash much better (cDOT 8.3.x+ can be several times faster than 7-Mode for Flash on the exact same hardware). Even things like block I/O (FC and iSCSI) are completely written from the ground up in cDOT. Cloud integration. Automation. How failover is done. Or how CPU cores are used, how difficult edge conditions are handled… I could continue but then the ADD-afflicted would move on, if they haven’t already…

In a nutshell, cDOT is a more flexible, forward-looking architecture that respects the features that made 7-Mode so popular with customers, but goes incredibly further. There is no competitor with the breadth of features available in cDOT, let alone the features coming soon.

cDOT is quite simply the next logical step for a 7-Mode customer.

Not enough customers moving to cDOT?

The reality is actually pretty straightforward.

  • Most new customers simply go with cDOT, naturally. 7-Mode still has a couple of features cDOT doesn’t, and if those features are really critical to a customer, that’s when someone might go with 7-Mode today. With each cDOT release the feature delta list gets smaller and smaller. Plus, as mentioned earlier, cDOT has a plethora of features and huge enhancements that will never make it to 7-Mode, with much more coming soon.
  • The things still missing in cDOT (like WORM) aren’t even offered by the majority of storage vendors… many large customers use our WORM technology.
  • Large existing customers, especially ones running critical applications, naturally take longer to cycle technologies. The average time to switch major technologies (irrespective of vendor) is around 5 years. The big wave of cDOT transitions hasn’t even hit yet!
  • Given that cDOT 8.2 with SnapVault was the appropriate release for many of our customers and 8.3 the release for most of our customers, a huge number of systems are still within that 5-year window prior to converting to cDOT, given when those releases came out.
  • Customers with mission-critical systems will typically not convert an existing system – they will wait for the next major refresh cycle. Paranoia rules in those environments (that’s a general statement regardless of vendor). And we have many such customers.

7-Mode to cDOT transition is seen by some as difficult and expensive

This is a fun one, and a favorite FUD item for competitors and so-called “analysts”. I sometimes think we confused people by calling cDOT “ONTAP”. I bet expectations would be different if we’d called it “SuperDuper ClusterFrame OS”.

You see, cDOT is radically different in its internals versus 7-Mode – however, it’s still officially also called “ONTAP”. As such, customers are conditioned to super-easy upgrades between ONTAP releases (just load the new code and you’re done). cDOT is different enough that we can’t just do that.

I lobbied for the “ClusterFrame” name but was turned down BTW. I still think it rocks.

The fact that you can run either 7-Mode or cDOT on the same physical hardware confuses people even further. It’s a good thing to be able to reuse hardware (software-defined and all that). Some vendors like to make each new rev of the same family line (and its code) utterly incompatible with the last one… we don’t do that.

And for the startup champions: Startups haven’t been around long enough to have seen a truly major hardware and/or software change! (another thing conveniently ignored by many). Nor do they have the sheer amount of features and ancillary software ONTAP does. And of course, some vendors forget to mention what even a normal tech refresh looks like for their fancy new “built from the ground up” box with the extremely exciting name.

We truly know how to do upgrades… probably better than any vendor out there. For instance: What most people don’t know is that WAFL (ONTAP’s underlying block layout abstraction layer) has been quietly upgraded many, many times over the years. On the fly. In major ways. With a backout option. Another vendor’s product (again the one with the extremely exciting name) needed to be wiped twice by as many “upgrades” in one year in order to have its block layout changed.

Here’s the rub:

Transition complexity really depends on how complex your current deployment is, your appetite for change and tolerance of risk. But transition urgency depends on how much you need the fully nondisruptive nature of cDOT and all the other features it has vs 7-Mode.

What I mean by that:

We have some customers that lose upwards of $4m/hour of downtime. The long-term benefits of a truly nondisruptive architecture make any arguments regarding migration efforts effectively moot.

If you are using a lot of the 7-Mode features and companion software (and it has more features than almost any other storage OS), specific tools written only for 7-Mode, older OS clients only supported on 7-Mode, tons of snaps and clones going back to several years’ retention etc…

Then, in order to retain that kind of similar elaborate deployment in cDOT, the migration effort will also naturally be a bit more complex. But still doable. And we can automate most of it. Including moving over all the snapshots and archives!

On the other hand, if you are using the system like an old-fashioned device and aren’t taking advantage of all the cool stuff, then moving to anything is relatively easy. And especially if you’re close to 100% virtualized, migration can be downright trivial (though simply moving VMs around storage systems ignores any snapshot history – the big wrinkle with VM migrations).

Some argue it might make sense to look at competitors and move to those instead

Looking at options is something that makes business sense in general. It would be very disingenuous of me to say it’s foolish to look at options.

But this holds firmly true: If you want to move to a competitor platform, and use a lot of the 7-Mode features, it would arguably be impossible to do cleanly and maintain full functionality (at a bare minimum you’d lose all your snaps and clones – and some customers have several years’ worth of backup data in SnapVault – try asking them to give that up).

This is true for all competitor platforms: Someone using specific features, scripts, tools, snaps, clones etc. on any platform, will find it almost impossible to cleanly migrate to a different platform. I don’t care who makes it. Doesn’t matter. Can you cleanly move from VMware + VMware snaps to Hyper-V and retain the snaps?

Backup/clone retention is really the major challenge here – for other vendors. Do some research and see how frequently customers switch backup platforms… 🙂 We can move snaps etc. from 7-Mode to cDOT just fine 🙂

The less features you use, the easier the migration and acclimatization to new stuff becomes, but the less value you are getting out of any given product.

Call it vendor lock-in if you must, but it’s merely a side effect of using any given device to its full potential.

The reality: it is incredibly easier to move from 7-Mode to cDOT than from 7-Mode to other vendor products. Here’s why…

What programs and tools are offered by NetApp to make transition easy and quick?

Initially, migration of a complex installation was harder. But we’ve been doing this a while now, and can do the following to make things much easier:

  1. The very cool 7MTT (7-Mode Transition Tool). This is an automation tool we keep rapidly enhancing that dramatically simplifies migrations of complex environments from 7-Mode to cDOT. Any time and effort analysis that ignores how this tool works is quite simply a flawed and incomplete analysis.
  2. After migrating to a new cDOT system, you can take your old 7-Mode gear and convert it to cDOT (another thing that’s impossible with a competitor – you can’t move from, say, a VNX to a VMAX and then convert the VNX to a VMAX).
  3. As of cDOT 8.3.0: We made SnapMirror replication work for all protocols from 7-Mode to cDOT! This is the fundamental way we can easily move over not just the baseline data but also all the snaps, clones etc. Extremely important, and something that moving to a competitor would simply be impossible to carry forward.
  4. As of cDOT 8.3.2 we are allowing something pretty amazing: CFT (Copy Free Transition). Which does exactly what the name suggests: Allows not having to move any data over to cDOT! It’s a combination ONTAP and 7MTT feature, and allows disconnecting the shelves from 7-Mode controllers and re-attaching them to cDOT controllers, and thereby converting even a gigantic system in practically no time. See here for a quick guide, here for a great blog post.
And before I forget…

What about migrating from other vendors to cDOT?

It cuts both ways – anything less would be unfair. Not only is it far easier to move from 7-Mode to cDOT than to a competitor, it’s also easy to move from a competitor to cDOT! Since it’s all about growth, and that’s the only way real growth happens.

As of version 8.3.1 we have what’s called Online Foreign LUN Import (FLI). See link here. It’s all included – no special licenses needed.

With Online FLI we can migrate LUNs from other arrays and maintain maximum uptime during the migration (a cutover is needed at some point of course but that’s quick).

And all this we do without external “helper” gear or special software tools.

In the case of NAS migrations, we have the free and incredibly cool XCP software that can migrate things like high file count environments 25-100x faster than traditional methods. Check it out here.

In summary

I hardly expect to change the minds of people suffering from acute confirmation bias (I wish I could say “you know who you are”, not knowing you are afflicted is the major problem), but hopefully the more level-headed among us should recognize by now that:

  • 7-Mode to cDOT migrations are extremely straightforward in all but the most complex and custom environments
  • Those same complex environments would find it impossible to transparently migrate to anything else anyway
  • Backups/clones is one of those things that complicates migrations for any vendor – ONTAP happens to be used by a lot of customers to handle backups as part of its core value prop
  • NetApp provides extremely powerful tools to help with migrations from 7-Mode to cDOT and from competitors to cDOT (with amazing tools for both SAN and NAS!) that will also handle the backups/clones/archives!
  • The grass isn’t always greener on the other side – The transition from 7-Mode to cDOT is the first time NetApp has asked customers to do anything that major in over 20 years. Other, especially younger, vendors haven’t even seen a truly major code change yet. How will they react to such a thing? NetApp is handling it just fine 🙂

D

Technorati Tags: , , , , , ,