Practical Considerations for Implementing NVMe Storage

Before we begin, something needs to be clear: Although dual-ported NVMe drives are not yet cost effective, the architecture of Nimble Storage is NVMe-ready today. And always remember that in order to get good benefits from NVMe, one needs to implement it all the way from the client. Doing NVMe only at the array isn’t as effective.

In addition, Nimble already uses technology far faster than NVMe: Our write buffers use byte-addressable NVDIMM-N, instead of slower NVRAM HBAs or NVMe drives that other vendors use. Think about it: I/O happens at DDR4 RAM speeds, which makes even the fastest NVMe drive seem positively glacial.


I did want to share my personal viewpoint of where storage technology in general may be headed if NVMe is to be mass-adopted in a realistic fashion and without making huge sacrifices.

About NVMe

Lately, a lot of noise is being made about NVMe technology. The idea being that NVMe will be the next step in storage technology evolution. And, as is the natural order of things, new vendors are popping up to take advantage of this perceived opening.

For the uninitiated: NVMe is a relatively new standard that was created specifically for devices connected over a PCI bus. It has certain nice advantages vs SCSI such as reduced latency and improved IOPS. Sequential throughput can be significantly higher. It can be more CPU-efficient. It needs a small and simple driver, the standard requires only 13 commands, and it can also be used over some FC or Ethernet networks (NVMe over Fabrics). Going through a fabric only adds a small amount of extra latency to the stack compared to DAS.

NVMe is strictly an optimized block protocol, and not applicable to NAS/object platforms unless one is talking about their internal drives.

Due to the additional performance, NVMe drives are a no brainer in systems like laptops and DASD/internal to servers. Usually there is only a small number (often just one device) and no fancy data services are running on something like a laptop… replacing the media with better media+interface is a good idea.

For enterprise arrays though, the considerations are different.

NVMe Performance

Marketing has managed to confuse people regarding NVMe’s true performance. It’s important to note that tests illustrating NVMe performance show a single NVMe device being faster than a single SAS or SATA SSD. But storage arrays usually don’t have a single device and so drive performance isn’t the bottleneck as it is with low media count systems.

In addition, most tests and research papers comparing NVMe to other technologies use wildly dissimilar SSD models. For instance, pitting a modern, ultra-high-end NVMe SSD against an older consumer SATA SSD with a totally different internal controller. This can make proper performance comparisons difficult. How much of the performance boost is due to NVMe and how much because the expensive, fancy SSD is just a much better engineered device?

For instance, consider this chart of NVMe device latency, courtesy of Intel:


As you can see, regarding latency, NVMe as a drive connection protocol will offer better latency than SAS or SATA but the difference is in the order of a few microseconds. The protocol differences become truly important only with next gen technologies like 3D Xpoint, which ideally needs a memory interconnect to shine (or, at a minimum, PCI) since the media is so much faster than the usual NAND. But such media will be prohibitively expensive to be used as the entire storage within an array in the foreseeable future, and would quickly be bottlenecked by the array CPUs at scale.

NVMe over Fabrics

Additional latency savings will come from connecting clients using NVMe over Fabrics. By doing I/O over an RDMA network, a latency reduction of around 100 microseconds is possible versus encapsulated SCSI protocols like iSCSI, assuming all the right gear is in place (HBAs, switches, host drivers). Doing NVMe at the client side also helps with lowering CPU utilization, which can make client processing overall more efficient.

Where are the Bottlenecks?

The reality is that the main bottleneck in today’s leading modern AFAs is the controller itself and not the SSDs (simply because there is enough performance in just a couple of dozen modern SAS/SATA SSDs to saturate most systems). Moving to competent NVMe SSDs will mean that those same controllers will now be saturated by maybe 10 NVMe SSDs. For example, a single NVMe drive may be able to read sequentially at 3GB/s, whereas a single SATA drive 500MB/s. Putting 24 NVMe drives in the controller doesn’t mean that magically the controller will now deliver 72GB/s. In the same way, a single SATA SSD might be able to do 100000 read small block random IOPS and an NVMe with better innards 400000 IOPS. Again, it doesn’t mean that same controller with 24 devices will all of a sudden now do 9.6 million IOPS!

How Tech is Adopted

Tech adoption comes in waves until a significant technology advancement is affordable and reliable enough to become pervasive. For instance, ABS brakes were first used in planes in 1929 and were too expensive and cumbersome to use in everyday cars. Today, most cars have ABS brakes and we take for granted the added safety they offer.

But consider this: What if someone told you that in order to get a new kind of car (that has several great benefits) you would have to utterly give up things like airbags, ABS brakes, all-wheel-drive, traction control, limited-slip differential? Without an equivalent replacement for these functions?

You would probably realize that you’re not that excited about the new car after all, no matter how much better than your existing car it might be in other key aspects.

Storage arrays follow a similar paradigm. There are several very important business reasons that make people ask for things like HA, very strong RAID, multi-level checksums, encryption, compression, data reduction, replication, snaps, clones, hot firmware updates. Or the ability to dynamically scale a system. Or comprehensive cross-stack analytics and automatic problem prevention.

Such features evolved over a long period of time, and help mitigate risk and accelerate business outcomes. They’re also not trivial to implement properly.

NVMe Arrays Today

The challenge I see with the current crop of ultra-fast NVMe over Fabrics arrays is that they’re so focused on speed that they ignore the aforementioned enterprise features in lieu of sheer performance. I get it: it takes great skill, time and effort to reliably implement such features, especially in a way that they don’t strip the performance potential of a system.

There is also a significant cost challenge in order to safely utilize NVMe media en masse. Dual-ported SSDs are crucial in order to deliver proper HA. Current dual-ported NVMe SSDs tend to be very expensive per TB vs current SAS/SATA SSDs. In addition, due to the much higher speed of the NVMe interface, even with future CPUs that include FPGAs, many CPUs and PCI switches are needed to create a highly scalable system that can fully utilize such SSDs (and maintain enterprise features), which further explains why most NVMe solutions using the more interesting devices tend to be rather limited.

There are also client-side challenges: Using NVMe over Fabrics can often mean purchasing new HBAs and switches, plus dealing with some compromises. For instance, in the case of RoCE, DCB switches are necessary, end-to-end congestion management is a challenge, and routability is not there until v2.

There’s a bright side: There actually exist some very practical ways to give customers the benefits of NVMe without taking away business-critical capabilities.

Realistic Paths to NVMe Adoption

We can divide the solution into two pieces, the direction chosen will then depend on customer readiness and component availability. All the following assumes no loss of important enterprise functionality (as we discussed, giving up on all the enterprise functionality is the easy way out when it comes to speed):

Scenario 1: Most customers are not ready to adopt host-side NVMe connectivity:

If this is the case, a good option would be to have something like a fast byte-addressable ultra-fast device inside the controller to massively augment the RAM buffers (like 3D Xpoint in a DIMM), or, if not available, some next-gen NVMe drives to act as cache. That would provide an overall speed boost to the clients and not need any client-side modifications. This approach would be the most friendly to an existing infrastructure (and a relatively economical enhancement for arrays) without needing all internal drives to be NVMe nor extensive array modifications.

You see, part of any competent array’s job is using intelligence to hide any underlying media issues from the end user. A good example: even super-fast SSDs can suffer from garbage collection latency incidents. A good system will smooth out the user experience so users won’t see extreme latency spikes. The chosen media and host interface are immaterial for this, but I bet if you were used to 100μs latencies and they suddenly spiked to 10ms for a while, it would be a bad day. Having an extra-large buffer in the array would help do this more easily, yet not need customers to change anything host-side.

An evolutionary second option would be to change all internal drives to NVMe, but to make this practical would require wide availability of cost-effective dual-ported devices. Note that with low SSD counts (less than 12) this would provide speed benefits even if the customer doesn’t adopt a host-side NVMe interface, but it will be a diminishing returns endeavor at larger scale, unless the controllers are significantly modified.

Scenario 2: Large numbers of customers are ready and willing to adopt NVMe over Fabrics.

In this case, the first thing that needs to change is the array connectivity to the outside world. That alone will boost speeds on modern systems even without major modifications. Of course, this will often mean client and networking changes to be most effective, and often such changes can be costly.

The next step depends on the availability of cost-effective dual-ported NVMe devices. But in order for very large performance benefits to be realized, pretty big boosts to CPU and PCI switch counts may be necessary, necessitating bigger changes to storage systems (and increased costs).

Architecture Matters

In the quest for ultra-low latency and high throughput without sacrificing enterprise features (yet remaining reasonably cost-effective), overall architecture becomes extremely important.

For instance, how will one do RAID? Even with NVMe over Fabrics, approaches like erasure coding and triple mirroring can be costly from an infrastructure perspective. Erasure coding remains CPU-hungry (even more so when trying to hit ultra-low latencies), and triple mirroring across an RDMA fabric would mean massive extra traffic on that fabric.

Localized CPU:RAID domains remain more efficient, and mechanisms such as Nimble NCM can fairly distribute the load across multiple storage nodes without relying on a cluster network for heavy I/O. This technology is available today.

Next Steps

In summary, I urge customers to carefully consider the overall business impact of their storage making decisions, especially when it comes to new technologies and protocols. Understand the true benefits first. Carefully balance risk with desired outcome, and consider the overall system and not just the components. Of course, one needs to understand the risks vs rewards first, hence this article. Just make sure that, in order to achieve a certain ideal, you don’t give up on critical functionality that you’ve been taking for granted.

Leave a comment for posterity...