There are a lot of myths and misinformation, plus more than a modicum of misunderstanding, regarding how storage systems can use available bandwidth, especially with certain newer kinds of media.
I wanted to explain some of the harsh facts of storage system design in the real world, and why one shouldn’t just add up drive speeds to estimate performance.
A Common Myth – Additive Drive Performance
“This awesome new drive can read at 4GB/s!”
Sure. By itself. To get that performance, the drive needs 4x PCIe 3.0 lanes, or 2x PCIe 4.0 lanes. All the time. Which, ironically, is something one can easily do on a laptop 🙂
However, arrays don’t have a single SSD. They have potentially hundreds.
The next statement often is:
“So, 100 of those babies can read at 400GB/s! This is incredible!”
Not even close. What would be incredible is if that figure could actually be achieved through an array controller.
Why is that?
Even a smaller, dual controller array may have 24x SSDs.
For those fancy fast SSDs to be able to be read at full speed, one would need 24 * 4 = 96 PCIe 3.0 lanes going from all the SSDs to the array itself, and eventually the outside world.
Why does PCIe lane count matter so much?
CPUs and PCIe Lanes
You see, the ultimate data mover in most modern storage systems is the CPU.
For non-memory traffic, server-class CPUs (like the common Intel Skylake-SP used in many storage systems) connect to the outside world via PCIe. Things like expansion slots on a system, or connectivity to drives, ultimately have to go through PCIe.
A single Intel Skylake-SP has 48x PCIe 3.0 lanes. A dual-socket Skylake SP system has 96x PCIe 3.0 lanes.
Everything has to go through those 96x CPU PCIe lanes.
PCIe lanes have a maximum achievable bandwidth per lane. PCIe 3.0 can transfer about 1GB/s in each direction simultaneously (so you could have 1GB/s reads and 1GB/s writes happening simultaneously per PCIe 3.0 lane). PCIe 4.0 neatly doubles that to 2GB/s – the problem being that most CPUs today (with the notable exception of AMD) don’t support PCIe 4.0.
PCIe Lane Balancing in Storage Systems
So, let’s use a simple example and see if we have anywhere near enough PCIe lanes to satisfy merely 24x fancy modern SSDs that have 4x PCIe lanes each.
In this simplified block diagram of a single array controller (which doesn’t quite match any specific array but should be sufficiently generic to explain things), notice that I had to divide up my 96x total PCIe lane allotment and split it among:
- The outside world: things like host-facing HBAs for FC or Ethernet
- The outside world: things like an expansion shelf
- Any internal media in the system
- Last but by no means least, mirroring to the rest of the array’s controllers.
All of a sudden, my hitherto plentiful-seeming 96 lanes have been reduced to nowhere near enough to deal with modern media PCI lane requirements.
The way CPUs talk to things like a set of 24x internal SSDs is via aggregators like PCI switches, and things are always oversubscribed. Those 24x SSDs may not even have all their PCI lanes connected, often it will be just 1-2 per drive, and those lanes fan-in so they talk to the CPUs via a vastly reduced amount of bandwidth compared to what the devices could really do in aggregate.
Let’s Talk Expansion and Outside I/O.
If I now have 48 SSDs (say, 24 internal and 24 in an expansion, not an unreasonable scenario), I now need 192 PCIe 3.0 lanes just for the SSDs to send data at their top speed to the CPUs of the array. This doesn’t include taking that data and sending it to hosts!
No matter how much I try to balance the I/O, the fact remains that I need some PCI lanes to read from media, and some PCI lanes to transfer what I read to the outside world.
I could try to save some lanes by doing only 8x lanes for the mirroring to the other controller, but that would then mean my write speeds are forever limited to that lane count or about 8GB/s maximum with PCIe 3.0.
I could try to give more lanes to the SSDs, but that would mean not having enough lanes for host interface cards and would therefore impact the host-side I/O capability.
As you can see, there is no easy way to satisfy modern SSDs with current controller architectures since the ultimate limiting factor is the PCIe lane count of the array CPUs.
All this means that any modern storage system has to result to some extremely heavy overprovisioning of resources, no matter how much marketing may like you to believe that certain media types are magical and will solve all your storage performance problems.
Solutions to the PCIe Lane Shortage Problem
There is no magic, but there are some viable options to alleviate PCIe lane shortage:
- Have more PCIe lanes – perhaps through a different CPU vendor that has a higher lane count, like AMD.
- Have faster PCIe lanes. Switching from PCIe 3.0 to PCIe 4.0 doubles bandwidth per lane. Again, AMD is the main CPU vendor today offering PCIe 4.0.
- Have more active controllers – and, in turn, ensure those controllers have more total lanes to the media, especially if it’s shared between the controllers.
- Employ a shared-nothing (or shared-little) architecture, where each controller owns a very limited number of drives, ensuring more bandwidth for a given number of drives. This is similar to SDS/Grid architectures, and the main challenge is having a fast enough interconnect between nodes to satisfy the more exotic media (otherwise, what’s the point of having yet another bottleneck?) Think 200Gb interconnects between nodes, for instance, and you are on the right path. Yes, it gets very expensive to do this right.
This article isn’t even touching on things like Path Length – all the bandwidth in the world is meaningless if there is too much software in between the data and the outside world – RAID, checksums, compression, dedupe… I just wanted to give you all a better understanding of how storage systems have to balance resources.
So, next time some vendor is claiming all kinds of amazing things about how the media they use changes the game and so on, ask them how many PCIe lanes are going from the SSDs to the CPU root complex, and see what they say… 🙂