(Edited: My bad, it was 1.3TB/s, not 1TB/s).
What do you do when you need so much I/O performance that no one single storage system can deliver it, no matter how large?
To be specific: What if you needed to transfer data at 1TB per second? (or 1.3TB/s, as it eventually turned out to be)?
That was the problem faced by the U.S. Department of Energy (DoE) and their Sequoia supercomputer at the Lawrence Livermore National Laboratory (LLNL), one of the fastest supercomputing systems on the planet.
You can read the official press release here. I wanted to get more into the technical details.
People talk a lot about “big data” recently – no clear definition seems to exist, in my opinion it’s something that has some of the following properties:
- Too much data to be processed by a “normal” computer or cluster
- Too much data to work with using a relational DB
- Too much data to fit in a single storage system for performance and/or capacity reasons – or maybe just simply:
- Too much data to process using traditional methods within an acceptable time frame
Clearly, this is a bit loose – how much is “too much”? How long is “too long”? For someone only armed with a subnotebook computer, “too much” does not have the same meaning as for someone rocking a 12-core server with 256GB RAM and a few TB of SSD.
So this definition is relative… but in some cases, such as the one we are discussing, absolute – given the limitations of today’s technology.
For instance, the amount of storage LLNL required was several tens of PB in a single storage pool that could provide unprecedented I/O performance to the tune of 1TB/s. Both size and performance needed to be scalable. It also needed to be reliable and fit within a reasonable budget and not require extreme space, power and cooling. A tall order indeed.
This created some serious logistics problems regarding storage:
- No single disk array can hold that amount of data
- No single disk array can perform anywhere close to 1TB/s
Let’s put this in perspective: The storage systems that scale the biggest are typically scale-out clusters from the usual suspects of the storage world (we make one, for example). Even so, they max out at less PB than the deployment required.
The even bigger problem is that a single large scale-out system can’t really deliver more than a few tens of GB/s under optimal conditions – more than fast enough for most “normal” uses but utterly unacceptable for this case.
The only realistic solution to satisfy the requirements was massive parallelization, specifically using the NetApp E-Series for the back-end storage and the Lustre cluster filesystem.
A bit about the solution…
Almost a year ago NetApp purchased the Engenio storage line from LSI. That storage line is resold by several companies like IBM, Oracle, Quantum, Dell, SGI, Teradata and more. IBM also resells the ONTAP-based FAS systems and calls them “N-Series”.
That purchase has made NetApp the largest provider of OEM arrays on the planet by far. It was a good deal – very rapid ROI.
There was a lot of speculation as to why NetApp would bother with the purchase. After all, the ONTAP-based systems have a ton more functionality than pretty much any other array and are optimized for typical mostly-random workloads – DBs, VMs, email, plus megacaching, snaps, cloning, dedupe, compression, etc – all with RAID6-equivalent protection as standard.
The E-Series boxes on the other hand don’t do thin provisioning, dedupe, compression, megacaching… and their snaps are the less efficient copy-on-first-write instead of redirect-on-write. So, almost the anti-ONTAP
The first reason for the acquisition was that, on purely financial terms, it was a no-brainer deal even if one sells shoes for a living, let alone storage. Even if there were no other reasons, this one would be enough.
Another reason (and the one germane to this article) was that the E-Series has a tremendous sustained sequential performance density. For instance, the E5400 system can sustain about 4GB/s in 4U (real GB/s, not out of cache), all-in. That’s 4U total for 60 disks including the controllers. Expandable, of course. It’s no slouch for random I/O either, plus you can load it with SSDs, too…
Again, note – 60 drives per 4U shelf and that includes the RAID controllers, batteries etc. In addition, all drives are front-loading and stay active while servicing the shelf – as opposed to most (if not all) dense shelves in the market that need the entire (very heavy) shelf pulled out and/or several drives offlined in order to replace a single failed drive… (there’s some really cool engineering in the shelf to do this without thermal problems, performance loss or vibrations). All this allows standard racks and no fear of the racks tipping over while servicing the shelves (you know who you are!)
There are some vendors that purely specialize in sequential I/O and tipping racks – yet they have about 3-4x less performance density than the E5400, even though they sometimes have higher per-controller throughput. In a typical marketing exercise, some of our more usual competitors have boasted 2GB/s/RU for their controllers, meaning that in 4U the controllers (that take up 4U in that example) can do 8GB/s, but that requires all kinds of extra rack space to achieve (extra UPSes, several shelves, etc). Making their resulting actual throughput number well under 1GB/s/RU. Not to mention the cost (those systems are typically more expensive than a 5400). Which is important with projects of the scale we are talking about.
Most importantly, what we accomplished at the LLNL was no marketing exercise…
The benefits of truly high performance density
Clearly, if your requirements are big enough, you end up spending a lot less money and needing a lot less rack space, power and cooling by going with a highly performance-dense solution.
However, given the requirements of the LLNL, it’s clear that you can’t use just a single E5400 to satisfy the performance and capacity requirements of this use case. What you can do though is use a bunch of them in parallel… and use that massive performance density to achieve about 40GB/s per industry-standard rack with 600x high-capacity disks (1.8PB raw per rack).
For even higher performance per rack, the E5400 can use the faster SAS or SSD drives – 480 drives per rack (up to 432TB raw), providing 80GB/s reads/60GB/s writes.
Enter the cluster filesystem
So, now that we picked the performance-dense, reliable, cost-effective building block, how do we tie those building blocks together?
The answer: By using a cluster filesystem.
Loosely defined, a cluster filesystem is simply a filesystem that can be accessed simultaneously by the servers mounting it. In addition, it also typically means it can span storage systems and make them look as one big entity.
It’s not a new concept – and there are several examples, old and new: AFS, Coda, GPFS, and the more prevalent Stornext and Lustre are some.
The LLNL picked Lustre for this project. Lustre is a distributed filesystem that breaks apart I/O into multiple Object Storage Servers, each connected to storage (Object Storage Targets). Metadata is served by dedicated servers that are not part of the I/O stream and thus not a bottleneck. See below for a picture (courtesy of the Lustre manual) of how it is all connected:
High-speed connections are used liberally for lower latency and higher throughput.
A large file can reside on many storage servers, and as a result I/O can be spread out and parallelized.
Lustre clients see a single large namespace and run a proprietary protocol to access the cluster.
It sounds good in theory – and it delivered in practice: 1.3TB/s sustained performance was demonstrated to the NetApp block devices. Work is ongoing to finalize the testing with the complete Lustre environment. Not sure what the upper limit would be. But clearly it’s a highly scalable solution.
Putting it all together
NetApp has fully realized solutions for the “big data” applications out there – complete with the product and services needed to complete each engagement. The Lustre solution employed by the LLNL is just one of the options available. There is Hadoop, Full Motion uncompressed HD video, and more.
So – how fast do you need to go?