HPE X10000 Deep Dive – Differentiation For Unstructured Data

At HPE Discover Barcelona 2024, HPE released the Alletra Storage MP X10000, the latest in our new line of shared hardware platform storage offerings.

It’s an innovative new platform specially made for unstructured data, and a long time in the making. This is HPE tech, not a partnership.

The initial workloads this solution is aimed at are anything requiring fast S3 performance, including AI workloads, data lakes, cloud native app development and high speed restore and backup.

It has several innovations such as RDMA for object, and is highly differentiated – plus, allows this kind of technology in a smaller possible starting capacity instead of only focusing on the huge side of the scale.

As usual, my aim is not to regurgitate basic information but rather to explain the true technical differentiation and get people excited about the possibilities on offer here.

The summary of the X10000 benefits are:

Disaggregation flexibility for separately expanding compute and/or capacity
Ability to scale down and not need huge capacities to get good performance
Balanced read/write performance and low latency for all workloads
Flexible, fully container-based architecture that opens up tons of possibilities for running customer code inside the storage solution.

Let’s get to it:

Why Make This?

Other products in this space have one or more of these deficiencies:

Built with a specific protocol as the foundation. For example, under the covers being truly file, object or block, and running other protocols as emulations – like object on top of file (very common), or block on top of object, etc. This inevitably results in some sort of inefficiency or inflexibility.
Inability to independently scale performance and capacity.
Needing multiple “buckets” to get maximum performance. This leads to management complexity.
Needing special media or special memory to get good performance.
Being severely compromised in one or more performance dimensions. For example, good for large object throughput but high latency for small objects, or poor for writes but good for reads.
Needing extremely large capacities and controller counts to achieve good performance.
Very high minimum starting points – putting even the smallest solutions out of reach for most customers.
Can’t provide computational storage.

I will tackle these and explain how the X10000 innovations avoid these issues and result in better business outcomes (and less headaches).

RDMA & GPUDirect for S3

First exciting things first though: We are working with NVIDIA to add support to the X10000 for RDMA so customers can enjoy GPUDirect for object workloads. Once that feature is finalized, it will be one of the very few object solutions with that ability (and full certification).

This technology will both greatly improve performance vs TCP and significantly reduce CPU demands, allowing customers to get a lot more out of their infrastructure – and eliminate NAS bottlenecks from GPUDirect pipelines.

X10000 Architecture Fundamentals

Under the covers, the entire system is built using containers. This is useful from a flexibility and scalability standpoint – for instance, what if we let users also run their own containers? Or offer extra HPE services as containers? Lots of possibilities open up for computational storage (and very fast nodes with ample performance are provided to explore those possibilities).

The name is a bit of a clue: B10000 is the block system. X can be – many things 🙂

Resiliency & Data Integrity

Extreme resiliency and data integrity are ensured with Triple+ Erasure Coding and Cascade Multistage Checksums (an evolution of the protection first seen in Nimble and then Alletra 5000 and 6000 – if you want to learn more about those fundamentals and why they matter, check this document).

The major difference compared to Nimble is that the write buffer now goes directly to the disks instead of mirrored NVDIMM – eliminating the HA pair restriction and allowing similar concepts of cluster resiliency as the B10000 but at even larger scale (for the benefits of eliminating HA pairs, check this post).

The other big difference is that Triple+ RAID isn’t done on whole disks but rather each disk is carved into “disklets” (small logical disks – the smallest size is 1GB) for granularity and flexibility.

These disklet RAID groups can be confined to a single JBOF or span JBOFs. RAID groups within a JBOF allow x10000 to scale down efficiently. RAID groups across JBOFs can protect against entire JBOF failure for larger clusters.

A Different Kind of DSP

The X10000 uses DSPs – sharded Data Service Partitions – to vertically slice up the workload between controller nodes. DSPs are automatically generated, portable and allow capabilities like resiliency in case of extreme loss (for example, if you lose more than one node – the DSPs just get nicely and evenly redistributed among the remaining controllers).

Disklet RAID slices are dynamically allocated to DSPs as needed (notice the color coding in the image denoting an example of slice ownership by DSP).

Upon addition of one or more nodes, DSPs are rebalanced. Because all state is only persisted within JBOFs and nodes are completely stateless, this movement of DSPs takes a few seconds and there is no data movement involved. To expand the performance capability of a cluster, all it requires is addition of controllers and redistribution of DSPs. Since objects are distributed across DSPs based on a hash, performance is always load balanced across the nodes of a cluster.

HPE X10000 architecture — HPE Alletra Storage MP X10000 Architecture

Every Protocol is a First Class Citizen

In X10000, a log-structured Key-Value store implements protocol-agnostic storage of data and metadata chunks and is the foundational data layer. It is optimized for flash access, reducing write amplification with a log-structured and extent-based approach.

On top of the KV store are native protocol-specific namespace layers, such as Object. These protocol layers are optimized for the semantics of a specific Protocol, treating each as a first-class citizen. This allows X10000 to take advantage of the strengths of each protocol, without inheriting the downsides of a second protocol or running protocols on top of each other (like Object on top of File or vice versa).

Independently Scale Performance and Capacity

A very common problem is systems that end up having too much compute and not enough capacity, and vice versa.

The X10000 (just like its block cousin the B10000) allows to separately increase compute vs space, so that the optimal blend of performance vs capacity is achieved not just initially but long-term. This reduces TCO and eliminates waste.

A Single Bucket Can Get All the Speed

S3 buckets are a useful structure. However, with certain object solutions, it’s common to need multiple S3 buckets to get the most performance out of a system. However, typical unstructured workloads such as analytics and data protection assume a single bucket or a small number of buckets per application unit such as a single warehouse or backup chain.

The X10000 doesn’t have this problem. Even a single bucket is enough to get the maximum performance out of the hardware. This frees the administrators from unnecessary complexity to gain speed – instead, buckets can be used for the true utility purpose they serve instead of an inelegant performance hack.

This is especially felt on writes. Versus some competitors, we may be 60x faster for small object PUT operations when using a single bucket.

X10000’s ability to scale a single bucket linearly means individual applications benefit from X10000’s scale out ability just the same as a large number of applications or tenants.

No Need for Special Media to Get Good Performance

Continuing the paradigm of the B10000 system, there’s no need for special drives or exotic memory to get good performance out of the system. Standard enterprise SSDs are used. This helps reduce TCO and removes the reliance on uncommon components that may be affected by a shift in supplier strategy. It also helps performance scale better when more SSDs are added since all SSDs are used for all aspects of performance in parallel, eliminating a specific component being a bottleneck.

Unstructured Workload Performance Needs Vary Greatly

Unstructured data workloads are extremely varied, and even within a certain use case category like Artificial Intelligence, workload characteristics can vary widely. See below picture that characterizes typical Machine Learning and Deep Learning workloads. While many Object architectures prioritize bandwidth-oriented performance, X10000 Object namespace and the rest of the data path stack deliver high IOPS-oriented performance for small objects along with high bandwidth for larger objects, and low latency for GETs and PUTs (< 2ms).

The variability of AI workloads — The Variability of AI Workloads

Varied Needs Require Performance in Every Dimension

The X10000 is designed to provide balanced read vs write performance, both for high throughput and small transactional operations. This means that for heavy write workloads, one does not need a massive cluster. This results in an optimized performance experience regardless of workload and provides the ability to reach performance targets without waste.

Performance scales linearly as the cluster is expanded.

There are two key architectural decisions that enable X10000 to deliver high IOPS performance. First, X10000’s log-structured Key-Value store is extent-based. Extents are variable sized. Extent-based metadata and layout allow X10000 to adapt metadata and data accesses to application boundaries.

Second, X10000’s write buffer and indexes are optimized for small objects. X10000 implements a write buffer to which a small PUT is first committed, before it is destaged to a log-structured store and metadata updates are merged into to a Fractal Index Tree for efficient updates.

Write Path

The write path is interesting: Small object PUTs are committed to X10000’s write buffer prior to destaging them to the log-structured, erasure-code-protected store. The commit to a write buffer reduces the latency of small PUTs and reduces write amplification. The write buffer is stored on the same SSDs as the log-structured store and formed out of a collection of disklets.

Using SSDs for the write buffer was first implemented on B10000, which showed that an SSD-based write buffer can deliver the same high reliability and low latency as prior approaches such as NVDIMM, even for latency-sensitive structured data workloads.

Additionally, X10000 takes advantage of Object semantics to completely bypass the write buffer beyond a certain object size threshold, and instead directly writes the large object as part of a RAID stripe. This reduces write amplification, improves write performance and is part of the collection of techniques used to deliver X10000’s high write performance.

Great Performance Efficiency Even With a Small Deployment

A design goal of the X10000 was to provide high initial performance even with a relatively small deployment.

The minimum starting point is 3 nodes and 1 JBOF (all nodes are active). This also reflects the Node:JBOF performance ratio potential: With the drives used today, a single JBOF is enough to get high performance across all metrics with 3 nodes. Adding a 4^th node won’t add much more performance until a 2^nd JBOF is added, then performance will linearly scale until you hit 6 nodes, etc.

Ability to Start Small and Scale Later (Scale Down).

It’s always cool to talk about Exascale systems and scaling up: HPE makes the largest by far and the second largest has been measured to a cool 11TB/s storage throughput (yes, Eleven TeraBytes Per Second).

But what about scaling down? The X10000 doesn’t require lots of large capacity drives to get good performance – the smallest recommended config is with 3.84TB SSDs, about 92TB raw. In the space this solution is aimed at, that’s a small enough amount of capacity for most customers. In that amount of capacity, competing solutions would either not exist, or typically be incredibly constrained in at least one performance and/or TCO dimension.

Space Efficiency and Data Reduction

Incoming data is compressed, and good space efficiency is ensured by using 24-disklet RAID groups.

For backup applications, HPE’s Rapid Restore solution for the Alletra Storage MP X10000 combines exabyte-scale and performance-designed for faster object data write and retrieval with HPE’s StoreOnce Catalyst technology maximizing storage efficiency and data security. Benefits include fully encrypted backups, improved storage efficiency up to 3x over competitors, and rapid data recovery. The X10000 is also partner-certified as a backup target when backing up directly from CommVault and Veeam. See more info here.

Last but Not Least: Manageability

Deploying and using advanced storage shouldn’t be a science project. The X10000 is managed using the same common framework as all of HPE’s modern storage, server and network solutions. No need to learn a new interface or go to a different management portal. You can see how easy the management is here.

Summary

The HPE Alletra Storage MP X10000 offers a new and innovative approach in the space of unstructured data storage solutions. From the containerized architecture that provides the possibility of computational storage, to the efficient design that allows customers to have balanced performance even at a smaller starting point, the X10000 is an exciting solution that’s aimed at solving practical problems in a practical way.