Beware of Cloud Sizing Tools and avoid Reliability Angst

Some time ago I wrote about the dangers of taking certain things for granted with new technologies.

This time I wanted to use a more specific enterprise application example to show that customers need to be extra careful when comparing solutions, especially for mission-critical apps.

Sometimes being too high-level means missing the unspeakable horrors lurking under the covers. And ignorance doesn’t mean bliss… just nasty surprises.

To summarize: Avoid bait-and-switch so you avoid surprise costs and pain.

Ensure all components in any sizing tools reflect your business requirements.
The simpler the infrastructure, the more reliable. If one must do things like stripe across many volumes in order to get decent performance even for medium-sized solutions, then that may be a warning sign that the solution is lacking.
Ensure all the underlying components in any pricing you see would fit your company’s mission-critical needs. For instance – what reliability and resiliency are the storage components rated for? And is that sufficient for your needs?
Ensure you are accounting for the right number of systems (Production-spec vs non, times number of applications, etc). This can quickly add up with certain apps.

SAP HANA Solutions

There are multiple ways to consume certified SAP HANA solutions. The key is combining certified components that can together address a certain performance requirement (like HANA node count, as shown in this article) in a solution that meets the resiliency and uptime requirements of the end user.

So, one could combine certified storage and use the TDI approach (most flexible) or opt for the Appliance model that delivers the entire solution as a package (best for truly high performance requirements). Both of those can be consumed as CapEx or OpEx of course.

Another way is to use public cloud offerings, which means picking from certified compute configurations (each with a different CPU/RAM capability) and then attaching cloud storage to those compute options.

I will focus on the storage part, since that’s where I noticed the need for extra caution.

FYI: SAP HANA Certification Regarding Performance

The storage subsystem for SAP HANA needs to meet certain minimum KPIs – for instance, sub-millisecond write latency for the redo log volumes, and a minimum of 400MB/s per data volume speed (fairly pedestrian with modern all-flash systems). Of course, depending on what the final spec is, you may need a lot more than that (for instance, if you want large servers with 24TB RAM, you’d need a much higher throughput per server, otherwise it would take forever for the database to start up).

Then you multiply all that times the nodes needed.

Fit-For-Purpose

When comparing solutions for production workloads (especially mission-critical), apart from performance, a couple of other things arguably matter even more:

Uptime and Resiliency (for example, resistance to data corruption and hardware failures)

A solution may be certified for performance, but certification unfortunately doesn’t consider reliability or data integrity. Those things would be something the end customer needs to do the due diligence for in order to ensure fitness for purpose in a mission-critical environment.

Scalability vs Complexity vs Price/Performance

What do you think is simpler and more resilient:

A single incredibly resilient volume able to hit very high performance levels without further complexity or
Multiple smaller volumes, each with performance limitations and lower resiliency, striped together using LVM to create a larger device that’s still slower than the previous example.

Clearly, #2 has some issues. Any LVM striping creates a less resilient solution and increases configuration, management and upkeep complexity.

The risk is now multiplied times the number of underlying volumes – a problem with any of the underlying volumes brings the whole stripe set down.

The reason I mention all this is because several offerings have severe limitations regarding things like:

IOPS/GB
Throughput per volume
Volume resiliency
Volume latency
Uptime

Those limitations are what creates the complications and increases risk.

Public Cloud Sizing Example

For instance, let’s say you have a sizable, mission-critical environment and need to hit 15GB/s. Part of the need for the speed is for a large DB to be able to start quickly, not just for the normal runtime performance. A large database without enough throughput behind it would be very slow every time it started.

A certain public cloud provider’s more affordable offerings may only be rated for a couple hundred MB/s maximum throughput per volume, 3 nines uptime, 3 nines durability(!), and more than 1ms latency. To hit the target speeds, one would need to stripe across many of those devices (further reducing durability) and still wouldn’t be able to get decent latency or durability appropriate for mission-critical environments.
Their “medium” tier may be able to get around 1GB/s throughput per volume but the same weak latency, uptime and resiliency as their affordable tier.
Their “fancy” tier may provide 1GB/s throughput per volume, over 1ms latency but at least the durability is 5 nines.
Their “ultimate” tier may be the only one that can do things like sub-millisecond latency, and over 1GB/s throughput per volume.

Clearly then, for this example 15GB/s workload, one would absolutely need to go with their “ultimate” tier to get the right latency, resiliency and throughput for an important workload. Anything else would mean complications, low performance and less resiliency. And one would still need to stripe across several of those “ultimate” tier volumes but at least not as many as with the other options.

But, of course, the “ultimate” tier is far, far costlier than the other ones… (FYI I’m not calling out names or providing links in order to keep this generic. I may have noticed all this with a particular large vendor but the concepts and concerns extend to everyone providing AaaS solutions).

Comparing Price/Performance/Fitness for Purpose of HANA Solutions

Which finally brings us to the main reason I wrote this article.

As we discussed, for mission-critical workloads, only the most expensive storage tier of this public cloud provider would suffice. In fact, they absolutely say so themselves, which is perfectly fine and in line with common sense.

Here’s the tricky part:

Public cloud providers have online pricing tools that can give you a rough estimate of costs.

What’s interesting is that in their pricing tool for HANA configurations, the instances rated for production default to the cheapest storage option – which is not recommended by the cloud vendor.

Not even the “fancy” option was the default (I understand maybe not wanting to make the “ultimate” tier the default).

No, it was the cheapest one the sizer defaulted to, which would normally be used for test/dev purposes really.

Now, I’m not saying they did that on purpose to mislead anyone, perhaps the configuration tool is buggy, all I’m saying is – do your due diligence and examine the underlying components carefully so you don’t get lulled into a false sense of security.

Otherwise one of two things may happen:

You won’t have the right performance or resiliency, which will put your business at risk, or
Some architect will catch the error and correct it, which will vastly increase the pricing – perhaps well beyond what you were budgeting. And by the time the error is caught it will be too late to change course.

Be Aware of All Applicable Charges (Number of Systems/Network Egress Fees/Backup)

System Count

Continuing with the SAP HANA example, knowing the right number of “fancy” vs “cheaper” systems needed is key since it will massively affect total cost. To simplify things, when I say “systems” I mean HANA systems, since that’s where the heavy storage, CPU and memory requirements reside. You’d need application servers as well.

Typical SAP deployments have 2-3 apps like BW, ECC, CRM, SCM etc.

For each SAP application, multiple systems are typically needed, here’s an assortment:

Production
Dev
QA
UA (sometimes shared duty with QA)
DR (sometimes uses the UA/QA system to save cost
Sandbox
Training

Typically, the Production, UA and QA systems need to be of identical spec, so one would be looking at the “fancy” Mission Critical storage back-end for at least 2 of the systems (assuming one uses a triple-duty system for QA/UA/DR). The rest can usually be of lesser spec, and therefore less expensive.

This means that, per application, about 3 systems are needed at a minimum, at least 2 of which are the “fancy” type. Now, multiply that times the number of applications, and you have a decent estimate of what is needed.

If one wants proper performance and reliability, all these systems need to be considered.

Backup

Backup is another important consideration. How many months of backups will need to be kept? Where are they stored? How efficient is that store? How quickly can they be recovered? And how much does this all cost?

Egress

Several public cloud providers charge data egress costs (i.e. data coming out of the cloud and into the outside world incurs a capacity-based fee). Ensure that is factored into all the calculations.

Call to Action and Options to Consider

If you’d rather consume infrastructure as a service – that’s perfectly fine and really where the industry is headed.

Just always be acutely aware of the underlying component capabilities of whatever solution you’re willing to consume.

Is it reliable enough?
Is it resilient enough?
Is the latency low enough?
Is the performance enough?
Do you need to do things like LVM striping that would take away from the simplicity and resiliency of the solution?
What about backups?
What about network egress fees?

For instance – in addition to completely turnkey SAP solutions, HPE has a GreenLake for Block Storage AaS solution that will achieve a sustained (not cached hero numbers) up to 55GB/s throughput in just 4U and is certified for up to 120 SAP HANA nodes, with 100% guaranteed availability (public cloud uptime is typically under 3 nines) and latencies well below 1ms (indeed, 75% of all I/O happens at under 250 microseconds latency). Extremely well-suited for Mission Critical environments.

Indeed, the only time there would be the need for striping with such a solution is in leviathan HANA Appliance setups that stripe across entire arrays to provide Ludicrous Speed (few customers worldwide need this today of course – but know that it’s already available).

And, ironically, this Mission-Critical storage offering may end up being far more cost-effective than the “ultimate” storage tier from the public cloud vendor example given before.

Even for smaller deployments, the HPE Business Critical tier offers 6 nines uptime guarantees and incredible resiliency (and still very sizable HANA deployments of up to 54 nodes in 4U).

Without LVM complications, and with ease of consumption and management from a single pane of glass (and, of course, APIs if that’s your thing).

Update: check this document for an overview of HPE GreenLake for SAP HANA solutions.