HPE Ransomware Detection and Recovery in Zerto 10: Sophistication that Works |

Ransomware seems to be at the forefront of many discussions today, and for good reason: The ransomware gangs make a ton of money by causing massive problems to businesses that are in turn losing billions because of this – but most importantly, losing time.

So eventually, like for anything that’s a problem, people tried to find solutions.

The challenge becomes finding what solutions truly address the problem in a realistic way instead of being mostly marketing in order to show a vendor isn’t behind in this area.

Some of you may remember the awesome Chrysler ads with Ricardo Montalban talking about “rich Corinthian Leather”. There is no such thing, the leather came from New Jersey. Corinth in Greece was never known for its prowess in leather anything – but the name sounded cool and different, so marketing went with it, as is their idiom. I’ll explain how HPE’s Zerto ransomware detection & recovery is truly useful in both detecting modern ransomware and rapidly recovering with a tight RPO. I’ll also show which types of protection are more like Corinthian Leather 🙂

A good example of Corinthian Leather: “Immutable Snapshots”. Practically every serious storage system from the major vendors has this technology, which mostly means locking snaps so that even if ransomware has infected the backup system (and therefore has the permissions to delete snaps, which is the least of the many things ransomware will try and do), the storage system won’t allow the deletion to happen.

Techniques like locking snapshots are, at best, a supplemental form of defense. Some ransomware indeed tries to delete snaps before the hackers demand the ransom – but they have already been encrypting your data for months, so your snaps are also infected…

So if you can’t detect, with accuracy, when encryption started happening, you have no defense and no safe recovery point.

To summarize: Aside from prevention, what’s most useful if you have been infected is:

Real-time detection but also…
…the ability to detect modern kinds of ransomware that fool methods like standard Shannon entropy detection (for example, encryption that results in compressible data) but also…
…the ability to very quickly recover, and with a minimal loss of data (tight RPO and RTO) – not in hours/days but seconds/minutes. Time is money and all that.

Let’s get started:

A Brief Explanation of Ransomware

Modern ransomware is a way to extort money out of end users. It has become incredibly sophisticated, intelligently (and silently) attacking backup software and antivirus, and then transparently corrupting user data.

The corruption happens via encryption. The ransomware injects a filter driver that transparently encrypts and decrypts data: think of it as an extra layer that works somewhat like Windows BitLocker or Apple’s FileVault – the big difference being that you don’t have the key to unlock the data, the hackers do.

Because it’s a filter driver, the users are unaware, and eventually the ransomware operators remove the transparency from the encryption and request money to provide the users the key to unlock their data.

Some ransomware will also try to delete your backups/snaps on top of the encryption, but often the backups don’t even need to be touched since if an attacker slowly encrypts and waits, say, a few months before asking you for money, chances are you don’t want to recover from a backup that’s several months old. Plus you wouldn’t even know which backup/snapshot was clean!

Most businesses simply wouldn’t survive this kind of data loss.

Of course, the ideal situation is to not be hit by ransomware to begin with – that is best handled by techniques well outside your servers and storage, with multiple lines of defense needed.

But what if you do get hit by ransomware and it starts affecting your data? How do you quickly detect this and recover instead of losing days or months of work? What does your last line of defense look like?

Problem Description (or why ransomware detection solutions are mostly Corinthian leather):

Bear with me – when fighting any kind of misinformation, it takes a lot longer to fight the misinformation with sufficient step-by-step explanation, than to spread the misinformation itself… so we will go over some foundational material to show what’s really happening.

Most ransomware detection techniques revolve around certain broad areas:

Measure backup/snapshot rate of change, the assumption being that encryption will increase space utilization for such protection mechanisms.
Measure data reduction efficiency, the assumption being that encryption will worsen data reduction ratios.
Measure CPU utilization, the assumption being that the act of mass encryption will radically increase CPU busy %
Measure data entropy, the assumption being that encryption will increase overall entropy – then compare against a fixed entropy threshold for non-encrypted data, and alert if that threshold has been breached. For example, if data is encrypted, the assumption is that it’s random and can’t be compressed…

I classify the first three as “Generation 1” methods, and entropy detection as “Generation 2”. The problem is that these techniques are now not enough.

Modern ransomware has evolved, making classic Gen 1/Gen 2 detection techniques obsolete

It doesn’t try to encrypt too quickly, which means CPU load isn’t materially different vs normal operations.
It doesn’t try to encrypt too much – just a tiny amount of encryption in each file is enough to make the data useless, but it isn’t enough to show up as a meaningful rate of change – so neither backup size nor data reduction ratio calculations work to detect it.
Entropy detection is not sufficient by itself to detect ransomware that encrypts tiny amounts of data, since it normally relies on measuring entropy increase vs a fixed threshold, and modern ransomware won’t necessarily hit that fixed threshold due to changing files only a small amount, which could easily be construed as normal behavior.
Modern ransomware doesn’t even look encrypted and increasingly uses techniques such as Base64 encoding, which is even more dangerous since they completely confuse standard Shannon entropy detection. They convert data from having the entropy of binaries, into having that of text, making it hard to hit any known thresholds for detection (in fact, this one simple trick defeats all standard entropy detection schemes).

Additional Challenges

Host-level encryption?

There’s also the aspect of legitimate encryption done at the host level (for instance, Databases encrypted by the DB engine itself).

If a file is encrypted on purpose, how would basic ransomware detection techniques distinguish that encryption from another encryption? It would just look like replacing garbled data with equally garbled data. And result in false positives.

It seems logical then that a modern ransomware detection mechanism needs to be able to deal both with legitimate host-level activity, plus modern ransomware, dynamically, without relying on fixed thresholds and with no assumptions regarding data types.

Scanning backups/snapshots?

Scanning backups or snapshots has massive additional disadvantages in addition to the aforementioned challenges with basic entropy detection.

Scanning backups depends on the amount of data written, so it can’t easily scale in larger environments. And, of course, recovery would rely on coarse-grained backups schedules, resulting in poor RPO like several days.

Scanning at the individual drive level?

When a certain vendor announced this, I had to check whether it was April 1^st. Very little intelligence is possible at the drive level, and no easy way to tie it to specific files, collaborate with the rest of the infrastructure, or roll back to a specific operation. Useful for marketing purposes though and claims of “computational storage” – technically the truth 🙂

Combining detection with recovery: The ideal approach.

Ransomware detection needs to be closely tied to recovery.

Detection needs to happen in real time or as close to it as possible.
Recovery needs to be able to roll back data very granularly, so that the RPO is from exactly before the data started being encrypted. One should not rely on backups or snapshots, which could be hours or days old.

The HPE Approach is Gen 3: Data-Adaptive Detection

We decided to create several new inventions to radically change how ransomware detection is implemented and gain a huge competitive advantage versus anyone else by creating a 3^rd-generation anomaly detection engine to combat modern threats. This has also resulted in several patents filed since we are doing far more than basic entropy calculations 🙂

In broad terms, the HPE solution accomplishes the following things – if you’re considering other approaches, please ask those vendors if they accomplish this:

Data-Adaptive techniques allow for dynamic calculation of trigger thresholds, massively enhancing accuracy and reducing false positives.
Automatically detect an attack in near real-time and recover in minutes, with an incredibly tight RPO.
Ability to detect ransomware activity even in already compressed data or data made to look like text after base64 encoding and (in the future) legitimate, user-encrypted data.
Ability to detect encryption even if very small amounts are encrypted in each file.
The ability to identify the source of the infection (down to volumes in a specific server) – allowing even faster response and quarantining.
Doesn’t care about dataset size – it relies completely on real-time streams.
No need to use malware signatures.
No need to use computationally expensive deduplication/compression engines to determine uniqueness or heavy calculations in general – the latency of the detector is only 1-2 microseconds per sample.
No need to use snapshots or backups to scan, since those wouldn’t provide enough granularity for recovery.
Agentless, which apart from being easier to implement, also means it’s impervious to malware that attacks backup agents.

I will touch upon a few of the cooler aspects, otherwise this article will become as long as a research paper, which would defeat the purpose of simplifying things (plus I need to be careful not to expose too much of the secret sauce) 🙂

Dynamic calculation of thresholds.

This is one of the key and unique aspects of the HPE approach. You see, the typical way for detecting encryption is entropy calculations and comparing to a fixed threshold of randomness.

There are several challenges that massively complicate having a fixed detection threshold:

Different data types
Whether the data is compressed already or not
Insidious tricks like encoding data using base64 make encrypted data seem like it has less entropy (converts binary to text), which breaks fixed threshold detection systems.

The entropy equation is this (there’s no test later, relax, we will focus on just one thing here):

The big thing we do is that we dynamically calculate the “n” above, which is not done by other vendors (they assume a fixed, high value, for instance 256 for a byte, since that’s the total number of all possible values using 8 bits). The “n” just means the cardinality of the data “alphabet”, which would naturally be different if the data was plain text, compressed text, normal binaries, etc.

This is a huge deal, since it enables us to dynamically adjust detection thresholds based on different types of data. For instance, if the cardinality turns out to be that of normal text, we will accordingly lower the detection threshold.

In broad terms, we create a histogram of sampled data, make no assumptions about the type of statistical distribution to expect, and calculate the cardinality of the data using a special algorithm (that’s the secret sauce).

The rest is advanced, yet mostly standard stuff:

We then take the resulting “n” for that specific data and calculate a dynamic threshold using an ideal Bernoulli distribution (typical in research circles, the unique thing here is the constant recalculation of “n”, which drives everything else).

We then use a modified entropy calculation (which normalizes the data in a special way) and a reworked Student T-test for hypothesis testing and check the live streaming data (plus a bit of a historical time window) and compare it against that dynamically changing threshold.

This results in extremely high confidence detection, regardless of encoding, data types, etc.

To further increase accuracy, the solution trains itself (the initial training time is configurable). That training is automatically done per stream, which further enhances accuracy (doing it globally for the whole system would mean the granularity of specific app behavior would be lost).

RPO down to the last operation before ransomware started encryption, plus identification of infection source:

The other big thing we do is that we are tying ransomware detection tightly to granular recovery. A major HPE technology is Zerto, which allows for rapid DR and log-based continuous data protection.

So what we did was plug our new detector tech right into Zerto, and this tech has now become available with Zerto v10.

This can enable things like identifying what servers/files started first to be encrypted, and then rolling back to the last known write operation before ransomware started encrypting – which allows businesses to recover and quarantine the best possible way, with the least amount of risk and disruption.

The Cherry on Top: Zerto Cyber Resilience Vault

To further augment the new detection engine, HPE now offers the Zerto Cyber Resilience Vault. It is a separate, automated air-gapped and tightly integrated clean-room recovery environment complete with servers, networking and storage plus software.

Here’s a diagram that illustrates the final solution:

For a lot more information on the Cyber Resiliency Vault, this is a good paper – this article is long enough already. But you’d normally have the Vault and the Replication Target in the same site, and Production would be physically separated.

Summary

I urge everyone to consider their current ransomware recovery strategy and, in the light of the information presented in this article, whether the solutions would be effective if facing modern ransomware, or whether it would be more like Corinthian leather 🙂

Ask how a solution protects against things that defeat other detectors, but more importantly, what is the RPO and RTO on offer? Would you perhaps be losing hours, days or weeks of data?

How much loss is acceptable?

It is clear that HPE is offering realistic approaches to difficult problems – it is all part of a broader strategy that always aims at solving total infrastructure challenges.

For an official paper on the topic, check this out.

2 Replies to “HPE Ransomware Detection and Recovery in Zerto 10: Sophistication that Works”

Piotr Nogaś says:

August 2, 2023 at 1:34 am

Dimitris, you rock 🙂
Piotr Nogaś says:

August 2, 2023 at 1:37 am

Dimitris, could you explain if/to what extent Zerto fulfills ISO 27034?

Comments are closed.