The Importance of SSD Firmware Updates

I wanted to bring this crucial issue to light since I’m noticing several storage vendors being either cavalier about this or simply unaware.

I will explain why solutions that don’t offer some sort of automated, live SSD firmware update mechanism are potentially extremely risky propositions. Yes, this is another “vendor hat off, common sense hat on” type of post.

Modern SSD Architecture is Complex

The increased popularity and lower costs of fast SSD media are good things for storage users, but there is some inherent complexity within each SSD that many people are unaware of.

Each modern SSD is, in essence, an entire pocket-sized storage array, that includes, among other things:

  • An I/O interface to the outside world (often two)
  • A CPU
  • An OS
  • Memory
  • Sometimes Compression and/or Encryption
  • What is, in essence, a log-structured filesystem, complete with complex load balancing and garbage collection algorithms
  • An array of flash chips driven in parallel through multiple channels
  • Some sort of RAID protection for the flash chips, including sparing, parity, error checking and correction…
  • A supercapacitor to safely flush cache to the flash chips in case of power failure.

Sounds familiar?

With Great Power and Complexity Come Bugs

To make something clear: This discussion has nothing to do with overall SSD endurance & hardware reliability. Only the software aspect of the devices.

All this extra complexity in modern SSDs means that an increased number of bugs compared to simpler storage media is a statistical certainty. There is just a lot going on in these devices.

Bugs aren’t necessarily the end of the world. They’re something understood, a fact of life, and there’s this magical thing engineers thought of called… Patching!

As a fun exercise, go to the firmware download pages of various popular SSDs and check the release notes for some of the bugs fixed. Many fixes address some rather abject gibbering horrors… 🙂

Even costlier enterprise SSDs have been afflicted by some really dangerous bugs – usually latent defects (as in: they don’t surface until you’ve been using something for a while, which may explain why these bugs were missed by QA).

I fondly remember a bug that hit some arrays at a previous place of employment: the SSDs would work great but after a certain number of hours of operation, if you shut your machine down, the SSDs would never come up again. Or, another bug that hit a very popular SSD that would downsize itself to an awesome 8MB of capacity (losing all existing data of course) once certain conditions were met.

Clearly, these are some pretty hairy situations. And, what’s more, RAID, checksums and node-level redundancy wouldn’t protect against all such bugs.

For instance, think of the aforementioned power off bug – all SSDs of the same firmware vintage would be affected simultaneously and the entire array would have zero SSDs that functioned. This actually happened, I’m not talking about a theoretical possibility. You know, just in case someone starts saying “but SSDs are reliable, and think of all the RAID!”

It’s all about approaching correctness from a holistic point of view. Multiple lines of defense are necessary.

The Rules: How True Enterprise Storage Deals with Firmware

Just like with Fight Club, there are some basic rules storage systems need to follow when it comes to certain things.

  1. Any firmware patching should be a non-event. Doesn’t matter what you’re updating, there should be no downtime.
  2. ANY firmware patching should be a NON-EVENT. Doesn’t matter what you’re updating, there should be NO downtime!
  3. Firmware updates should be automated even when dealing with devices en masse.
  4. The customer should automatically be notified of important updates they need to perform.
  5. Different vintage and vendor component updates should be handled automatically and centrally. And, most importantly: Safely.

If these rules are followed, bug risks are significantly mitigated and higher uptime is possible. Enterprise arrays typically will follow the above rules (but always ask the vendor).

Why Firmware Updating is a Challenge with Some Storage Solutions

Certain kinds of solutions make it inherently harder to manage critical tasks like component firmware updates.

You see, being able to hot-update different kinds of firmware in any given set of hardware means that the mechanism doing the updating must be intimately familiar with the underlying hardware & software combination, however complex.

Consider the following kind of solution, maybe for someone sold on the idea that white box approaches are the future:

  • They buy a bunch of diskless server chassis from Vendor A
  • They buy a bunch of SSDs from Vendor B
  • They buy some Software Defined Storage offering from Vendor C
  • All running on the underlying OS of Vendor D…

Now, let’s say Vendor B has an emergency SSD firmware fix they made available, easily downloadable on their website. Here are just some of the challenges:

  1. How will that customer be notified by Vendor B that such a critical fix is available?
  2. Once they have the fix located, which Vendor will automate updating the firmware on the SSDs of Vendor B, and how?
  3. How does the customer know that Vendor B’s firmware fix doesn’t violently clash with something from Vendor A, C or D?
  4. How will all that affect the data-serving functionality of Vendor C?
  5. Can any of Vendors A, B, C or D orchestrate all the above safely?
  6. With no downtime?

In most cases I’ve seen, the above chain of events will not even progress past #1. The user will simply be unaware of any update, simply because component vendors don’t usually have a mechanism that alerts individual customers regarding firmware.

You could inject a significant permutation here: What if you buy the servers pre-built, including SSDs, from Vendor A, including full certification with Vendors C and D? 

Sure – it still does not materially change the steps above. One of Vendors A, C or D still need to somehow:

  1. Automatically alert the customer about the critical SSD firmware fix being available
  2. Be able to non-disruptively update the firmware…
  3. …While not clashing with the other hardware and software from Vendors A, C and D
I could expand this type of conversation to other things like overall environmental monitoring and checksums but let’s keep it simple for now and focus on just component firmware updates…

Always Remember – Solve Business Problems & Balance Risk

Any solution is a compromise. Always make sure you are comfortable with the added risk certain areas of compromise bring (and that you are fully aware of said risk).

The allure of certain approaches can be significant (at the very least because of lower promised costs). It’s important to maintain a balance between increased risk and business benefit.

In the case of SSDs specifically, the utter criticality of certain firmware updates means that it’s crucially important for any given storage solution to be able to safely and automatically address the challenge of updating SSD firmware.

D