Status and the Nimble acquisition by HPE

Just a quick post since I’m getting inquiries regarding the status of my vital signs.

Yes, I’m alive, just very busy helping with the ongoing integration of Nimble Storage into HPE and all kinds of cool AI.

All kinds of training to be done all around, positioning guides to be written, strategy to plan, technology to design.

The Acquisition

For anyone not aware, Nimble Storage was recently acquired by HPE.

There are several reasons this is a good thing for both companies, but also a sobering event for any remaining independent storage vendors.

You see, most remaining standalone storage vendors are not really differentiated, and VCs aren’t investing in this field like they used to. The vast majority of storage vendors will go bankrupt, a few will be acquired (even that is drying up), and a couple may remain independent.

If you’re a customer looking at storage vendors, and independent XYZ vendor is claiming they’re better than Nimble, ask them why they weren’t acquired by HPE (or anyone else) for a tiny fraction of what Nimble cost. Surely that would have been a better deal? Since their product is so superior? 🙂

Nimble Support Will Remain Awesome

Nimble Support will absolutely not become part of the overall HPE support organization.

Indeed, the final decision has been made, and Nimble Support will stay a part of Nimble Engineering (which is not where most storage vendors keep Support, but maybe that’s part of the reason support problems are so endemic in other companies).

Regarding the rest of HPE Support: The master plan is to use Nimble methodologies and technologies to improve and optimize HPE support for all HPE products.

InfoSight Will Cover Other HPE Products

Our AI & Analytics framework, InfoSight, is one of the ways our Support organization enjoys a massively unfair advantage in resolving difficult problems across the entire infrastructure (not just the arrays – statistically, we resolve and prevent more problems from outside storage than originate in storage).

The plan is have InfoSight go far beyond Nimble systems: to 3PAR, servers, networking… Because, you see, HPE makes all kinds of interesting things that can power an entire data center.

Plus, we will expand InfoSight to applications and the Cloud, maybe even areas such as protection against cybercrime like ransomware.

Having InfoSight be the AI for the Data Center is a pretty exciting future to live in and I’m proud to be a part of this effort.

Nimble Storage Complements the HPE Portfolio

With the acquisition of Nimble, HPE acquired an immensely capable, modern storage platform. A platform universally loved by its customers.

Of course, there are certain things Nimble Storage arrays don’t do today, but other HPE platforms do, and vice versa. Such is the power of a portfolio.

All in all, Nimble/HPE will now be able to create even more comprehensive solutions than before.

The Combined Might of HPE and Nimble is Colossal

HPE is a huge company, and enjoys certain advantages as such:

  • #1 in servers (incl. #1 in HPC)
  • Top 2 in storage, BURA (disk and tape), networking (wired and wireless)
  • Amazing supply chain
  • Innovation (check out The Machine, and Composable Infrastructure for just two cool examples)

I can’t give more details in this public forum but imagine what having access to HPE’s supply chain can do (including buying stuff in huge quantities and all that implies, plus early access to certain exotic technologies to help us improve the Nimble platform).

Plus there is the element of hugely expanded reach. With Nimble, HPE gets the chance to touch customers HPE wouldn’t normally speak to. Conversely, Nimble also gets a chance to touch customers Nimble wouldn’t normally speak to.

It’s About The People

Good acquisitions aren’t about just the technology. People build the technology. People sell the technology. People support the technology (well, AI does a lot of the support in our case, but you get the point).

HPE just hired, en masse, an amazing and aggressive team of IT professionals.

Nimble has around 12,000 customers at the time of this writing (several times more than any of the remaining independent storage vendors, my alma mater excluded), and each of those customers meant taking the sale away from an incumbent storage vendor that had to fight us tooth and nail.

A strong salesforce is necessary to do this, not just a strong product.

This strong salesforce will now have access to the entire HPE portfolio…

Fun times ahead.

Progress Needs Trailblazers

I got the idea for this stream-of-consciousness (and possibly too obvious) post after reading some comments regarding new ways to do high speed I/O. Not something boring like faster media or protocols, but rather far more exotic approaches that require drastically rewriting applications to get the maximum benefits of radically different architectures.

The comments, in essence, stated that such advancements would be an utter waste of time since 99% of customers have absolutely no need for such exotica, but rather just need cheap, easy, reliable stuff, and can barely afford to buy infrastructure to begin with, let alone custom-code applications to get every last microsecond of latency out of gear.

Why Trailblazers Matter

It’s an interesting balancing game: There’s a group of people that want progress and innovation at all costs, and another group of people that think the status quo is OK, or maybe just “give me cheaper/easier/more reliable and I’m happy”.

Both groups are needed, like Yin and Yang.

The risk-averse are needed because without them it would all devolve into chaos.

But without people willing to take risks and do seemingly crazy and expensive things, progress would simply be much harder.

Imagine if the people that prevailed were the ones that thought trepanning and the balancing of humours was state-of-the-art medicine and needed no further advancements. Perchance the invention of MRI machines might have taken longer?

Trepanning

What if traditional planes were deemed sufficient, would a daring endeavor like the SR-71 have been undertaken?

Planes

Advanced Tech Filters Down to the Masses

It’s interesting how the crazy, expensive and “unnecessary” eventually manages to become part of everyday life. Think of such commonplace technologies like:

  • ABS brakes
  • SSDs
  • 4K TVs
  • High speed networking
  • Smartphones
  • Virtualization
  • Having more than 640KB of RAM in a PC

They all initially cost a ton of money and/or were impractical for everyday use (the first ABS brakes were in planes!)

What is Impractical or Expensive Now Will be Normal Tomorrow…

…but not every new tech will make it. Which is perfectly normal.

  • Rewriting apps to take advantage of things like containers, microservices and fast object storage? It will slowly happen. Most customers can simply wait for the app vendors to do it.
  • In-memory databases? Not a niche any more, when even the ubiquitous SQL Server is doing it…
  • Using advanced and crazy fast byte-addressable NVDIMM in storage solutions? Some mainstream vendors like Nimble and Microsoft are already doing it. Others are working on some really interesting implementations.
  • AI, predictive analytics, machine learning? Already happening and providing huge value (for example Nimble’s InfoSight).

Don’t Belittle the Risk Takers!

Even if you think the crazy stuff some people are working on isn’t a fit for you, don’t belittle it. It can be challenging to hold your anger, I understand, especially when the people working on crazy stuff are pushing their viewpoint as if it will change the world.

Because you never know, it just might.

If you don’t agree – get out of the way, or, even better, help find a better way.

Just Because it’s New to You, Doesn’t Mean it’s Really New, or Risky

Beware of turning into an IT Luddite.

Just because you aren’t using a certain technology, it doesn’t make the technology new or risky. If many others are using it, and reaping far more benefits than you with your current tech, then maybe they’re onto something… 🙂

At what point does your resistance to change become a competitive disadvantage?

Tempered by: is the cost of adopting the new technology overshadowed by the benefits?

Stagnation is Death

…or, at a minimum, boring. Τα πάντα ρει, as Heraclitus said. I, for one, am looking forward to the amazing innovations coming our way, and actively participate in trying to make them come to fruition faster.

What will you do?

D

Going Green: Why I Joined Nimble Storage

I am proud to announce that, as of today, I am a member of the Nimble Storage team.

Nimble Logo

This marks the end of an era – I spent quite a bit of time at NetApp: learned a lot, did a lot – by the end I had my hands in all kinds of sausage making… 🙂

I wish my friends at NetApp the best of luck for the future. The storage industry is a very tough arena, and one that will be increasingly harder and with less tolerance than ever before.

Why?

I compared Nimble Storage with many competitors before making my decision. Quite simply, Nimble’s core values agree with mine. It goes without saying that I wouldn’t choose to move to a company unless I believed they had the best technology (and the best support), but the core values is where it all starts. The product is built upon those core values.

I firmly believe that modern storage should be easy to consume. Indeed, it should be a joy to consume, even for complex multi-site environments. It should not be a burden to deal with. Nor should it be a burden to sell.

Systems that are holistically easy to consume have several business benefits, some of which are: lower OPEX and CAPEX, increased productivity, less risk, easier planning, faster problem resolution.

It’s important to understand that easy to consume is not at all the same as easy to usethat is but a very small subset of easy consumption.

The core value of easy consumption encompasses several aspects, many of which are ignored by most storage vendors. Most modern players will focus on ease of use, show demos of a pretty GUI and suchlike. “Look how easy it is to install” or “look how easy it is to create a LUN”. Well – there’s a lot more to worry about in real life.

The lifecycle of a storage system

Beyond initial installation and simple element creation, there is a multitude of things users and vendors need to be concerned with. Here’s a partial list:

  • Installation
  • Migration to/from
  • Provisioning
  • Host/fabric configuration
  • Backups, restores, replication
  • Scaling up/out
  • Upgrading from a smaller/older version
  • Firmware updates for all components (including drives)
  • Tech refresh
  • Support

What about more advanced topics?

A storage solution cannot exist in a vacuum. There are several ancillary (but extremely important) services needed in order to help consume storage, especially at scale. Services that typically cannot (and many that should not) reside on the storage system itself. How about…

  • Initial and future sizing
  • Capacity planning based on long-term usage data
  • Performance analysis and profiling
  • Performance issue resolution/recommendations
  • Root cause analysis
  • What-if scenario modeling
  • Support case resolution
  • Comprehensive end-to-end monitoring and alerting
  • Comprehensive reporting (including auditing)
  • Security (including RBAC and delegation)
  • Upgrade planning
  • Pervasive automation (including host-side)
  • Ensuring adherence to best practices
If a storage solution doesn’t make all or most of the above straightforward, then it is not truly easy to consume.

The problem:

Storage vendors will typically either be lacking in many of the above areas, or may need many different tools, and significant manual effort, in order to provide even some of these services.

Not having the tools creates an obvious problem – the customer and vendor simply can’t perform these functions, or the implementation is too basic. Most smaller vendors are in this camp. Not much functionality beyond what’s inside the storage device itself. Long-term consumption, especially at scale, becomes challenging.

On the other hand, having a multitude of tools to help with these areas also makes the solution hard to consume overall. Larger vendors fall into this category.

For instance: Customers may need to access many different tools just to monitor and alert on various metrics. One tool may provide certain information, another tool provides different metrics (often with significant overlap with the first tool), and so on. And not all tools work with all versions of the product. This increases administrative complexity and overall time and money spent. And the end result is often compromised and incredibly hard to support.

Vendors that need many different tools also create a problem for themselves: Almost nobody on staff will have the expertise to deal with the plethora of tools necessary to do certain things like sizing, performance troubleshooting or even a tech refresh. Or optimizing a product for specific workloads. Deep expertise is often needed to interpret the results of the tools. This causes interminable delays in problem resolution, lengthens sales cycles, complicates product development, creates staffing challenges, increases costs, and in general makes life miserable.

RG autojack

How?

What always fascinated me about Nimble Storage is that not only did they recognize these challenges, they actually built an entire infrastructure and innovative approach in order to solve the problem.

Nimble recognized the value of Predictive Analytics.

The challenge: How to use Big Data to solve the challenges faced by storage customers and storage vendors. And how to do this in a way that achieves a dramatically better end result.

While most vendors have call-home features, and some even have rudimentary capacity, configuration and maybe even performance telemetry being sent to some central repository (usually very infrequently), Nimble elected instead to send extremely comprehensive sensor telemetry to a huge analytics engine. A difficult undertaking, but one that would define the company in the years to come.

Nimble also recognized the need to do this from the very beginning. Each Nimble array sends 30-70 million data points back to Nimble every day. Trying to retrofit telemetry of this scope would be extremely difficult if not impossible to achieve effectively.

This wealth of data (the largest storage-related analytics engine in the world, by far) is used to help customers with the challenges mentioned previously, while at the same time lowering complexity.

It also, crucially, helps Nimble better support customers and design better products without having to bother customers for data dumps.

For example: What if a Nimble engineer trying to optimize SQL I/O performance wants to see detailed I/O statistics only for SQL workloads on all Nimble arrays in the world? Or on one array? Or on all arrays at a certain customer? It’s only a simple query away… and that’s just scratching the surface of what’s possible. It certainly beats trying to design storage based on arbitrary synthetic benchmarks, or manually grabbing performance traces from customer gear…

What?

Enter InfoSight. That’s the name of the gigantic analytics engine currently ingesting trillions of anonymized sensor data points every week. And growing. Check some numbers here

Nimble Storage customers do not need to install custom monitoring tools to perform highly advanced storage analytics, performance troubleshooting, and even hardware upgrade recommendations based on automated performance analysis heuristics.

No need to use the CLI, no need to manually send data dumps to the vendor, no need to use 10 different tools.

All the information customers need is available through a browser GUI. Even the vast majority of support cases are automatically handled by InfoSight, and I’m not talking about simply sending replacement hardware (that’s trivial).

I always saw InfoSight as the core offering of Nimble Storage, the huge differentiator that works hand in hand with the hardware and helps customers consume storage easily. Yes, Nimble Storage arrays are fast, reliable, easy to use, have impressive data reduction abilities, scale nicely, have great features, are cost-effective etc. But other vendors can claim they can satisfy at least some of those attributes.

Nobody else has anything even remotely the depth and capability of InfoSight. This is why Nimble calls their offering the Predictive Flash Platform. InfoSight Predictive Analytics + great hardware = Predictive Flash.

I will be covering this fascinating topic in a lot more depth in the future. An AI Expert System powered by a behemoth analytics engine, helping reduce complexity and making the solution Easy To Consume is a pretty impressive piece of engineering.

Watch this space…

D

Technorati Tags: , , , ,

Are you doing a disservice to your company with RFPs?

Whether we like it or not, RFPs (Request For Proposal) are a fact of life for vendors.

It usually works like this: A customer has a legitimate need for something. They decide (for whatever reason) to get bids from different vendors. They then craft an RFP document that is either:

  1. Carefully written, with the best intentions, so that they get the most detailed proposal possible given their requirements, or
  2. Carefully tailored by them and the help of their preferred vendor to box out the other vendors.

Both approaches have merit, even if #2 seems unethical and almost illegal. I understand that some people are just happy with what they have so they word their document to block anyone from changing their environment, turning the whole RFP process into an exercice in futility. I doubt that whatever I write here will change that kind of mindset.

However – I want to focus more on #1. The carefully written RFP that truly has the best intentions (and maybe some of it will rub off on the #2 “blocking” RFP type folks).

Here’s the major potential problem with the #1 approach:

You don’t know what you don’t know. For example, maybe you are not an expert on how caching works at a very low level, but you are aware of caching and what it does. So – you know that you don’t know about the low-level aspects of caching (or whatever other technology) and word your RFP so that you learn in detail how the various vendors do it.

The reality is – there are things whose existence you can’t even imagine – indeed, most things:

WhatUknow

By crafting your RFP around things you are familiar with, you are potentially (and unintentionally) eliminating solutions that may do things that are entirely outside your past experiences.

Back to our caching example – suppose you are familiar with arrays that need a lot of write cache in order to work well for random writes, so you put in your storage RFP requirements about very specific minimum amounts of write cache.

That’s great and absolutely applicable to the vendors that write to disk the way you are familiar with.

But what if someone writes to disk entirely differently than what your experience dictates and doesn’t need large amounts of write cache to do random writes even better than what you’re familiar with? What if they use memory completely differently in general?

Another example where almost everyone gets it wrong is specifying performance requirements. Unless you truly understand the various parameters that a storage system needs in order to properly be sized for what you need, it’s almost guaranteed the requirements list will be incomplete. For example, specifying IOPS without an I/O size and read/write blend and latency and sequential vs random – at a minimum – will not be sufficient to size a storage system (there’s a lot more here in case you missed it).

By setting an arbitrary limit to something that doesn’t apply to certain technologies, you are unintentionally creating a Type #2 RFP document – and you are boxing out potentially better solutions, which is ultimately not good for your business. And by not providing enough information, you are unintentionally making it almost impossible for the solution providers to properly craft something for you.

So what to do to avoid these RFP pitfalls?

Craft your RFP by asking questions about solving the business problem, not by trying to specify how the vendor should solve the business problem.

For example: Say something like this about space savings:

“Describe what, if any, technologies exist within the gizmo you’re proposing that will result in the reduction of overall data space consumption. In addition, describe what types of data and what protocols such technologies can work with, when they should be avoided, and what, if any, performance implications exist. Be specific.”

Instead of this:

“We need the gizmo to have deduplication that works this way with this block size plus compression that uses this algorithm but not that other one“.

Or, say something like this about reliability:

“Describe the technologies employed to provide resiliency of data, including protection from various errors, like lost or misplaced writes”.

Instead of:

“The system needs to have RAID10 disk with battery-backed write cache”.

It’s not easy. Most of us try to solve the problem and have at least some idea of how we think it should be solved. Just try to avoid that instinct while writing the RFP…

 

D

NetApp delivers 2TB/s performance to giant supercomputer for big data

(Edited: My bad, it was 2TB/s, up from 1.3TB/s, the solution has been getting bigger and upgraded, plus the post talks about the E5400, the newer E5600 is much faster).

What do you do when you need so much I/O performance that no one single storage system can deliver it, no matter how large?

To be specific: What if you needed to transfer data at over 1TB per second? (or 2TB/s, as it eventually turned out to be)?

That was the problem faced by the U.S. Department of Energy (DoE) and their Sequoia supercomputer at the Lawrence Livermore National Laboratory (LLNL), one of the fastest supercomputing systems on the planet.

You can read the official press release here. I wanted to get more into the technical details.

People talk a lot about “big data” recently – no clear definition seems to exist, in my opinion it’s something that has some of the following properties:

  • Too much data to be processed by a “normal” computer or cluster
  • Too much data to work with using a relational DB
  • Too much data to fit in a single storage system for performance and/or capacity reasons – or maybe just simply:
  • Too much data to process using traditional methods within an acceptable time frame

Clearly, this is a bit loose – how much is “too much”? How long is “too long”? For someone only armed with a subnotebook computer, “too much” does not have the same meaning as for someone rocking a 12-core server with 256GB RAM and a few TB of SSD.

So this definition is relative… but in some cases, such as the one we are discussing, absolute – given the limitations of today’s technology.

For instance, the amount of storage LLNL required was several tens of PB in a single storage pool that could provide unprecedented I/O performance to the tune of 2TB/s. Both size and performance needed to be scalable. It also needed to be reliable and fit within a reasonable budget and not require extreme space, power and cooling. A tall order indeed.

This created some serious logistics problems regarding storage:

  • No single disk array can hold that amount of data
  • No single disk array can perform anywhere close to 2TB/s

Let’s put this in perspective: The storage systems that scale the biggest are typically scale-out clusters from the usual suspects of the storage world (we make one, for example). Even so, they max out at less PB than the deployment required.

The even bigger problem is that a single large scale-out system can’t really deliver more than a few tens of GB/s under optimal conditions – more than fast enough for most “normal” uses but utterly unacceptable for this case.

The only realistic solution to satisfy the requirements was massive parallelization, specifically using the NetApp E-Series for the back-end storage and the Lustre cluster filesystem.

 

A bit about the solution…

Almost a year ago NetApp purchased the Engenio storage line from LSI. That storage line is resold by several companies like IBM, Oracle, Quantum, Dell, SGI, Teradata and more. IBM also resells the ONTAP-based FAS systems and calls them “N-Series”.

That purchase has made NetApp the largest provider of OEM arrays on the planet by far. It was a good deal – very rapid ROI.

There was a lot of speculation as to why NetApp would bother with the purchase. After all, the ONTAP-based systems have a ton more functionality than pretty much any other array and are optimized for typical mostly-random workloads – DBs, VMs, email, plus megacaching, snaps, cloning, dedupe, compression, etc – all with RAID6-equivalent protection as standard.

The E-Series boxes on the other hand don’t do thin provisioning, dedupe, compression, megacaching… and their snaps are the less efficient copy-on-first-write instead of redirect-on-write. So, almost the anti-ONTAP 🙂

The first reason for the acquisition was that, on purely financial terms, it was a no-brainer deal even if one sells shoes for a living, let alone storage. Even if there were no other reasons, this one would be enough.

Another reason (and the one germane to this article) was that the E-Series has a tremendous sustained sequential performance density. For instance, the E5400 system can sustain about 4GB/s in 4U (real GB/s, not out of cache), all-in. That’s 4U total for 60 disks including the controllers. Expandable, of course. It’s no slouch for random I/O either, plus you can load it with SSDs, too… 🙂 (Update: the newer E5600 can go up to 12GB/s in 2U with SSDs!)

Again, note – 60 drives per 4U shelf and that includes the RAID controllers, batteries etc. In addition, all drives are front-loading and stay active while servicing the shelf – as opposed to most (if not all) dense shelves in the market that need the entire (very heavy) shelf pulled out and/or several drives offlined in order to replace a single failed drive… (there’s some really cool engineering in the shelf to do this without thermal problems, performance loss or vibrations). All this allows standard racks and no fear of the racks tipping over while servicing the shelves 🙂 (you know who you are!)

There are some vendors that purely specialize in sequential I/O and tipping racks – yet they have about 3-4x less performance density than the E5400, even though they sometimes have higher per-controller throughput. In a typical marketing exercise, some of our more usual competitors have boasted 2GB/s/RU for their controllers, meaning that in 4U the controllers (that take up 4U in that example) can do 8GB/s, but that requires all kinds of extra rack space to achieve (extra UPSes, several shelves, etc). Making their resulting actual throughput number well under 1GB/s/RU. Not to mention the cost (those systems are typically more expensive than a 5400). Which is important with projects of the scale we are talking about.

Most importantly, what we accomplished at the LLNL was no marketing exercise…

 

The benefits of truly high performance density

Clearly, if your requirements are big enough, you end up spending a lot less money and needing a lot less rack space, power and cooling by going with a highly performance-dense solution.

However, given the requirements of the LLNL, it’s clear that you can’t use just a single E5400 to satisfy the performance and capacity requirements of this use case. What you can do though is use a bunch of them in parallel… and use that massive performance density to achieve about 40GB/s per industry-standard rack with 600x high-capacity disks (1.8PB raw per rack).

For even higher performance per rack, the E5400 can use the faster SAS or SSD drives – 480 drives per rack (up to 432TB raw), providing 80GB/s reads/60GB/s writes.

 

Enter the cluster filesystem

So, now that we picked the performance-dense, reliable, cost-effective building block, how do we tie those building blocks together?

The answer: By using a cluster filesystem.

Loosely defined, a cluster filesystem is simply a filesystem that can be accessed simultaneously by the servers mounting it. In addition, it also typically means it can span storage systems and make them look as one big entity.

It’s not a new concept – and there are several examples, old and new: AFS, Coda, GPFS, and the more prevalent Stornext and Lustre are some.

The LLNL picked Lustre for this project. Lustre is a distributed filesystem that breaks apart I/O into multiple Object Storage Servers, each connected to storage (Object Storage Targets). Metadata is served by dedicated servers that are not part of the I/O stream and thus not a bottleneck. See below for a picture (courtesy of the Lustre manual) of how it is all connected:

 

Lustre Scaled Cluster

 

High-speed connections are used liberally for lower latency and higher throughput.

A large file can reside on many storage servers, and as a result I/O can be spread out and parallelized.

Lustre clients see a single large namespace and run a proprietary protocol to access the cluster.

It sounds good in theory – and it delivered in practice: 1.3TB/s sustained performance was demonstrated to the NetApp block devices. Work is ongoing to finalize the testing with the complete Lustre environment. Not sure what the upper limit would be. But clearly it’s a highly scalable solution.

 

Putting it all together

NetApp has fully realized solutions for the “big data” applications out there – complete with the product and services needed to complete each engagement. The Lustre solution employed by the LLNL is just one of the options available. There is Hadoop, Full Motion uncompressed HD video, and more.

So – how fast do you need to go?

D

 

 

Technorati Tags: , ,