(edit: fixed the images)
After a long hiatus, we return to our regularly scheduled programming with a 2-part series that will address some wild claims Oracle has been making recently.
I’m pleased to introduce Jeffrey Steiner, ex-Oracle employee and all-around DB performance wizard. He helps some of our largest customers with designing high performance solutions for Oracle DBs:
Greetings from a guest-blogger.
I’m one of the original NetApp customers.
I bought my first NetApp in 1995 (I have a 3-digit support case in the system) and it was an F330. I think it came with 512MB SCSI drives, and maxed out at 16GB. It met our performance needs, it was reliable, and it was cost effective. I continued to buy more over the following years at other employers. We must have been close to the first company to run Oracle databases on NetApp storage. It was late 1999. Again, it met our performance needs, it was reliable, and it was cost effective. My employer immediately prior to joining NetApp was Oracle.
I’m now with NetApp product operations as the principal architect for enterprise solutions, which usually means a big Oracle database is involved, but it can also include DB2, SAS, MongoDB, and others.
I normally ignore competitive blogs, and I had never commented on a blog in my life until I ran into something entitled “Why your NetApp is so slow…” and found this statement:
If an application such MS SQL is writing data in a 64k chunk then before Netapp actually writes it on disk it will have to split it into 16 different 4k writes and 16 different disk IOPS
That’s just openly false. I tried to correct the poster, but was met with nothing but other unsubstantiated claims and insults to the product line. It was clear the blogger wasn’t going to acknowledge their false premise, so I asked Dimitris if I could borrow some time on his blog.
Here’s one of the alleged results of this behavior with ONTAP– the blogger was nice enough to do this calculation for a system reading at 2.6GB/s:
I’m not sure how to interpret this. Are they saying that this alleged horrible, awful design flaw in ONTAP leads to customers buying 50X more drives than required, and our evil sales teams have somehow fooled our customer based into believing this was necessary? Or, is this a claim that ZFS arrays have some kind of amazing ability to use 50X fewer drives?
Given the false premise about ONTAP chopping up any and all IO’s into little 4K blocks and spraying them over the drives, I’m guessing readers are supposed to believe the first interpretation.
Ordinarily, I enjoy this type of marketing. Customers bring this to our attention, and it allows us to explain how things actually work, plus it discredits the account team who provided the information. There was a rep in the UK who used to tell his customers that Oracle had replaced all competing storage arrays in their OnDemand centers with Pillar. I liked it when he said stuff like that. The reason I’m responding is not because I care about the existence of the other blog, but rather that I care about openly false information being spread about how ONTAP works.
How does ONTAP really work?
Some of NetApp’s marketing folks might not like this, but here’s my usual response:
Why does it matter?
It’s an interesting subject, and I’m happy to explain write tetrises and NVMEM write coalescence, and core utilization, but what does that have to do with your business? There was a time we dealt with accusations that NetApp was slow because we has 25 nanometer process CPU’s while the state of the art was 17nm or something like that. These days ‘cores’ seems to come up a lot, as if this happens:
That’s the Brawndo approach to storage sales (https://www.youtube.com/watch?v=Tbxq0IDqD04)
“Our storage arrays contain
5 kinds of technology
which make them AWESOME
unlike other storage arrays which are
A Better Way
I prefer to promote our products based on real business needs. I phrase this especially bluntly when talking to our sales force:
When you are working with a new enterprise customer, shut up about NetApp for at least the first 45 minutes of the discussion
I say that all the time. Not everyone understands it. If you charge into a situation saying, “NetApp is AWESOME, unlike EMC who is NOT AWESOME” the whole conversation turns into PowerPoint wars, links to silly blog articles like the one that prompted this discussion, and whoever wins the deal will win it based on a combination of luck and speaking ability. Providing value will become secondary.
I’m usually working in engineeringland, but in major deals I get involved directly. Let’s say we have a customer with a database performance issue and they’re looking for new storage. I avoid PowerPoint and usually request Oracle AWR/statspack data. That allows me to size a solution with extreme accuracy. I know exactly what the customer needs, I know their performance bottlenecks, and I know whatever solution I propose will meet their requirements. That reduces risk on both sides. It also reduces costs because I won’t be proposing unnecessary capabilities.
None of this has anything to do with who’s got the better SPC-2 benchmark, unless you plan on buying that exact hardware described, configuring it exactly the same way, and then you somehow make money based on running SPC-2 all day.
Here’s an actual Oracle AWR report from a real customer using NetApp. I have pruned the non-storage related parameters to make it easier to read, and I have anonymized the identifying data. This is a major international insurance company calculating its balance sheet at end-of-month. I know of at least 9 or 10 customers that have similar workloads and configurations.
Look at the line that says “Physical reads”. That’s the blocks read per second. Now look at “Std Block Size”. That’s the block size. This is 90K physical block reads per second, which is 90K IOPS in a sense. The IO is predominantly db_file_scattered_read, which counter-intuitively is sequential IO. A parameter called db_file_multiblock_read_count is set to 128. This means Oracle is attempting to read 128 blocks at a time, which equates to 1MB block sizes. It’s a sequential IO read of a file.
Here’s what we got:
1) 89K read “IOPS”, sort of.
2) Those 89K read IOPS are actually packaged as units of 8 blocks in a single 64k unit.
3) 3K write IOPS
4) 8MB/sec of redo logging.
The most important point here is that the customer needed about 800MB/sec of throughput, they liked the cost savings of IP, and the storage system is meeting their needs. They refresh with NetApp on occasion, so obviouly they’re happy with the TCO.
To put a final nail in the coffin of the Oracle blogger’s argument, if we are really doing 89K block reads/sec, and those blocks are really chopped up into 4k units, that’s a total of about 180,000 4k IOPS that would need to be serviced at the disk layer, per the blogger’s calculation.
- Our opposing blogger thinks that would require about 1000 disks in theory
- This customer is using 132 drives in a real production system.
There’s also a ton of other data on those drives for other workloads. That’s why we have QoS – it allows mixed workloads to play nicely on a single unified system.
To make this even more interesting, the data would have been randomly written in 8k units, yet they are still able to read at 800MB/sec? How is this possible? For one, ONTAP does NOT break up individual IO’s into 4k units. It tries very, very hard to never break up an IO across disks, although that can happen on occasion, notably if you fill you system up to 99% capacity or do something very much against best practices.
The main reason ONTAP can provide good sequential performance with randomly written data is the blocks are organized contiguously on disk. Strictly speaking, there is a sort of ‘fragmentation’ as our competitors like to say, but it’s not like randomly spraying data everywhere. It’s more like large contiguous chunks of data are evenly distributed across the disks. As long as those contiguous segments are sufficiently large, readahead can ensure good throughput.
That’s somewhat of an oversimplification, but it would take a couple hours and a whiteboard to explain the complete details. 20+ years of engineering can’t exactly be summarized in a couple paragraphs. The document misrepresented by the original blog was clearly dated 2006 (and that was to slightly refresh the original posting back in the nineties), and while it’s still correct as far as I can see, it’s also lacking information on the enhancements and how we package data onto disks.
By the way, this database mentioned above? It’s virtualized in VMware too.
Why did I pick an example of only 90K IOPS? My point was this customer needed 90K IOPS, so they bought 90K IOPS.
If you need this performance:
then let us know. Not a problem. This is from a large SAP environment for a manufacturing company. It beats me what they’re doing, because this is about 10X more IO than what we typically see for this type of SAP application. Maybe they just built a really, really good network that permits this level of IO performance even though they don’t need it.
In any case, that’s 201,734 blocks/sec using a block size of 8k. That’s about 2GB/sec, and it’s from a dual-controller FAS3220 configuration which is rather old (and was the smallest box in its range when it was new).
Sing the bizarro-universe math from the other blog, these 200K IOPS would have been chopped up into 4k blocks and require a total of 400K back-end disk IOPS to service the workload. Divided by 125 IOPS/drive, we have a requirement for 3200 drives. It was ACTUALLY using more like 200 drives.
We can do a lot more, especially with the newer platforms and ONTAP clustering, which would enable up to 24 controllers in the storage cluster. So the performance limits per cluster are staggeringly high.
To put a really interesting (and practical) twist on this, sequential IO in the Oracle realm is probably going to become less important. You know why? Oracle’s new in-memory feature. Me and several others were floored when we got the first debrief on Oracle In-Memory. I couldn’t have asked for a better implementation if I was in charge of Oracle engineering myself. Folks at NetApp started asking what this means for us, and here’s my list:
- Oracle customers will be spending less on storage.
That’s it. That’s my list. The data format on disk remains unchanged, the backup/restore process is the same, the data commitment process is the same. All the NetApp features that earned us around 12,500 Oracle customers are still applicable.
The difference is customers will need smaller controllers, fewer disks, and less bandwidth because they’ll be able to replace a lot of the brute-force full table scan activity with a little In-Memory magic. No, the In-Memory licenses aren’t free, but the benefits will be substantial.
SPC-2 Benchmarks and Engenio Purchases
The other blog demanded two additional answers:
1) Why hasn’t NetApp done an SPC-2 bencharmk?
2) Why did NetApp purchase Engenio?
I personally don’t know why we haven’t done an SPC-2 benchmark with ONTAP, but they are rather expensive and it’s aimed at large sequential IO processing. That’s not exactly the prime use case for FAS systems, but not because they’re weak on it. I’ve got AWR reports well into the GB/sec, so it certainly can do all the sequential IO you want in the real world, but what workloads are those?
I see little point in using an ONTAP system for most (but certainly not all) such workloads because the features overall aren’t applicable. I’m aware of some VOD applications on ONTAP where replication and backups were important. Overall, if you want that type of workload, you’d specify a minimum bandwidth requirement, capacity requirement, and then evaluate the proposals from vendors. Cost is usually the deciding factor.
Again, my personal opinion here on why NetApp acquired Engenio.
Tom Georgens, our CEO, spent 9 years leading Engenio and obviously knew the company and its financials well. I can’t think of any possible way to know you’re getting value for money than having someone in Georgens’ position making this decision.
Here’s the press release about it:
Engenio will enable NetApp to address emerging and fast-growing market segments such as video, including full-motion video capture and digital video surveillance, as well as high performance computing applications, such as genomics sequencing and scientific research.
Yup, sounds about right. That’s all about maximum capacity, high throughput, and low cost. In contrast, ONTAP is about manageability and advanced features. Those are aimed at different sets of business drivers.
Hey, check this out. Here’s an SEC filing:
Since the acquisition of the Engenio business in May 2011, NetApp has been offering the formerly-branded Engenio products as NetApp E-Series storage arrays for SAN workloads. Core differentiators of this price-performance leader include enterprise reliability, availability and scalability. Customers choose E-Series for general purpose computing, high-density content repositories, video surveillance, and high performance computing workloads where data is managed by the application and the advanced data management capabilities of Data ONTAP storage operating system are not required.
Key point here is “where the advanced data management capabilities of Data ONTAP are not required.” It also reflected my logic in storage decisions prior to joining NetApp, and it reflects the message I still repeat to account teams:
- Is there any particular feature in ONTAP that is useful for your customer’s actual business requirements? Would they like to snapshot something? Do they need asynchronous replication? Archival? SnapLock? Scale-out clusters with many nodes? Non-disruptive everything? Think carefully, and ask lots of questions.
- If the answer is “yes”, go with ONTAP.
- If the answer is “no”, go with E-Series.
That’s what I did. I probably influenced or approved around $5M in total purchases. It wasn’t huge, but it wasn’t nothing either. I’d guess we went ONTAP about 70% of the time, but I had a lot of IBM DS3K arrays around too, now known as E-Series.
I’ve annoyed the E-Series team a few times by referring to it as “dumb storage”, but I mean that in the nicest possible way. It’s primary job is to just sit there and work. It needs to do it fast, reliably, and cost effectively, but on a day-to-day basis it’s not generally doing anything all that advanced.
In some ways, the reliability was a weakness. It was so reliable, that we forgot it was there at all, and we’d do something like changing the email server addresses, and forget to update the RAS feature of the E-Series. Without email notification, it can take a couple years before someone notices the LED that indicates a drive needs replacement.