How DBMS Vendors Admit to an Architectural Limitation: Part 3 – EDW on IBM z/OS

This is the 3rd and final example of a vendor admitting, without admitting, to an architectural limitation. The first two parts on Exadata and Teradata are here and here.

Teradata started to get real traction in the EDW space with a shared-nothing architecture in the late 1980’s. At that time the only real competition was DB2 on an IBM mainframe. From those days until just a couple of years ago IBM insisted that for MVS, then z/OS, customers should stick to the mainframe for their data warehouses and marts. There was some dabbling with sharded data in DB2 for z/OS… and Teradata made some in-roads… as did Netezza… but IBM insisted that there was no reason not to stay Blue. DB2 on AIX and then LINUX appeared… and both offered a better price/performance option than DB2 ob Z/OS… but the faithful stayed faithful for the most part.

Then IBM bought Netezza, a pure shared-nothing microprocessor-based machine, and the recommendation changed. Today IBM recommends the Analytics Accelerator, based on Netezza, to mainframe users who want to deploy an EDW. This is an admission, with no admission at all, that there was all along an architectural advantage to shared-nothingness.

If you search this blog for “Netezza” you can get my perspective on that technology. But to be blunt, the Analytics Accelerator is not IBM’s best EDW platform… DB2 LUW is by far… and with BLU LUW is better still.

I have made it clear in my previous posts that I consider it lazy for an IT shop to commit to a vendor or to a product. As engineers we need to embrace change. For IBM z/OS shops this means a realistic look at non-z/OS alternatives to deploy or to re-deploy an EDW. It makes no sense to build a data warehouse or a data mart directly on z/OS. Use the Analytics Accelerator or, better still, open the competition to better products like DB2 LUW, Teradata, Vertica, etc.

References

Database Fog Blog

Other

CPUs and HW for HANA, BLU, Hekaton, and Oracle 12c

CPUs from retired computers waiting for recycl...
(Photo credit: Wikipedia)

This short post is intended to provide a quick warning regarding in-memory columnar and cpu requirements… with a longer post to follow.

When a row is inserted or bulk-loaded into a DBMS, if there are no indexes, the amount of cpu required is very small. The majority of the time is spent committing a transaction is the time to write a log record to persist the data.

When the same record is reformatted into a column the amount of processing required is significantly higher. The data must be parsed into columns, the values must be compressed, dictionaries may be updated, and the breadcrumbs that let the columnar data be regenerated into rows must be laid. Further, if the columnar structure are to be optimized then the data must be ordered… with a sort or some kind of index structure. I have seen academic papers that suggest that for an insert columnar processing may be 100X more than row processing… and you can see why this could be true (I apologize for not finding the reference… I’ll dig it up… as I recall I read it in a post some time back by Daniel Abadi).

Now let’s think about this… several vendors are suggesting that you can deploy their columnar features with no changes required… no new hardware… in-place. But this does not ring true if the new columnar feature requires 100X extra CPU cycles per row… or 50X… or 10X… unless you are running your database on an empty server.

This claim is a shot at SAP who, more honestly, suggests new hardware with high-end processors for their in-memory columnar product… but methinks it is marketing, not architecture, from these other folks.

The Squeeze Hits Teradata

The SqueezeTeradata stock is falling hard due to guidance here that they may miss revenue targets. Analysts are downgrading the stock… but mostly from Buy to Neutral.

Normally stock prices would not be a relevant topic for this blog… but in this case I believe that the Squeeze I suggested first in October of 2012 and then again in February of this year (here and here) has started to affect Teradata revenues.

Note that the Squeeze should affect Netezza, Exadata, and Greenplum as well… but the effect will not be so directly reflected in the stock prices of IBM, Oracle, or EMC/VMWare/GE.

 

BLU Meanies: Data In-memory

Cover of "Sgt. Pepper's Lonely Hearts Clu...
Cover via Amazon

IBM is presenting a DB2 Tech Talk that compares the BLU Accelerator to HANA. There are several mistakes and some odd thinking in the pitch so let me address the issues as a way to explain some things about HANA and about BLU. This blog will consider what data needs to be in-memory.

IBM like several others, continues to repeat a talking point along the lines of: “We believe that you should not have to fit all of you active data in memory…”. Let’s think about this…

Note that in the current release HANA has a constraint that all of the data in a single column, the entire vector that represents the data in that column, must be in-memory before it can be operated on. If the table is partitioned and partition-elimination is applied then the data in the partition for the column must be loaded in-memory. This is a real constraint that will be removed in a subsequent release… but it is not a very severe constraint if you think about it.

But let’s be clear… HANA does not require all data to be in-memory… it will read data from peripheral devices in and out as required just as BLU does.

Now what does this mean? Let’s walk through some scenarios.

First, let’s imagine a customer with 10TB of user data, per the scenario IBM discusses. Let’s not get into a whose product compresses better discussion and assume that both BLU and HANA will get 4X compression… so there is 2.5TB of user data to be processed.

Now let’s imagine a system with only a very little memory available for data. In other words, let’s configure both BLU and HANA so that they are full columnar databases, but not in-memory databases. In this case BLU would operate by doing constant I/O without constraint and HANA would fail whenever it could not fit a required column in memory. Note that HANA might not fail at all… it would depend on whether there was a large single un-partitioned column that was required.

This scenario is really silly though… HANA is an in-memory database, designed to keep data in-memory from the start… so SAP would not support this imaginary configuration. The fact that you could make BLU work out of memory is not really relevant as nowhere does IBM position, or reference, BLU as a disk-based column store add-on… you would just use DB2.

Now let’s configure a system to IBM’s specification with 400GB of memory. IBM does not really say how much of this memory is available to BLU for data… but for the sake of argument let’s ignore the system requirements and assume that BLU uses one-half, 200GB, as work space to process queries so that 200GB is available to store data in-memory. As you will see it does not really matter in this argument whether I am spot on here or not. So using IBM’s recommendation there is now a 200GB cache that can be used as data is paged in and out. Anyone who has ever used a data warehouse knows that caching does not work well for BI queries as each query touches large enough volumes of the data to flush the cache… so BLU will effectively be performing I/O for most queries and is back to being an out-of-memory columnar database. Note that this flushing issue is why the in-memory capabilities from Oracle and Teradata pin certain tables into memory. In this scenario HANA will operate exactly as BLU does with the constraint that any single column that in a compressed form exceeds 200GB will not be able to be processed.

Finally let’s configure a system with 5TB of memory per SAP’s recommendation for HANA. In this case BLU and HANA both fit all of the data in-memory… with 2.5TB of compressed user data in and 2.5TB of work space… and there is no I/O. This is an in-memory DBMS.

But according to the IBM Power 770 spec (here) there is no way to get 5TB of memory on a single p770 node… so to match HANA and eliminate all I/O they would require two nodes… but BLU cannot be deployed on a cluster… so on they would have to deploy on a single node and perform I/O on 20% of the data. The latency for SSD I/O is 200Kns and for disk it is 10Mns… for DRAM it is 100ns and HANA loads full cache lines so that the average latency is under 20ns… so the penalty paid by BLU is severe and it will never keep up with HANA.

There is more bunk around recommendations for the number of cores but I can make no sense of it at all so I do not know where to begin to debunk it. SAP recommends high-end Intel servers to run HANA. In the scenario above we would recommend multiple servers… soon enough there will be Haswell servers with 6TB of DRAM and this case will run on one node.

I have stated repeatedly that anytime a vendor presents a slide comparing their product to their competitors you should immediately throw them out… it will always be twisted. Don’t trust them. And don’t trust me as I work for SAP. But hopefully you can see some logic in my case. If you need an IMDB then you need memory. If you are short of memory then the IMDB operates like a columnar RDBMS with a memory cache. If you are running a BI query workload then you need to pin data in the cache or the system will thrash. Because of this SAP recommends that you get enough memory to get all of the data in… we recommend that you operate our in-memory database product in-memory…

This really the point of the post. The Five Minute Rule informs us about what data should be in-memory (see here). An in-memory database is designed from the bottom up to manage hot data in-memory. The in-memory add-ons being offered over legacy systems are very capable and should not be ignored… and as the price of memory drops the Five Minute Rule will suggest that data in-memory will account for and ever larger percentage of your EDW. But to offer an in-memory capability and recommend that you should keep the bulk of the data on disk is silly… and to state that your product has a competitive advantage because you do not recommend that all of the data managed by your in-memory feature be kept in-memory is silliness squared.

Oracle 12c IMDB Announcement at OOW13

SAP4OOWI changed the picture to show you the billboard SAP bought on US101N right across from the O HQ…

– Rob

Larry Ellison announced a new in-memory capability for Oracle 12c last night. There is little solid information available but taken at face value the new feature is significant… very cool… and fairly capable.

In short it appears that users have the ability to pin a table into memory in a columnar format. The new feature provides level 3 (see here) columnar capabilities… data is stored compressed and processed using vector and SIMD instruction sets. The pinned data is a redundant copy of the table in-memory… so INSERT/UPDATE/DELETE and data loads queries will use the row store and data is copied and converted to the in-memory columnar format.

As you can imagine there are lots of open questions. Here are some… and I’ll try to sort out answers in the next several weeks:

  1. It seems that data is converted row-to-columnar in real-time using a 2-phased commit. This will significantly slow down OLTP performance. LE suggested that there was a significant speed-up for OLTP based on performance savings from eliminating indexes required for analytics. This is a little disingenuous, methinks… as you will most certainly see a significant degradation when you compare OLTP performance without indexes (or with a couple of OLTP-centric indexes) and with the in-memory columnar feature on to OLTP performance without the redundant copy and format to columnar effort. So be careful. The use case suggested: removing analytic indexes and using the in-memory column store is solid and real… but if you have already optimized for OLTP and removed the analytic indexes you are likely to see performance drop.
  2. It is not clear whether the columnar data is persisted to disk/flash. It seems like maybe it is not. This implies that on start-up or recovery data is read from the row store on-disk tables and logs and converted to columnar. This may significantly impact start-up and recovery for OLTP systems.
  3. It is unclear how columnar tables are joined to row tables. It seems that maybe this is not possible… or maybe there is a dynamic conversion from one form to another? Note that it was mentioned that is possible for columnar data to be joined to columnar data. Solving for heterogeneous joins would require some sophisticated optimization. I suspect that when any row table is mentioned in a query that the row join engine is used. In this case analytic queries may run significantly slower as the analytic indexes will have been removed.
  4. Because of this and of item #2 it is unclear how this feature plays with Exadata. For lots of reasons I suspect that they do not play well and that Exadata cannot use the new feature. For example, there is no mention of new extended memory options for the Exadata appliance… and you can be sure that this feature will require more memory.

There was a new hardware system announced that uses this in-memory capability… If you add all of this up it may be that this is a system designed to run SAP applications. In fact, before the presentation on in-memory there was a long (-winded) presentation of a new Fujitsu system and the SAP SD benchmark was specifically mentioned. This was not likely an accident. So… maybe what we have is a counter to HANA for SAP apps… not a data warehouse at all.

As I said… we’ll see as the technical details emerge. If the architectural constraints 1-4 above hold then this will require some work for Oracle to compete with HANA for SAP apps or for data warehouse workloads…

IBM BLU and SAP HANA

Weird blue dot
Weird blue dot (Photo credit: awshots)

As I noted here, I think that the IBM BLU Accelerator is a very nice piece of work. Readers of this blog are in the software business where any feature developed by any vendor can be developed in a relatively short period of time by any other vendor… and BLU certainly moves DB2 forward in the in-memory database space led by HANA… it narrowed the gap. But let’s look at the gap that remains.

First, IBM is touting the fact that BLU requires no proprietary hardware and suggests that HANA does. I do not really understand this positioning? HANA runs on servers from a long list of vendors and each vendor spins the HANA reference architecture a little differently. I suppose that the fact that there is a HANA reference architecture could be considered limiting… and I guess that there is no reference for BLU… maybe it runs anywhere… but let’s think about that.

If you decide to run BLU and put some data in-memory then certainly you need some free memory to store it. Assuming that you are not running on a server with excess memory this means that you need to buy more. If you are running on a blade that only supports 128GB of DRAM or less, then this is problematic. If you upgrade to a 256GB server then you might get a bit of free memory for a little data. If you upgrade to a fat server that supports 512GB of DRAM or more, then you would likely be within the HANA reference architecture set. There is no magic here.

One of the gaps is related: you cannot cluster BLU so the amount of data you can support in-memory is limited to a single node per the paragraphs above. HANA supports shared-nothing clustering and will scale out to support petabytes of data in-memory.

This limit is not so terribly bad if you store some of your data in the conventional DB2 row store… or in a columnar format on-disk. This is why BLU is an accelerator, not a full-fledged in-memory DBMS. But if the limit means that you can get only a small amount of data resident in-memory it may preclude you from putting the sort of medium-to-large fact tables in BLU that would benefit most from the acceleration.

You might consider putting smaller dimension tables in BLU…. but when you join to the conventional DB2 row store the column store tables are materialized as rows and the row database engine executes the join. You can store the facts in BLU in columnar format… but they may not reside in-memory if there is limited availability… and only those joins that do not use row store will use the BLU level 3 columnar features (see here for a description of the levels of columnar maturity). So many queries will require I/O to fetch data.

When you pull this all together: limited available memory on a single node, with large fact tables projecting in and out of disk storage, and joins pushed to the row store you can imagine the severe constraint for a real-world data warehouse workload. BLU will accelerate some stuff… but the application has to be limited to the DRAM dedicated to BLU.

It is only software… IBM will surely add BLU clustering (see here)… and customers will figure out that they need to buy the same big-memory servers that make up the HANA reference architecture to realize the benefits…  For analytics, BLU features will converge over the next 2-3 years to make it ever more competitive with HANA. But in this first BLU release the use of in-memory marketing slogans and of tests that might not reflect a real-world workload are a little misleading.

Right now it seems that HANA might retain two architectural advantages:

  1. HANA real-time support for OLTP and analytics against a single table instance; and
  2. the performance of the HANA platform: where more application logic runs next to the DBMS, in the same address space, across a lightweight thread boundary.

It is only software… so even these advantages will not remain… and the changing landscape will provide fodder for bloggers for years to come.

References

  • Here is a great series of blogs on BLU that shows how joins with the row store materializes columns as rows…

Chuck McDevitt: An Obituary

Teradata Storage Rack
Teradata Storage Rack (Photo credit: pchow98)

My friend and a major contributor to the art of database architecture, Chuck McDevitt, died last week. Five years ago Chuck was diagnosed with an advanced cancer and given six months to live. He fought and endured and worked for most of those five years, teaching us all a little about how we might live our lives.

Chuck’s contributions to database architecture are not so well known. He was employee number fourteen at Teradata and developed, with Dan Holle, the Teradata version of SQL. Chuck invented several foundational parts of any parallel database system. He left Teradata and went to Cogit, a start-up that developed a very early parallel data mining tool. From there Chuck went to Ab Inito as a senior architect, and from there he went to Greenplum where he was the Chief Architect.

At Greenplum Chuck was the brains behind the development of their parallel version of Postgres. It is significant to note that Chuck’s architectural insights led to an extensible and powerful implementation that far exceeded the efforts of others trying to accomplish the same result from a Postgres starting point: Aster Data, Netezza, DataAllegro, and the Postgres community.

To convey Chuck’s contribution to Teradata let me tell a story.

After E.F. Codd published his ground-breaking papers on relational theory two research projects began to develop relational databases. The UC Berkeley project developed a query language that was called QUEL which was followed by an IBM Research project and a query language called Sequel. As Teradata entered the scene they developed a third query language that was called TEQUEL, selecting the best features of each. IBM then delivered DB2 and SQL/DS, and the Oracle Database appeared, all based on Sequel which was by then called SQL. QUEL was forgotten as SQL dominated the market leading to it becoming an ANSI standard in 1986.

One Friday in 1985, the Teradata sales management convened a meeting to request that Teradata support SQL in order to effectively compete in the emerging RDBMS market. They were told by Product Management that it would take three years to develop SQL support… that it was out of the question. Further, the Teradata CTO argued that TEQUEL was superior to SQL and that this was an advantage in any case.

Dan Holle heard about the meeting, called Chuck in and, convinced that the Teradata management team was headed in the wrong direction, started to work. What the technical Execs must have known, but clearly did not appreciate, was that Dan and Chuck had developed the Teradata database using what was at the time, a very advanced concept… compiling the TEQUEL query language into an intermediate language and then processing that. Working non-stop through the weekend Chuck developed an SQL parser that would generate the proper intermediate code for consumption by the optimizer and execution engine… and on Monday morning he demonstrated a functioning SQL version of Teradata.

It is fair to say that, but for Chuck McDevitt, Teradata would not exist… and it is likely that Greenplum would not exist.

As I repeatedly suggest in this blog… architecture counts… and you should all know that the database community lost a great architect last week. Chuck will be missed.

DB2 BLU vs. Netezza… and the Winner is…

Tombstone
Tombstone (Photo credit: Za3tOoOr!)

I wondered here how IBM would position DB2 with BLU versus Netezza. Please have a look before you go on… and let me admit here and now that when I wrote this I chickened out. As I sat down this time I became convinced that I should predict the end of Netezza.

Why?

In the post here Bob Picciano, the general manager of IBM’s Information Management Software Division, made it nearly clear. He said that DB2 BLU is for systems “under 50 terabytes” only because BLU does not cluster. I suspect that if IBM converted all of the Netezza clusters with under 50TB of data to BLU it would knock out 70% or more of the Netezza install base. He states that “most data warehouses are in the under-10-terabyte range”… and so we can assume that Netezza, precluded from anything under 50TB, has a relatively small market left. He suggests that Netezza is for “petabyte-size collections”… but as I suggested here (check out the picture!), Hadoop is going to squeeze the top away from Netezza… while in-memory takes away the bottom… and IBM is very much into Hadoop so the take-away will not require a fight. Finally, we can assume, I think, that the BLU folks are working on a clustered version that will eat more from the bottom of Netezza’s market.

We should pay Netezza some respect as it fades. When they entered the market Teradata was undisputed. Netezza did not knock out the champ but, for the first time, they proved that it was possible to stay in the ring… and this opened the market for Exadata, Greenplum, Vertica and the rest.

Memory Trends and HANA

If the Gartner estimates here are correct… then DRAM prices will fall 50% per year per year over the next several years… and then in 2015 non-volatile RAM (see the related articles below) will become generally available.

It has been suggested that memory prices will fall slower than data warehouses will grow (see here). That does not seem to be the case… and the combination of cheaper memory and then non-volatile memory will make in-memory databases like SAP HANA ever more compelling. In fact, as I predicted… and to their credit, Teradata is adding more memory (see here).

Related articles

MPP, IMDB and Moore’s Law

In the post here I listed the units of parallelism (UoP) applied by various products on a single node. Those findings are summarized in the table below.

Product

Version/HW

Cores per Node

UoP per Node

Notes

Teradata EDW 6700H

16

32

Uses hyper-threads.
Greenplum DCA UAP Edition

16

8

Recommends 1 Segment for each 2 cores. Maybe some multi-threading per query so it could be greater than 8 on the average… and could be 16 with hyper-threads… but not more than 32 for sure.
Exadata X3

12

12-24

Maybe only 12… cannot find if they use hyper-threads.
Netezza Striper

16

16

May use hyper-threads but limited by 16 FPGAs.
HANA Any Xeon E7-4800

40

80

Uses hyper-threads.

A UoP is defined as the maximum number of  instructions that can execute in parallel on a single node for a single query. Note that in the comments there was a lively debate where some readers wanted to count threads or processes or slices that were “active” but in a wait state. Since any program can start threads that wait I do not count these as UoP (later we might devise a new measure named units of waiting that would gauge the inefficiency in any given design by measuring the amount of waiting around required to keep the CPUs fed… maybe the measure would be valuable in measuring the inefficiency of the queue at your doctor’s office or at any government agency).

On some CPUs vendors such as Intel allow two threads to execute instructions in-parallel in a core. This is called hyper-threading and, if implemented, it allows for two UoP on a single core. Rather than constantly qualify the statements for the rest of this blog when I refer to cores I mean to imply hyper-threads.

The lively comments in the blog included some discussion of the sort of techniques used by vendors to try and keep the cores in the CPU on each node fed. It is these techniques that lead to more active I/O streams than cores and more threads than cores.

For several years now Intel and the other CPU manufacturers have been building ever more cores into their products. This has allowed them to continue the trend known as Moore’s Law. Multi-core is now a fact of life and even phones, tablets, and personal computers have multi-core chips.

But if you look at the table  you can see that the database products above, even the newly announced products from Teradata and Netezza, are using CPUs with relatively few cores. The high-end Intel processors have 40 cores and the databases, with the exception of HANA, use Intel products with at most 16 cores. Further, Intel will deliver Ivy Bridge processors to the market this year with 120 cores. These vendors know this… yet they have chosen to deliver appliances with the previous generation CPUs. You might ask why?

I believe that there is an architectural reason for this (also a marketing reason covered here).

It is very hard to keep 80 cores fed with data when you have to perform block I/O. It will be nearly impossible to keep the 240 cores coming with Ivy Bridge fed. One solution is to deploy more nodes in a shared-nothing configuration with fewer cores per node… but this will be expensive requiring more power, floorspace, administration, etc. This is the solution taken by most of the vendors above. Another solution is to solve the problem without I/O with an in-memory database (IMDB) architecture. This is the solution taken by SAP with HANA.

Intel, IBM, and the rest will continue to build out using the multi-core approach for the foreseeable future. IMDB products will be able to fully utilize this product. Other products will struggle to take full advantage as we can see already… they will adapt and adjust and do what they can… but ultimately IMDB will win, I think… because there is just no other way to keep up as Moore’s Law continues to drive technology… no other way to feed the CPU engines with data fast enough.

If I am right then you will see more IMDB offerings from more vendors, including from the major vendors in the near future (note that this does not include the announcements of “database in memory” from Oracle which is not by any measure an in-memory database).

This is the underlying reason why Donald Feinberg (and Timo Elliott) are right on here. Every organization will be running in-memory… and soon.