This short post is intended to provide a quick warning regarding in-memory columnar and cpu requirements… with a longer post to follow.
When a row is inserted or bulk-loaded into a DBMS, if there are no indexes, the amount of cpu required is very small. The majority of the time is spent committing a transaction is the time to write a log record to persist the data.
When the same record is reformatted into a column the amount of processing required is significantly higher. The data must be parsed into columns, the values must be compressed, dictionaries may be updated, and the breadcrumbs that let the columnar data be regenerated into rows must be laid. Further, if the columnar structure are to be optimized then the data must be ordered… with a sort or some kind of index structure. I have seen academic papers that suggest that for an insert columnar processing may be 100X more than row processing… and you can see why this could be true (I apologize for not finding the reference… I’ll dig it up… as I recall I read it in a post some time back by Daniel Abadi).
Now let’s think about this… several vendors are suggesting that you can deploy their columnar features with no changes required… no new hardware… in-place. But this does not ring true if the new columnar feature requires 100X extra CPU cycles per row… or 50X… or 10X… unless you are running your database on an empty server.
This claim is a shot at SAP who, more honestly, suggests new hardware with high-end processors for their in-memory columnar product… but methinks it is marketing, not architecture, from these other folks.
Teradata stock is falling hard due to guidance here that they may miss revenue targets. Analysts are downgrading the stock… but mostly from Buy to Neutral.
Normally stock prices would not be a relevant topic for this blog… but in this case I believe that the Squeeze I suggested first in October of 2012 and then again in February of this year (here and here) has started to affect Teradata revenues.
Note that the Squeeze should affect Netezza, Exadata, and Greenplum as well… but the effect will not be so directly reflected in the stock prices of IBM, Oracle, or EMC/VMWare/GE.
My friend and a major contributor to the art of database architecture, Chuck McDevitt, died last week. Five years ago Chuck was diagnosed with an advanced cancer and given six months to live. He fought and endured and worked for most of those five years, teaching us all a little about how we might live our lives.
Chuck’s contributions to database architecture are not so well known. He was employee number fourteen at Teradata and developed, with Dan Holle, the Teradata version of SQL. Chuck invented several foundational parts of any parallel database system. He left Teradata and went to Cogit, a start-up that developed a very early parallel data mining tool. From there Chuck went to Ab Inito as a senior architect, and from there he went to Greenplum where he was the Chief Architect.
At Greenplum Chuck was the brains behind the development of their parallel version of Postgres. It is significant to note that Chuck’s architectural insights led to an extensible and powerful implementation that far exceeded the efforts of others trying to accomplish the same result from a Postgres starting point: Aster Data, Netezza, DataAllegro, and the Postgres community.
To convey Chuck’s contribution to Teradata let me tell a story.
After E.F. Codd published his ground-breaking papers on relational theory two research projects began to develop relational databases. The UC Berkeley project developed a query language that was called QUEL which was followed by an IBM Research project and a query language called Sequel. As Teradata entered the scene they developed a third query language that was called TEQUEL, selecting the best features of each. IBM then delivered DB2 and SQL/DS, and the Oracle Database appeared, all based on Sequel which was by then called SQL. QUEL was forgotten as SQL dominated the market leading to it becoming an ANSI standard in 1986.
One Friday in 1985, the Teradata sales management convened a meeting to request that Teradata support SQL in order to effectively compete in the emerging RDBMS market. They were told by Product Management that it would take three years to develop SQL support… that it was out of the question. Further, the Teradata CTO argued that TEQUEL was superior to SQL and that this was an advantage in any case.
Dan Holle heard about the meeting, called Chuck in and, convinced that the Teradata management team was headed in the wrong direction, started to work. What the technical Execs must have known, but clearly did not appreciate, was that Dan and Chuck had developed the Teradata database using what was at the time, a very advanced concept… compiling the TEQUEL query language into an intermediate language and then processing that. Working non-stop through the weekend Chuck developed an SQL parser that would generate the proper intermediate code for consumption by the optimizer and execution engine… and on Monday morning he demonstrated a functioning SQL version of Teradata.
It is fair to say that, but for Chuck McDevitt, Teradata would not exist… and it is likely that Greenplum would not exist.
As I repeatedly suggest in this blog… architecture counts… and you should all know that the database community lost a great architect last week. Chuck will be missed.
I wondered here how IBM would position DB2 with BLU versus Netezza. Please have a look before you go on… and let me admit here and now that when I wrote this I chickened out. As I sat down this time I became convinced that I should predict the end of Netezza.
In the post here Bob Picciano, the general manager of IBM’s Information Management Software Division, made it nearly clear. He said that DB2 BLU is for systems “under 50 terabytes” only because BLU does not cluster. I suspect that if IBM converted all of the Netezza clusters with under 50TB of data to BLU it would knock out 70% or more of the Netezza install base. He states that “most data warehouses are in the under-10-terabyte range”… and so we can assume that Netezza, precluded from anything under 50TB, has a relatively small market left. He suggests that Netezza is for “petabyte-size collections”… but as I suggested here (check out the picture!), Hadoop is going to squeeze the top away from Netezza… while in-memory takes away the bottom… and IBM is very much into Hadoop so the take-away will not require a fight. Finally, we can assume, I think, that the BLU folks are working on a clustered version that will eat more from the bottom of Netezza’s market.
We should pay Netezza some respect as it fades. When they entered the market Teradata was undisputed. Netezza did not knock out the champ but, for the first time, they proved that it was possible to stay in the ring… and this opened the market for Exadata, Greenplum, Vertica and the rest.
The current release of HANA requires that all of the data required to satisfy a query be in-memory to run the query. Let’s think about what this means:
HANA compresses tables into bitmap vectors… and then compresses the vectors on write to reduce disk I/O. Disk I/O with HANA? Yup.
Once this formatting is complete all tables and partitions are persisted to disk… and if there are updates to the tables then logs are written to maintain ACIDity and at some interval, the changed data is persisted asynchronously as blocks to disk. When HANA cold starts no data is in-memory. There are options to pre-load data at start-up… but the default is to load data as it is used.
When the first query begins execution the data required to satisfy the query is moved into memory and decompressed into vectors. Note that the vector format is still highly compressed and the execution engine operates on this compressed vector data. Also, partition elimination occurs during this data move… so only the partitions required are loaded. The remaining data is on disk until required.
Let us imagine that after several queries all of the available memory is consumed… but there is still user data out-of-memory on peripheral storage… and a new query is submitted that requires this data. At this point HANA frees enough storage to satisfy the new query and processes it. Note that, in the usual DW case (write-once/read-many), the data flushed from memory does not need to be written back… the data is already persisted… otherwise HANA will flush any unwritten changed blocks…
If a query is submitted that performs a cartesian product… or that requires all of the data in the warehouse at once… in other words where there is not enough memory to fit all of the vectors in memory even after flushing everything else out… the query fails. It is my understanding that this constraint will be fixed in a next release and data will stream into memory and be processed in-stream instead of in-whole. Note that in other databases a query that consumes all of the available memory may never complete, or will seriously affect all other running queries, or will lock the system… so the HANA approach is not all bad… but as noted there is room for improvement and the constraint is real.
This note should remove several silly arguments leveled by HANA’s competitors:
HANA, and most in-memory databases, offer full ACID-compliance. A system failure does not result in lost data.
HANA supports more data than will fit in-memory and it pages data in-and-out in a smart fashion based on utilization. It is not constrained to only data that fits in-memory.
HANA is not useless when it runs out of memory. HANA has a constraint when there is more data than memory… it does not crash the system… but lets be real… if you page data to disk and run out of disk you are in trouble… and we’ve all seen our DBMS‘s hit this wall. If you have an in-memory DBMS then you need to have enough memory to support your workload… if you have a DB2 system you better not run out of temp space or log space on disk… if you have Teradata you better not run out of spool space.
I apologize… there is no public reference I know of to support the features I described. It is available to HANA customers in the HANA Blue Book. It is my understanding that a public version of the Blue Book is being developed.
In my last post here I suggested that there were three levels of maturity around column orientation and described the first level, PAX, which provides columnar compression. This apparently is the level Exadata operates at with its Hybrid Columnar Compression.
In this post we will consider the next two levels of maturity: early materialized column processing and late materialized column processing which provide more I/O avoidance and some processing advantages.
In the previous post I suggested a five-column table and depicted each of those columns oriented on disk in separate file structures. This orientation provides the second level of maturity: columnar projection.
Imagine a query that selects only 4 of the five columns in the table leaving out the EmpFirst column. In this case the physical structure that stores EmpFirst does not have to be accessed; 20% less data is read, reducing the I/O overhead by the same amount. Somewhere in the process the magic has to be invoked that returns the columns to a row orientation… but just maybe that overhead costs less than the saving from the reduced I/O?
Better still, imagine a fact table with 100 columns and a query that accesses only 10 of the columns. This is a very common use case. The result is a 9X reduction in the amount of data that has to be read and a 9X reduction in the cost of weaving columns into rows. This is columnar projection and the impact of this far outweighs small advantage offered by PAX (PAX may provide a .1X-.5X, 10%-50%, compression advantage over full columnar tables). This is the advantage that lets most of the columnar databases beat Exadata in a fair fight.
But Teradata and Greenplum stop here. After data is projected and selected the data is decompressed into rows and processed using their conventional row-based database engines. The gains from more maturity are significant.
The true column stores read compressed columnar data into memory and then operate of the columnar data directly. This provides distinct advantages:
Since data remains compressed DRAM is used more efficiently
Aggregations against a single column access data in contiguous memory improving cache utilization
Since data remains compressed processor caches are used more efficiently
Since data is stored in bit maps it can be processed as vectors using the super-computing instruction sets available in many CPUs
Aggregations can be executed using multiplication instead of table scans
Distinct query optimizations are available when columnar dictionaries are available
Column structures behave as built-in indexes, eliminating the need for separate index structures
These advantages can provide 10X-50X performance improvements over the previous level of maturity.
Column Compression provides approximately a 4X performance advantage over row compression (10X instead of 2.5X). This is Column Maturity Level 1.
Columnar Projection includes the advantages of Column Compression and provides a further 5X-10X performance advantage (if your queries touch 1/5-1/10 of the columns). This is Column Maturity Level 2.
Columnar Processing provides a 10X+ performance improvement over just compression and projection. This is Column Maturity Level 3.
Of course your mileage will vary… If your workload tends to touch more than 80% of the columns in your big fact tables then columnar projection will not be useful… and Exadata may win. If your queries do not do much aggregation then columnar processing will be less useful… and a product at Level 2 may win. And of course, this blog has not addressed the complexities of joins and loading and workload management… so please do not consider this as a blanket promotion for Level 3 column stores… but now that you understand the architecture I hope you will be better able to call BS on the marketing…
Included is a table that outlines the maturity level of several products:
There are three forms of columnar-orientation currently deployed by database systems today. Each builds upon the next. The simplest form uses column-orientation to provide better data compression. The next level of maturity stores columnar data in separate structures to support columnar projection. The most mature implementations support a columnar database engine that performs relational algebra on column-oriented data. Let me explain…
Imagine a simple table with 1M rows… with the schema and the first several rows depicted in Figure 1. Conceptually, a row-orientation deploys data on disk and in-memory as depicted in Figure 2 and a column-orientation deploys data on disk and in-memory as depicted in Figure 3. The actual deployment may be significantly different, as we will see.
Note that I am going to throw out some indicative numbers around compression. I will suggest that applying compression to rows will provide from 1.5X to 3.5X compression with and average of 2.5X… and that applying compression to columns provides from 3X compression to 50X compression with the average around 10X. These are supportable numbers but the compression you see for any specific data set will vary.
There are two powerful compression techniques that individually or combined provide most of the benefits: dictionary-encoding and run-length encoding. For the purposes of this blog I will describe only dictionary-encoding; and I will do an injustice to that by explaining it only briefly and conceptually… just enough that you get the idea.
Further compression is possible by encoding runs of similar values to a value plus the number of times it repeats so that the bit stream 0000000000000000 could be represented as 01111 (0 occurs 24 times).
You can now also start to see why column-orientation compresses better that a row-orientation. In the row block above there is little opportunity to encode whole rows in a dictionary… the cardinality of rows in a table is too high (note that this may not be true for a dimension table which is, in-effect, a dictionary). There is some opportunity to encode the bit runs in a row… as noted, you can expect to get 2X-2.5X from row compression for a fact table. Column-orientation allows dictionary encoding to be applied effectively to low cardinality columns… and this accounts for the advantage there.
Dictionary-encoding reduces data to a compressed form by building a map that provides a translation for each cardinal value in the table to a tightly compressed form. For example, if there are indeed only three values possible in the DeptID field above then we might build a dictionary for that column as depicted in Figure 4. You can see… by encoding and storing the data in the minimal number of bits required, significant storage reduction is possible… and the lower the cardinality of a column the smaller the resulting bit representation.
Note that there is no free lunch here. There is a cost to be paid in CPU cycles to compress data and to decompress data… but for a read-optimized data warehouse database compression is cool. Exactly how cool depends on the level of maturity and we will get to that as we go.
It is crucial to remember that column store databases are relational. They ingest rows and emit rows and perform relational algebra in-between. So there has to be some magic that turns tuples into columns and restores them from columns. The integrity of a row has to persist. Again I am going to defer on the details and point you at the references below… but imagine that for each row a bit map is built that, for each column, points to the entry in the column dictionary with the proper value.
There is no free lunch to column store… no free lunch anywhere, it seems. Building this bit map on INSERT is very expensive, and modifying it on UPDATE is fairly expensive. This is why column-orientation is not suitable for OLTP workloads without some extra effort. But the cost is amortized by significant performance gains for READs.
One last concept: since peripheral I/O reads blocks imagine two approaches to column compression: one applies the concepts above to an entire table breaking each table into separate column-oriented files that may be read separately; and one which applies the concepts individually to each large block in a table file. Imagine, in the first case that Figure 2 represents a picture of the first few rows in our 1M-row table. Imagine, in the second case, that Figure 2 represents the rows in one block of data re-oriented into columns.
This second, block-oriented, approach is called PAX, and it is more-or-less the approach used by Exadata. In the PAX approach each block contains its own mini-column store and a dictionary for dictionary encoding with the values in the block. Because the cardinality for columns within a block will often be less than for an entire table there are some distinct advantages to PAX compression. Compression will be higher by more than a little than for full table columnar compression.
When Exadata reads a block from disk it decompresses the data back into rows and performs row-oriented processing to complete the query. This is very cool for Exadata… a great feature. As noted, column compression may be 4X better than row compression on the average. This reduces the storage requirements and reduces the overhead of I/O by 4X… and this is a very significant improvement. But Exadata stops here. It is not a columnar-oriented DBMS and it misses the significant advantages that come from the next two levels of column-orientation… I’ll take these up in the next post.
To be clear, all of the databases that use these more mature techniques: Teradata, HANA, Greenplum, Vertica, Paraccel, DB2, and SQL Server gain from columnar compression even if the PAX approach provides some small advantage as a compression technique.
It is also worth noting that Teradata does not gain as much as others in this regard. This is not because of poor design, rather it is due to the fact that, to their credit, Teradata implemented a Teradata-specific dictionary-based compression scheme long ago. Columnar compression let others catch up to what Teradata has offered for years.
And before you ask… Netezza offers no columnar orientation… preferring to compress deeply using an FPGA co-processor to decompress… and to reduce I/O using zone maps rather than the using the mid-level column projection techniques in the next blog here.
I would like to recommend to you John Appleby’s post here on the HANA blog site. While the title suggests the article is about HANA, in fact it is about trends in computing and processors… and very relevant to posts here past, present, and upcoming…
I would also recommend Curt Monash’s site. His notes on Teradata here mirror my observation that a 30%-50% performance boost per release cycle is the target for most commercial databases… and what wins in the general market. This is why the in-memory capabilities offered by HANA and maybe DB2 BLU are so disruptive. These products should offer way more than that… not 1.5X but 100X in some instances.
Finally I recommend “What Every Programmer Should Know About Memory” by Ulrich Drepper here. This paper provides a great foundation for the deep hardware topics to come.
Database computing is becoming a special case, a commercial case, of supercomputing… high-performance computing (HPC) to those less inclined to superlatives. Over the next few years the differentiation between products will increasingly be due to the use of high-performance computing techniques: in-memory techniques, vector processing, massive parallelism, and use of HPC instruction sets.
I suggested here that Teradata shipped the EDW 6700 series without waiting for Ivy Bridge because they could not use the cores effectively… but it could be that Haswell (see here) fit their release schedule better. It will be interesting to see whether they can use all of the cores then?