Who is How Columnar? Exadata, Teradata, and HANA – Part 1: Column Compression

Basic Table

There are three forms of columnar-orientation currently deployed by database systems today. Each builds upon the next. The simplest form uses column-orientation to provide better data compression. The next level of maturity stores columnar data in separate structures to support columnar projection. The most mature implementations support a columnar database engine that performs relational algebra on column-oriented data. Let me explain…

Imagine a simple table with 1M rows… with the schema and the first several rows depicted in Figure 1. Conceptually, a row-orientation deploys data on disk and in-memory as depicted in Figure 2 and a column-orientation deploys data on disk and in-memory as depicted in Figure 3. The actual deployment may be significantly different, as we will see.

Note that I am going to throw out some indicative numbers around compression. I will suggest that applying compression to rows will provide from 1.5X to 3.5X compression with and average of 2.5X… and that applying compression to columns provides from 3X compression to 50X compression with the average around 10X. These are supportable numbers but the compression you see for any specific data set will vary.

A row oriented block

There are two powerful compression techniques that individually or combined provide most of the benefits: dictionary-encoding and run-length encoding. For the purposes of this blog I will describe only dictionary-encoding; and I will do an injustice to that by explaining it only briefly and conceptually… just enough that you get the idea.

Five column oriented blocks

Further compression is possible by encoding runs of similar values to a value plus the number of times it repeats so that the bit stream 0000000000000000 could be represented as 01111 (0 occurs 24 times).

You can now also start to see why column-orientation compresses better that a row-orientation. In the row block above there is little opportunity to encode whole rows in a dictionary… the cardinality of rows in a table is too high (note that this may not be true for a dimension table which is, in-effect, a dictionary). There is some opportunity to encode the bit runs in a row… as noted, you can expect to get 2X-2.5X from row compression for a fact table. Column-orientation allows dictionary encoding to be applied effectively to low cardinality columns… and this accounts for the advantage there.

Col Dict

Dictionary-encoding reduces data to a compressed form by building a map that provides a translation for each cardinal value in the table to a tightly compressed form. For example, if there are indeed only three values possible in the DeptID field above then we might build a dictionary for that column as depicted in Figure 4. You can see… by encoding and storing the data in the minimal number of bits required, significant storage reduction is possible… and the lower the cardinality of a column the smaller the resulting bit representation.

Note that there is no free lunch here. There is a cost to be paid in CPU cycles to compress data and to decompress data… but for a read-optimized data warehouse database compression is cool. Exactly how cool depends on the level of maturity and we will get to that as we go.

It is crucial to remember that column store databases are relational. They ingest rows and emit rows and perform relational algebra in-between. So there has to be some magic that turns tuples into columns and restores them from columns. The integrity of a row has to persist. Again I am going to defer on the details and point you at the references below… but imagine that for each row a bit map is built that, for each column, points to the entry in the column dictionary with the proper value.

There is no free lunch to column store… no free lunch anywhere, it seems. Building this bit map on INSERT is very expensive, and modifying it on UPDATE is fairly expensive. This is why column-orientation is not suitable for OLTP workloads without some extra effort. But the cost is amortized by significant performance gains for READs.

One last concept: since peripheral I/O reads blocks imagine two approaches to column compression: one applies the concepts above to an entire table breaking each table into separate column-oriented files that may be read separately; and one which applies the concepts individually to each large block in a table file. Imagine, in the first case that Figure 2 represents a picture of the first few rows in our 1M-row table. Imagine, in the second case, that Figure 2 represents the rows in one block of data re-oriented into columns.

This second, block-oriented, approach is called PAX, and it is more-or-less the approach used by Exadata. In the PAX approach each block contains its own mini-column store and a dictionary for dictionary encoding with the values in the block. Because the cardinality for columns within a block will often be less than for an entire table there are some distinct advantages to PAX compression. Compression will be higher by more than a little than for full table columnar compression.

When Exadata reads a block from disk it decompresses the data back into rows and performs row-oriented processing to complete the query. This is very cool for Exadata… a great feature. As noted, column compression may be 4X better than row compression on the average. This reduces the storage requirements and reduces the overhead of I/O by 4X… and this is a very significant improvement. But Exadata stops here. It is not a columnar-oriented DBMS and it misses the significant advantages that come from the next two levels of column-orientation… I’ll take these up in the next post.

To be clear, all of the databases that use these more mature techniques: Teradata, HANA, Greenplum, Vertica, Paraccel, DB2, and SQL Server gain from columnar compression even if the PAX approach provides some small advantage as a compression technique.

It is also worth noting that Teradata does not gain as much as others in this regard. This is not because of poor design, rather it is due to the fact that, to their credit, Teradata implemented a Teradata-specific dictionary-based compression scheme long ago. Columnar compression let others catch up to what Teradata has offered for years.

And before you ask… Netezza offers no columnar orientation… preferring to compress deeply using an FPGA co-processor to decompress… and to reduce I/O using zone maps rather than the using the mid-level column projection techniques in the next blog here.

The Fog is Getting Thicker…

English: San Francisco in fog
San Francisco in fog (Photo credit: Wikipedia)

I renamed this so that Teradata folks would not get here so often… its not really about Intelligent Memory… just prompted by it. The post on Intelligent Memory is here. – Rob

Two quick comments on Teradata’s recent announcement of Intelligent Memory.

First… very very cool. More on this to come.

Next… life is going to become very hard for my readers and for bloggers in this space. The notion of an in-memory database is becoming rightfully blurred… as is the notion of column store.

Oracle blurs the concepts with words like “database in-memory” and “hybrid column compression” which is neither an in-memory database or a column store.

Teradata blurs the concept with a strong offering that uses DRAM as a block-IO device (like the old RAM-disks we used to configure on our PCs).

Teradata and Greenplum blur the idea of a column store by adding columnar tables over their row store database engines.

I’m not a fan of the double-speak… but the ability of companies to apply the 80/20 rule to stretch their architectures and glue on new advanced technologies is a good thing for consumers.

But it becomes very hard to distinguish the products now.

In future blogs I’ll try to point out differences… but we’ll have to go a little deeper into the Database Fog.

HANA Support for OLTP and BI In a Single Table

Aerial view of Hana, Maui
Aerial view of Hana, Maui (Photo credit: Wikipedia)

This is a rehash of my post for SAP here… I thought you might find it interesting as it describes the architecture HANA uses to support OLTP and BI against a single table.

A couple of points to think about:

  • If you have only one database structure you can optimize for only one query; e.g. the OLTP query is fast against a OLTP structure but slow against a BI structure… or visa versa.
  • If you have two structures you have to ETL the data between the two at some cost. There is cost in keeping a replica of the data, cost in developing, administering, and executing the ETL process. In addition there is a lost opportunity cost hidden in the latency of the data. You cannot see the current state of the business by querying the BI data as some data has not yet been ETL’d across.
  • OLTP performance is normally paramount; so the perfect system would not compromise that performance or compromise it only a little.

Let’s look at the HANA approach to this at a high level.

HANA provides a single view of a table to an application or a user, but under-the-covers each table includes a OLTP optimized part, a BI optimized part, and a mechanism for moving data from one part to the other

When a transaction hits the system; inserts, updates, and deletes are processed in the OLTP part with no performance penalty. The read portion of the OLTP query accesses the read-optimized internal structure with no performance penalty. Note that reading a single column in a column store, which is the key for the transaction, is roughly equivalent to reading an index structure on top of a standard disk-based DBMS. Except the column is always in-memory which means I/O is never required. This provides the HANA system with an advantage over a disk-based system. Disk I/O is 120+ times slower than memory access so even an index is unlikely to beat in-memory. See here for some numbers you should know.

After the transaction is committed into the internal, OLTP-optimized part, a process starts that moves the data to the BI optimized part. This is called a delta merge as the OLTP portion holds all of the changes, the delta, in the data set.

When a BI query starts it can limit the scan to only partitions in the BI optimized part, or if real-time data is required it can scan both parts. The small portion of the scan that accesses the OLTP/delta portion is sub-optimal when compared to the scan of the BI part,  but not slow at all as the data is all in-memory.

We can tease the performance apart as follows:

  1. There is a OLTP insert/update/delete “write” portion… and HANA executes this like any OLTP database, as fast as an OLTP RDBMS, with a commit after a write-to-log;
  2. There is a OLTP select “read” portion… and HANA performs this in the in-memory column store faster than many OLTP databases… and scans the delta structure as fast as any OLTP database;
  3. There is a delta merge from the OLTP write-optimized part to the BI read-optimized column store that is hundreds to tens of thousands of times faster than any ETL tool; and
  4. There is a BI select portion that scans the in-memory column store hundreds to thousands of times faster than a disk-based BI database.
  5. If the BI query requires access to real-time data then an in-memory scan of the delta file is required… there is no analogy to this in a system with separate OLTP and BI tables.
The implementation uses MVCC instead of locks.


The Five Minute Rule and In-memory Databases

I was recently reminded of a couple of papers written by Jim Gray and Gianfranco Putzolu  that calculated the cost of keeping data in memory vs the cost of paging it in from disk. I was happy to see that the thread was being kept alive by Goetz Graefe.

These papers used the cost of each media to determine how “hot” data needed to be to be cost-effectively stored in-memory. The 1987  five minute rule (click here to reference the original papers) was so named because at that time and based on the relative costs of CPU, Memory, and Disk; a 1KB  record that was accessed every five minutes could be effectively stored in memory and a 4KB block of data broke-even at two minutes.

In 2009, with CPU prices coming down but the number of instructions executed per second going up, and with memory and prices down, the break-even point between keeping 4KB in memory or on a SATA disk was 90 minutes.

Let’s be clear about what this means. Based solely on the cost of CPUs, RAM, and SATA drives; any data that is accessed more frequently than each 90 minutes should be kept in memory. This does not include any ROI based on the business benefits of a speedy response. It does not adjust for data compression which allows more than 4KB of user data to use 4KB of RAM. Just pure IT economics gets us to this point.

So… if you have data in a data warehouse or a mart that is touched by a query at least once every 90 minutes… it is wasteful to store it on disk. If you have an in-memory database than can compress the data 2X and use it in its compressed form, then the duration goes up to 180 minutes. You do not have to look any further than this to find the ROI for an in-memory data base (IMDB).


The Big Data Bang

There is still an open question over whether, after the Big Bang, there is enough mass in the Universe to slow the expansion and cause the universe to contract. While the Big Data Bang continues to expand the universe of bits and bytes… I would like to ask whether some of these numbers are overstated? I know that the sum of the bits and bytes is expanding but I wonder if the universe of information is expanding as much as we claim?

Note that by “information” I mean a unique combination of bits and bytes representing some new information. In other words, if the same information is copied redundantly over and over does that count?

There is a significant growth industry in deduplication software that can backup data without copying redundant information. The savings from these products is astounding. NetApp claims 70% of the unstructured data may be redundant (see here). Data Domain says that eliminating (and compressing) redundant data reduces storage requirements by 10X-30X (see here).  What’s up with that?

In the data warehouse space it is just as bad. The same data lives in OLTP systems, ETL staging areas, Operational Data Stores, Enterprise Data Warehouses, Data Marts, and now Hadoop clusters. The same information is replicated in aggregate tables, indexes, materialized views, and cubes.  If you go into many shops you can find 50TB of EDW data exploded into 500TB of sandboxes for the data scientists to play with. Data is stored in snapshots on an hourly basis where less than 10% of the data changes from hour to hour. There is redundancy everywhere. There is redundancy everywhere. 🙂

I believe that there is a data explosion… and I believe that it is significant… but  there is also a sort of laziness about copying data.

Soon we will see in production the first systems where a single copy of OLTP and EDW and analytic data can reside in the same platform and be shared. It will be sort of shocking to see the Big Data Bang slow a little…

Stop Tuning and Scan…

After years of tuning data warehouses, queries, data loads, and BI applications, I give up. In the long run it is not really possible anyway… and better still… no longer necessary. A better approach is to build your database and your hardware infrastructure to scan fast and smart. So here’s a blog on why it’s impossible to tune a warehouse… and on why it’s no longer necessary.

My argument against tuning is easy to grasp. By definition a data warehouse serves many constituencies: Marketing and Finance and Customer Support and Distribution; and these business units will each access the data from their unique perspective following a unique path through the warehouse. A designer cannot lay out the data effectively to support each access path… cannot index every column, cannot map more than one zone, cannot replicate the data again and again with aggregates and materialized views, cannot cache the entire warehouse. Even if you get it right changing business requirements will fracture your approach; or worse, the design will not support new queries and constrain your business.

Many readers will be skeptical at this point… suggesting that the software and hardware to eliminate tuning does not exist. So let’s build a model and test the state of the art.

Let us imagine and model a 25TB data warehouse with a 20TB fact table that holds 25 months of daily facts partitioned by day. The fact table is 100 columns wide and we will model two queries that reference 20 of the columns… One that touches every row and one that is date constrained and touches only 14 days of data.

Here are some hardware specs. A server with a single I/O controller can read about 1.5GB/second into the database. With two controllers can read around 2.7GB/sec. Note that these are not the theoretical limits of the hardware but real measurements taken from the current hardware on the market: Dell, HP, and SUN/Oracle.

Now let’s deploy our imaginary warehouse on a strong state of the market multi-core server with, to be conservative, a single controller. This server would scan our fact table in around 222 minutes. Partition elimination would allow the date constrained query to complete in just over 4 minutes. Note that these imaginary queries ignore the effort to join and/or aggregate data. Later I’ll have more to say on this…

If we deploy our warehouse on a shared-nothing cluster with 20 nodes the aggregate I/O bandwidth increases to 30GB/sec and the execution times for our two queries improves to 11 minutes and 12 seconds, respectively. This is the power of parallel I/O.

Now we have to factor in compression. Typical row-based compression yields approximately a 2.5x result… columnar compression varies wildly… But let’s assume 25X in our model. There is a cost to be paid to decompress the data… But since it is paid by everyone and CPU is a relatively inexpensive commodity, we’ll ignore it in our model.

For 2.5X row-based compression our big query now completes in 4.4 minutes and the smaller query completes in 4.8 seconds.

The model is a little more complicated when we throw in columnar compression so let’s consider two columnar models. For an implementation such as Exadata we get the benefit of columnar compression but not the benefit of columnar projection. 25X hybrid columnar compression will execute our two scans in 26 seconds and .5 seconds. Now we are talking! A more complete columnar implementation will only touch the columns required by our query, 20% of the data, providing another 5X improvement. This drops our scan queries to 5.2 seconds and .1 second, respectively. Smoking fast. Note that the more simple columnar compression approach will provide the same fast response when every column is touched and the more complex approach will slow down in that case… so you can make the trade off in your shop as required.

Let me remind you again… This is a full scan of 20TB with no tricks: no indices, no pre-aggregation, no materialized views, no cache and no flash, no pre-sorted zone maps. All that is required is a parallel implementation with partitioning, compression, and a columnar table type… and this implementation works. It is robust.

A note on joins… It is more difficult to model joins… and I’ll attempt a simple model in another post. But you can see that this fast scan approach has solved the costly part of the problem using parallel processing… and you can imagine that a shared-nothing massively parallel approach to joins may hold the key.