I would like to point out a very important section in the paper on Hekaton on the Microsoft Research site here. I will quote the section in total:
2. DESIGN CONSIDERATIONS
An analysis done early on in the project drove home the fact that a 10-100X throughput improvement cannot be achieved by optimizing existing SQL Server mechanisms. Throughput can be increased in three ways: improving scalability, improving CPI (cycles per instruction), and reducing the number of instructions executed per request. The analysis showed that, even under highly optimistic assumptions, improving scalability and CPI can produce only a 3-4X improvement. The detailed analysis is included as an appendix.
The only real hope is to reduce the number of instructions executed but the reduction needs to be dramatic. To go 10X faster, the engine must execute 90% fewer instructions and yet still get the work done. To go 100X faster, it must execute 99% fewer instructions. This level of improvement is not feasible by optimizing existing storage and execution mechanisms. Reaching the 10-100X goal requires a much more efficient way to store and process data.
This is important because it confirms the difference in a Level 3 and a Level 2 columnar implementation as described here. It is just not possible for a Level 2 implementation with a row-based join engine to achieve the performance of a Level 3 implementation. This will allow the Level 3 implementations: HANA, BLU, Hekaton, and Oracle 12c to distance themselves from the Level 2 products: Teradata and Greenplum; by more than 10X… and this is a very significant advantage.
There are three forms of columnar-orientation currently deployed by database systems today. Each builds upon the next. The simplest form uses column-orientation to provide better data compression. The next level of maturity stores columnar data in separate structures to support columnar projection. The most mature implementations support a columnar database engine that performs relational algebra on column-oriented data. Let me explain…
Imagine a simple table with 1M rows… with the schema and the first several rows depicted in Figure 1. Conceptually, a row-orientation deploys data on disk and in-memory as depicted in Figure 2 and a column-orientation deploys data on disk and in-memory as depicted in Figure 3. The actual deployment may be significantly different, as we will see.
Note that I am going to throw out some indicative numbers around compression. I will suggest that applying compression to rows will provide from 1.5X to 3.5X compression with and average of 2.5X… and that applying compression to columns provides from 3X compression to 50X compression with the average around 10X. These are supportable numbers but the compression you see for any specific data set will vary.
There are two powerful compression techniques that individually or combined provide most of the benefits: dictionary-encoding and run-length encoding. For the purposes of this blog I will describe only dictionary-encoding; and I will do an injustice to that by explaining it only briefly and conceptually… just enough that you get the idea.
Further compression is possible by encoding runs of similar values to a value plus the number of times it repeats so that the bit stream 0000000000000000 could be represented as 01111 (0 occurs 24 times).
You can now also start to see why column-orientation compresses better that a row-orientation. In the row block above there is little opportunity to encode whole rows in a dictionary… the cardinality of rows in a table is too high (note that this may not be true for a dimension table which is, in-effect, a dictionary). There is some opportunity to encode the bit runs in a row… as noted, you can expect to get 2X-2.5X from row compression for a fact table. Column-orientation allows dictionary encoding to be applied effectively to low cardinality columns… and this accounts for the advantage there.
Dictionary-encoding reduces data to a compressed form by building a map that provides a translation for each cardinal value in the table to a tightly compressed form. For example, if there are indeed only three values possible in the DeptID field above then we might build a dictionary for that column as depicted in Figure 4. You can see… by encoding and storing the data in the minimal number of bits required, significant storage reduction is possible… and the lower the cardinality of a column the smaller the resulting bit representation.
Note that there is no free lunch here. There is a cost to be paid in CPU cycles to compress data and to decompress data… but for a read-optimized data warehouse database compression is cool. Exactly how cool depends on the level of maturity and we will get to that as we go.
It is crucial to remember that column store databases are relational. They ingest rows and emit rows and perform relational algebra in-between. So there has to be some magic that turns tuples into columns and restores them from columns. The integrity of a row has to persist. Again I am going to defer on the details and point you at the references below… but imagine that for each row a bit map is built that, for each column, points to the entry in the column dictionary with the proper value.
There is no free lunch to column store… no free lunch anywhere, it seems. Building this bit map on INSERT is very expensive, and modifying it on UPDATE is fairly expensive. This is why column-orientation is not suitable for OLTP workloads without some extra effort. But the cost is amortized by significant performance gains for READs.
One last concept: since peripheral I/O reads blocks imagine two approaches to column compression: one applies the concepts above to an entire table breaking each table into separate column-oriented files that may be read separately; and one which applies the concepts individually to each large block in a table file. Imagine, in the first case that Figure 2 represents a picture of the first few rows in our 1M-row table. Imagine, in the second case, that Figure 2 represents the rows in one block of data re-oriented into columns.
This second, block-oriented, approach is called PAX, and it is more-or-less the approach used by Exadata. In the PAX approach each block contains its own mini-column store and a dictionary for dictionary encoding with the values in the block. Because the cardinality for columns within a block will often be less than for an entire table there are some distinct advantages to PAX compression. Compression will be higher by more than a little than for full table columnar compression.
When Exadata reads a block from disk it decompresses the data back into rows and performs row-oriented processing to complete the query. This is very cool for Exadata… a great feature. As noted, column compression may be 4X better than row compression on the average. This reduces the storage requirements and reduces the overhead of I/O by 4X… and this is a very significant improvement. But Exadata stops here. It is not a columnar-oriented DBMS and it misses the significant advantages that come from the next two levels of column-orientation… I’ll take these up in the next post.
To be clear, all of the databases that use these more mature techniques: Teradata, HANA, Greenplum, Vertica, Paraccel, DB2, and SQL Server gain from columnar compression even if the PAX approach provides some small advantage as a compression technique.
It is also worth noting that Teradata does not gain as much as others in this regard. This is not because of poor design, rather it is due to the fact that, to their credit, Teradata implemented a Teradata-specific dictionary-based compression scheme long ago. Columnar compression let others catch up to what Teradata has offered for years.
And before you ask… Netezza offers no columnar orientation… preferring to compress deeply using an FPGA co-processor to decompress… and to reduce I/O using zone maps rather than the using the mid-level column projection techniques in the next blog here.
This post has been thrown at me a couple of times now… so I’ll now take the time to go through it… and try to address the junk.
It starts by suggesting that “the Germans” have started a war… but the next sentence points out that the author tossed grenades at HANA two months before the start he suggests. It also ignores the fact that the HANA post in question was a response to incorrect public statements by a Microsoft product manager about HANA (here).
The author suggests some issue with understanding clustered indexes… Note that “There are 2 implementations of xVelocity columnstore technology: 1. Non clustered index which is read only – this is the version available in SMP (single node) SQL Server 2012. 2. Columnstore as a clustered index that is updateable – This is the version available in MPP or PDW version of SQL 2012.”. The Microsoft documentation I read did not distinguish between the two and so I mistakenly attributed features of one to the other. Hopefully this clears up the confusion.
He suggests that the concept of keeping redundant versions of the data… one for OLTP and one for BI is “untrue”… I believe that the conventional way to deal with OLTP and BI is to build separate OLTP and BI databases… data warehouses and data marts. So I stand by the original comment.
The author rightfully suggests that I did not provide a reference for my claim that there are odd limitations to the SQL that require hand-coding… here they are (see the do’s and dont’s).
He criticizes my statement that shared-nothing gave us the basis for solving “big data”. I do not understand the criticism? Nearly very large database in the world is based on a shared-nothing architecture… and the SQL Server PDW is based on the same architecture in order to allow SQL Server to scale.
He is critical of the fact that HANA is optimized for the hardware and suggests that HANA does not support Intel’s Ivy Bridge. HANA is optimized for Ivy Bridge… and HANA is designed to fully utilize the hardware… If we keep it simple and suggest that using hardware-specific instruction sets and hardware-specific techniques to keep data in cache together provide a 50X performance boost [This ignores the advantages of in-memory and focusses only on hw-specific optimizations… where data in cache is either 15X (L3) or 20X (L2) or 200X (L1) faster than data fetched from DRAM… plus 10X or more using super-computer SIMD instructions], I would ask… would you spend 50X more for under-utilized hardware if you had a choice? SAP is pursuing a distinct strategy that deserves a more thoughtful response than the author provided.
He accuses me of lying… lying… about SQL being architected for single-core x286 processors. Sigh. I am unaware of a rewrite of the SQL Server product since the 286… and tacking on support for modern processors is not re-architecting. If SQL Server was re-architected from scratch since then I would be happy to know that I was mistaken… but until I hear about a re-write I will assume the SQL Server architecture, the architecture, is unchanged from when Sybase originally developed it and licensed it to Microsoft.
He says that HANA is cobbled together from older piece parts… and points to a Wikipedia page. But he does not use the words in the article… that HANA was synthesized from other products and , as stated in the next sentence, built on: “a new application architecture“. So he leaves the reader to believe that there is nothing new… he is mistaken. HANA is more than a synthesis of in-memory, column-store, and shared-nothing. It includes a new execution engine built on algorithms from the search space… columns in the column store are processed as vectors rather than the rote tuple-by-tuple approach from the 1980’s. It includes powerful in-database support for procedural languages with facilities that convert loops to fully parallel set-based processes. It provides, as noted above, a unique approach to supporting OLTP and BI queries in the same instance (see here)… and more. I’m not trying to hype HANA here… time and the market will determine if these new features are important… but there is no doubt that they are new.
I did not find the Business Intelligist post to be very informative or helpful. With the exception of the Wikipedia article mentioned above there is only unsubstantiated opinion in the piece… … and a degree of rudeness that is wholly uncalled for.