In my last post here I suggested that there were three levels of maturity around column orientation and described the first level, PAX, which provides columnar compression. This apparently is the level Exadata operates at with its Hybrid Columnar Compression.
In this post we will consider the next two levels of maturity: early materialized column processing and late materialized column processing which provide more I/O avoidance and some processing advantages.
In the previous post I suggested a five-column table and depicted each of those columns oriented on disk in separate file structures. This orientation provides the second level of maturity: columnar projection.
Imagine a query that selects only 4 of the five columns in the table leaving out the EmpFirst column. In this case the physical structure that stores EmpFirst does not have to be accessed; 20% less data is read, reducing the I/O overhead by the same amount. Somewhere in the process the magic has to be invoked that returns the columns to a row orientation… but just maybe that overhead costs less than the saving from the reduced I/O?
Better still, imagine a fact table with 100 columns and a query that accesses only 10 of the columns. This is a very common use case. The result is a 9X reduction in the amount of data that has to be read and a 9X reduction in the cost of weaving columns into rows. This is columnar projection and the impact of this far outweighs small advantage offered by PAX (PAX may provide a .1X-.5X, 10%-50%, compression advantage over full columnar tables). This is the advantage that lets most of the columnar databases beat Exadata in a fair fight.
But Teradata and Greenplum stop here. After data is projected and selected the data is decompressed into rows and processed using their conventional row-based database engines. The gains from more maturity are significant.
The true column stores read compressed columnar data into memory and then operate of the columnar data directly. This provides distinct advantages:
- Since data remains compressed DRAM is used more efficiently
- Aggregations against a single column access data in contiguous memory improving cache utilization
- Since data remains compressed processor caches are used more efficiently
- Since data is stored in bit maps it can be processed as vectors using the super-computing instruction sets available in many CPUs
- Aggregations can be executed using multiplication instead of table scans
- Distinct query optimizations are available when columnar dictionaries are available
- Column structures behave as built-in indexes, eliminating the need for separate index structures
These advantages can provide 10X-50X performance improvements over the previous level of maturity.
Summary
- Column Compression provides approximately a 4X performance advantage over row compression (10X instead of 2.5X). This is Column Maturity Level 1.
- Columnar Projection includes the advantages of Column Compression and provides a further 5X-10X performance advantage (if your queries touch 1/5-1/10 of the columns). This is Column Maturity Level 2.
- Columnar Processing provides a 10X+ performance improvement over just compression and projection. This is Column Maturity Level 3.
Of course your mileage will vary… If your workload tends to touch more than 80% of the columns in your big fact tables then columnar projection will not be useful… and Exadata may win. If your queries do not do much aggregation then columnar processing will be less useful… and a product at Level 2 may win. And of course, this blog has not addressed the complexities of joins and loading and workload management… so please do not consider this as a blanket promotion for Level 3 column stores… but now that you understand the architecture I hope you will be better able to call BS on the marketing…
Included is a table that outlines the maturity level of several products:
Product |
Columnar Maturity Level |
Notes |
Teradata |
2 |
Columnar tables, Row Engine |
Exadata |
1 |
PAX only |
HANA |
3 |
Full Columnar Support |
Greenplum |
2 |
Columnar tables, Row Engine |
DB2 |
3 |
BLU Hybrid |
SQL Server |
2 |
I think… researching… |
Vertica |
3 |
Full Columnar Support |
Paraccel |
3 |
Full Columnar Support |
Netezza |
n/a |
No Columnar Support |
Hadapt |
2 |
I think… researching… |