Who is How Columnar? Exadata, Teradata, and HANA – Part 2: Column Processing

In my last post here I suggested that there were three levels of maturity around column orientation and described the first level, PAX, which provides columnar compression. This apparently is the level Exadata operates at with its Hybrid Columnar Compression.

In this post we will consider the next two levels of maturity: early materialized column processing and late materialized column processing which provide more I/O avoidance and some processing advantages.

In the previous post I suggested a five-column table and depicted each of those columns oriented on disk in separate file structures. This orientation provides the second level of maturity: columnar projection.

Imagine a query that selects only 4 of the five columns in the table leaving out the EmpFirst column. In this case the physical structure that stores EmpFirst does not have to be accessed; 20% less data is read, reducing the I/O overhead by the same amount. Somewhere in the process the magic has to be invoked that returns the columns to a row orientation… but just maybe that overhead costs less than the saving from the reduced I/O?

Better still, imagine a fact table with 100 columns and a query that accesses only 10 of the columns. This is a very common use case. The result is a 9X reduction in the amount of data that has to be read and a 9X reduction in the cost of weaving columns into rows. This is columnar projection and the impact of this far outweighs small advantage offered by PAX (PAX may provide a .1X-.5X, 10%-50%, compression advantage over full columnar tables). This is the advantage that lets most of the columnar databases beat Exadata in a fair fight.

But Teradata and Greenplum stop here. After data is projected and selected the data is decompressed into rows and processed using their conventional row-based database engines. The gains from more maturity are significant.

The true column stores read compressed columnar data into memory and then operate of the columnar data directly. This provides distinct advantages:

  • Since data remains compressed DRAM is used more efficiently
  • Aggregations against a single column access data in contiguous memory improving cache utilization
  • Since data remains compressed processor caches are used more efficiently
  • Since data is stored in bit maps it can be processed as vectors using the super-computing instruction sets available in many CPUs
  • Aggregations can be executed using multiplication instead of table scans
  • Distinct query optimizations are available when columnar dictionaries are available
  • Column structures behave as built-in indexes, eliminating the need for separate index structures

These advantages can provide 10X-50X performance improvements over the previous level of maturity.

Summary

  • Column Compression provides approximately a 4X performance advantage over row compression (10X instead of 2.5X). This is Column Maturity Level 1.
  • Columnar Projection includes the advantages of Column Compression and provides a further 5X-10X performance advantage (if your queries touch 1/5-1/10 of the columns). This is Column Maturity Level 2.
  • Columnar Processing provides a 10X+ performance improvement over just compression and projection. This is Column Maturity Level 3.

Of course your mileage will vary… If your workload tends to touch more than 80% of the columns in your big fact tables then columnar projection will not be useful… and Exadata may win. If your queries do not do much aggregation then columnar processing will be less useful… and a product at Level 2 may win. And of course, this blog has not addressed the complexities of joins and loading and workload management… so please do not consider this as a blanket promotion for Level 3 column stores… but now that you understand the architecture I hope you will be better able to call BS on the marketing…

Included is a table that outlines the maturity level of several products:

Product

Columnar Maturity Level

Notes

Teradata

2

 Columnar tables, Row Engine
Exadata

1

 PAX only
HANA

3

 Full Columnar Support
Greenplum

2

 Columnar tables, Row Engine
DB2

3

 BLU Hybrid
SQL Server

2

 I think… researching…
Vertica

3

 Full Columnar Support
Paraccel

3

 Full Columnar Support
Netezza

n/a

 No Columnar Support
Hadapt

2

 I think… researching…

6 thoughts on “Who is How Columnar? Exadata, Teradata, and HANA – Part 2: Column Processing

    • Hi Mick…

      I’ll do some more reading… but I suspect that Infobright is a level 1 column store. That is, they use column orientation to achieve compression but then store multiple columns in a data pack. The very cool thing they do is that they capture detailed statistics for each data pack which allows them to answer some queries from the stats… and allows them to retrieve and decompress only packs that seem relevant to the query. It is sort of like IBM’s skip processing… only with more intelligent skipping.

      Rob

      Like

    • Hi Martijn,

      Vectorwise is a self-proclaimed PAX system… and therefore a Level 1 column store. But is does not do them justice to call them a column store… they have a different approach that is very different. I’ll write a little more in some notes I’ll publish next week…

      Rob

      Like

  1. Dear Rob,

    I can’t see how Vertica can be classified as true columnar according to your definitions.

    First, Vertica cannot operate directly on uncompressed data – it is written in its current documentation.

    Second, and much more important, you say “Column structures behave as built-in indexes, eliminating the need for separate index structures”. I am not sure if this is true for Vertica. I didn’t see anything about Vertica that says that it stores columnes in any way that allows filtering except by a column scan and I didn’t see anything about Vertica being able to operate on more than one column together using bitwise operations or by any other way.

    As far as I can see, the only way for Vertica to filter by a value of a column without scanning the whole column is by having a projection that fits, and the situation is getting worse when you want to filter by a combination of fields where the only way is to have a projection that fits this specific combination of fields.

    I can’t see in what way these projections are better than the old indexes.

    Thanks, Jacob

    Like

    • Hi Jacob…

      I did not mean to imply that operating on uncompressed data was a requirement for a full, Level 3, column store. The real test is whether the execution engine operates on columns. I think that you will find that this is the case for Vertica.

      Here is a post that might help you to see that Vertica has the right stuff (http://www.vertica.com/2010/04/22/column-store-vs-column-store/).

      I’m not a big fan of projections… I understand their usefulness… But like other materialized data they are expensive to maintain and narrow in scope.

      Rob

      Like

Comments are closed.