- How parallel are the pipes to move data between the RDBMS and the parallel file system;
- Is there intelligence to push down predicates; and
- Is there more intelligence to push down joins and other relational operators?
I want to be sure that I’ve conveyed the concepts behind these criteria properly… I may have rushed it in the early parts of this series.
Let’s imagine a query that joins a 2,000,000 row table with a 1000 row dimension table where both live in HDFS.
If all of the data has to be moved from HDFS to the RDBMS then 2,001,000 rows must be read and moved in order to apply a predicate or any other processing.. For fun lets say that the cost of moving this data is 2001K.
If there are 10 parallel pipes then the data movement is completed in one tenth the time… so the cost is 200K.
If a predicate is included that selects only 5% of the data from the big table, and the predicate is pushed down the cost is reduced to 101K. Add in parallel pipes and the cost is 10K
Imagine a query where there is a join between the two tables with predicates on one side and predicate push down… then you have to pay 101K to pull the projected data up and do the join in the RDBMS. If there is a join predicate that reduces the final answer set by another 95% then after the join you return 6K rows. Since everybody returns the same 6K rows as an answer we won’t add that in.
But if you can push the join down as well as the predicates then only 6K rows are moved up… so you can see how 2001K shrinks to 6K through the effective push down of processing.
Further, you can build arbitrarily complex queries and model them pretty well knowing that most of the cost is in data movement.
So think about how Teradata processes these two tables in Hadoop when you use the specialized SQL constructs and then again if you build the query from a BI tool. And stay tuned as I’ll show you how HANA processes the data next…. and then talk about several others.
On to Part 6…