Now for Greenplum & Hadoop… to continue this thread on RDBMS-Hadoop integration (Part 1, Part 2, Part 3, Part 4, Part 5, Part 6) I have suggested that we could evaluate integration architecture using three criteria:
- How parallel are the pipes to move data between the RDBMS and the parallel file system;
- Is there intelligence to push down predicates; and
- Is there more intelligence to push down joins and other relational operators?
The Greenplum interface is architecturally similar to the Teradata interface described in Part 4. Hadoop files are defined to the DBMS as external tables and there are capable parallel pipes to effectively move data from the HDFS side to GPDB. In addition Greenplum uses their Scatter-Gather method to load data into the GPDB effectively.
There is no ability to push down predicates. When a query executes all of the relevant data is sucked through the parallel pipes into the database segments for processing. This is very inefficient and there is not even the crude capability to push down processing provided by Teradata.
Finally, there is no ability to push down joins or aggregation.
Greenplum’s offering is not very advanced. To perform with Greenplum analytics data must move between the two storage layers with no intelligence to mitigate the cost.
On to the last post in the series Part 8 on SQL Server and Polybase.