Part 7 – How Hadooped is Greenplum, the Pivotal GPDB?

Now for Greenplum & Hadoop… to continue this thread on RDBMS-Hadoop integration (Part 1Part 2, Part 3, Part 4Part 5, Part 6) I have suggested that we could evaluate integration architecture using three criteria:

  1. How parallel are the pipes to move data between the RDBMS and the parallel file system;
  2. Is there intelligence to push down predicates; and
  3. Is there more intelligence to push down joins and other relational operators?

The Greenplum interface is architecturally similar to the Teradata interface described in Part 4. Hadoop files are defined to the DBMS as external tables and there are capable parallel pipes to effectively move data from the HDFS side to GPDB. In addition Greenplum uses their Scatter-Gather method to load data into the GPDB effectively.

There is no ability to push down predicates. When a query executes all of the relevant data is sucked through the parallel pipes into the database segments for processing. This is very inefficient and there is not even the crude capability to push down processing provided by Teradata.

Finally, there is no ability to push down joins or aggregation.

Greenplum’s offering is not very advanced. To perform with Greenplum analytics data must move between the two storage layers with no intelligence to mitigate the cost.

On to the last post in the series Part 8 on SQL Server and Polybase.

4 thoughts on “Part 7 – How Hadooped is Greenplum, the Pivotal GPDB?”

  1. I think you are not talking about the HAWQ engine, but can you confirm?

    1. Nope… Talking about GPDB, dong Jiang. But HAWQ is similar… HAWQ is basically GPDB on HDFS…. But since HAWQ lives alongside Hadoop the lack of advanced features are slightly less painful (at the cost of performing 2X slower overall).

      Rob

Comments are closed.

Discover more from Database Fog Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading