HAWQ and Hadoop and Open Source and a Wacky Idea

Juvenile Cooper's Hawk (Accipiter cooperii) ta...

I want to soften my criticism of Greenplum‘s announcement of HAWQ a little. This post by Merv Adrian convinced me that part of by blog here looked at the issue of whether HAWQ is Hadoop too simply. I could outline a long chain of logic that shows the difficulty in making a rule for what is Hadoop and what is not (simply: MapR is Hadoop and commercial… Hadapt is Hadoop and uses a non-standard file format… so what is the rule?). But it is not really important… and I did not help my readers by getting sucked into the debate. It is not important whether Greenplum is Hadoop or not… whether they have committers or not. They are surely in the game and when other companies start treating them as competitors by calling them out (here) it proves that this is so.

It is not important, really, whether they have 5 developers or 300 on “Hadoop”. They may have been over-zealous in marketing this… but they were trying to impress us all with their commitment to Hadoop… and they succeeded… we should not doubt that they are “all-in”.

This leaves my concern discussed here over the technical sense in deploying Greenplum on HDFS as HAWQ… or deploying Greenplum in native mode with the UAP Hadoop integration features which include all of the same functionality as HAWQ… and 2x-3X better performance.

It leaves my concern that their open source competition more-or-less matches them in performance when queries are run against non-proprietary, native Hadoop, data structures… and my concerns that the community will match their performance very soon in every respect.

It is worth highlighting the value of HAWQ’s very nearly complete support for the SQL standard against native Hadoop data structures. This differentiates them. Building out the SQL dialect is not a hard technical problem these days. I predict that there will be very nearly complete support for SQL in an open source offering in the next 18-24 months.

These technical issues leave me concerned with the viability of Greenplum in the market. But there are two ways to look at the EMC Pivotal Initiative: it could be a cloud play… in which case Greenplum will be an uncomfortable fit; or it could be an open source play… in which case, here comes the wacky idea, Greenplum could be open-sourced along side Cloud Foundry and then this whole issue on committers and Hadoopiness becomes moot. Greenplum is, after all, Postgres under the covers.

2 thoughts on “HAWQ and Hadoop and Open Source and a Wacky Idea”

  1. Rob, you refer to “HAWQ’s very nearly complete support for the SQL standard against native Hadoop data structures.” But as far as I can tell, in order for HAWQ to access HDFS files, it needs to go thru GPXF – which looks to me like GP External Tables (per Chuck Hollis blog) – which is the same technique that GP DB used in the past. So doesn’t seem much different to me – other than sharing the nodes with Hadoop instead of separate cluster. Am I missing something?

    1. Hi Glen…

      As you point out, HAWQ access to native HDFS data is via external tables. But Greenplum provides full support for external tables… hence they provide comprehensive SQL support against native Hadoop files as external tables. This is important. I’m doing the research now… But to my knowledge only Microsoft’s Polybase equals this.

      I agree that there is no apparent new functionality provided by HAWQ… Which is why I cannot imagine anyone implementing HAWQ instead of native Greenplum? Why would you pay a steep performance penalty for HAWQ with no apparent gain?

      Rob

Comments are closed.

Discover more from Database Fog Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading