My 2 Cents: Greenplum 1Q2013

Unripe plums
Unripe plums (Photo credit: Wikipedia)

Since my blogs tend to be in response to some stimulus they may not reflect a holistic view on any particular product. The “My 2 Cents” series will try to provide a broader view…

Please consider this as you read on…

Summary

From a technical perspective, Greenplum is my favorite data warehouse database. Built on the same architecture as Teradata (see here), the Greenplum team was able to extend the core of Postgres… first building out a shared-nothing architecture and then adding feature after feature… putting the heat on the other major players. Greenplum was the first row-based RDBMS to add full columnar support… and their data-loading capability is second-to-none.

Oddly they do not want to be in the data warehouse space. Their recent announcement (here) does not include any reference to data warehousing or business intelligence. The tweets from @Greenplum, the Greenplum website, and all things marketing are focussed on analytics and/or Hadoop. Even their page on data warehousing (here) has no articles on data warehousing. It is just not their target market. That is fine… the product is still a great EDW platform… but it is a worry.

Where They Win

The reason they target analytics is because they excel there. If your warehouse workload clogs because of big, complex, queries… Greenplum can win the day. Their data flow architecture, which keeps tuples moving from execution step to execution step without writing to spool provides them with the ability to beat the competition on analytics. They provide a very rich set of in-database analytics and some add-on capabilities to improve the productivity of your data scientist team.

Their data load architecture, which they call scatter-gather, is a big differentiator. If your problem is that you cannot get data loaded and reports out in your nightly batch window then the combination of scatter-gather and the ability to run big report queries is unbeatable.

Greenplum also has a unique solution for near-real-time. They marry Gemfire, an in-memory object-oriented database, with scatter-gather to move small batches of inserted data to Greenplum with a very small time delta. I do not believe this solution supports inserts or deletes as they have to be applied directly to the Greenplum database… but it is a nice capability for a certain class of problems.

Where They Lose

Greenplum, like Teradata, can be beat when the problem to be solved is narrow. In these cases, when the database supports a single application with a small number of queries or when it supports a narrowly focussed data mart, they are vulnerable to Netezza, Vertica, or even Exadata. It is also sometimes the case that a poorly designed POC can narrow the scope enough that Greenplum loses.

Greenplum can also lose when a full EDW is required. The basic architecture of the RDBMS is capable of supporting an EDW… but some of the operational features required… RASR, workload, incremental backup, etc. are not mature. This may well be the intentional result of their focus away from these features at analytics.

In the Market

Despite the worries Greenplum should be included in every POC. They will push Teradata hard in performance and in price/performance.

As noted here… I do not understand their market strategy. It seems that they are competing with themselves by offering Hadoop for analytics… but this cannot be a bad thing for customers even if it is an odd position in the market. The analytics market they favor is tough… relatively small (compared to the DW space)… emerging… there are several capable competitors… and the market is haunted by the same problem that killed the data mining market in the mid-1990’s… there are just not enough skilled data scientists (see here).

My Guess at the Future

I cannot guess at the future of Greenplum… They are being moved into a new business unit that could be spun into a new company that has a charter to build software for the cloud (see here). This is odd in several dimensions. First, as I noted here, the shared nothing architecture Greenplum is built on is not a perfect fit for the cloud. There are ways to get around this (maybe the topic for a future post?) but it will require development in a fundamentally new direction. Further, the new division seems to be a software-only venture. This makes the future of the EMC Greenplum Data Computing Appliance uncertain. I suppose that there will be announcements soon to clarify these questions… but the architectural disconnects make it likely that there will be some arm-waving for a while.

Next up… my 2 Cents on The Rest…

4 thoughts on “My 2 Cents: Greenplum 1Q2013”

  1. I agree on almost everything. Analytics, as in discovery of previously unknown information, is not yet a business application… unless somebody does one soon.
    They would not be competing with themselves if they were to run the data base on the HDFS layer instead that directly on the Linux file system. This together with the capability to access external tables would allow a seamless move between the two.

    1. Hi Riccardo,

      If the Greenplum database stores its tables in a generic fashion in HDFS… So that the data is accessible to both Hadoop M-R programs and Greenplum SQL… Then there will be a performance penalty to be paid by Greenplum and by Hadoop… There is no single data structure that will be optimal from both sides.

      If Greenplum stores the data in the same structure used today in XFS then Hadoop will either not have access at all or a huge penalty will be paid. If Hadoop has no access this path would be nonsense… GP would pay a penalty and Hadoop would still have no access.

      If the data is stored in an Hadoop native structure then the performance of Greenplum will be compromised… And one will wonder why not just use Impala?

      Rob

  2. The Greenplum messaging definitely seems to have shifted completely from traditional DBMS-based data warehousing to Hadoop-based ‘big data’.

    The surprising, and disappointing, failure of the Greenplum DBMS to gain any significant market traction may be part of the rationale behind the shift in emphasis. In the 2.5 years since EMC bought Greenplum we have seen the BI industry swing fully towards analytics and ‘big data’.

    EMC may have had little success selling Greenplum DBMS into its customer base – based on what we’ve seen in the UK/EMEA, although that is slowly changing – but with the change in marketing emphasis EMC can now ride the ‘big data’ wave. Seems to make perfect sense, however disappointing!

    Combining both the DBMS and Hadoop worlds and running GPDB on top of HDFS might be the future for EMC’s offering:

    http://www.youtube.com/watch?v=wb46DnWM3_M

    MPP databases in general may be considered less than ideal candidates for public cloud deployment, as you point out in other articles. That said, we’ve always considered the Greenplum DBMS to be the ‘least bad’ fit as a public cloud offering.

    The fact that Greenplum requires no special server, storage, networking or other hardware, and can be deployed on a variety of OS and file systems makes it a good candidate for public cloud deployment where not all of these variables can be controlled.

    We’ve built several Greenplum systems, both SMP and MPP, over the last few years and never run into any blockers. I don’t think we’ve ever used the same hardware, OS or file system on any two Greenplum systems.

    Greenplum certainly gets our vote as a Postgres-based, feature-rich, sensibly priced, ‘roll your own’ MPP DBMS offering. What’s not to like?

    Greenplum may well flourish as a software-only offering, rather than being offered wrapped up in the DCA. It is available as software only at present, but that was never likely to get the EMC sales guys overly excited.

    1. To be fair, Greenplum announced the move from data warehousing to analytics well before the EMC acquisition… And the executive leadership team at Greenplum are the originals… so the move to Hadoop is either their idea or managed with their consent… so we should not blame EMC for these strategies.

      As I’ve said… shared-nothing tightly ties CPU, Memory, IO bandwidth, and storage (and therefore data) into scalable components. Cloud computing is based on the ability to move processes, virtual machines, around without moving the data. I do not see the difference between Teradata and Greenplum in this regard… “least bad” is not significant, methinks.

      I think that there will be some effort to find a middle ground that provides some hardware abstraction for shared-nothing to work in the cloud. But it is problematic and likely to negatively impact performance… we will see if it negatively impacts price/performance?

      And… once, over beers, an EMC exec told me that the strategy of tying Greenplum to hardware in an appliance came from the Greenplum side not from the EMC side. This is rumor quality… unsubstantiated… but unsurprising.

      Finally, Between the recent announcements regarding improvements to Hive… and the release of Impala… I do not believe that Greenplum on HDFS is compelling. As I noted in a comment earlier… either the GP data will live in HDFS as “native” HDFS data accessible to MapReduce jobs which will negatively affect GP performance (and place it squarely in competition with the less expensive Hive and Impala options) or it will store data in HDFS in GP native formats… slowing GP down only a little but adding no real advantage to anyone. This second option would be nearly nonsensical.

      Thanks for the thoughtful comments…

      Rob

Comments are closed.

Discover more from Database Fog Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading