A Short DW DBMS Market History: HANA, Oracle, DB2, Netezza, Teradata, & Greenplum

Here is a quick review of tens years of data warehouse database competition… and a peek ahead…

Maybe ten years ago Netezza shook up the DW DBMS market with a parallel database machine that could compete with Teradata.

About six years ago Greenplum entered the market with a commodity-based product that was competitive… and then added column store to make it a price/performance winner.

A couple of years later Oracle entered with Exadata… a product competitive enough to keep the Oracle faithful on an Oracle product… but nothing really special otherwise.

Teradata eventually added a columnar feature that matched Greenplum… and Greenplum focussed away from the data warehouse space. Netezza could not match the power of columnar and could not get there so they fell away.

At this point Teradata was more-or-less back on top… although Greenplum and the other chipped away based on price. In addition, Hadoop entered the market and ate away at Teradata’s dominance in the Big Data space. The impact of Hadoop is well documented in this blog.

Three-to-four years ago SAP introduced HANA and the whole market gasped. HANA was delivering 1000X performance using columnar formats, memory to eliminate I/O, and bare-metal techniques that effectively loaded data into the processor in full cache lines.

Unfortunately, SAP did not take advantage of their significant lead in the general database markets. They focussed on their large installed base of customers… pricing HANA in a way that generated revenue but did not allow for much growth in market share. Maybe this was smart… maybe not… I was not privy to the debate.

Now Oracle has responded with in-memory columnar capability and IBM has introduced BLU. We might argue over which implementation is best… but clearly whatever lead SAP HANA held is greatly diminished. Further, HANA pricing makes it a very tough sell outside of its implementation inside the SAP Business Suite.

Teradata has provided a memory-based cache under its columnar capabilities… but this is not at the same level of sophistication as the HANA, 12c, BLU technologies which compute directly against compressed columnar data.

Hadoop is catching up slowly and we should expect that barring some giant advance from the commercial space that they will reach parity in the next 5 years or so (the will claim parity sooner… but if we require all of the capabilities offered to be present there is just no way to produce mature software any faster than 5 years).

Interestingly there is one player who seems to be advancing the state of the art. Greenplum has rolled out a best-in-class optimizer with Orca… and now has acquired Quickstep which may provide the state-of-the-art in bare metal columnar computing. When these come together Greenplum could once again bounce to the top of the performance, and the price/performance, stack. In addition, Greenplum has skinnied down and is running on an open source business model. They are very Hadoop-friendly.

It will be interesting to see if this open-source business model provides the revenue to drive advanced development… there is not really a “community” behind Greenplum development. It will also be interesting to see if the skinny business model will allow for the deployment of an enterprise-level sales force… but it just might. If Pivotal combines this new technology with a focus on the large EDW market… they may become a bigger player.

Note that was sort of dumb-luck that I posted about how Hadoop might impact revenues of big database players like Teradata right before Teradata posted a loss… but do not over think this and jump to the conclusion that Teradata is dying. They are the leader in their large space. They have great technology and they more-or-less keep up with the competition. But skinnier companies can afford to charge less and Teradata, who grew up in the days of big enterprise software, will have to skinny down like Greenplum. It will be much harder for Teradata than it was for Greenplum… and both companies will struggle with profitability for a while. But it is these technology and market dynamics that give us all something to think about, blog about, and talk about over beers…

Thinking About the Pivotal Announcements…

Yesterday I provided a model for how business sees open source as a means to be profitable (here). This is the game Pivotal seems to be playing with their release of Hadoop, Gemfire, HAWQ, and Greenplum into open source. I do not know their real numbers… so they may need more or fewer additional customers than the mythical company to get back to break-even. But it is unlikely that any company can turn the corner from a license-based revenue stream to a recurring revenue stream in a year… so Pivotal must be looking at a loss. And when losses come it is usual to cut costs… to cut R&D.

There has already been a brain-drain out of the database ranks at Pivotal as they went “all in” on Hadoop. They likely hope for an open source community to pick up the slack… but there is not a body of success I can see in building a community to engineer a commercial product-turned-open. This is especially problematic for Gemfire, an old technology that has been in the commercial space for a very long time. HAWQ has to compete for database resources with the other Hadoop RDBMS technologies… that will be difficult. Greenplum has a chance as it is based on PostgreSQL… but it is a long way away from the current PostgreSQL code base these days. There is danger here.

The bottom line… Greenplum and HAWQ and Gemfire have become risky propositions for both the current customer base and for new customers. I’ll leave it to you to evaluate the risk as this story unfolds. Still, with the risk comes reward… the cost of acquiring Greenplum will drop dramatically and today Greenplum is a competitive product. In addition, if Greenplum gains some traction, it will put price pressure on the other database products. Note that HAWQ was already marked down to open source price levels… and part of Pivotal’s problem was that HAWQ was eating at the Greenplum market. With these products priced at similar levels there becomes some weirdness in choosing… but the advantage is to customers looking at Greenplum.

One great outcome comes for Pivotal Hadoop customers… the fact that Hortonworks will more-or-less subsume Pivotal Hadoop leaves those folks in a better place than before.

If you consider the thought experiment you would have to ask yourself why a company that was breaking even would take this risky route? It could be that they took the route because they were not breaking even and this was a possible path to get even. Also consider… open sourcing code is the modern graceful way to retire an unprofitable product line.

This is sound thinking by Pivotal… during the creation, EMC gave Pivotal several unprofitable troubled assets and these announcements give Pivotal a path forward. If the database product line cannot carry their weight then they will go into maintenance mode and slowly fade. Too bad… as you know I consider Greenplum a solid product whose potential was wasted. But Pivotal has a very nice product in Cloud Foundry… and they clearly see this as their route to profitability and to an IPO… a route that no longer includes a significant contribution from database products.

Open Source is Not a Market…

This post is more about the technology business than about technology… but it may be relevant as you try to sort out winners and losers… and this sort of sorting is important if you consider new companies who may, or may not, succeed in the long run.

To make my point let us do a little thought experiment. Imagine a company doing $100M in revenue with a commercial, not open source, database product. They win the $100M in revenue by competing with Oracle, IBM, Microsoft, Teradata, et cetera… and maybe competing a little here and there with some open source products.

Let’s assume that they make 50% of their revenue from services and support, and that their average sale is $2M… so they close 25 deals a year competing in this market. Finally, let’s assume that they break-even each year and spend 20% of their revenues on R&D. The industry average for support services is 20%.. so with each $2M sale they add $400K in recurring revenue.

They are considering making their product open source. Let’s assume that they make the base product free… and provide some value-added offering that costs $200K for the average buyer. Further, they offer a support package for the same $400K/year customers currently pay. How does the math work out?

Let’s baseline against the 25 deals/year…

If they make 25 sales and every buyer buys both the support package and the value-added offer the average sale drops from $2M to $200K, sales revenue drops from $50M to $5M, the annual revenue drops from $100M to $55M… and the company loses $45M. So… starting off they need to make 225 more sales just to break even. But now it gets complicated… if they sell 5 extra deals then in the next year they earn $2M extra in support fees… so if they sell 113 extra deals in year one then in year two they have made up the entire $45M difference and they are back to break-even going forward. If it takes them 2 years to get the extra recurring revenue then they lose money in year two… but are back to break-even in year three.

From here it gets even more complicated. The mythical company above sells the baseline of 25 new copies a year with an enterprise sales force that is expensive. There is no way that the same sales force that services 25 sales/year could service 100+ extra deals. So either costs go up or the 100+ extra customers becomes unattainable. We might hope that the cost of sales will drop way off as the sales price moves to $200K. This is not unreasonable… but certainly not guaranteed. Further, if you are one of the existing sales-staff then you have to sell 10X just to make the same commission. Finally these numbers assume that every customer buys the value-add and gets enterprise-level support. Reality will be something less than this.

We might ask: is it even possible to sell 100+ more with the same product in the same market? Let us be clear that the market the database product plays in has not changed. Open Source is not a market. All we have done is reduced the sales price for the product with some hope that price is a significant driver in the market.

This is not meant as an academic exercise. Tomorrow we will consider how this thought experiment applies to Pivotal’s announcements last week… and to the future of Pivotal’s database assets (here).

Hadoop Squeezes Greenplum

For several years now I have been suggesting that Hadoop will squeeze the big data RDBMSs: Teradata, Exadata, Greenplum, and Netezza… squeezing them first out of the big data end of the market and then impinging on the high-end of the EDW space. Further I have suggested that there may be a significant and immediate TCO reduction from using Hadoop with your EDW RDBMS which squeezes these product’s market faster and further.

Originally I suggested that Greenplum and Netezza would feel the squeeze first since they were embracing Hadoop directly and at the expense of their RDBMS offerings. Greenplum took this further by trying to compete on price… cutting the price of the GPDB and then introducing HAWQ, basically GPDB on HDFS, at a Hadoop DBMS price point. These moves coupled with a neglect of the EDW market where Greenplum made its name apparently has allowed Hadoop to squeeze Greenplum out of the commercial market.

My network has been humming with rumors from reliable sources for 4+ weeks now… and I am now getting confirmation from both inside and outside Pivotal that the Greenplum software will move to open source in short order. The details are being worked out… and while there may still be a change of heart… it seems to be a done deal. The buzzness plan that Greenplum embarked on prior to the EMC acquisition in 2010 has not been a commercial success.

No one is sorrier to see this than me. Greenplum had a real shot at success. It was a very solid piece of work leading the space with strong architectural extensions like data flow shared nothingness, hybrid row/columnar capabilities, and into big data applications. The ORCA optimizer had the potential to change the game again.

Greenplum was nearly profitable in 2009 running hard at Teradata and Exadata and Netezza in the EDW space. The EDW market is tough… so we have to be fair and point out that pursuing this market may have led to the same result… but a small-market analytics play was followed by an open-source Hadoop play that could only end in squeezing Greenplum. There was never really a business plan with a win at the end.

Hopefully by open sourcing Greenplum some of the sound software will make it into PostgreSQL… but dishing Greenplum into the open source space with few developers and no community dishes it into the same space that Informix, Red Brick, and others sit. I know that I suggested open sourcing Greenplum over 18 months ago (see the wacky idea here)… but the idea then, as now, amounts to capitualization. I just declared what seemed to me to be inevitable a little sooner than Pivotal.

Teradata has now further embraced Hadoop… and they run the risk of repeating the Greenplum downturn. They have a much stronger market platform to work from… but in the long run this may also be a deadly embrace.

So here is another wacky idea. The only successful business model around open source software to date (which is not to say that there is not some other model to be discovered) generates revenue from support and services and just a little software around the edges. Teradata has a support team and a services business that knows big data and is embedded in the enterprise… Cloudera, Hortonworks, and MapR are not close here. Were Teradata to go after the Hadoop market with their own distribution (not much of a barrier to entry here.. just download the Apache stuff and build a team of committers… they might even be able to pick up the Pivotal team)… they would start from a spot way ahead of the start-ups in several respects… in several hard respects. Further they have Aster IP which could qualify as software around the edges. As a Hadoop player Teradata could more easily manage how Hadoop squeezes their business, mitigate risk, and emerge a big winner in the big data space.

Related Database Fog Blog Posts:

The Greenplum ORCA Optimizer

In January Greenplum rolled out a new query optimizer. This is very cool and very advanced stuff.

Query optimization is a search problem… in a perfect world you would search through the space of all possible plans for any query and choose the least expensive plan. But the time required to iterate through all possible plans would take more time than most queries… so optimizers use rules to cut down the space searched. The rules have been built up over the years and are designed to prune the space quickly to keep performance high for simple queries. But these rules can break down when complex queries are introduced… so Greenplum made the significant investment to build a new optimizer from scratch.

Florian Waas, the leader of this program for Greenplum (now off on another venture) explained it to me this way. If the large rectangle in Figure 1 represents the total search space for a query, a modern query optimizer only searches the area in the small gray square… it looks for the best plan in that small space.

DBFog Query Search Space Fig1You may be surprised to learn that the optimizers used by every major DBMS product are single-threaded… they use only one core of a multi-core processor to search the space and produce a plan. There is no way to effectively search more with a faster single processor (even though you could search more the amount of time you spend as a percentage of the query execution time would stay the same… because the query execution would speed up as well)… so if the optimizer is to search more of the space it will have to use multiple cores and search the space in parallel… and this is exactly what Greenplum has accomplished.

The benchmark results for this are impressive (see here)… several queries in the TPC-DS suite run hundreds of times faster.

ORCA is available to early support customers now and the results map to the benchmark… some queries see an extreme performance boost, while others run significantly slower. This is to be expected from any first release optimizer.

But Greenplum have built another advanced technology into ORCA to reduce the time it will take to mature the software. ORCA includes AMPERe, an optimizer debugging facility that captures the state necessary to recreate problems and fix them. Together these capabilities: parallel search and specialized debugging have advanced the state of the art significantly.

What does it mean to you? It will take some time to shake out ORCA… and HAWQ is still very slow when compared to other analytic databases… and very very slow when compared to the in-memory databases available… and in-memory products like Spark are coming to the Hadoop eco-system. But at the price point HAWQ is a bargain. If you need an inexpensive batch engine that crunches numbers offline then in the next year, as ORCA matures, it may be worth a look.

As a side note… this topic introduces one of the issues related to in-memory databases… when even a very complex query completes with a sub-optimal plan in under a second how much time can you spend searching the plan space? I suspect that applying the parallel optimization principles developed by the Greenplum team will yield similar or even better improvements for in-memory… and these techniques will be a requirement very soon in that space.

References

 

Part 7 – How Hadooped is Greenplum, the Pivotal GPDB?

Now for Greenplum & Hadoop… to continue this thread on RDBMS-Hadoop integration (Part 1Part 2, Part 3, Part 4Part 5, Part 6) I have suggested that we could evaluate integration architecture using three criteria:

  1. How parallel are the pipes to move data between the RDBMS and the parallel file system;
  2. Is there intelligence to push down predicates; and
  3. Is there more intelligence to push down joins and other relational operators?

The Greenplum interface is architecturally similar to the Teradata interface described in Part 4. Hadoop files are defined to the DBMS as external tables and there are capable parallel pipes to effectively move data from the HDFS side to GPDB. In addition Greenplum uses their Scatter-Gather method to load data into the GPDB effectively.

There is no ability to push down predicates. When a query executes all of the relevant data is sucked through the parallel pipes into the database segments for processing. This is very inefficient and there is not even the crude capability to push down processing provided by Teradata.

Finally, there is no ability to push down joins or aggregation.

Greenplum’s offering is not very advanced. To perform with Greenplum analytics data must move between the two storage layers with no intelligence to mitigate the cost.

On to the last post in the series Part 8 on SQL Server and Polybase.