Thinking About the Pivotal Announcements…

Yesterday I provided a model for how business sees open source as a means to be profitable (here). This is the game Pivotal seems to be playing with their release of Hadoop, Gemfire, HAWQ, and Greenplum into open source. I do not know their real numbers… so they may need more or fewer additional customers than the mythical company to get back to break-even. But it is unlikely that any company can turn the corner from a license-based revenue stream to a recurring revenue stream in a year… so Pivotal must be looking at a loss. And when losses come it is usual to cut costs… to cut R&D.

There has already been a brain-drain out of the database ranks at Pivotal as they went “all in” on Hadoop. They likely hope for an open source community to pick up the slack… but there is not a body of success I can see in building a community to engineer a commercial product-turned-open. This is especially problematic for Gemfire, an old technology that has been in the commercial space for a very long time. HAWQ has to compete for database resources with the other Hadoop RDBMS technologies… that will be difficult. Greenplum has a chance as it is based on PostgreSQL… but it is a long way away from the current PostgreSQL code base these days. There is danger here.

The bottom line… Greenplum and HAWQ and Gemfire have become risky propositions for both the current customer base and for new customers. I’ll leave it to you to evaluate the risk as this story unfolds. Still, with the risk comes reward… the cost of acquiring Greenplum will drop dramatically and today Greenplum is a competitive product. In addition, if Greenplum gains some traction, it will put price pressure on the other database products. Note that HAWQ was already marked down to open source price levels… and part of Pivotal’s problem was that HAWQ was eating at the Greenplum market. With these products priced at similar levels there becomes some weirdness in choosing… but the advantage is to customers looking at Greenplum.

One great outcome comes for Pivotal Hadoop customers… the fact that Hortonworks will more-or-less subsume Pivotal Hadoop leaves those folks in a better place than before.

If you consider the thought experiment you would have to ask yourself why a company that was breaking even would take this risky route? It could be that they took the route because they were not breaking even and this was a possible path to get even. Also consider… open sourcing code is the modern graceful way to retire an unprofitable product line.

This is sound thinking by Pivotal… during the creation, EMC gave Pivotal several unprofitable troubled assets and these announcements give Pivotal a path forward. If the database product line cannot carry their weight then they will go into maintenance mode and slowly fade. Too bad… as you know I consider Greenplum a solid product whose potential was wasted. But Pivotal has a very nice product in Cloud Foundry… and they clearly see this as their route to profitability and to an IPO… a route that no longer includes a significant contribution from database products.

Open Source is Not a Market…

This post is more about the technology business than about technology… but it may be relevant as you try to sort out winners and losers… and this sort of sorting is important if you consider new companies who may, or may not, succeed in the long run.

To make my point let us do a little thought experiment. Imagine a company doing $100M in revenue with a commercial, not open source, database product. They win the $100M in revenue by competing with Oracle, IBM, Microsoft, Teradata, et cetera… and maybe competing a little here and there with some open source products.

Let’s assume that they make 50% of their revenue from services and support, and that their average sale is $2M… so they close 25 deals a year competing in this market. Finally, let’s assume that they break-even each year and spend 20% of their revenues on R&D. The industry average for support services is 20%.. so with each $2M sale they add $400K in recurring revenue.

They are considering making their product open source. Let’s assume that they make the base product free… and provide some value-added offering that costs $200K for the average buyer. Further, they offer a support package for the same $400K/year customers currently pay. How does the math work out?

Let’s baseline against the 25 deals/year…

If they make 25 sales and every buyer buys both the support package and the value-added offer the average sale drops from $2M to $200K, sales revenue drops from $50M to $5M, the annual revenue drops from $100M to $55M… and the company loses $45M. So… starting off they need to make 225 more sales just to break even. But now it gets complicated… if they sell 5 extra deals then in the next year they earn $2M extra in support fees… so if they sell 113 extra deals in year one then in year two they have made up the entire $45M difference and they are back to break-even going forward. If it takes them 2 years to get the extra recurring revenue then they lose money in year two… but are back to break-even in year three.

From here it gets even more complicated. The mythical company above sells the baseline of 25 new copies a year with an enterprise sales force that is expensive. There is no way that the same sales force that services 25 sales/year could service 100+ extra deals. So either costs go up or the 100+ extra customers becomes unattainable. We might hope that the cost of sales will drop way off as the sales price moves to $200K. This is not unreasonable… but certainly not guaranteed. Further, if you are one of the existing sales-staff then you have to sell 10X just to make the same commission. Finally these numbers assume that every customer buys the value-add and gets enterprise-level support. Reality will be something less than this.

We might ask: is it even possible to sell 100+ more with the same product in the same market? Let us be clear that the market the database product plays in has not changed. Open Source is not a market. All we have done is reduced the sales price for the product with some hope that price is a significant driver in the market.

This is not meant as an academic exercise. Tomorrow we will consider how this thought experiment applies to Pivotal’s announcements last week… and to the future of Pivotal’s database assets (here).

Hadoop Squeezes Greenplum

For several years now I have been suggesting that Hadoop will squeeze the big data RDBMSs: Teradata, Exadata, Greenplum, and Netezza… squeezing them first out of the big data end of the market and then impinging on the high-end of the EDW space. Further I have suggested that there may be a significant and immediate TCO reduction from using Hadoop with your EDW RDBMS which squeezes these product’s market faster and further.

Originally I suggested that Greenplum and Netezza would feel the squeeze first since they were embracing Hadoop directly and at the expense of their RDBMS offerings. Greenplum took this further by trying to compete on price… cutting the price of the GPDB and then introducing HAWQ, basically GPDB on HDFS, at a Hadoop DBMS price point. These moves coupled with a neglect of the EDW market where Greenplum made its name apparently has allowed Hadoop to squeeze Greenplum out of the commercial market.

My network has been humming with rumors from reliable sources for 4+ weeks now… and I am now getting confirmation from both inside and outside Pivotal that the Greenplum software will move to open source in short order. The details are being worked out… and while there may still be a change of heart… it seems to be a done deal. The buzzness plan that Greenplum embarked on prior to the EMC acquisition in 2010 has not been a commercial success.

No one is sorrier to see this than me. Greenplum had a real shot at success. It was a very solid piece of work leading the space with strong architectural extensions like data flow shared nothingness, hybrid row/columnar capabilities, and into big data applications. The ORCA optimizer had the potential to change the game again.

Greenplum was nearly profitable in 2009 running hard at Teradata and Exadata and Netezza in the EDW space. The EDW market is tough… so we have to be fair and point out that pursuing this market may have led to the same result… but a small-market analytics play was followed by an open-source Hadoop play that could only end in squeezing Greenplum. There was never really a business plan with a win at the end.

Hopefully by open sourcing Greenplum some of the sound software will make it into PostgreSQL… but dishing Greenplum into the open source space with few developers and no community dishes it into the same space that Informix, Red Brick, and others sit. I know that I suggested open sourcing Greenplum over 18 months ago (see the wacky idea here)… but the idea then, as now, amounts to capitualization. I just declared what seemed to me to be inevitable a little sooner than Pivotal.

Teradata has now further embraced Hadoop… and they run the risk of repeating the Greenplum downturn. They have a much stronger market platform to work from… but in the long run this may also be a deadly embrace.

So here is another wacky idea. The only successful business model around open source software to date (which is not to say that there is not some other model to be discovered) generates revenue from support and services and just a little software around the edges. Teradata has a support team and a services business that knows big data and is embedded in the enterprise… Cloudera, Hortonworks, and MapR are not close here. Were Teradata to go after the Hadoop market with their own distribution (not much of a barrier to entry here.. just download the Apache stuff and build a team of committers… they might even be able to pick up the Pivotal team)… they would start from a spot way ahead of the start-ups in several respects… in several hard respects. Further they have Aster IP which could qualify as software around the edges. As a Hadoop player Teradata could more easily manage how Hadoop squeezes their business, mitigate risk, and emerge a big winner in the big data space.

Related Database Fog Blog Posts:

Part 7 – How Hadooped is Greenplum, the Pivotal GPDB?

Now for Greenplum & Hadoop… to continue this thread on RDBMS-Hadoop integration (Part 1Part 2, Part 3, Part 4Part 5, Part 6) I have suggested that we could evaluate integration architecture using three criteria:

  1. How parallel are the pipes to move data between the RDBMS and the parallel file system;
  2. Is there intelligence to push down predicates; and
  3. Is there more intelligence to push down joins and other relational operators?

The Greenplum interface is architecturally similar to the Teradata interface described in Part 4. Hadoop files are defined to the DBMS as external tables and there are capable parallel pipes to effectively move data from the HDFS side to GPDB. In addition Greenplum uses their Scatter-Gather method to load data into the GPDB effectively.

There is no ability to push down predicates. When a query executes all of the relevant data is sucked through the parallel pipes into the database segments for processing. This is very inefficient and there is not even the crude capability to push down processing provided by Teradata.

Finally, there is no ability to push down joins or aggregation.

Greenplum’s offering is not very advanced. To perform with Greenplum analytics data must move between the two storage layers with no intelligence to mitigate the cost.

On to the last post in the series Part 8 on SQL Server and Polybase.

Pivotal GPDB and the 2013 Forrester Wave EDW Report

The last wave of the summer, 2008
A small wave. (Photo credit: Боби Димитров)

Forrester regularly provides fodder for bloggers when they report on the EDW space (see Curt Monash’s review of their last report here). They have a 2013 report out now that is quite mysterious (see here).

They report that Pivotal is up there with the leading EDW vendors and positioned to move further up.

Here is the mystery. If you go to the Pivotal site and search on “data warehouse” you get ten hits:

  • Eight talk about analytic data warehouses, not enterprise data warehouses;
  • One talks about using Hive as a data warehouse; and
  • One talks about data and sandboxing.

There are no hits on the term “enterprise data warehouse” and one hit on the term “EDW” which refers to why you should move data off of the EDW to an analytic platform.

As I’ve pointed out… Pivotal does not market into the EDW space. They are not developing product for that space.  EDW is not part of their product strategy.

The fact that their product is a capable platform for an EDW is worth noting… and readers of this blog should consider GPDB, aka Greenplum, for EDW projects. But you should be fully aware of the risk that Pivotal is not really backing this use case.

For an analyst to suggest that Pivotal has an industry-leading strategy in a space that they are not pursuing at all is very odd.

Aster Data, HAWQ, GPDB and the First Hadoop Squeeze

The First SqueezeI have suggested that the big EDW parallel databases: Teradata, Exadata, Greenplum, and Netezza in particular will be squeezed over time. Colder data will move from those products to Hadoop and hotter data will move in-memory. You can see posts on this here, here, and here.

But there are three products, the Greenplum database (GPDB), HAWQ, and Aster Data, that will be squeezed more quickly as they are positioned either in between the EDW and Hadoop… or directly over Hadoop. In this post I’ll explain what I suspect Pivotal and Teradata are trying to do… why I believe their strategy will not work for long… and why readers of this blog should be careful moving forward.

The Squeeze picture assumes that Hadoop consumes more and more “big data” over time as the giant investment in that open source eco-system matures the software and improves both the performance and the feature base. I think that this is a very safe assumption. But the flip side of this assumption is that we recognize that currently the Hadoop eco-system is not particularly mature and that the performance is not top-notch. It is this flip side that provides the opportunity targeted by Pivotal and Teradata.

Here is the situation… Hadoop, even in its newbie state, is lowering the price point for biggish data. Large EDW implementations, let’s say over 100TB, that had no choice but to pick a large EDW database product 4 years ago are considering and selecting Hadoop more often at a price point 10X-20X less than the lowest street price offered by commercial DBMS vendors. But these choices are painful due to the relatively immature state of the Hadoop eco-system. It is this spot that is being targeted by Aster and GPDB… the “big data” spot where Aster and GPDB can charge a price greater than the cost of Hadoop but less than the cost of the EDW DBMS products… while providing performance and maturity worth the modest premium.

This spot, under the EDW and above Hadoop is a legitimate niche where revenue can be generated. But it is the niche that will be the first to be consumed by Hadoop as the various Hadoop RDBMS features mature. It is a niche that will not be commercially interesting in two years and will be gone in four years. Above is the Squeeze picture updated to position Aster, HAWQ and GPDB.

What would I do? Pivotal has some options. First, as I have stated before, GPDB is a solid EDW DBMS and the majority of it’s market even after running from the EDW space is there. They could move up the food chain back to the EDW space where they started and have an impact. This impact could be greater still if they could find a way to build a truly effective cloud-based EDW DBMS out of the GPDB. But this is not their current strategy and they are losing steam as an EDW both technically and in the market. The window to move back up is closing. Their current strategy which is “all-in” on Hadoop will steal business from GPDB for low-margin business around HAWQ and steal business from HAWQ for an even lower-margin business around Pivotal Hadoop. I wonder how long Pivotal can fund this strategy at a loss?

I’m not sure what I would do if I were Teradata? The investment in Aster Data is not likely to pay off before Hadoop consumes the space. Insofar as it is a sunk cost now… and they can leverage the niche described above… their positioning can earn them some revenue and stave off the full effect of the Squeeze for a short time. But Aster was never really a successful EDW play and there is no room for it to move up the food chain at Teradata.

What does this mean? Readers should take note and consider the risk that Hadoop wins in the near term… They might avoid a costly move to Aster or GPDB or HAWQ with a short lifespan. Maybe it is time to bite the bullet now and start introducing Hadoop into your infrastructure?

One final note… it is not my expectation that either the Hadoop DBMS nor any NoSQL DBMS product will consume the commercial RDBMS space anytime soon. There are reasons for this… stay tuned and I’ll post on this topic in the new year.

With this post the Database Fog Blog will receive its 100000th view. I am so grateful for your attention and consideration. And with this last post of my calendar year I wanted to say thanks… to send my regards to all whether you will be celebrating a holiday season or not… and to wish every reader, regardless of what calendar you follow, all the best in the next year…

– Rob

HANA, BLU, Hekaton, and Oracle 12c vs. Teradata and Greenplum – November 2013

Catch Me If You Can (musical)
(Photo credit: Wikipedia)

I would like to point out a very important section in the paper on Hekaton on the Microsoft Research site here. I will quote the section in total:

2. DESIGN CONSIDERATIONS 

An analysis done early on in the project drove home the fact that a 10-100X throughput improvement cannot be achieved by optimizing existing SQL Server mechanisms. Throughput can be increased in three ways: improving scalability, improving CPI (cycles per instruction), and reducing the number of instructions executed per request. The analysis showed that, even under highly optimistic assumptions, improving scalability and CPI can produce only a 3-4X improvement. The detailed analysis is included as an appendix. 

The only real hope is to reduce the number of instructions executed but the reduction needs to be dramatic. To go 10X faster, the engine must execute 90% fewer instructions and yet still get the work done. To go 100X faster, it must execute 99% fewer instructions. This level of improvement is not feasible by optimizing existing storage and execution mechanisms. Reaching the 10-100X goal requires a much more efficient way to store and process data. 

This is important because it confirms the difference in a Level 3 and a Level 2 columnar implementation as described here. It is just not possible for a Level 2 implementation with a row-based join engine to achieve the performance of a Level 3 implementation. This will allow the Level 3 implementations: HANA, BLU, Hekaton, and Oracle 12c to distance themselves from the Level 2 products: Teradata and Greenplum; by more than 10X… and this is a very significant advantage.

Related articles

HAWQ Performance Marketing

My contacts from Strata read my post here and provided me with the following information:

  • The performance numbers quoted for Greenplum HAWQ versus HIVE and Impala used Greenplum tables implemented over HDFS. In other words, this data is unreadable from outside of the Greenplum database… unreadable by any other program in the Hadoop eco-system… a proprietary format. If the tests were re-run using the same open data structures used by HIVE and Impala you would find the performance of HAWQ to be closer to, or worse than, those Hadoop components.
  • The HAWQ performance numbers quoted represent a 2X-3X performance degradation over the same benchmark run on the native Greenplum RDBMS.

Again… this is from a credible source… but please consider this a rumor… and view this report, and the associated Greenplum marketing… with an appropriate measure of engineering skepticism.

Greenplum is a fantastic product… if I assume the report to be true then I do not understand why are they doing this… what use case is solved by a 300% performance degradation accessing proprietary data in HDFS? Remember, you could put Greenplum in the same cluster as Hadoop (UAP) and query everything HAWQ could query without the performance degradation. I just do not see the point? Could someone from GP comment and help my readers and myself here?

HAWQ and Pivotal HD – Is it Hadoop?

Green Plums
Green Plums (Photo credit: camera bag)

First, from a technical standpoint I like the Greenplum-on-HDFS HAWQ offering. It looks like the GP Team replaced XFS with HDFS and added some native support for several HDFS file types. I will say more on this soon.

But I would like to weigh-in on the question raised by HortonWorks here… is HAWQ Hadoop? And I have a question towards the end…

Let me propose an analogy: Hadoop is an eco-system of open source components much like LINUX is an eco-system of open source components. If you think the analogy apt, then HAWQ on HDFS  is not Hadoop any more than Microsoft Internet Explorer on LINUX is LINUX. Hive is open source and part of the Hadoop eco-system… as is Impala. Firefox is open source and part of the LINUX eco-system. HAWQ is not Hadoop.

The HortonWorks link points out that Greenplum is not engaged in the Hadoop eco-system as a contributor. They also quote Greenplum as saying that they have 300 developers working on Hadoop. Well… if HAWQ is part of Hadoop and HAWQ is the Greenplum database on HDFS then they have 300 developers on Hadoop. But if, as I suggested, HAWQ is not Hadoop then the number of Greenplum developers on Hadoop might be less. I bumped into a long-time Greenplum employee at Strata who told me that HAWQ was a skunkworks project with 4-5 developers max. This comes from a credible source… but it is still rumor-quality… so take it with a grain of salt.

The bottom line is that Greenplum has marketed very aggressively. They fuzz the definition of Hadoop to claim their commercial database offering running on Hadoop is therefore “Hadoop”. They fuzz the definition of developers working on Hadoop based on this first fuzz.

But does it matter? Greenplum will read and process data stored in HDFS faster than any other SQL-based engine. That is worth something.

But what is it worth? I’m fairly certain that the Greenplum databases will run faster off of Hadoop on XFS than in Hadoop… maybe significantly faster. So the reason for Greenplum on HDFS is faster SQL access to data in HDFS files.

This leads me to my question. I wonder… were the performance numbers quoted, showing a significant performance advantage over both HIVE and Impala, based on queries executed against the Greenplum proprietary table formats or against the same native HDFS file types read by HIVE? If they ran against Greenplum tables then I wonder what the real apples-to-apples comparison would show? Note that I am not being cynical here… I do not know how the tests were set up… only that Greenplum was fast. But if as I said: ” the reason for Greenplum on HDFS is faster SQL access to data in HDFS files” and the data was in Greenplum file structures accessible only by Greenplum then there is little reason left.

I also wonder if it matters because HIVE and Impala will improve their performance significantly over the next 12-24 months. The sheer amount of human R&D being expended here will allow these SQL engines to catch, or nearly catch, HAWQ in performance. If there is any gap left, the price and the community of open source offerings will defeat HAWQ in the market.

As I have suggested here… there is no apparent commercial opportunity competing against Hadoop at this point. I suggested here that Hadoop would eat Greenplum if they stuck to the analytics space and offered both products… effectively competing with themselves. This new strategy is not likely to work in the medium or long run. Greenplum is, indeed, all-in on Hadoop… but without a winning hand.

March 10: See here for the answers to my questions… – Rob

March 12: See here for a rethink on this subject… – Rob

Will Hadoop Eat Greenplum and Netezza?

If I were the Register I would have titled this: Raging Stuffed Elephant To Devour Two Warehouse Vendors… I love the Register… if you do not read it have a look

This is a post is about the market implications of architecture…

Let us assume that Hadoop matures and finds a permanent place in the market. This is not certain with some folks expressing concern (here) and others boundless enthusiasm (here). So let’s assume… and consider where it might fit.

The SqueezeOne place is in the data warehouse market… This view says Hadoop replaces the DBMS for data warehouses. But the very mature BI/DW market requires a high level of operational integrity and Hadoop is not there yet… it is advancing rapidly as an enterprise platform and I believe it will get there… but it will be 3-4 years. This is the thinking I provided here that leads me to draw the picture in Figure 1.

It is not that I believe that Hadoop will consume the data warehouse market but I believe that very large EDW’s… those over 1PB… and maybe over 500TB will be compelled by the economics of “free” to move big warehouses to Hadoop. So Hadoop will likely move down into the EDW space from the top.

Another option suggests that Big Data will be a platform unto itself. In this view Hadoop will sit beside the existing BI/DW platform and feed that platform the results of queries that derive structure from unstructured data… and/or that aggregate Big Data into consumable chunks. This is where Hadoop sits today.

In data warehouse terms this positions Hadoop as a very large independent analytic data mart. Figure 2 depicts this. Note that an analytics data mart, and a Hadoop cluster, require far less in the way of operational infrastructure… they share very similar technical requirements.Hadoop Along Side

This leads me to the point of this post… if Hadoop becomes a very large analytic data mart then where will Greenplum and Netezza fit in 2-3 years? Both vendors are positioning themselves in the analytic space… Greenplum almost exclusively so. Both vendors offer integrated Hadoop products… Greenplum offers the Greenplum database and Hadoop in the same hardware cluster (see here for their latest announcement)… Netezza provides a Hadoop connector (here). But if you believe in Hadoop… as both vendors ardently do… where do their databases fit in the analytics space once Hadoop matures and fully supports SQL? In the next 3-4 years what will these RDBMSs offer in the big data analytics space that will be compelling enough to make the configuration in Figure 3 attractive?

Unified HadoopI know that today Hadoop cannot do all that either Netezza or Greenplum can do. I understand that Netezza has two positions in the market… as an analytic appliance and as a data mart appliance… so it may survive in the mart space. But the overlap of technical requirements between Hadoop and an analytic data mart… combined with the enormous human investment in Hadoop R&D, both in the core and in the eco-system… make me wonder about where “Big Data” analytic relational databases will fit?

Note that this is not a criticism of the Greenplum RDBMS. Greenplum is a very fine product, one of the best EDW platforms around. I’ll have more to say about it when I provide my 2 Cents… But if Figure 2 describes the end state for analytics in 2-3 years then where is the place for the Figure 3 architecture? If Figure 3 is the end state then I do not see where the line will be drawn between the analytic workload that requires Greenplum and that that will run on Hadoop? I barely can see it now… and I cannot see it at all in the near future.

Both EMC Greenplum and IBM seem to strongly believe in Hadoop… they must see the overlap in functionality and feel the market momentum of Hadoop. They must see, better than most, that Hadoop wins this battle.