Hadoop and Company Financial Performance

I have posted several times about the impact of the Hadoop eco-system on a several companies (here, here, here, for example). The topic cam up in a tweet thread a few weeks back… which prompts this quick note.

Fours years ago the street price for a scalable, parallel, enterprise data warehouse platform was $US25K-$US35K per terabyte. This price point provided vendors like Teradata, Netezza, and Greenplum reasonable, lucrative, margins. Hadoop entered the scene and captured the Big Data space from these vendors by offering 20X slower performance at 1/20th the price: $US1K-$US5K per terabyte. The capture was immediate and real… customers who were selecting these products for specialized, very large, 1PB and up deployments switched to Hadoop as fast as possible.

Now, two trends continue to eat at the market share of parallel database products.

First, relational implementations on HDFS continue to improve in performance and they are now 4X-10X slower than the best parallel databases at 1/10th-1/20th the street price. This puts pressure on prices and on margins for these relational vendors and this pressure is felt.

In order to keep their installed base of customers in the fold these vendors have built ever more sophisticated integration between their relational products and Hadoop. This integration, however, allows customers to significantly reduce expense by moving large parts of their EDW to an Annex (see here)… and this trend has started. We might argue whether an EDW Annex should store the coldest 80% or the coldest 20% of the data in your EDW… but there is little doubt that some older data could satisfy SLAs by delivering 4X-10X slower performance.

In addition, these trends converge. If you can only put 20% of your old, cold data in an Annex that is 10X slower than your EDW platform then you might put 50% of your data into an Annex that is only 4X slower. As the Hadoop relational implementations continue to add columnar, in-memory, and other accelerators… ever more data could move to a Hadoop-based EDW Annex.

I’ll leave it to the gamblers who read this to guess the timing and magnitude of the impact of Hadoop on the relational database markets and on company financial performance. I cannot see how it cannot have an impact.

Well, actually I can see one way out. If the requirement for hot data that requires high performance accelerates faster than the high performance advances of Hadoop then the parallel RDBMS folks will hold their own or advance. Maybe the Internet of Things helps here…. but I doubt it.

Thinking About the Pivotal Announcements…

Yesterday I provided a model for how business sees open source as a means to be profitable (here). This is the game Pivotal seems to be playing with their release of Hadoop, Gemfire, HAWQ, and Greenplum into open source. I do not know their real numbers… so they may need more or fewer additional customers than the mythical company to get back to break-even. But it is unlikely that any company can turn the corner from a license-based revenue stream to a recurring revenue stream in a year… so Pivotal must be looking at a loss. And when losses come it is usual to cut costs… to cut R&D.

There has already been a brain-drain out of the database ranks at Pivotal as they went “all in” on Hadoop. They likely hope for an open source community to pick up the slack… but there is not a body of success I can see in building a community to engineer a commercial product-turned-open. This is especially problematic for Gemfire, an old technology that has been in the commercial space for a very long time. HAWQ has to compete for database resources with the other Hadoop RDBMS technologies… that will be difficult. Greenplum has a chance as it is based on PostgreSQL… but it is a long way away from the current PostgreSQL code base these days. There is danger here.

The bottom line… Greenplum and HAWQ and Gemfire have become risky propositions for both the current customer base and for new customers. I’ll leave it to you to evaluate the risk as this story unfolds. Still, with the risk comes reward… the cost of acquiring Greenplum will drop dramatically and today Greenplum is a competitive product. In addition, if Greenplum gains some traction, it will put price pressure on the other database products. Note that HAWQ was already marked down to open source price levels… and part of Pivotal’s problem was that HAWQ was eating at the Greenplum market. With these products priced at similar levels there becomes some weirdness in choosing… but the advantage is to customers looking at Greenplum.

One great outcome comes for Pivotal Hadoop customers… the fact that Hortonworks will more-or-less subsume Pivotal Hadoop leaves those folks in a better place than before.

If you consider the thought experiment you would have to ask yourself why a company that was breaking even would take this risky route? It could be that they took the route because they were not breaking even and this was a possible path to get even. Also consider… open sourcing code is the modern graceful way to retire an unprofitable product line.

This is sound thinking by Pivotal… during the creation, EMC gave Pivotal several unprofitable troubled assets and these announcements give Pivotal a path forward. If the database product line cannot carry their weight then they will go into maintenance mode and slowly fade. Too bad… as you know I consider Greenplum a solid product whose potential was wasted. But Pivotal has a very nice product in Cloud Foundry… and they clearly see this as their route to profitability and to an IPO… a route that no longer includes a significant contribution from database products.

Open Source is Not a Market…

This post is more about the technology business than about technology… but it may be relevant as you try to sort out winners and losers… and this sort of sorting is important if you consider new companies who may, or may not, succeed in the long run.

To make my point let us do a little thought experiment. Imagine a company doing $100M in revenue with a commercial, not open source, database product. They win the $100M in revenue by competing with Oracle, IBM, Microsoft, Teradata, et cetera… and maybe competing a little here and there with some open source products.

Let’s assume that they make 50% of their revenue from services and support, and that their average sale is $2M… so they close 25 deals a year competing in this market. Finally, let’s assume that they break-even each year and spend 20% of their revenues on R&D. The industry average for support services is 20%.. so with each $2M sale they add $400K in recurring revenue.

They are considering making their product open source. Let’s assume that they make the base product free… and provide some value-added offering that costs $200K for the average buyer. Further, they offer a support package for the same $400K/year customers currently pay. How does the math work out?

Let’s baseline against the 25 deals/year…

If they make 25 sales and every buyer buys both the support package and the value-added offer the average sale drops from $2M to $200K, sales revenue drops from $50M to $5M, the annual revenue drops from $100M to $55M… and the company loses $45M. So… starting off they need to make 225 more sales just to break even. But now it gets complicated… if they sell 5 extra deals then in the next year they earn $2M extra in support fees… so if they sell 113 extra deals in year one then in year two they have made up the entire $45M difference and they are back to break-even going forward. If it takes them 2 years to get the extra recurring revenue then they lose money in year two… but are back to break-even in year three.

From here it gets even more complicated. The mythical company above sells the baseline of 25 new copies a year with an enterprise sales force that is expensive. There is no way that the same sales force that services 25 sales/year could service 100+ extra deals. So either costs go up or the 100+ extra customers becomes unattainable. We might hope that the cost of sales will drop way off as the sales price moves to $200K. This is not unreasonable… but certainly not guaranteed. Further, if you are one of the existing sales-staff then you have to sell 10X just to make the same commission. Finally these numbers assume that every customer buys the value-add and gets enterprise-level support. Reality will be something less than this.

We might ask: is it even possible to sell 100+ more with the same product in the same market? Let us be clear that the market the database product plays in has not changed. Open Source is not a market. All we have done is reduced the sales price for the product with some hope that price is a significant driver in the market.

This is not meant as an academic exercise. Tomorrow we will consider how this thought experiment applies to Pivotal’s announcements last week… and to the future of Pivotal’s database assets (here).

Some HANA and Intel Videos

Here are two videos of me speaking from the 2013 Intel Developer Forum FYI.

The first has some technical detail:

The second is more of a PR pitch about Intel Hadoop:

I’m working with Intel on a new video with a pretty interesting storyline (at least I hope that you find it interesting?)… so stay tuned.

Rob

Hadoop Squeezes Greenplum

For several years now I have been suggesting that Hadoop will squeeze the big data RDBMSs: Teradata, Exadata, Greenplum, and Netezza… squeezing them first out of the big data end of the market and then impinging on the high-end of the EDW space. Further I have suggested that there may be a significant and immediate TCO reduction from using Hadoop with your EDW RDBMS which squeezes these product’s market faster and further.

Originally I suggested that Greenplum and Netezza would feel the squeeze first since they were embracing Hadoop directly and at the expense of their RDBMS offerings. Greenplum took this further by trying to compete on price… cutting the price of the GPDB and then introducing HAWQ, basically GPDB on HDFS, at a Hadoop DBMS price point. These moves coupled with a neglect of the EDW market where Greenplum made its name apparently has allowed Hadoop to squeeze Greenplum out of the commercial market.

My network has been humming with rumors from reliable sources for 4+ weeks now… and I am now getting confirmation from both inside and outside Pivotal that the Greenplum software will move to open source in short order. The details are being worked out… and while there may still be a change of heart… it seems to be a done deal. The buzzness plan that Greenplum embarked on prior to the EMC acquisition in 2010 has not been a commercial success.

No one is sorrier to see this than me. Greenplum had a real shot at success. It was a very solid piece of work leading the space with strong architectural extensions like data flow shared nothingness, hybrid row/columnar capabilities, and into big data applications. The ORCA optimizer had the potential to change the game again.

Greenplum was nearly profitable in 2009 running hard at Teradata and Exadata and Netezza in the EDW space. The EDW market is tough… so we have to be fair and point out that pursuing this market may have led to the same result… but a small-market analytics play was followed by an open-source Hadoop play that could only end in squeezing Greenplum. There was never really a business plan with a win at the end.

Hopefully by open sourcing Greenplum some of the sound software will make it into PostgreSQL… but dishing Greenplum into the open source space with few developers and no community dishes it into the same space that Informix, Red Brick, and others sit. I know that I suggested open sourcing Greenplum over 18 months ago (see the wacky idea here)… but the idea then, as now, amounts to capitualization. I just declared what seemed to me to be inevitable a little sooner than Pivotal.

Teradata has now further embraced Hadoop… and they run the risk of repeating the Greenplum downturn. They have a much stronger market platform to work from… but in the long run this may also be a deadly embrace.

So here is another wacky idea. The only successful business model around open source software to date (which is not to say that there is not some other model to be discovered) generates revenue from support and services and just a little software around the edges. Teradata has a support team and a services business that knows big data and is embedded in the enterprise… Cloudera, Hortonworks, and MapR are not close here. Were Teradata to go after the Hadoop market with their own distribution (not much of a barrier to entry here.. just download the Apache stuff and build a team of committers… they might even be able to pick up the Pivotal team)… they would start from a spot way ahead of the start-ups in several respects… in several hard respects. Further they have Aster IP which could qualify as software around the edges. As a Hadoop player Teradata could more easily manage how Hadoop squeezes their business, mitigate risk, and emerge a big winner in the big data space.

Related Database Fog Blog Posts:

A Sidebar on the IoT: Using new Things smartly

This is a sidebar to some thinking on an architecture for the Internet of Things here… – Rob

I was recently prompted to think about a Big Data problem that is in the US papers… the issues around processing US Veterans through the Veteran’s Administration (VA) bureaucracy. I imagine that there are really two problems… I will outline them… but the point will be to try to get you to think about how to use the IoT creatively to mitigate seemingly intractable problems.

One requires some obvious, if not easy-to-implement, information technology… there has to be some fool-proof way to ensure that VA staff cannot game the system and hide problems. This could be solved with an audit system that looks for anomalies in processing patterns in much the same way that other fraud detection software operates.

But fraud is not the problem… it was meant to mask the problem. The fact is that there are more Veterans requiring medical assistance than there are funds and doctors to provide that assistance. All the fraud detection would have accomplished is to show how many Veterans are going without care.

So… the IT folks at the VA are tasked with the impossible task of servicing too many Veterans with too few doctors… But military doctors are used to this issue and the word “triage” comes from the vocabulary of war… not the vocabulary of medicine.

This leads me towards the point… there are new things, in the IoT sense, coming out that can make triage of Veterans who are not hospitalized possible. Rumors are that the new iWatch contains several sensors that can monitor the pulse, temperature, and maybe more, of wearers. What the VA needs to do is immediately put an iWatch on every Veteran who applies for medical help. They then need to monitor the vitals of all of the veterans and schedule in those with the weakest signs. They need to notify individuals to come in immediately when vitals turn bad… Further, they need to track these vitals against medical records and outcomes to make the triage ever more efficient over time.

By using internet Things to triage the VA could use their limited medical resource much more efficiently… and mitigate some or all of the problem.

The IoT affords us all opportunities to change the game for our employers. There will be opportunities to disrupt some existing markets. But to take full advantage will require us to be creative and smart. I, of one, am anxious to start… Now back to considering an architecture for things.

 

An Architecture for the IoT – Part 1

There are so many things in the Internet of Things (IoT) that might record data into your data fabric that a new approach may be required. Let’s think about this… define some terms, and see how these terms fit into current data fabric thinking, let’s consider how they fit into a more modern logical data warehouse architecture, and let’s think about whether the IoT might push us to a different approach.

I’m not going to go overboard on terms here… But we do need to distinguish between a sensor and a processor.

To my way of thinking a sensor is a thing. It creates, but does not necessarily process, data. A sensor has some means to communicate with a processor… but if there is no significant processing on the sensor other than communications then we will suggest that there is no “processor” in a meaningful sense. Let me give you four examples:

  • The first is courtesy of Ray Carnes, a chief architect at Boeing. Imagine a brake-pad in your car with 100,000 dust-sized RFID sensors randomly scattered as part of the pad. These sensors do nothing but signal on an interval that they are present. This allows a processor elsewhere to record the signals and determine how much of the brake pad has worn. If only 80,000 sensors report we can assume that 20% of the pad has worn away.
  • A Nest thermostat senses movement and temperature. It uses a network-connect to send the results of this sensing to the Nest mother-ship and performs little-or-no processing on site.
  • Sensors in my Audi detect rotation of the wheels. There is a network that sends the results to a small embedded anti-lock braking processor that monitors all four wheels as well as the pressure on the brake-pedal and sends signals to all five components to allow the car to brake evenly.
  • There is a sensor in the screen on the ATM I used yesterday that detects that I want to request service. This user interface communicates with a powerful general processor which then communicates with the Bank mother-ship to create and process banking transactions.

This last bullet is important… any device that takes user input is a sensor with an embedded processor. It is a “thing” just like the Nest thing. Today we tend to blur the line between sensor and processor as every thing has a powerful processor onboard. The IoT will change this assumption.

A processor then, is a computer that performs some analysis on the data generated by one or more sensors. A processor may also store data… a sensor will not.

Now let’s think about how we might combine sensors and processors in an architecture. To start lets consider the context of the data the processor can use for analysis:

  • If the processor has only the last data sensed we would say that the context is immediate and local to one sensor. The processor can see streamed data but can only operate on the last event. We would say that this sensor-processor configuration can provide a simple reflexive response. When you press the lock button in your car a sensor detects this event and signals to all four doors and the boot to lock it up.
  • Another configuration might allow the local processor to store more context from a single sensor over a longer period of time… so the context is historical and local. In the case of the anti-lock brakes… the processor receives signals from a group of sensors and stores a very short historical context. This grouped historical context is very powerful…
  • Another configuration might store the group context and then forward the event details to a bigger server that stores and analyzes a universal context of all things to look for patterns. Further, there could be a hierarchy of groups leading to a universal context.
  • Finally, a server with some group context could summarize the details for that group and pass on only a summary over time up to another group server or to a universal server.

I suspect that you can see where I’m going. There is a trade-off is this picture between the advantages of pushing analytic processing close to the sensor and the associated requirement for more analytic processors, the advantages of intermediate analysis requiring more data movement but fewer analytic processors, and the advantage of a central analytic mother ship where all data is stored and analyzed. In the next version of this thread I’ll try to tease apart the trade-offs.

Part 8 – How Hadooped is SQL Server PDW with Polybase?

Now for SQL Server… continuing the thread on RDBMS-Hadoop integration (Part 1Part 2, Part 3, Part 4Part 5, Part 6, Part 7) I have suggested that we could evaluate integration architecture using three criteria:

  1. How parallel are the pipes to move data between the RDBMS and the parallel file system;
  2. Is there intelligence to push down predicates; and
  3. Is there more intelligence to push down joins and other relational operators?

Before we start I will suggest a fourth criteria that will be more fully explored later when we consider networks and pipes… that is: how is data sharded/hashed/distributed as it moves from the distribution scheme in HDFS to an optimal, usually hashed, scheme in the target RDBMS. Consider Greenplum as an example… they move data in parallel as quickly as possible to the GPDB and then redistribute the data across GPDB segment nodes using scatter-gather, a very efficient distribution mechanism. We will consider how PDW Poybase manages this as part of our first criteria.

Also note… since I started this series Teradata has come out with a new capability: the QueryGrid. I will add a post to consider this separately… and in this note I will assume the older Teradata capability. This is a little unfair to Teradata and I apologize for that… but otherwise this post becomes too complex. I’ll make things right for Teradata ASAP.

Now on to Microsoft…

First, Polybase has effective parallel pipes to move data from HDFS to the parallel SQL Server instances in PDW. This matches the best capability of other products like Teradata and Greenplum in this category. But where Teradata and Greenplum move data and then redistribute it, pushing the data over a network twice, Poybase has pushed the PDW hash function down to the HDFS node so that data is distributed as it is sent. This very nice feature skips one full move of the data.

Our second criteria considers how smart the connector is in pushing down filters/predicates. Polybase uses a cost-based approach to determine whether is is less expensive to push predicates down or to move all of the data up to the PDW layer. This is a best-in-class capability.

For the 3rd criteria we ask does the architecture push down advanced functions like joins and aggregates… and does the architecture minimize data pulled up to join with semi-joins? Polybase again provides strong capabilities here pushing down joins and aggregates. Polybase does not use semi-joins, so there is room to improve here… but Microsoft clearly has this capability in their roadmap.

One final note… Polybase works with PDW but not with other SQL Server products. This limitation may be relevant in many cases.

PDW + Polybase is a strong offering… matching HANA in most aspects with HANA having a slight edge in push-down with semi-joins but with SQL Server matching this with the most sophisticated parallel data distribution capability.

References

Part 1: How Hadooped is Your RDBMS?

Sorry for the comic adjective “Hadooped”?

The next few blogs will try to evaluate the different approaches to integrating Hadoop and a standard RDBMS… so the first thing I’ll try in this post is to suggest a criteria based on some architectural  choices for making the evaluation. Further, I’ll inject a little surprise and make the point by using the criteria to say something about a product that is not an integration of an RDBMS and Hadoop.

For the purposes of this let me clear that by “Hadoop” I mean at least HDFS plus MapReduce… so I will discuss integrating a parallel RDBMS with data stored in HDFS: a massively parallel file system with a programming capability included. By “integration” I mean that queries using the full set of SQL supported by the RDBMS must be available for processing queries that refer to data across the Hadoop-RDBMS divide.

Since we’ve assumed that all SQL functionality is supported the architectural issue left to solve is performance and this issue revolves on one topic: how do we minimize the cost of moving data between the two partners for a given query?

Now to get on with it…

The easiest, but not all that easy, problem involves using parallelism to move data from one system to the other… so the first criteria we will evaluate for each product will consider how parallel is their movement of data.

The next criteria involves intelligence in the RDBMS to push down some execution operators to the data layer. Of course the RDBMS must scan remote data… so in this part of the evaluation we will grade each product’s ability to push processing down to apply predicates and project the minimal amount of data up to the RDBMS.

Finally, a most intelligent product would push more than just predicates down… it would push down joins and aggregation… and the decisions around splitting processing would be fully optimized. A most intelligent product would fully federate the HDFS data into the RDBMS.

So there you have it… I will start evaluating RDBMS-Hadoop architecture by three criteria:

  • how parallel is the data movement between the RDBMS and Hadoop;
  • is there intelligence to minimize data movement by pushing the least data and the associated query plan to one system or another… this requires parallel pipes in both directions; and
  • is there intelligence to build an optimal query plan that splits steps across both systems to completely minimize the movement of data and/or optimize the compute.

And a final word on the relative strength of each criteria:

  • If we imagine a 10-node Hadoop cluster talking to a 10-node RDBMS with 10 parallel pipes and compared it to the same setup with only 1 pipe (not parallel) then we might suggest that the parallel pipes provide a 10X performance increase.
  • If we imagine intelligence that moved 100K rows rather than 10M then we might suggest that intelligent push down might provide a 100X performance increase…
  • If we had even more intelligence and further optimized processing then another 10X-100X might be possible.

So all three criteria are not equal… intelligent query planning trumps wide pipes…

Now for the surprise… in the next blog we’ll look at how Exadata’s architecture maps to these criteria… since it is a two-tiered architecture with an RDBMS tied to a parallel file system…

You can see the rest of the series here: Part 2, Part 3, Part 4, Part 5, Part 6, Part 7, Part 8.

Pivotal GPDB and the 2013 Forrester Wave EDW Report

The last wave of the summer, 2008
A small wave. (Photo credit: Боби Димитров)

Forrester regularly provides fodder for bloggers when they report on the EDW space (see Curt Monash’s review of their last report here). They have a 2013 report out now that is quite mysterious (see here).

They report that Pivotal is up there with the leading EDW vendors and positioned to move further up.

Here is the mystery. If you go to the Pivotal site and search on “data warehouse” you get ten hits:

  • Eight talk about analytic data warehouses, not enterprise data warehouses;
  • One talks about using Hive as a data warehouse; and
  • One talks about data and sandboxing.

There are no hits on the term “enterprise data warehouse” and one hit on the term “EDW” which refers to why you should move data off of the EDW to an analytic platform.

As I’ve pointed out… Pivotal does not market into the EDW space. They are not developing product for that space.  EDW is not part of their product strategy.

The fact that their product is a capable platform for an EDW is worth noting… and readers of this blog should consider GPDB, aka Greenplum, for EDW projects. But you should be fully aware of the risk that Pivotal is not really backing this use case.

For an analyst to suggest that Pivotal has an industry-leading strategy in a space that they are not pursuing at all is very odd.