Hadoop and ETL

My last post (here) blathered about the effect that Hadoop must have on database vendor profits. An associate wrote me with the reminder that Hadoop is also impacting revenues and profits of ETL companies.

If you think about Hadoop as both an inexpensive staging area for an EDW and as a parallel compute engine that can transform ungoverned, extracted data and load it into a governed EDW platform… then you are just one thought from realizing that these two functions have heretofore been in the domain of ETL… and that moving these functions to Hadoop might have an effect in the ETL space.

I do not believe that ETL tools will go away… but they may become just the GUI development environment that lets you quickly develop transformations and connect them into an end-to-end ETL process. The scheduling, processing engine, and monitoring could then be handled by the Hadoop eco-system.

Here is the idea from a previous post.

About five years ago the precursor to Alpine Data Labs, then an EMC Greenplum subsidiary, was developing a GUI for analytics that connected processes and I suggested they spin the product both into analytics and into ETL… I’ll have to look and see where they are these days…

Hadoop and Company Financial Performance

I have posted several times about the impact of the Hadoop eco-system on a several companies (here, here, here, for example). The topic cam up in a tweet thread a few weeks back… which prompts this quick note.

Fours years ago the street price for a scalable, parallel, enterprise data warehouse platform was $US25K-$US35K per terabyte. This price point provided vendors like Teradata, Netezza, and Greenplum reasonable, lucrative, margins. Hadoop entered the scene and captured the Big Data space from these vendors by offering 20X slower performance at 1/20th the price: $US1K-$US5K per terabyte. The capture was immediate and real… customers who were selecting these products for specialized, very large, 1PB and up deployments switched to Hadoop as fast as possible.

Now, two trends continue to eat at the market share of parallel database products.

First, relational implementations on HDFS continue to improve in performance and they are now 4X-10X slower than the best parallel databases at 1/10th-1/20th the street price. This puts pressure on prices and on margins for these relational vendors and this pressure is felt.

In order to keep their installed base of customers in the fold these vendors have built ever more sophisticated integration between their relational products and Hadoop. This integration, however, allows customers to significantly reduce expense by moving large parts of their EDW to an Annex (see here)… and this trend has started. We might argue whether an EDW Annex should store the coldest 80% or the coldest 20% of the data in your EDW… but there is little doubt that some older data could satisfy SLAs by delivering 4X-10X slower performance.

In addition, these trends converge. If you can only put 20% of your old, cold data in an Annex that is 10X slower than your EDW platform then you might put 50% of your data into an Annex that is only 4X slower. As the Hadoop relational implementations continue to add columnar, in-memory, and other accelerators… ever more data could move to a Hadoop-based EDW Annex.

I’ll leave it to the gamblers who read this to guess the timing and magnitude of the impact of Hadoop on the relational database markets and on company financial performance. I cannot see how it cannot have an impact.

Well, actually I can see one way out. If the requirement for hot data that requires high performance accelerates faster than the high performance advances of Hadoop then the parallel RDBMS folks will hold their own or advance. Maybe the Internet of Things helps here…. but I doubt it.

Some HANA and Intel Videos

Here are two videos of me speaking from the 2013 Intel Developer Forum FYI.

The first has some technical detail:

The second is more of a PR pitch about Intel Hadoop:

I’m working with Intel on a new video with a pretty interesting storyline (at least I hope that you find it interesting?)… so stay tuned.

Rob

Hadoop Squeezes Greenplum

For several years now I have been suggesting that Hadoop will squeeze the big data RDBMSs: Teradata, Exadata, Greenplum, and Netezza… squeezing them first out of the big data end of the market and then impinging on the high-end of the EDW space. Further I have suggested that there may be a significant and immediate TCO reduction from using Hadoop with your EDW RDBMS which squeezes these product’s market faster and further.

Originally I suggested that Greenplum and Netezza would feel the squeeze first since they were embracing Hadoop directly and at the expense of their RDBMS offerings. Greenplum took this further by trying to compete on price… cutting the price of the GPDB and then introducing HAWQ, basically GPDB on HDFS, at a Hadoop DBMS price point. These moves coupled with a neglect of the EDW market where Greenplum made its name apparently has allowed Hadoop to squeeze Greenplum out of the commercial market.

My network has been humming with rumors from reliable sources for 4+ weeks now… and I am now getting confirmation from both inside and outside Pivotal that the Greenplum software will move to open source in short order. The details are being worked out… and while there may still be a change of heart… it seems to be a done deal. The buzzness plan that Greenplum embarked on prior to the EMC acquisition in 2010 has not been a commercial success.

No one is sorrier to see this than me. Greenplum had a real shot at success. It was a very solid piece of work leading the space with strong architectural extensions like data flow shared nothingness, hybrid row/columnar capabilities, and into big data applications. The ORCA optimizer had the potential to change the game again.

Greenplum was nearly profitable in 2009 running hard at Teradata and Exadata and Netezza in the EDW space. The EDW market is tough… so we have to be fair and point out that pursuing this market may have led to the same result… but a small-market analytics play was followed by an open-source Hadoop play that could only end in squeezing Greenplum. There was never really a business plan with a win at the end.

Hopefully by open sourcing Greenplum some of the sound software will make it into PostgreSQL… but dishing Greenplum into the open source space with few developers and no community dishes it into the same space that Informix, Red Brick, and others sit. I know that I suggested open sourcing Greenplum over 18 months ago (see the wacky idea here)… but the idea then, as now, amounts to capitualization. I just declared what seemed to me to be inevitable a little sooner than Pivotal.

Teradata has now further embraced Hadoop… and they run the risk of repeating the Greenplum downturn. They have a much stronger market platform to work from… but in the long run this may also be a deadly embrace.

So here is another wacky idea. The only successful business model around open source software to date (which is not to say that there is not some other model to be discovered) generates revenue from support and services and just a little software around the edges. Teradata has a support team and a services business that knows big data and is embedded in the enterprise… Cloudera, Hortonworks, and MapR are not close here. Were Teradata to go after the Hadoop market with their own distribution (not much of a barrier to entry here.. just download the Apache stuff and build a team of committers… they might even be able to pick up the Pivotal team)… they would start from a spot way ahead of the start-ups in several respects… in several hard respects. Further they have Aster IP which could qualify as software around the edges. As a Hadoop player Teradata could more easily manage how Hadoop squeezes their business, mitigate risk, and emerge a big winner in the big data space.

Related Database Fog Blog Posts:

Using Teradata’s Appliance for Hadoop to Reduce TCO

Teradata has recently announced a very complete Teradata database-to-Hadoop integration. Is this note we’ll consider how a Teradata shop might effectively use these features to significantly reduce the TCO of any Teradata system.

The Teradata Appliance for Hadoop (here) offering is quite well thought out and complete… including a Teradata appliance, a Hadoop appliance, and the new QueryGrid capability to seamlessly connect the two… so hardware, software, support, and services are all available in very easy-to-consume bundles.
There is little published on the details of the QueryGrid feature… so I cannot evaluate where it stands on the query integration maturity curve (see here)… but it certainly provides a significant advance over the current offering (see here and Dan Graham’s associated comments).
I believe that there is some instant financial gratification to be had by existing Teradata customers from this Hadoop mashup. Let’s consider this…
Before the possibility of a Hadoop annex to Teradata, Teradata customers had no choice but to store cold, old, data in the Teradata database. If, on occasion, you wanted to perform year by year comparisons over ten years of data then you needed to keep ten years of data in the database at a rough cost of $50K/TB (see here) … even if these queries were rarely executed and were not expected to run against a high performance service level requirement. If you wanted to perform some sophisticated predictive analysis against this data it had to be online. If fact, the Teradata mantra… one which I wholeheartedly agree with… suggests that you really should keep the details online forever as the business will almost always find a way to glean value from this history.
This mantra is the basis of what the Hadoop vendors call a data lake. A data warehouse expert would quickly recognize a data lake as a staging area for un-scrubbed detailed data… with the added benefit that a Hadoop-based data lake can store and process data at a $1K/TB price point… and this makes it cost-effective to persist the staged data online forever.
So what does this mean to a Teradata EDW owner? Teradata has published numbers (here) suggesting that 92% of the queries in an EDW only touch 20% of the data. I would suggest that there is some sort of similar ratio that holds for 90% of the remaining queries… they may touch only another 40% of the data. This suggests that the 40% of the data remaining is online to service less than 1% of the queries… and I suggest that these queries can be effectively serviced from the $1K/TB Hadoop annex.
In other words, almost every Teradata shop can immediately benefit from Teradata’s new product announcements by moving 40% of their Teradata database data to Hadoop. Such a move would free Teradata disk space and likely take pressure off to upgrade the cluster. Further, when an upgrade is required, users can reduce the disk footprint of the Teradata database side of the system; add a Hadoop annex, and significantly reduce the TCO of the overall configuration.
Some time back I suggested that Teradata would be squeezed by Hadoop (here and here). To their credit Teradata is going to try and mitigate the squeeze. But the economics remain… and Teradata customers should seriously consider how to leverage the low $/TB of Teradata’s Hadoop offering to reduce costs. Data needs to reside in the lowest cost infrastructure that still provides the required level of service… and the Teradata Hadoop integration provides an opportunity to leverage a new, low-cost, infrastructure.

Part 8 – How Hadooped is SQL Server PDW with Polybase?

Now for SQL Server… continuing the thread on RDBMS-Hadoop integration (Part 1Part 2, Part 3, Part 4Part 5, Part 6, Part 7) I have suggested that we could evaluate integration architecture using three criteria:

  1. How parallel are the pipes to move data between the RDBMS and the parallel file system;
  2. Is there intelligence to push down predicates; and
  3. Is there more intelligence to push down joins and other relational operators?

Before we start I will suggest a fourth criteria that will be more fully explored later when we consider networks and pipes… that is: how is data sharded/hashed/distributed as it moves from the distribution scheme in HDFS to an optimal, usually hashed, scheme in the target RDBMS. Consider Greenplum as an example… they move data in parallel as quickly as possible to the GPDB and then redistribute the data across GPDB segment nodes using scatter-gather, a very efficient distribution mechanism. We will consider how PDW Poybase manages this as part of our first criteria.

Also note… since I started this series Teradata has come out with a new capability: the QueryGrid. I will add a post to consider this separately… and in this note I will assume the older Teradata capability. This is a little unfair to Teradata and I apologize for that… but otherwise this post becomes too complex. I’ll make things right for Teradata ASAP.

Now on to Microsoft…

First, Polybase has effective parallel pipes to move data from HDFS to the parallel SQL Server instances in PDW. This matches the best capability of other products like Teradata and Greenplum in this category. But where Teradata and Greenplum move data and then redistribute it, pushing the data over a network twice, Poybase has pushed the PDW hash function down to the HDFS node so that data is distributed as it is sent. This very nice feature skips one full move of the data.

Our second criteria considers how smart the connector is in pushing down filters/predicates. Polybase uses a cost-based approach to determine whether is is less expensive to push predicates down or to move all of the data up to the PDW layer. This is a best-in-class capability.

For the 3rd criteria we ask does the architecture push down advanced functions like joins and aggregates… and does the architecture minimize data pulled up to join with semi-joins? Polybase again provides strong capabilities here pushing down joins and aggregates. Polybase does not use semi-joins, so there is room to improve here… but Microsoft clearly has this capability in their roadmap.

One final note… Polybase works with PDW but not with other SQL Server products. This limitation may be relevant in many cases.

PDW + Polybase is a strong offering… matching HANA in most aspects with HANA having a slight edge in push-down with semi-joins but with SQL Server matching this with the most sophisticated parallel data distribution capability.

References

Part 6: How Hadooped is HANA?

Now for HANA plus Hadoop… to continue this thread on RDBMS-Hadoop integration (Part 1, Part 2, Part 3, Part 4, Part 5) I have suggested that we could evaluate integration architecture using three criteria:

  1. How parallel are the pipes to move data between the RDBMS and the parallel file system;
  2. Is there intelligence to push down predicates; and
  3. Is there more intelligence to push down joins and other relational operators?

As a preface I need to include a note/apology. As you will see HANA may well have the best RDBMS-Hadoop integration in the market. I try hard not to blow foam about HANA in this blog… and I hope that the objective criteria I have devised to evaluate all of the products will keep this post credible… but please look at this post harder than most and push back if you think that I overstep.

First… surprisingly, HANA’s first release has only a single pipe to the Hadoop side. This is worrisome but easily fixed. It will negatively impact performance when large tables/files have to be moved up for processing.

But HANA includes Hadoop as a full partner in a federated data architecture using the Smart Data Access (SDA) engine inside the HANA address space. As a result, HANA not only pushes predicates but it uses cost-based optimization to determine what to push down and what to pull up. HANA interrogates the Hadoop system to gather statistics and uses the HANA optimizer to develop smart execution plans with awareness of both the speed of in-memory and the limited memory resources. When data in HANA is joined with data in Hadoop SDA effectively uses semi-joins to minimize the data pulled up.

Finally, HANA can develop execution plans that executes joins in Hadoop. This includes both joins between two Hadoop tables and joins where small in-memory tables are pushed down to execute the joins in Hadoop. The current limitation is that Hadoop files must be defined as Hive tables.

Here is the HANA execution plan for TPC-H query 19. HANA has pushed down all of the steps behind the Remote Row Scan step… so in this case the entire query including a nested loop join was pushed down. In other queries HANA will push only parts of the plan to Hadoop.

TPCH Q19 Plan

So HANA possesses a very sophisticated integration with Hadoop… with capabilities that minimize the amount of data moved based on the cost of the movement. This is where all products need to go. But without parallel pipes this sophisticated capability provides only a moderate advantage (see Part 5),

Note that this is not the ultimate in integration… there is another level… but I’ll leave some ideas for extending integration even further for my final post in the series.

Next… Part 7… considering Greenplum…

Part 4: How Hadooped is Teradata?

In this thread on RDBMS-Hadoop integration (Part 1, Part 2, Part 3) I have suggested that we could evaluate integration architecture using three criteria:

  1. How parallel are the pipes to move data between the RDBMS and the parallel file system;
  2. Is there intelligence to push down predicates; and
  3. Is there more intelligence to push down joins and other relational operators?

Let’s consider the Teradata SQL-H implementation using these criteria.

First, Teradata has effective parallel pipes to move data from HDFS to the Teradata database with one pipe per node. There does not seem to be any inter-node IO parallelism. This is a solid feature.

There is a limited ability to push down predicates… SQL-H does allow data to be partitioned on the HDFS side and it will perform partition elimination if the query explicitly calls out a predicate within a partionfilter() keyword. In addition there is an ability to project out columns using a columns() keyword to explicitly specify the columns to be returned. These features are klunky but effective. You would expect partitions to be eliminated when the partitioning column is referenced with a predicate in the query like any other query… and you would expect columns to be projected out if they are not referenced. Normal SQL predicates are applied after the data is moved over the network but before every record is written into the Teradata database.

Finally SQL-H provides no advanced capabilities to push down join operators or other functions.

The bottom line: SQL-H is a sort of klunky implementation, requiring non-ANSI-standard and non-Teradata standard SQL syntax. Predicate push down is limited but better than nothing. As you will see when we review other products, SQL-H is a  basic offering. The lack of full predicate push-down and advanced features will negatively and severely impact performance when accessing large volumes of data, Big Data, and the special SQL syntax will limit the ability to access HDFS data from 3rd party tools. This performance penalty will force customers to pre-join and pre-aggregate data in Hadoop rather than access it naturally.

Next Part 5...

References

Teradata Magazine: Hands On Dynamic Access

Doug Frazier: SQL-H Presentation

Part 1: How Hadooped is Your RDBMS?

Sorry for the comic adjective “Hadooped”?

The next few blogs will try to evaluate the different approaches to integrating Hadoop and a standard RDBMS… so the first thing I’ll try in this post is to suggest a criteria based on some architectural  choices for making the evaluation. Further, I’ll inject a little surprise and make the point by using the criteria to say something about a product that is not an integration of an RDBMS and Hadoop.

For the purposes of this let me clear that by “Hadoop” I mean at least HDFS plus MapReduce… so I will discuss integrating a parallel RDBMS with data stored in HDFS: a massively parallel file system with a programming capability included. By “integration” I mean that queries using the full set of SQL supported by the RDBMS must be available for processing queries that refer to data across the Hadoop-RDBMS divide.

Since we’ve assumed that all SQL functionality is supported the architectural issue left to solve is performance and this issue revolves on one topic: how do we minimize the cost of moving data between the two partners for a given query?

Now to get on with it…

The easiest, but not all that easy, problem involves using parallelism to move data from one system to the other… so the first criteria we will evaluate for each product will consider how parallel is their movement of data.

The next criteria involves intelligence in the RDBMS to push down some execution operators to the data layer. Of course the RDBMS must scan remote data… so in this part of the evaluation we will grade each product’s ability to push processing down to apply predicates and project the minimal amount of data up to the RDBMS.

Finally, a most intelligent product would push more than just predicates down… it would push down joins and aggregation… and the decisions around splitting processing would be fully optimized. A most intelligent product would fully federate the HDFS data into the RDBMS.

So there you have it… I will start evaluating RDBMS-Hadoop architecture by three criteria:

  • how parallel is the data movement between the RDBMS and Hadoop;
  • is there intelligence to minimize data movement by pushing the least data and the associated query plan to one system or another… this requires parallel pipes in both directions; and
  • is there intelligence to build an optimal query plan that splits steps across both systems to completely minimize the movement of data and/or optimize the compute.

And a final word on the relative strength of each criteria:

  • If we imagine a 10-node Hadoop cluster talking to a 10-node RDBMS with 10 parallel pipes and compared it to the same setup with only 1 pipe (not parallel) then we might suggest that the parallel pipes provide a 10X performance increase.
  • If we imagine intelligence that moved 100K rows rather than 10M then we might suggest that intelligent push down might provide a 100X performance increase…
  • If we had even more intelligence and further optimized processing then another 10X-100X might be possible.

So all three criteria are not equal… intelligent query planning trumps wide pipes…

Now for the surprise… in the next blog we’ll look at how Exadata’s architecture maps to these criteria… since it is a two-tiered architecture with an RDBMS tied to a parallel file system…

You can see the rest of the series here: Part 2, Part 3, Part 4, Part 5, Part 6, Part 7, Part 8.

Aster Data, HAWQ, GPDB and the First Hadoop Squeeze

The First SqueezeI have suggested that the big EDW parallel databases: Teradata, Exadata, Greenplum, and Netezza in particular will be squeezed over time. Colder data will move from those products to Hadoop and hotter data will move in-memory. You can see posts on this here, here, and here.

But there are three products, the Greenplum database (GPDB), HAWQ, and Aster Data, that will be squeezed more quickly as they are positioned either in between the EDW and Hadoop… or directly over Hadoop. In this post I’ll explain what I suspect Pivotal and Teradata are trying to do… why I believe their strategy will not work for long… and why readers of this blog should be careful moving forward.

The Squeeze picture assumes that Hadoop consumes more and more “big data” over time as the giant investment in that open source eco-system matures the software and improves both the performance and the feature base. I think that this is a very safe assumption. But the flip side of this assumption is that we recognize that currently the Hadoop eco-system is not particularly mature and that the performance is not top-notch. It is this flip side that provides the opportunity targeted by Pivotal and Teradata.

Here is the situation… Hadoop, even in its newbie state, is lowering the price point for biggish data. Large EDW implementations, let’s say over 100TB, that had no choice but to pick a large EDW database product 4 years ago are considering and selecting Hadoop more often at a price point 10X-20X less than the lowest street price offered by commercial DBMS vendors. But these choices are painful due to the relatively immature state of the Hadoop eco-system. It is this spot that is being targeted by Aster and GPDB… the “big data” spot where Aster and GPDB can charge a price greater than the cost of Hadoop but less than the cost of the EDW DBMS products… while providing performance and maturity worth the modest premium.

This spot, under the EDW and above Hadoop is a legitimate niche where revenue can be generated. But it is the niche that will be the first to be consumed by Hadoop as the various Hadoop RDBMS features mature. It is a niche that will not be commercially interesting in two years and will be gone in four years. Above is the Squeeze picture updated to position Aster, HAWQ and GPDB.

What would I do? Pivotal has some options. First, as I have stated before, GPDB is a solid EDW DBMS and the majority of it’s market even after running from the EDW space is there. They could move up the food chain back to the EDW space where they started and have an impact. This impact could be greater still if they could find a way to build a truly effective cloud-based EDW DBMS out of the GPDB. But this is not their current strategy and they are losing steam as an EDW both technically and in the market. The window to move back up is closing. Their current strategy which is “all-in” on Hadoop will steal business from GPDB for low-margin business around HAWQ and steal business from HAWQ for an even lower-margin business around Pivotal Hadoop. I wonder how long Pivotal can fund this strategy at a loss?

I’m not sure what I would do if I were Teradata? The investment in Aster Data is not likely to pay off before Hadoop consumes the space. Insofar as it is a sunk cost now… and they can leverage the niche described above… their positioning can earn them some revenue and stave off the full effect of the Squeeze for a short time. But Aster was never really a successful EDW play and there is no room for it to move up the food chain at Teradata.

What does this mean? Readers should take note and consider the risk that Hadoop wins in the near term… They might avoid a costly move to Aster or GPDB or HAWQ with a short lifespan. Maybe it is time to bite the bullet now and start introducing Hadoop into your infrastructure?

One final note… it is not my expectation that either the Hadoop DBMS nor any NoSQL DBMS product will consume the commercial RDBMS space anytime soon. There are reasons for this… stay tuned and I’ll post on this topic in the new year.

With this post the Database Fog Blog will receive its 100000th view. I am so grateful for your attention and consideration. And with this last post of my calendar year I wanted to say thanks… to send my regards to all whether you will be celebrating a holiday season or not… and to wish every reader, regardless of what calendar you follow, all the best in the next year…

– Rob