Some HANA and Intel Videos

Here are two videos of me speaking from the 2013 Intel Developer Forum FYI.

The first has some technical detail:

The second is more of a PR pitch about Intel Hadoop:

I’m working with Intel on a new video with a pretty interesting storyline (at least I hope that you find it interesting?)… so stay tuned.

Rob

Hadoop Squeezes Greenplum

For several years now I have been suggesting that Hadoop will squeeze the big data RDBMSs: Teradata, Exadata, Greenplum, and Netezza… squeezing them first out of the big data end of the market and then impinging on the high-end of the EDW space. Further I have suggested that there may be a significant and immediate TCO reduction from using Hadoop with your EDW RDBMS which squeezes these product’s market faster and further.

Originally I suggested that Greenplum and Netezza would feel the squeeze first since they were embracing Hadoop directly and at the expense of their RDBMS offerings. Greenplum took this further by trying to compete on price… cutting the price of the GPDB and then introducing HAWQ, basically GPDB on HDFS, at a Hadoop DBMS price point. These moves coupled with a neglect of the EDW market where Greenplum made its name apparently has allowed Hadoop to squeeze Greenplum out of the commercial market.

My network has been humming with rumors from reliable sources for 4+ weeks now… and I am now getting confirmation from both inside and outside Pivotal that the Greenplum software will move to open source in short order. The details are being worked out… and while there may still be a change of heart… it seems to be a done deal. The buzzness plan that Greenplum embarked on prior to the EMC acquisition in 2010 has not been a commercial success.

No one is sorrier to see this than me. Greenplum had a real shot at success. It was a very solid piece of work leading the space with strong architectural extensions like data flow shared nothingness, hybrid row/columnar capabilities, and into big data applications. The ORCA optimizer had the potential to change the game again.

Greenplum was nearly profitable in 2009 running hard at Teradata and Exadata and Netezza in the EDW space. The EDW market is tough… so we have to be fair and point out that pursuing this market may have led to the same result… but a small-market analytics play was followed by an open-source Hadoop play that could only end in squeezing Greenplum. There was never really a business plan with a win at the end.

Hopefully by open sourcing Greenplum some of the sound software will make it into PostgreSQL… but dishing Greenplum into the open source space with few developers and no community dishes it into the same space that Informix, Red Brick, and others sit. I know that I suggested open sourcing Greenplum over 18 months ago (see the wacky idea here)… but the idea then, as now, amounts to capitualization. I just declared what seemed to me to be inevitable a little sooner than Pivotal.

Teradata has now further embraced Hadoop… and they run the risk of repeating the Greenplum downturn. They have a much stronger market platform to work from… but in the long run this may also be a deadly embrace.

So here is another wacky idea. The only successful business model around open source software to date (which is not to say that there is not some other model to be discovered) generates revenue from support and services and just a little software around the edges. Teradata has a support team and a services business that knows big data and is embedded in the enterprise… Cloudera, Hortonworks, and MapR are not close here. Were Teradata to go after the Hadoop market with their own distribution (not much of a barrier to entry here.. just download the Apache stuff and build a team of committers… they might even be able to pick up the Pivotal team)… they would start from a spot way ahead of the start-ups in several respects… in several hard respects. Further they have Aster IP which could qualify as software around the edges. As a Hadoop player Teradata could more easily manage how Hadoop squeezes their business, mitigate risk, and emerge a big winner in the big data space.

Related Database Fog Blog Posts:

Using Teradata’s Appliance for Hadoop to Reduce TCO

Teradata has recently announced a very complete Teradata database-to-Hadoop integration. Is this note we’ll consider how a Teradata shop might effectively use these features to significantly reduce the TCO of any Teradata system.

The Teradata Appliance for Hadoop (here) offering is quite well thought out and complete… including a Teradata appliance, a Hadoop appliance, and the new QueryGrid capability to seamlessly connect the two… so hardware, software, support, and services are all available in very easy-to-consume bundles.
There is little published on the details of the QueryGrid feature… so I cannot evaluate where it stands on the query integration maturity curve (see here)… but it certainly provides a significant advance over the current offering (see here and Dan Graham’s associated comments).
I believe that there is some instant financial gratification to be had by existing Teradata customers from this Hadoop mashup. Let’s consider this…
Before the possibility of a Hadoop annex to Teradata, Teradata customers had no choice but to store cold, old, data in the Teradata database. If, on occasion, you wanted to perform year by year comparisons over ten years of data then you needed to keep ten years of data in the database at a rough cost of $50K/TB (see here) … even if these queries were rarely executed and were not expected to run against a high performance service level requirement. If you wanted to perform some sophisticated predictive analysis against this data it had to be online. If fact, the Teradata mantra… one which I wholeheartedly agree with… suggests that you really should keep the details online forever as the business will almost always find a way to glean value from this history.
This mantra is the basis of what the Hadoop vendors call a data lake. A data warehouse expert would quickly recognize a data lake as a staging area for un-scrubbed detailed data… with the added benefit that a Hadoop-based data lake can store and process data at a $1K/TB price point… and this makes it cost-effective to persist the staged data online forever.
So what does this mean to a Teradata EDW owner? Teradata has published numbers (here) suggesting that 92% of the queries in an EDW only touch 20% of the data. I would suggest that there is some sort of similar ratio that holds for 90% of the remaining queries… they may touch only another 40% of the data. This suggests that the 40% of the data remaining is online to service less than 1% of the queries… and I suggest that these queries can be effectively serviced from the $1K/TB Hadoop annex.
In other words, almost every Teradata shop can immediately benefit from Teradata’s new product announcements by moving 40% of their Teradata database data to Hadoop. Such a move would free Teradata disk space and likely take pressure off to upgrade the cluster. Further, when an upgrade is required, users can reduce the disk footprint of the Teradata database side of the system; add a Hadoop annex, and significantly reduce the TCO of the overall configuration.
Some time back I suggested that Teradata would be squeezed by Hadoop (here and here). To their credit Teradata is going to try and mitigate the squeeze. But the economics remain… and Teradata customers should seriously consider how to leverage the low $/TB of Teradata’s Hadoop offering to reduce costs. Data needs to reside in the lowest cost infrastructure that still provides the required level of service… and the Teradata Hadoop integration provides an opportunity to leverage a new, low-cost, infrastructure.

The Greenplum ORCA Optimizer

In January Greenplum rolled out a new query optimizer. This is very cool and very advanced stuff.

Query optimization is a search problem… in a perfect world you would search through the space of all possible plans for any query and choose the least expensive plan. But the time required to iterate through all possible plans would take more time than most queries… so optimizers use rules to cut down the space searched. The rules have been built up over the years and are designed to prune the space quickly to keep performance high for simple queries. But these rules can break down when complex queries are introduced… so Greenplum made the significant investment to build a new optimizer from scratch.

Florian Waas, the leader of this program for Greenplum (now off on another venture) explained it to me this way. If the large rectangle in Figure 1 represents the total search space for a query, a modern query optimizer only searches the area in the small gray square… it looks for the best plan in that small space.

DBFog Query Search Space Fig1You may be surprised to learn that the optimizers used by every major DBMS product are single-threaded… they use only one core of a multi-core processor to search the space and produce a plan. There is no way to effectively search more with a faster single processor (even though you could search more the amount of time you spend as a percentage of the query execution time would stay the same… because the query execution would speed up as well)… so if the optimizer is to search more of the space it will have to use multiple cores and search the space in parallel… and this is exactly what Greenplum has accomplished.

The benchmark results for this are impressive (see here)… several queries in the TPC-DS suite run hundreds of times faster.

ORCA is available to early support customers now and the results map to the benchmark… some queries see an extreme performance boost, while others run significantly slower. This is to be expected from any first release optimizer.

But Greenplum have built another advanced technology into ORCA to reduce the time it will take to mature the software. ORCA includes AMPERe, an optimizer debugging facility that captures the state necessary to recreate problems and fix them. Together these capabilities: parallel search and specialized debugging have advanced the state of the art significantly.

What does it mean to you? It will take some time to shake out ORCA… and HAWQ is still very slow when compared to other analytic databases… and very very slow when compared to the in-memory databases available… and in-memory products like Spark are coming to the Hadoop eco-system. But at the price point HAWQ is a bargain. If you need an inexpensive batch engine that crunches numbers offline then in the next year, as ORCA matures, it may be worth a look.

As a side note… this topic introduces one of the issues related to in-memory databases… when even a very complex query completes with a sub-optimal plan in under a second how much time can you spend searching the plan space? I suspect that applying the parallel optimization principles developed by the Greenplum team will yield similar or even better improvements for in-memory… and these techniques will be a requirement very soon in that space.

References

 

Netezza Zone Maps and I/O Avoidance

A reader recently wrote to me and asked about Netezza: “why does everyone insist that these (zone maps) tell you where ‘not to look’ when hunting for data?”. I’ll provide a direct answer… and a more meaningful answer.

Imagine that you have a list of data blocks with some metadata for each block that tells you the range of data in each block for a given column: FOO as follows:

  • Block 1: FOO 0-173
  • Block 2: FOO 174-323
  • Block 3: FOO 324-500

and a query that selects WHERE FOO=42.

If Netezza scanned the metadata and sent its read routine the list of blocks to not read… it would send 2,3. This is clearly not the case… if there are a million blocks it would not send a list of 999999 block numbers to not read… and force the read routine to figure out what was left to read. So clearly Netezza does not really tell you where ‘not to look’. This is a clever turn-of-phrase.

But I like this particular cleverness. Every DBMS is built with features designed to avoid I/O:

  • indexes are metadata that point to blocks to reduce reading unnecessary blocks;
  • partitions contain metadata to reduce reading unnecessary partitions;
  • column stores are designed to reduce reads of unnecessary columns; and
  • caches hold hot data blocks to avoid re-reading those blocks.

In fact, the highest performance DBMS will almost always be the one that most effectively minimizes I/O. This is why in-memory databases always have the highest performance.

So while zone maps do not really tell the system directly what not to read… the effect is ‘not to look’ at unnecessary data.

Part 8 – How Hadooped is SQL Server PDW with Polybase?

Now for SQL Server… continuing the thread on RDBMS-Hadoop integration (Part 1Part 2, Part 3, Part 4Part 5, Part 6, Part 7) I have suggested that we could evaluate integration architecture using three criteria:

  1. How parallel are the pipes to move data between the RDBMS and the parallel file system;
  2. Is there intelligence to push down predicates; and
  3. Is there more intelligence to push down joins and other relational operators?

Before we start I will suggest a fourth criteria that will be more fully explored later when we consider networks and pipes… that is: how is data sharded/hashed/distributed as it moves from the distribution scheme in HDFS to an optimal, usually hashed, scheme in the target RDBMS. Consider Greenplum as an example… they move data in parallel as quickly as possible to the GPDB and then redistribute the data across GPDB segment nodes using scatter-gather, a very efficient distribution mechanism. We will consider how PDW Poybase manages this as part of our first criteria.

Also note… since I started this series Teradata has come out with a new capability: the QueryGrid. I will add a post to consider this separately… and in this note I will assume the older Teradata capability. This is a little unfair to Teradata and I apologize for that… but otherwise this post becomes too complex. I’ll make things right for Teradata ASAP.

Now on to Microsoft…

First, Polybase has effective parallel pipes to move data from HDFS to the parallel SQL Server instances in PDW. This matches the best capability of other products like Teradata and Greenplum in this category. But where Teradata and Greenplum move data and then redistribute it, pushing the data over a network twice, Poybase has pushed the PDW hash function down to the HDFS node so that data is distributed as it is sent. This very nice feature skips one full move of the data.

Our second criteria considers how smart the connector is in pushing down filters/predicates. Polybase uses a cost-based approach to determine whether is is less expensive to push predicates down or to move all of the data up to the PDW layer. This is a best-in-class capability.

For the 3rd criteria we ask does the architecture push down advanced functions like joins and aggregates… and does the architecture minimize data pulled up to join with semi-joins? Polybase again provides strong capabilities here pushing down joins and aggregates. Polybase does not use semi-joins, so there is room to improve here… but Microsoft clearly has this capability in their roadmap.

One final note… Polybase works with PDW but not with other SQL Server products. This limitation may be relevant in many cases.

PDW + Polybase is a strong offering… matching HANA in most aspects with HANA having a slight edge in push-down with semi-joins but with SQL Server matching this with the most sophisticated parallel data distribution capability.

References

Part 7 – How Hadooped is Greenplum, the Pivotal GPDB?

Now for Greenplum & Hadoop… to continue this thread on RDBMS-Hadoop integration (Part 1Part 2, Part 3, Part 4Part 5, Part 6) I have suggested that we could evaluate integration architecture using three criteria:

  1. How parallel are the pipes to move data between the RDBMS and the parallel file system;
  2. Is there intelligence to push down predicates; and
  3. Is there more intelligence to push down joins and other relational operators?

The Greenplum interface is architecturally similar to the Teradata interface described in Part 4. Hadoop files are defined to the DBMS as external tables and there are capable parallel pipes to effectively move data from the HDFS side to GPDB. In addition Greenplum uses their Scatter-Gather method to load data into the GPDB effectively.

There is no ability to push down predicates. When a query executes all of the relevant data is sucked through the parallel pipes into the database segments for processing. This is very inefficient and there is not even the crude capability to push down processing provided by Teradata.

Finally, there is no ability to push down joins or aggregation.

Greenplum’s offering is not very advanced. To perform with Greenplum analytics data must move between the two storage layers with no intelligence to mitigate the cost.

On to the last post in the series Part 8 on SQL Server and Polybase.

Part 6: How Hadooped is HANA?

Now for HANA plus Hadoop… to continue this thread on RDBMS-Hadoop integration (Part 1, Part 2, Part 3, Part 4, Part 5) I have suggested that we could evaluate integration architecture using three criteria:

  1. How parallel are the pipes to move data between the RDBMS and the parallel file system;
  2. Is there intelligence to push down predicates; and
  3. Is there more intelligence to push down joins and other relational operators?

As a preface I need to include a note/apology. As you will see HANA may well have the best RDBMS-Hadoop integration in the market. I try hard not to blow foam about HANA in this blog… and I hope that the objective criteria I have devised to evaluate all of the products will keep this post credible… but please look at this post harder than most and push back if you think that I overstep.

First… surprisingly, HANA’s first release has only a single pipe to the Hadoop side. This is worrisome but easily fixed. It will negatively impact performance when large tables/files have to be moved up for processing.

But HANA includes Hadoop as a full partner in a federated data architecture using the Smart Data Access (SDA) engine inside the HANA address space. As a result, HANA not only pushes predicates but it uses cost-based optimization to determine what to push down and what to pull up. HANA interrogates the Hadoop system to gather statistics and uses the HANA optimizer to develop smart execution plans with awareness of both the speed of in-memory and the limited memory resources. When data in HANA is joined with data in Hadoop SDA effectively uses semi-joins to minimize the data pulled up.

Finally, HANA can develop execution plans that executes joins in Hadoop. This includes both joins between two Hadoop tables and joins where small in-memory tables are pushed down to execute the joins in Hadoop. The current limitation is that Hadoop files must be defined as Hive tables.

Here is the HANA execution plan for TPC-H query 19. HANA has pushed down all of the steps behind the Remote Row Scan step… so in this case the entire query including a nested loop join was pushed down. In other queries HANA will push only parts of the plan to Hadoop.

TPCH Q19 Plan

So HANA possesses a very sophisticated integration with Hadoop… with capabilities that minimize the amount of data moved based on the cost of the movement. This is where all products need to go. But without parallel pipes this sophisticated capability provides only a moderate advantage (see Part 5),

Note that this is not the ultimate in integration… there is another level… but I’ll leave some ideas for extending integration even further for my final post in the series.

Next… Part 7… considering Greenplum…

Part 4: How Hadooped is Teradata?

In this thread on RDBMS-Hadoop integration (Part 1, Part 2, Part 3) I have suggested that we could evaluate integration architecture using three criteria:

  1. How parallel are the pipes to move data between the RDBMS and the parallel file system;
  2. Is there intelligence to push down predicates; and
  3. Is there more intelligence to push down joins and other relational operators?

Let’s consider the Teradata SQL-H implementation using these criteria.

First, Teradata has effective parallel pipes to move data from HDFS to the Teradata database with one pipe per node. There does not seem to be any inter-node IO parallelism. This is a solid feature.

There is a limited ability to push down predicates… SQL-H does allow data to be partitioned on the HDFS side and it will perform partition elimination if the query explicitly calls out a predicate within a partionfilter() keyword. In addition there is an ability to project out columns using a columns() keyword to explicitly specify the columns to be returned. These features are klunky but effective. You would expect partitions to be eliminated when the partitioning column is referenced with a predicate in the query like any other query… and you would expect columns to be projected out if they are not referenced. Normal SQL predicates are applied after the data is moved over the network but before every record is written into the Teradata database.

Finally SQL-H provides no advanced capabilities to push down join operators or other functions.

The bottom line: SQL-H is a sort of klunky implementation, requiring non-ANSI-standard and non-Teradata standard SQL syntax. Predicate push down is limited but better than nothing. As you will see when we review other products, SQL-H is a  basic offering. The lack of full predicate push-down and advanced features will negatively and severely impact performance when accessing large volumes of data, Big Data, and the special SQL syntax will limit the ability to access HDFS data from 3rd party tools. This performance penalty will force customers to pre-join and pre-aggregate data in Hadoop rather than access it naturally.

Next Part 5...

References

Teradata Magazine: Hands On Dynamic Access

Doug Frazier: SQL-H Presentation

Part 3: A Quick Follow-on on Parallel Databases

In the thread I’m now working on (Part 1, Part 2) I’ve talked about Exadata and a criteria for evaluating split architectures. It is worth quickly talking about homogeneous systems with no split like Teradata or Greenplum or HANA… systems with no separate query-capable top end.

Every parallel shared-nothing DBMS has big pipes to move data between nodes… and they have all of the advanced intelligence to reduce the amount of data moved both between nodes within a query.

So if we compare Exadata to a split system with an RDBMS and Hadoop it fairs very well. But if we compare Exadata to Teradata or Greenplum or HANA… then the bottlenecks in Exadata look more severe. Exadata may tie, or occasionally win, in a POC against these homogenous competitors when the queries fit their architectural sweet spot. But if any queries are included that expose the bottlenecks or the limitations in query push-down… Exadata’s weak split architecture shows.

Next: Part 4

%d bloggers like this: