In this thread on RDBMS-Hadoop integration (Part 1, Part 2, Part 3) I have suggested that we could evaluate integration architecture using three criteria:
- How parallel are the pipes to move data between the RDBMS and the parallel file system;
- Is there intelligence to push down predicates; and
- Is there more intelligence to push down joins and other relational operators?
Let’s consider the Teradata SQL-H implementation using these criteria.
First, Teradata has effective parallel pipes to move data from HDFS to the Teradata database with one pipe per node. There does not seem to be any inter-node IO parallelism. This is a solid feature.
There is a limited ability to push down predicates… SQL-H does allow data to be partitioned on the HDFS side and it will perform partition elimination if the query explicitly calls out a predicate within a partionfilter() keyword. In addition there is an ability to project out columns using a columns() keyword to explicitly specify the columns to be returned. These features are klunky but effective. You would expect partitions to be eliminated when the partitioning column is referenced with a predicate in the query like any other query… and you would expect columns to be projected out if they are not referenced. Normal SQL predicates are applied after the data is moved over the network but before every record is written into the Teradata database.
Finally SQL-H provides no advanced capabilities to push down join operators or other functions.
The bottom line: SQL-H is a sort of klunky implementation, requiring non-ANSI-standard and non-Teradata standard SQL syntax. Predicate push down is limited but better than nothing. As you will see when we review other products, SQL-H is a basic offering. The lack of full predicate push-down and advanced features will negatively and severely impact performance when accessing large volumes of data, Big Data, and the special SQL syntax will limit the ability to access HDFS data from 3rd party tools. This performance penalty will force customers to pre-join and pre-aggregate data in Hadoop rather than access it naturally.
Next Part 5...
Teradata Magazine: Hands On Dynamic Access
Doug Frazier: SQL-H Presentation
13 thoughts on “Part 4: How Hadooped is Teradata?”
All of these shortcomings are release 1.0 SQL-H. No one else has gotten to this level yet so technically its the best Hadoop data interchange in existence. Furthermore, it is co-developed with the largest data warehouse customer in the world to move data to and from petabyte class Hadoop clusters. One step at a time. Teradata has plenty of evidence that we know how to handle parallelism, SQL, and grand challenge data handling.
Stay tuned. Version 2 is right around the corner.
You spoke too soon.
Cool, Dan… let me know when v2 comes out and I’ll update this.
As far as who is the best right now… let’s see what it looks like when we evaluate SQL Server, Pivotal, Oracle, and HANA…
OK, Dan… I violated my normal “talk about HANA last” rule to set the bar a little higher for you… see Part 6 here (http://wp.me/p1a7GL-rA).
Teradata Database 15.0 was released in early April in Prague. The facility called Teradata QueryGrid was announced. It is an internal framework for connecting a Teradata or Aster SQL query to other systems. Its a highly controlled form of federation or data virtualization, whichever word you prefer. All connections are parallel data transfers unless the remote system can’t handle it. The SQL can be issued by any BI tool and user through Teradata or Aster database. The SQL-H connection now allows for full filtering of data on the Hadoop side. It also includes bi-directional data transfer so, for example, a massive Insert(select) can be issued with the materialized data being stored in Hive. Toss in an Infiniband network and we can literally move terabytes between systems in a few minutes. Naturally Teradata supports Hortonworks first but we are also certifying other Hadoop distros since — after all — HCatalog is in Apache Hadoop. Net: Teradata QueryGrid provides super high speed parallel bi-directional data exchange with Apache Hadoop using SQL containing filters and joins submitted by your everyday business user. Anyway, your assessment predates this announcement. I could not divulge its progress or beta customers back then.
Great series of posts. Any thoughts on Teradata’s QueryGrid announcement?
I think that QueryGrid is absolutely the right direction… clearly it is what Dan Graham was suggesting below. Well done, Dan. Kudos to Teradata. As far as Hadoop integration goes this puts Teradata on par with both HANA and SQL Server (the next post in the series).
I guess that next I’ll have to figure out how to evaluate the architecture of federation schemes… there are several ways to go about it and some approaches are significantly better than others…
Comments are closed.