Part 4: How Hadooped is Teradata?

In this thread on RDBMS-Hadoop integration (Part 1, Part 2, Part 3) I have suggested that we could evaluate integration architecture using three criteria:

How parallel are the pipes to move data between the RDBMS and the parallel file system;
Is there intelligence to push down predicates; and
Is there more intelligence to push down joins and other relational operators?

Let’s consider the Teradata SQL-H implementation using these criteria.

First, Teradata has effective parallel pipes to move data from HDFS to the Teradata database with one pipe per node. There does not seem to be any inter-node IO parallelism. This is a solid feature.

There is a limited ability to push down predicates… SQL-H does allow data to be partitioned on the HDFS side and it will perform partition elimination if the query explicitly calls out a predicate within a partionfilter() keyword. In addition there is an ability to project out columns using a columns() keyword to explicitly specify the columns to be returned. These features are klunky but effective. You would expect partitions to be eliminated when the partitioning column is referenced with a predicate in the query like any other query… and you would expect columns to be projected out if they are not referenced. Normal SQL predicates are applied after the data is moved over the network but before every record is written into the Teradata database.

Finally SQL-H provides no advanced capabilities to push down join operators or other functions.

The bottom line: SQL-H is a sort of klunky implementation, requiring non-ANSI-standard and non-Teradata standard SQL syntax. Predicate push down is limited but better than nothing. As you will see when we review other products, SQL-H is a basic offering. The lack of full predicate push-down and advanced features will negatively and severely impact performance when accessing large volumes of data, Big Data, and the special SQL syntax will limit the ability to access HDFS data from 3rd party tools. This performance penalty will force customers to pre-join and pre-aggregate data in Hadoop rather than access it naturally.

Next Part 5...

References

Teradata Magazine: Hands On Dynamic Access

Doug Frazier: SQL-H Presentation

13 thoughts on “Part 4: How Hadooped is Teradata?”

Dan Graham says:

March 4, 2014 at 5:22 am

Rob,
All of these shortcomings are release 1.0 SQL-H. No one else has gotten to this level yet so technically its the best Hadoop data interchange in existence. Furthermore, it is co-developed with the largest data warehouse customer in the world to move data to and from petabyte class Hadoop clusters. One step at a time. Teradata has plenty of evidence that we know how to handle parallelism, SQL, and grand challenge data handling.
Stay tuned. Version 2 is right around the corner.
You spoke too soon.

Loading...
1. Rob Klopp says:
  
  March 4, 2014 at 8:47 am
  
  Cool, Dan… let me know when v2 comes out and I’ll update this.
  
  As far as who is the best right now… let’s see what it looks like when we evaluate SQL Server, Pivotal, Oracle, and HANA…
  
  Rob
  
  Loading...
2. Rob Klopp says:
  
  March 11, 2014 at 9:55 am
  
  OK, Dan… I violated my normal “talk about HANA last” rule to set the bar a little higher for you… see Part 6 here (http://wp.me/p1a7GL-rA).
  
  Rob
  
  Loading...
  1. Dan Graham says:
    
    May 21, 2014 at 7:32 am
    
    Teradata Database 15.0 was released in early April in Prague. The facility called Teradata QueryGrid was announced. It is an internal framework for connecting a Teradata or Aster SQL query to other systems. Its a highly controlled form of federation or data virtualization, whichever word you prefer. All connections are parallel data transfers unless the remote system can’t handle it. The SQL can be issued by any BI tool and user through Teradata or Aster database. The SQL-H connection now allows for full filtering of data on the Hadoop side. It also includes bi-directional data transfer so, for example, a massive Insert(select) can be issued with the materialized data being stored in Hive. Toss in an Infiniband network and we can literally move terabytes between systems in a few minutes. Naturally Teradata supports Hortonworks first but we are also certifying other Hadoop distros since — after all — HCatalog is in Apache Hadoop. Net: Teradata QueryGrid provides super high speed parallel bi-directional data exchange with Apache Hadoop using SQL containing filters and joins submitted by your everyday business user. Anyway, your assessment predates this announcement. I could not divulge its progress or beta customers back then.
    
    Loading...
Pingback: Part 5: A Review of Processing Push-down | Database Fog Blog
Pingback: Part 1: How Hadooped is Your RDBMS? | Database Fog Blog
Pingback: Part 2: Evaluating Exadata… Does it stack up with RDBMS-Hadoop systems? | Database Fog Blog
Pingback: Part 3: A Quick Follow-on on Parallel Databases | Database Fog Blog
Pingback: Part 6: How Hadooped is HANA? | Database Fog Blog
Pingback: Part 7 – How Hadooped is Greenplum, the Pivotal GPDB? | Database Fog Blog
Jesse says:

April 8, 2014 at 10:42 am

Robb,

Great series of posts. Any thoughts on Teradata’s QueryGrid announcement?

Loading...
1. Rob Klopp says:
  
  April 10, 2014 at 1:11 pm
  
  Thanks, Jesse…
  
  I think that QueryGrid is absolutely the right direction… clearly it is what Dan Graham was suggesting below. Well done, Dan. Kudos to Teradata. As far as Hadoop integration goes this puts Teradata on par with both HANA and SQL Server (the next post in the series).
  
  I guess that next I’ll have to figure out how to evaluate the architecture of federation schemes… there are several ways to go about it and some approaches are significantly better than others…
  
  Rob
  
  Loading...
Pingback: How Hadooped is SQL Server PDW with Polybase? | Database Fog Blog

Comments are closed.

Share this:

Like this:

13 thoughts on “Part 4: How Hadooped is Teradata?”

Discover more from Database Fog Blog