Part 8 – How Hadooped is SQL Server PDW with Polybase?

Now for SQL Server… continuing the thread on RDBMS-Hadoop integration (Part 1Part 2, Part 3, Part 4Part 5, Part 6, Part 7) I have suggested that we could evaluate integration architecture using three criteria:

  1. How parallel are the pipes to move data between the RDBMS and the parallel file system;
  2. Is there intelligence to push down predicates; and
  3. Is there more intelligence to push down joins and other relational operators?

Before we start I will suggest a fourth criteria that will be more fully explored later when we consider networks and pipes… that is: how is data sharded/hashed/distributed as it moves from the distribution scheme in HDFS to an optimal, usually hashed, scheme in the target RDBMS. Consider Greenplum as an example… they move data in parallel as quickly as possible to the GPDB and then redistribute the data across GPDB segment nodes using scatter-gather, a very efficient distribution mechanism. We will consider how PDW Poybase manages this as part of our first criteria.

Also note… since I started this series Teradata has come out with a new capability: the QueryGrid. I will add a post to consider this separately… and in this note I will assume the older Teradata capability. This is a little unfair to Teradata and I apologize for that… but otherwise this post becomes too complex. I’ll make things right for Teradata ASAP.

Now on to Microsoft…

First, Polybase has effective parallel pipes to move data from HDFS to the parallel SQL Server instances in PDW. This matches the best capability of other products like Teradata and Greenplum in this category. But where Teradata and Greenplum move data and then redistribute it, pushing the data over a network twice, Poybase has pushed the PDW hash function down to the HDFS node so that data is distributed as it is sent. This very nice feature skips one full move of the data.

Our second criteria considers how smart the connector is in pushing down filters/predicates. Polybase uses a cost-based approach to determine whether is is less expensive to push predicates down or to move all of the data up to the PDW layer. This is a best-in-class capability.

For the 3rd criteria we ask does the architecture push down advanced functions like joins and aggregates… and does the architecture minimize data pulled up to join with semi-joins? Polybase again provides strong capabilities here pushing down joins and aggregates. Polybase does not use semi-joins, so there is room to improve here… but Microsoft clearly has this capability in their roadmap.

One final note… Polybase works with PDW but not with other SQL Server products. This limitation may be relevant in many cases.

PDW + Polybase is a strong offering… matching HANA in most aspects with HANA having a slight edge in push-down with semi-joins but with SQL Server matching this with the most sophisticated parallel data distribution capability.

References

HANA, BLU, Hekaton, and Oracle 12c vs. Teradata and Greenplum – November 2013

Catch Me If You Can (musical)
(Photo credit: Wikipedia)

I would like to point out a very important section in the paper on Hekaton on the Microsoft Research site here. I will quote the section in total:

2. DESIGN CONSIDERATIONS 

An analysis done early on in the project drove home the fact that a 10-100X throughput improvement cannot be achieved by optimizing existing SQL Server mechanisms. Throughput can be increased in three ways: improving scalability, improving CPI (cycles per instruction), and reducing the number of instructions executed per request. The analysis showed that, even under highly optimistic assumptions, improving scalability and CPI can produce only a 3-4X improvement. The detailed analysis is included as an appendix. 

The only real hope is to reduce the number of instructions executed but the reduction needs to be dramatic. To go 10X faster, the engine must execute 90% fewer instructions and yet still get the work done. To go 100X faster, it must execute 99% fewer instructions. This level of improvement is not feasible by optimizing existing storage and execution mechanisms. Reaching the 10-100X goal requires a much more efficient way to store and process data. 

This is important because it confirms the difference in a Level 3 and a Level 2 columnar implementation as described here. It is just not possible for a Level 2 implementation with a row-based join engine to achieve the performance of a Level 3 implementation. This will allow the Level 3 implementations: HANA, BLU, Hekaton, and Oracle 12c to distance themselves from the Level 2 products: Teradata and Greenplum; by more than 10X… and this is a very significant advantage.

Related articles

A Rebuttal…

This post has been thrown at me a couple of times now… so I’ll now take the time to go through it… and try to address the junk.

It starts by suggesting that “the Germans” have started a war… but the next sentence points out that the author tossed grenades at HANA two months before the start he suggests. It also ignores the fact that the HANA post in question was a response to incorrect public statements by a Microsoft product manager about HANA (here).

The author suggests some issue with understanding clustered indexes… Note that “There are 2 implementations of xVelocity columnstore technology: 1. Non clustered index which is read only – this is the version available in SMP (single node) SQL Server 2012. 2. Columnstore as a clustered index that is updateable – This is the version available in MPP or PDW version of SQL 2012.”. The Microsoft documentation I read did not distinguish between the two and so I mistakenly attributed features of one to the other. Hopefully this clears up the confusion.

He suggests that the concept of keeping redundant versions of the data… one for OLTP and one for BI is “untrue”… I believe that the conventional way to deal with OLTP and BI is to build separate OLTP and BI databases… data warehouses and data marts. So I stand by the original comment.

The author rightfully suggests that I did not provide a reference for my claim that there are odd limitations to the SQL that require hand-coding… here they are (see the do’s and dont’s).

He criticizes my statement that shared-nothing gave us the basis for solving “big data”. I do not understand the criticism? Nearly very large database in the world is based on a shared-nothing architecture… and the SQL Server PDW is based on the same architecture in order to allow SQL Server to scale.

He is critical of the fact that HANA is optimized for the hardware and suggests that HANA does not support Intel’s Ivy Bridge. HANA is optimized for Ivy Bridge… and HANA is designed to fully utilize the hardware… If we keep it simple and suggest that using hardware-specific instruction sets and hardware-specific techniques to keep data in cache together provide a 50X performance boost [This ignores the advantages  of in-memory and focusses only on hw-specific optimizations… where data in cache is either 15X (L3) or 20X (L2) or 200X (L1) faster than data fetched from DRAM… plus 10X or more using super-computer SIMD instructions], I would ask… would you spend 50X more for under-utilized hardware if you had a choice? SAP is pursuing a distinct strategy that deserves a more thoughtful response than the author provided.

He accuses me of lying… lying… about SQL being architected for single-core x286 processors. Sigh. I am unaware of a rewrite of the SQL Server product since the 286… and tacking on support for modern processors is not re-architecting. If SQL Server was re-architected from scratch since then I would be happy to know that I was mistaken… but until I hear about a re-write I will assume the SQL Server architecture, the architecture, is unchanged from when Sybase originally developed it and licensed it to Microsoft.

He says that HANA is cobbled together from older piece parts… and points to a Wikipedia page. But he does not use the words in the article… that HANA was synthesized from other products and , as stated in the next sentence, built on: “a new application architecture“.  So he leaves the reader to believe that there is nothing new… he is mistaken. HANA is more than a synthesis of in-memory, column-store, and shared-nothing. It includes a new execution engine built on algorithms from the search space… columns in the column store are processed as vectors rather than the rote tuple-by-tuple approach from the 1980’s. It includes powerful in-database support for procedural languages with facilities that convert loops to fully parallel set-based processes. It provides, as noted above, a unique approach to supporting OLTP and BI queries in the same instance (see here)… and more. I’m not trying to hype HANA here… time and the market will determine if these new features are important… but there is no doubt that they are new.

I did not find the Business Intelligist post to be very informative or helpful. With the exception of the Wikipedia article mentioned above there is only unsubstantiated opinion in the piece… … and a degree of rudeness that is wholly uncalled for.

Microsoft SQL Server Announcements – November 2012

Here is one I composed for SAP on the HANA blog about the recent Microsoft SQL Server announcements that is not too obnoxiously pro-HANA. It is more about the data architecture required to handle a world where the client is a mobile device and every query must complete sub-second. This, I believe is where we are headed… taking those BI queries that run in an hour on weak warehouses and improving the response to 10 seconds won’t cut it if your user is on a mobile device… and if the query is customer-facing you will be out of business…

The only way to solve for this is to get lots of silicon between you and your data… and hope that no queries miss the cache… or put it all in-memory.

———

I might have added to the work post that anytime a database vendor pre-announces a product that is due out in 1-2 years,  “2014-2015” in this case, it is marketing not architecture… meant to freeze SQL Server customers in place while Microsoft tries to catch up.

Make sure to have a look at the comments… there is a great link to a Microsoft mouthpiece who suggests that I must have no technical background and that I am a liar. Nice.

30+ Year Old Database Architecture: DB2, Oracle, Postgres, Teradata, Sybase, and More…

As you look at the enterprise RDBMS marketplace today you will find something shocking… almost every product in the market is built based on designs and concepts that are over thirty years old. IBM’s System R grew into DB2 and influenced Oracle before 1980. Ingres, developed before 1980, became Postgres which became Netezza and Greenplum and more. Teradata was a fresh start… around 1980.

This is not a bad thing in its own right… but imagine the hardware architectures these systems were designed and optimized for. Maybe DB2 was built for a multi-core mainframe… maybe Oracle too… maybe. Memory was tiny… so memory management was important and memory was used sparingly. Data sizes were tiny. Consider the fact that Teradata named the company based on the belief that someday way beyond the planning horizon some customers might get to a terabyte of data.

The reality is that these old designs are inefficient. They have hacked the old code to continuously extend their products. I mean this as a compliment. It is not trivial engineering to find tweaks and tack-ons that make old code work on new hardware architectures. Teradata and Netezza and Greenplum designed ways to use multiple address spaces to take advantage of multiple cores. Oracle tacked-on a shared-nothing I/O subsystem to a shared-everything architecture to stretch.

But these hacks are not efficient.

Yale is working on some new-new stuff (see here). HANA is based on a completely different design (see here). The NoSQL vendors have bent the ACID-tested rules, if not always the fundamental approaches.

I can’t help but believe that in one of these new approaches is a path forward.

If you would like to read some history of the start here is a cool link.

%d bloggers like this: