Now for HANA plus Hadoop… to continue this thread on RDBMS-Hadoop integration (Part 1, Part 2, Part 3, Part 4, Part 5) I have suggested that we could evaluate integration architecture using three criteria:
- How parallel are the pipes to move data between the RDBMS and the parallel file system;
- Is there intelligence to push down predicates; and
- Is there more intelligence to push down joins and other relational operators?
As a preface I need to include a note/apology. As you will see HANA may well have the best RDBMS-Hadoop integration in the market. I try hard not to blow foam about HANA in this blog… and I hope that the objective criteria I have devised to evaluate all of the products will keep this post credible… but please look at this post harder than most and push back if you think that I overstep.
First… surprisingly, HANA’s first release has only a single pipe to the Hadoop side. This is worrisome but easily fixed. It will negatively impact performance when large tables/files have to be moved up for processing.
But HANA includes Hadoop as a full partner in a federated data architecture using the Smart Data Access (SDA) engine inside the HANA address space. As a result, HANA not only pushes predicates but it uses cost-based optimization to determine what to push down and what to pull up. HANA interrogates the Hadoop system to gather statistics and uses the HANA optimizer to develop smart execution plans with awareness of both the speed of in-memory and the limited memory resources. When data in HANA is joined with data in Hadoop SDA effectively uses semi-joins to minimize the data pulled up.
Finally, HANA can develop execution plans that executes joins in Hadoop. This includes both joins between two Hadoop tables and joins where small in-memory tables are pushed down to execute the joins in Hadoop. The current limitation is that Hadoop files must be defined as Hive tables.
Here is the HANA execution plan for TPC-H query 19. HANA has pushed down all of the steps behind the Remote Row Scan step… so in this case the entire query including a nested loop join was pushed down. In other queries HANA will push only parts of the plan to Hadoop.
So HANA possesses a very sophisticated integration with Hadoop… with capabilities that minimize the amount of data moved based on the cost of the movement. This is where all products need to go. But without parallel pipes this sophisticated capability provides only a moderate advantage (see Part 5),
Note that this is not the ultimate in integration… there is another level… but I’ll leave some ideas for extending integration even further for my final post in the series.
Next… Part 7… considering Greenplum…
15 thoughts on “Part 6: How Hadooped is HANA?”
Can you clarify the statement “HANA can develop execution plans that executes joins in Hadoop”?
Is Hadoop/Hive(?) running Map-Reduce or SQL joins?
Hi Dong Jiang,
Great question… The answer is that HANA uses HiveQL as a low level interface. HANA does not try to translate the HANA SQL into HiveQL… this would fail as the HANA SQL dialect is far richer than Hive’s… rather it builds a plan using basic HiveQL as execution operators. Think of it this way… an execution plan includes operators like perform Scan and Project, or perform Nested Join, or Aggregate. SDA execute these operators using HiveQL. The published SDA roadmap suggests that HANA will be extended to replace HiveQL operators with MapReduce operators… and then to replace MapReduce with native HDFS operators.
So the vision is HANA(SQL)-on-HDFS, makes perfect sense.
Why is there an intermediate step of MR operations? Go straight to HDFS operators, don’t be late to the party(market).
I understand… I imagine that it is easier to get to the party sooner with MR…
Makes sense. But HANA still relies on Hive and all the other (possibly non-realtime, but batch-processed, like MapReduce) stuff under the hood, right?
It does, Andreas… We would suggest that real-time data needs to be in HANA in-memory… or in an intermediate, lukewarm, store. Smart Data Access is a full federation engine… so you could have hot data in HANA, lukewarm data in Teradata, Exadata, or Sybase IQ; and cold data in Hadoop… and SDA would develop join plans that integrate them all. What this means is that when there is a real-time, non-batch, big data store available we can include it in the architecture (we have an SDA SDK available to extend functionality).
It is also interesting to note that if you put dimension tables in HANA… they are small and push down predicates and big table joins… HANA’s join engine can often execute the dimensional joins faster than the underlying RDBMS. In other words, in the lab we see facts on Teradata and dimensions on HANA often executing queries faster than when all of the data is in Teradata.
Rob, I’m pretty sure that parallel pipes is included in the next release of HANA – I’d need to go back and check but given the multi-engine philosophy of HANA and use of optimisation to do push down it’d be odd if it didn’t.
I’ve seen similar cost based optimisation across distributed systems before, notably in IBM’s Data Integrator , which effectively externalised the cost based optimiser of DB2 across multiple engines and took into account not just the DBMS costs within nodes but also things like network overheads between nodes.
I think your way of categorising these capabiltiies is very useful – there is a danger that this capability is seen in a binary fashion, ie. a particualar product does / does not do it. But in fact, as ever ‘the devil is in the details’ and its a question of degree; Is there a cost based optimiser ? If so what get’s pushed down and when ? (does the optimiser take advantage of pushdown when it can?), and when its swapping data between steps in the overall access paln is that data transfer done efficiently – either in memory or with parallel pipes if an external interface is used?
Thanks, Henry… I knew of HANA’s future parallel pipes… but thought it best to focus on the here and now. If I open the door to futures I end up have to evaluate phantoms.
I don’t think that it is a matter of degree, though. A product that pushes down predicates could be 10X-100X faster than a product that does not. A product that pushes down joins will be 10X-100X faster than a product that does not. Now if we start to evaluate two products who both optimize joins that is a different story and we have to start evaluating the effectiveness of the optimizer… but that is about features more than architecture and I try to stay away from those close calls.
I guess I’m still stuck on “How Parallel” (for all the platforms.) We can assume that the geometry of the two platforms – no matter which vendor – will be different, which leads down the path that in a parallel pipe implementation, one must consider the redistribution to account for the geometry difference as relevant to the definition of “how parallel”.
The question is – which side does this work? The pusher or the puller. It’s not that it needs to be done, it’s how it’s done. Some implementations will do this redistribution in the network, some will persist the state of the new distribution to storage before or after transmission.
Rob: you will likely think this is simply an optimization of the first of the three rules, but because some implementations will result in order of magnitude differences, if not the least 2x the Read/Writes, I think the effectiveness of the redistribution algorithm has to be addressed as a fundamental, much the same way we think of a column store implementation in three basic gradients…
Mostly I agree Michael… Originally I thought that I might try to unravel the different issues associated with networks that connect the two systems and the geometry between the two. You can see some of this in my post in Part 2 on Exadata where I point out the problems of an asymmetric geometry when 3 storage nodes try to feed big data to 1 RAC node. If you have hundreds of nodes in a big data platform it will be tough to ram big data into a small number of RDBMS nodes… So big parallel RDBMS systems who can deploy an RDBMS node for every HDFS node will have some advantage in moving data.
But it is my contention that the best strategy is to minimize the data sent across the network by pushing down as much functionality as is cost-justified, where an optimizer is determining the cost. Further, as the performance and functionality on the big data side improves it will make ever more sense to push down ever more function moving the small data to the big data. Once I looked at the problem I decided that fiddling with the parallel pipe side of the equation missed the bigger opportunity… which was to better use the parallel processing on the big data side of the equation and whittle down the data to be moved rather than to optimize the ability to move data.
Of course, in the end, we need both. Maybe I’ll try and develop a future blog to help readers sort out the different parallel network architectures and the implications of each? Would you be interested in co-authoring a post?
Rob, I had the same thinking, but the more I thought about some of the foobar’d implementations I’ve seen, the more I think it is a consideration when the implementations are so young. The issue is not really about the networking – it’s how the implementations deal with the redistribution required by the differences in geometry.
Reasoning is that getting to a performant interaction is actually a three step optimization process – you have two of them defined (single and multiple data streams), the third one is the redistribution… get that redistribution algo wrong and it won’t matter a bit if it’s parallel or single threaded…
From another perspective, redistribution is a First Class Citizen in every Parallel system and the wider the geometry of the system, the more negatively impactful the potential. Further, the wider the distance disparity of the two geometries, the more likely the implementation performance will go substantially sub-linear.
And yes, happy to help.
This will be an interesting topic…
The redistribution problem is a problem after the RDBMS side of the house receives the data. Teradata will write the data pulled up from Hadoop to spool and then redistribute it… Greenplum will scatter it (write it) and then gather it (redistribute it)… both doing essentially the same thing.
If the asymmetry between the size of the answer set and the receiving RDBMS is too large there will surely be issues. HANA, for example, could not consume a 500TB answer set into a 1TB memory space. But even in the Teradata case there has to be 500TB of spool. This is why push-down is so important… it reduces the data returned… in the case of HANA and SQL Server (stay tuned) the answer is to push all of the processing down rather than pull 500TB up.
I wonder if the problem you are seeing is a combination of asymmetry and lack of push-down?
Sorry for the late reply… So, predicate push down is assumed for all systems, that’s not the optimization I’m thinking of – but you’re close. The essence of the problem is that the movement from a source MPP to a target MPP with dissimilar geometry IS a redistribution.
Question is – can you do the distribution on copy, or is it required to persist and redistribute (or scatter/gather…). In order to do it on copy, some capability must exist to push down the distribution algorithm such that the sending system can send the right data to the right place.
In all cases, except possibly for copy TO HDFS (because it’s typically unordered anyway), the target system will still be required to read and write the data a second time – to put the data in whatever indexing scheme the target is using, be it TD Hashed, or B-Tree – whatever the file structure of the target happens to be.
Hence, the optimization is entirely about removing the – second – network copy, and potentially a second sort and write of the data. So – on top of the pushdown, making the network penalty 1x instead of 2x and the write penalty 1x instead of 2x – that is a high value optimization to me. Hope this explains things a bit better!
Comments are closed.