This post will consider the implications of a full database federation as would be required by a Logical Data Warehouse. I’ll build on the concepts introduced in the posts on RDBMS-Hadoop integration (Part1, Part 2, Part 3, Part 4, Part 5, Part 6, Part 7, Part 8).
Figure 1 summarizes those earlier concepts from simple to advanced.
But the full federation required to implement a logical data warehouse requires a significant step up from this. Simple federation will be a disaster and Basic federation will not be much better. Here is why.
Let’s add a database and use Figure 2 to consider the possibilities when we submit a query that joins Table A.One to A.Two to B.One to C.One. Note that in this picture we have included a Governor to execute the federated queries that is independent of any of the DBMSs… this is the usual case for federation.
In the simple case where the Governor executes the entire plan all of the data must come to the Governor. This is clearly unacceptable. Consider the worse case where a SELECT is issued against only one table… still all of the data must bubble up.
In the Basic case the problem is partially mitigated… less data moves after the predicates are resolved but the overhead will still kill query performance. A Governor with basic capabilities provides the minimal features to make this work. It is useful where slow federation is better than data replication… but that is about all.
However, the advanced case becomes seriously more complicated. The optimizer now has to decide if table B.One should move to C to join the data or should it move to A… or should it move data to the Governor.
The problem is further complicated by any resource shortage on any node or any functional capability differences. If the cost of data movement would suggest moving B data to C… but there is no CPU resource available on C then maybe a different decision should be made? If C.One is a big table but C is a column-store and the cost of the SELECT is small because a minimum of columns are required and the cardinality of those columns is small so the data might be fetched from the dictionary then we might make a different decision. If B is a fast in-memory database but there is no memory available then the cost changes. Finally, if there are twenty databases in your logical DW then the problem increases exponentially.
The point here is clear… data federation over n-tiers is a hard problem. There will be severe performance issues when the optimizer picks wrong. This is why the independent governor model is so attractive… Many of the variables around CPU resources and database capabilities are removed… and while the performance will be poor it will be predictably poor. You should consider the implications carefully… it is just not clear that a high-performance logical data warehouse is feasible simply laid over an existing architecture. And if you build on a model with a Governor you must be sure that the Governor, and the Provincial databases can handle the load. I suspect that the Governor will have to run on a cluster and use a shared-nothing architecture to handle a true enterprise-sized logical EDW.
HANA has a twist on this that is interesting. The Governor lives inside one of the database nodes… so for data in HANA there is no data movement cost unless the optimizer decides to send the data to another node. Further, HANA is very fast… and the performance will mitigate some of the slowness inherent in federation. Finally, HANA is a shared-nothing DBMS… so it is not a problem to move lots of data to HANA in support of big tables and/or thousands of concurrent queries.
I’ll try to use this thinking: simple, basic, advanced federation over some governed federator on a an in-memory or fast shared-nothing architecture to evaluate the products on the market that provide federation. This may prove interesting as the Logical Data Warehouse concept catches on and as products like Teradata’s QueryGrid come to market.
5 thoughts on “Logical Data Warehouses and the Basics of Database Federation”
Rob, Nice post, as usual ! The other thing I’ve noticed re HANA is that it also depends on the depth / complexity of the query, and thus the access plan. What is counter intuitive is that if you have an access plan that reaches out and gets data at the start of one of its, say, 10, steps, and then the remaining 9 steps do some pretty heavy manipulation the federated query can go faster than it would on the native system. what’s gained on the swings of query execution for the last 9 steps more than compensates for what’s lost on the slides of external data access. Of course depends on the query plan etc, etc
Obviously difficult problem. It is questionable if it will ever be solved with adequate performance if we consider that data types need to be translated/matched at some point, network latency, ( number of possible combinations optimizer has to consider in very short time on heterogeneous software and hardware ) X ( number of data sources ); also problematic would be performance impact on source systems during business hours.
Another approach would be to start with completely different reference architecture: Hadoop as unifying ( central hub ) data layer with fast in-memory front end ( governor in your lingo ) like Spark ( or IMDB like SAP HANA ). Data integration is currently happening via ETL anyway and Hadoop is de facto ( or future ) ETL platform of choice. That is more pedestrian approach as it involves data movement from data sources to Hadoop, but more realistic, I think.
I think that that the idea of migrating everything to a new combination of products is too daunting… too expensive… and carries with it the risk that the combination you pick today will not be optimal next year when the next Google White Paper and then follow-on open source products makes what you picked obsolete.
A federation layer insulates your applications from the data… and the flexibility this affords lets you migrate slowly or not at all… and lets you add and remove database technology as required. In the next five years this insulation may prove more valuable than the dynamic integration benefits we normally associate with federation technology.
If I were starting from scratch… I would go the route you suggest. It is the right track for sure.
By the way, Ranko… the most insightful point in your comment is the one around the performance impact on source systems during business hours. This is why I advocate a data lake, likely built on Hadoop, that replicates source data out of the production system to remove the secondary workload from those systems. I agree with you that federation against source systems is not a workable solution until all source systems run on a database technology that can support a simultaneous OLTP and analytic query workload.
Thanks again for the comment…
There you go – selling point for SAP HANA 🙂 It supports both OLTP/OLAP so you can run queries across a bunch of federated HANA installations and have all enterprise data instantly available at any level of detail ( with in situ deep drill down descriptive analytics capability ). So Mr. Plattner seems to be right with his OLTAP vision – and it doesn’t have to be limited to a single database instance if good federated query engine is available 🙂
If you need colder, really raw data or to perform advanced/predictive analytics then data can go to Hadoop, but that implies copying data/old way of doing things. Federated query is definitely more elegant approach if they can make it work properly.
Comments are closed.