A New Way of Thinking About EDW Federation

There is a new way to think about data warehouse architecture. The Gartner Group calls it a logical data warehouse and it uses database federation to dynamically integrate a universe of data warehouses, operational data stores, and data marts into a single, united, structure. This blog has suggested that there is a special case of the logical data warehouse that uses Hadoop to provide a modern data warehouse architecture with significant economic advantages (see here, here, and here).

This is the second post inspired by my chats with Bityota… and sort of, but not altogether, commercial in nature (the first post is here). That is, Bityota will use these posts in their collateral… but you won’t see foam about their products in the narrative below.

– Rob

The economics are driven by the ability, through database federation, to place tables on a less expensive database platform. In short, the aim is to place data on the least expensive platform that still provides enough performance to satisfy service level agreements (SLAs). In the case of the modern data warehouse architecture this means placing older, colder, data on Hadoop where the costs may be $1000/TB instead of having all of the data in a single data warehouse platform at a parallel RDBMS price point of $35,000/TB. Federation allows these two layers to seem as one to any program or end-user.

This approach may be thought of as a data life cycle infrastructure that has significantly more economic power that the hardware based life cycles suggested by database vendors to date. Let’s consider some of the trade-offs that define the hardware approach and the Hadoop-based approach.

The power of Hadoop federation comes from its ability to manage data placement at a macro level. Here, data is placed appropriately into a separate database management system running on differentiated hardware so that the economics of the entire infrastructure: hardware, software, and network can be optimized. It is even possible to add a third or fourth level to provide more fine-grained economic optimization. But this approach come at a cost. Each separate database optimizes queries at the database level. Despite advances in federation software technology this split optimization cannot optimize many queries. The optimization is not poor, but it is not optimal optimization. The global optimization provided by a single DBMS will almost always out-perform federated optimization.

The temperature-based optimization touted by some warehouse vendors provides good global optimization. To date a single DBMS must run on a homogeneous hardware platform with a single price point. Queries run optimally but the optimization can only twiddle around hardware details placing data in memory or on an SSD device or on the faster portion of a disk platter.

Figure 7. Federated Elastic Shared-nothing IaaS

Figure 7. Federated Elastic Shared-nothing IaaS

To eliminate this unfortunate trade-off: good optimization over minor hardware capabilities or fair optimization over the complete hardware eco-system we need a single DBMS that can run queries over a heterogeneous mix of hardware. We need a single database management system, with global query optimization, that can execute queries over multiple layers of hardware deployed in the cloud. We can easily imagine a multi-layered data warehouse with queries federated over several AWS offerings with hotter data on fast nodes that are always available, with warm data on less expensive nodes that are always available, and with cold historical data on inexpensive nodes that come online in processing windows so that you pay for the nodes only when you need them. Figure 7 shows a modern data warehouse deployed across an Amazon cloud.

This different way of thinking about a logical data warehouse is exciting… and a great example of how cloud computing may change everything in the database and data warehouse space.

2 thoughts on “A New Way of Thinking About EDW Federation

  1. Hi Rob, you’ve nailed the main issue for me with the statement: “The global optimization provided by a single DBMS will almost always out-perform federated optimization.”.

    Query optimisation (spelled the UK way!) is both far harder – especially when distributed/federated/logical architectures are in play – and far more important than a lot of folks realise. A poor query plan can undo a lot of very fine architecture/engineering in a heartbeat.

    Whilst at Dataupia we spent a *lot* of engineering dollars to turn Oracle SQL queries as submitted by the user into efficient parallel PostgreSQL queries that would execute quickly on the underlying Dataupia MPP stack. The MPP speed-up compared to Oracle was great…when it worked. A poor plan could result in worse performance than simply running on a standard Oracle + SMP + SAN/NAS stack.

    Query optimisation across a non-standard technology landscape is where the brains trust need to focus to make federation a sensible reality.


  2. Pingback: Get month | Nee's Programming Blog

Comments are closed.