There is a new way to think about data warehouse architecture. The Gartner Group calls it a logical data warehouse and it uses database federation to dynamically integrate a universe of data warehouses, operational data stores, and data marts into a single, united, structure. This blog has suggested that there is a special case of the logical data warehouse that uses Hadoop to provide a modern data warehouse architecture with significant economic advantages (see here, here, and here).
This is the second post inspired by my chats with Bityota… and sort of, but not altogether, commercial in nature (the first post is here). That is, Bityota will use these posts in their collateral… but you won’t see foam about their products in the narrative below.
The economics are driven by the ability, through database federation, to place tables on a less expensive database platform. In short, the aim is to place data on the least expensive platform that still provides enough performance to satisfy service level agreements (SLAs). In the case of the modern data warehouse architecture this means placing older, colder, data on Hadoop where the costs may be $1000/TB instead of having all of the data in a single data warehouse platform at a parallel RDBMS price point of $35,000/TB. Federation allows these two layers to seem as one to any program or end-user.
This approach may be thought of as a data life cycle infrastructure that has significantly more economic power that the hardware based life cycles suggested by database vendors to date. Let’s consider some of the trade-offs that define the hardware approach and the Hadoop-based approach.
The power of Hadoop federation comes from its ability to manage data placement at a macro level. Here, data is placed appropriately into a separate database management system running on differentiated hardware so that the economics of the entire infrastructure: hardware, software, and network can be optimized. It is even possible to add a third or fourth level to provide more fine-grained economic optimization. But this approach come at a cost. Each separate database optimizes queries at the database level. Despite advances in federation software technology this split optimization cannot optimize many queries. The optimization is not poor, but it is not optimal optimization. The global optimization provided by a single DBMS will almost always out-perform federated optimization.
The temperature-based optimization touted by some warehouse vendors provides good global optimization. To date a single DBMS must run on a homogeneous hardware platform with a single price point. Queries run optimally but the optimization can only twiddle around hardware details placing data in memory or on an SSD device or on the faster portion of a disk platter.
To eliminate this unfortunate trade-off: good optimization over minor hardware capabilities or fair optimization over the complete hardware eco-system we need a single DBMS that can run queries over a heterogeneous mix of hardware. We need a single database management system, with global query optimization, that can execute queries over multiple layers of hardware deployed in the cloud. We can easily imagine a multi-layered data warehouse with queries federated over several AWS offerings with hotter data on fast nodes that are always available, with warm data on less expensive nodes that are always available, and with cold historical data on inexpensive nodes that come online in processing windows so that you pay for the nodes only when you need them. Figure 7 shows a modern data warehouse deployed across an Amazon cloud.
This different way of thinking about a logical data warehouse is exciting… and a great example of how cloud computing may change everything in the database and data warehouse space.