A Modern Data Warehouse Architecture: Part 3 – Build an EDW Annex

In the first two post of this series (here and here) I first suggested that Hadoop could be effectively used as the platform for staging and then suggested that a modern warehouse would have a federation layer that turned it into a logical data warehouse. Figure 3 depicts this extended architecture.

Figure 3. A Logical EDW

But if we have both Hadoop and a federation layer implemented… and we recognize the economics associated with moving data to Hadoop… Hadoop provides a 5X-50X price advantage over a commercial very large DBMS product… and we can move data from the expensive environment to the low-cost environment without impacting any applications… then we have the opportunity to move governed EDW data to Hadoop and place it into a Hadoop EDW Annex. Figure 4 shows this.

An EDW Annex

Now you might suggest that there is an impact… Hadoop will be significantly slower than a commercial EDW platform (for now…). But experienced EDW architects understand that in the classic architecture we had to co-locate data in a single database to join the data. So, we put all of the data, hot and cold data, in our EDW even though the service levels required for queries that touch old cold historical data did not justify the power and price of the EDW infrastructure. We had to but did not need to. We knew, if only implicitly, that most EDW queries touch a small subset of the data. Following the ratio suggested by Teradata (see the reference here) that 90% of the queries touch only 20% of the data we can imagine a system where 80% of the data resides in Hadoop to service 10% of the queries… and only that 10% experiences Hadoop performance.

I suggested this approach for Teradata here… but an architecture with an EDW Annex to store cleansed governed historical data works for any expensive RDBMS that can federate with Hadoop: Exadata, Netezza, Teradata, or HANA.

This concludes this series… sort of. I’ll post soon to express more about how this architecture provides long-term strategic value. I think that these three concepts: Hadoop as an EDW staging area, federation and logical data warehousing, and Hadoop as an EDW Annex; provide the foundation for a modern EDW… and I imagine that over the next several years this will become the reference architecture most of us will build to.