Database Supercomputing in the Cloud

This post will combine the information in the last two posts to show how the cloud might be used as a database supercomputer at no extra cost. 

To quickly recap: in the post here, you saw a simple model that shows how a cluster dedicated to a single query uses the on-demand resources of cloud computing to provide a 10X-100X performance boost over a multitasking cluster running many queries at once. Note that this model did not account for the start-up time of spinning up a new cluster instance for each query, but we assume that there is still a significant saving for queries that run for several minutes. Further, we believe that queries can be queued and run one at a time on clusters to reduce the start-up costs whenever there is work queued. 

The next post here reminded us of the cost of populating cache from DRAM whenever there is a multitask context switch from one query to the next. You also saw how supercomputing SIMD instructions further accelerate processing. Finally, we discussed how modern columnar databases store data as vectors in a way that allows compressed columnar data to be loaded directly into cache to optimize the use of that memory. 

One last note: all the cores share the L3 cache in a chip. On a 4-CPU processor, this means that you could run 4 parallel processes, all on compressed vectorized data. 

So how might the cloud help here? First, if we dedicate a cluster to query processing where “dedicate” means that the hardware is dedicated to a single queue of queries, then once the query data is loaded in memory and the cache is primed, the performance can be 1000X the speed of an alternative configuration where the hardware is shared both by a multitasking DBMS and that DBMS is sharing the cluster with other cloud workloads dispatched by the cloud operating system. Better still, since in the cloud using fine-grained billing, you pay only for what you use, so this extra performance is nearly free. 

When SAP HANA was introduced, SAP published amazing 1000X benchmarks using in-memory supercomputing instructions. Unfortunately, the technology did not quite exist to deploy queries on-demand on tens or hundreds or thousands of dedicated cloud computers. I imagine that we are not too far from this today. 

The bottom line is that multitasking has never been efficient. When you swap context in and out of cache, you are just wasting resources. It made sense when computing was expensive, and you swapped context to take advantage of the time it took to perform I/O to disk or tape. For analytic workloads where the data is columnar and vectorized, you can fit multiple terabytes of compressed data in memory and stream it to supercomputing instructions quickly.  

I imagine a time when simple, very high performance, single-threaded database processing units, similar in architecture to GPUs, will handle queries in a dedicated manner, fetching vectorized data from products like Apache Arrow, and 1000X speed-up at a very low cost will be the standard. 

Numbers Every Database Professional Should Know

Latency Estimates in a Processor

In this post I will setup the next post by reminding you of these numbers every programmer should know. The picture shows the latency to access data across the three levels of cache in a modern processor, and across the memory bus to DRAM, and then across to an SSD or spinning disk drive. 

These numbers are the key to in-memory database performance. L1 cache is expensive and small. All data and all instructions are fed to the core through the L1 cache and typically the instruction and the data it will operate on are each fetched independently. If the data (or instruction… but I will just talk about the data from here on out) is not found in the small L1 cache the L2 cache is searched. Likewise, if the data is not in the L2 cache the L3 cache is searched and then DRAM. As you can see there is a 200X performance difference between a fetch from L1 cache and a fetch from DRAM. If the data must be fetched from SSD or disk, there is a 1000X penalty. 

Databases work to mitigate the 1000X penalty by pre-fetching data into DRAM. Modern processors mitigate the penalty between cache and DRAM by trying to anticipate the data required and pre-fetching it into L3 cache. This is a gamble and the likelihood of winning the pre-fetch bet varies with your DBMS and with the workload. More simultaneous queries against many tables puts pressure on the memory and lessens the effectiveness of data pre-fetch. 

Databases also mitigate the 1000X penalty by compressing data so that each fetch of a block from disk captures more data. For tabular (not columnar) data each block must be decompressed before it can be used. Decompression steals CPU cycles from query processing but mitigates some of the cost of I/O to disk. Columnar databases fetch compressed data and, if the databases use modern vector processing techniques, they operate directly on the compressed vector data without decompression. This is especially powerful as these vectorized columnar databases can push compressed data into cache and use that memory more effectively. 

Vectorized data can use the supercomputing instruction set that includes SIMD instructions. SIMD stands for Single Instruction; Multiple Data and these instructions fetch a single instruction from L1 cache and reuse it on the vectorized data over and over with pausing to fetch new instructions. Note that in a modern processor waiting for instructions or data to be fetched from cache or DRAM appears as CPU busy time. When the processor is stalled waiting for data to be fetched from disk the query/job gives up the CPU and dispatches a new query/job. 

Finally, for the longer-running queries associated with analytics and business intelligence (where long-running means a few seconds or more… “long” in CPU instruction time), it is highly likely that the L1 and L2 cache in each core will be flushed and new data will be fetched. It is even likely that all the data in L3 cache will be flushed. In this case most of the goodness associated with compression and vectorization is flushed with it. The vectors must be fetched again from L3 and DRAM. Further, note that when you run in a virtual machine or in the cloud you may think that you are the only user but other virtual machines other containers, or other serverless processes may be flushing you out of memory all the time. 

This is important. Until we swapped out a query for a new one it looked like in-memory vector processing might be 200X-1000X faster than a tabular query; but in an environment where tens or hundreds of queries are running concurrently the pressure on memory keeps flushing the cache and the advantage of vectorized query processing is reduced. Reduced, not eliminated. We would still expect the advantages of compressed data in cache and SIMD supercomputing instructions to provide much more than a 10X speedup. 

As mentioned, this post is a detailed review of material I covered years ago when HANA was introduced. In the next post I will add some new thinking. 

Here is more information on processor architecture and cache usage.

DW Cloud Economics – The Model

This post describe the model used to generate the content from the DW Cloud Economics – A Do Over post incase you want to check the numbers…

First, keep in mind that the numbers in this model are relative and indicative, not absolute. Here are the fixed parameters I used to generate these indicative numbers:

Cost per server per hour: $4.00 This could be any number or any currency and it would not matter.

Runtime per job with no contention: 3 hours This could be any duration and it would not change the relative results. If the number was well below an hour, it would increase the hour-based billing costs.

Jobs per Workload: 6 Again any number works.

Baseline Configuration: 24 Servers

Contention between 6 jobs on 24 Servers: 50% This sets the average amount of time a job will wait for resources. So, a job that runs for 3 hours with no contention will run in 4.5 hours with 50% contention. As the number of servers goes up the contention comes down for a fixed workload.

Baseline Performance: 2000 A completely arbitrary number that allows the model to calculate a relative price/performance. Relative “Performance” is calculated by dividing the Baseline Performance by the actual runtime. This means that a workload that takes 5 hours will yield a performance of 400 and a workload that take 2 hours will yield a performance number of 1000.

There are also a series of calculated values:

Total Servers / Workload is equal to Servers/Job for the Multi-tasking and Serial Execution configurations. It is equal to Servers/Job * Number of Jobs for the dedicated server per job configuration.

Scaleup is the ratio between the number of Total Servers / Workload and the Baseline Configuration. It is used to scale down the runtime and the contention.

Runtime / Job is the number of hours per job with no contention scaled down based on the Scaleup plus the time the Job is waiting due to contention. Contention scales down as the number of servers scales up.

Actual Runtime / Workload is Runtime / Job plus contention. Note that in the Serial Configuration and the Dedicated Server configuration there is no contention.

Paid Runtime / Workload is the Actual Runtime / Workload rounded to the nearest hour for the hourly billing table and equal to the Actual Runtime / Workload for the billing per minute table.

Cost / Workload is the Cost per Server per Hour parameter times the Paid Runtime / Hour value times the Total Servers / Workload value.

Price / Performance is the Cost / Workload divided by the Actual Runtime / Workload times the Baseline Performance parameter divided by the Cost / Workload value as noted above.

DB Cloud Economics – A Do Over

Here is a better model that demonstrates how cloudy databases can provide dramatic performance increases without increasing costs.

Mea culpa… I was going to extend the year-old series of posts on the economics of cloud computing and recognized that I poorly described the simple model I used, and worse, made some mistakes in the model. It was very lame. I’m so sorry… That said, I hope that you will like the extension to the model and the new conclusions.

I’ll get to those conclusions in a few short paragraphs and add a new post to explain the model.

If we imagine a workload consisting of 6 “jobs” that would execute on a dedicated cluster of 24 servers in 3 hours each with no contention, we can build a simple model to determine the price performance of the workload for three different configurations:

  1. M1: The usual configuration where all six jobs run together and share the configuration, multitasking and contending for the resources. As the configuration scales up the cost goes up, the runtime goes down, and the contention drops.
  2. M2: A silly model where each job runs serially on the configuration with no contention. As the configuration scales up the costs go up, the runtime goes down.
  3. M3: A model where each job runs independently, simultaneously, on its own cluster with no contention.

If we pay for servers by the hour the model shows that there is an opportunity to deploy many more servers, as many as one per “job”, to reduce the runtime and the cost. Here are the results:

ConfigServers DeployedActual Runtime / Workload (Hours)Paid Runtime / Workload
(Hours)
Cost / WorkloadPrice / Performance
Multitasking244.55$480444
 483.84$768533
 963.44$1,536593
Serial Execution2418.018$1,728111
 489.09$1,728222
 964.54.5$1,920444
Dedicated Server per Job1440.51$5764000
 2880.31$1,1528000
 5760.11$2,30416000
Table 1. Costs with Hourly Billing

Note that the price/performance metric is relative across the models. You can see the dramatic performance and price/performance increases using a configuration where each unit of work gets a dedicated set of resources. But the more dramatic picture is exposed in Table 2.

ConfigurationServers DeployedActual Runtime / Workload (Hours)Paid Runtime / Workload (Hours)Cost / WorkloadPrice / Performance
Multitasking244.54.5$432444
 483.83.8$720533
 963.43.4$1,296593
Serial Execution2418.018.0$1,728111
 489.099.0$1,728222
 964.54.5$1,728444
Dedicated Server per Job1440.50.5$2884000
 2880.30.3$2888000
 5760.10.1$28816000
Table 2. Costs with per Minute Billing

Here the granular billing reduces the total cost with the same dramatic price and price/performance increases. This was the point I was trying to make in my earlier posts. Note that this point, that fine-grained billing can be used to significantly reduce costs, is why deploying work in containers, or better still as serverless transactions, is so cost-effective. It is why deploying virtual machines in the cloud misses the real cost savings. In other words, it is why building cloud-native implementations is so important and why cloud-native databases will quickly overcome databases that cannot get there.

I also was trying to show that work deployed across these configurations, what I call a “job”, can be ETL jobs or single queries or a set of queries or a Spark job. If you have lots of smaller work, it may be best to run them in a multitasking configuration to avoid the cost of tearing down and starting up new configurations. But even here there is a point where the tear down cost can be mitigated across multiple semi-dedicated configurations.

What is a Cloud-native Database?

Before this series is complete, I plan on defining in some detail what the various levels of Cloud-nativeness might be to allow readers to classify products based on architecture, not marketing. In this post, I’ll lay out some general concepts. First, let’s be real about cloud-things that are not cloud-native.

Any database that runs on physical hardware can run on virtual machines and, therefore, can run on virtual machines in the cloud. Databases with no cloud capabilities other than the ability to run on a VM on a cloud-provider are not cloud-native. Worse, there are lots of anecdotal stories that suggest that there are no meaningful savings to be had from moving a database from an on-premise server or VM to a cloud VM with no other change to take advantage of cloud elasticity.

So here are two general definitions for your consideration:

1) A cloud-native database will have one or more features that utilize capabilities found only on a cloud-computing platform, and

2) A cloud-native database will demonstrate economic benefits derived from those cloud-specific features.

Note that the way you pay for services via capital expenses (CapEx) or as operating expenses (OpEx) does not provide economic benefit. If the monetary costs of a subscription are more-or-less equal to the financial costs of a license, then savings are tied to tax law, not to economics. Beware of cloudy subscriptions that change how you pay without clearly adding beneficial cloudy features. It is these subscriptions that often are the source of the no-savings anecdotes mentioned above.

This next point is about the separation of storage from compute. Companies have long ago disconnected their databases from just-a-bunch-of-disks (JBOD) to shared storage such as SAN or S3. Any database today can use shared storage. It is not useful to say that any database that can use shared storage has separated storage from compute. Using the idea that there must be features, not marketing, that allow compute to scale separately and more-or-less dynamically from storage as the definition, we will be able to move forward in this area.

So:

3) For storage to be appropriately separated from Compute, it must be possible to scale Compute up and down dynamically.

Next, when compute scales, it scales at different granularity. An application or database that automatically adds and subtracts virtual machines provides different economics than a database that scales using containers. Apps that add and subtract containers have different economics than applications that use so-called serverless containers to scale. In this dimension, we will try to characterize granularity to account for the associated cloud economics. This topic will be covered in more detail later, so I’ll save the rule for that post.

Note that it is possible for database vendors to develop a granular architecture and to use the associated economics to their advantage. They may charge you for the time when any part of your database is running but be billed by their provider for smaller chunks. This is not an issue unless their overall costs become uncompetitive.

Last, and in some ways least, different products may charge for time in smaller or larger chunks. You might be charged by the hour, by the minute, by the second, or in smaller increments. Think about the scaling economics I suggested in the first posts of this thread. If you are charged by the hour, then there is no financial incentive to scale up to finish jobs to the minute. You will be charged the same for ten minutes or fifty minutes when you are charged by the hour. The rule:

4) Cloud databases that charge in smaller time increments are more economical than those that charge in larger increments.

The rationale here is probably obvious, but I’ll cover it in-depth in a later post.

With these concepts in place, we can discuss how architectural changes affect each aspect of the economics of databases in the cloud.

Note that this last sentence was written assuming that a British computer scientist with an erudite accent would speak it when they create the PBS series from these posts. Not.

By the way, a few posts from now, I am going to go back to some ideas I shared five years ago around the relationship between database processing and the underlying hardware platform. I’ll update this thinking with cloud computing in mind. You can find this thinking here which originally came from Jeff Dean and Peter Norvig (displayed in lots of places but here is one).

A Segue from ETL to DB

This is a short post to segue to point where I’ve been headed all along. Figure 1 recasts the picture from the last post, showing storage separated from compute from ETL/ELT to a data warehouse. It should be a familiar picture to Snowflake architects who may have implemented multiple DW instances against a single storage layer.

Decoupled Multi Instance DW
Figure 1. Multiple DW Compute Instances Decoupled from Shared Storage

I’ll not give away the next article, other than to say that it derives from the same concepts just discussed.

Since this is so short, I will add a tangent just-for-fun.

Here is a post from seven years ago that anticipates how the cloud impacts DW performance. When you combine this with the economics presented in the last two posts (here and here), suggesting that performance is free, you can begin to see why database tuning is no longer an urgent requirement for a data warehouse.

When you tune, you specialize for a particular workload, and if your workload changes, the tuning wears thin. In other words, I now believe that you should build a robust data warehouse with minimal tuning and use cloud compute to get performance. No tuning lets you add a new workload without adjusting. Tuning makes your database fragile in the face of change.

More on Cloud Data Elasticity

The last post (here) demonstrated how scalability in the cloud provides the ability to reduce runtimes from days or hours to minutes without raising the cost. We used a batch ETL service running three ETL scripts as an example. We then showed how the same use of scalability could allow us to break the batch ETL service into discrete jobs and remove contention between the three scripts to further improve throughput and reduce costs.

There is a great deal more to say about how the cloud changes our thinking about applying resources to our big data workloads. I promised to get to a discussion of database work, but that will have to wait another week. Sorry. Let’s carry on.

In the scenario where we ran the three jobs together, we assumed that all three ran in the same amount of time and so we shut down the ETL service when all three scripts completed. We shut down the servers when they were all complete to stop billing at the $1152 price point. It is more likely that each script would take more-or-less time than another.

When we run each script in its own set of servers, we can stop billing as each job ends. If one of the jobs takes only 2.5 hours to complete, and your cloud provider will allow you to bill minute, not hour increments, then the cost of that single job drops from $288 to $240. Over a year, these savings add up, and you can ask for a raise.

So, we have added a third point: by using scalability, and by scheduling discrete workloads on dedicated cloud servers, you can scale performance and significantly reduce costs.

You have no doubt noticed that the scenarios describe scalability without any impact on the data. We assume that compute can scale independently of storage, and re-sharding of the data is not required. Every database has special sauce for managing data across storage. All we assume here is that the file scan of data, be they rows or columns, occurs in compute nodes after the data is read from storage. In a future post, we will see how modern storage systems impact this assumption without changing the economics benefits described here. Figure 1 is a classic, really too simple, depiction of this separation.

Simple Storage Separate from Compute
Figure 1. Compute Separate from Storage

The database logic here is straightforward. When compute and storage are tied, the query planner knows precisely and in advance how many parallel nodes are in play, and the system can spread data across those nodes in every step that requires data distribution. The number of nodes is fixed for both storage (processing IO) and compute. Figure 2 shows a system with storage and logic connected.

Couple Storage
Figure 2. Classic Shared-nothing Connected Storage and Compute

In a system with storage and compute separated, the query planner has to ask how many nodes are available for data distribution. I am using the term “database” here, but any parallel data processing system that shards data with a processing plan has some form of this logic.

In the ETL scenarios, data flows from the storage layer into a compute layer dynamically allocated at the start of the workload. The planner learns the configuration when the system starts up.

Shared Nodes
Figure 3. Multi-processing Compute Nodes

Figure 3 depicts the case where all three ETL scripts run together on a six-node cluster. Figure 4 shows each ETL script running on a dedicated cluster. Note that, just-for-fun, I have adjusted the configurations in Figure 4 to show that in the dedicated system case, it is possible to size systems differently if there is an advantage to do so. In a later post, I’ll discuss why this could be important (as right now, it may seem that in every case, you would want 10,000 severs to complete every job in 10 seconds).

Dedicated Compute Nodes
Figure 4. Dedicated Compute Nodes with Separated Storage

A couple of closing remarks. First, I cannot imagine why anyone would not run ETL scripts on a scalable cloud platform. The ability to scale up to reduce runtimes at no extra cost is remarkable (and I’m not sure that I have ever used that word in one of my blogs). Next, I cannot see why anyone who could run ETL in the cloud would not run each script in a dedicated cloudy configuration. If the issue is that your ETL product is not cloud-native does not separate from compute, then get a product that does (or use a cloud database and ELT).

Finally, here is a post from five years ago that anticipates the separation of storage from compute.

Next time: I’ll take this down a notch to talk about workloads smaller than an ETL batch job and consider how to run big data queries in the cloud.

Cloud-native Computing, Workloads, and Elasticity

Over the next several weeks, I’ll share my perspective of current best practices for big data, which is the term I’ll use to blend thinking about analytic data systems: data lakes, data warehouses, data marts operational data stores. On this journey, I’ll consider how analytic workloads are changing with AI and machine learning, discuss data architecture and virtual database technology, preview new hardware technologies (memory and processor), and, importantly, review the implications of cloud computing on the kit and kaboodle.

In this post, I need to start laying a foundation for discussing the cloud. We will see how scalable cloud-computing makes performance “free,” and then we will see how dedicated resources increase efficiency and further reduce costs. The next post builds on these concepts to describe where cloud database products will evolve.

To start, let’s describe a workload that executes three ETL scripts and consider the result when the three scripts run as separate workloads. Imagine an ETL batch job. Batch jobs are insulated. They read data from one or more source systems, perform a series of integration steps as programs, and then load the results using a data load utility into a lake, warehouse, or mart. By “insulated,” I mean that the compute resources required for the integration steps do not need to interact with other systems.

If you had a dedicated server to run a single ETL script, as long as it could read the raw data from the source and the reference data required for integration, there would be no need for connectivity to other systems. With all of the data and the ETL scripts in hand, the process could execute stand-alone. If you needed to run two ETL scripts at the same time, you could deploy the software on two distinct sets of servers; and three scripts could execute on three individual servers or server clusters. As long as you replicate the ETL software and the required data each time, there would be no issue with any number of distinct ETL systems.

In a cloud environment, you could easily spin up three distinct clusters, run the scripts, and spin them back down, paying for only what you use. The ability to dynamically acquire resources and release them in the cloud is called “elasticity.” It is a characteristic of cloud-native applications and not a characteristic of any application running in the cloud. That is, if you design your ETL software to be self-contained and deploy it using a cloud operating system that manages resources, you can take advantage of cloud elasticity. Tools like Docker containers and Kubernetes make this possible.

To continue, imagine that the three ETL jobs run against large datasets overnight, and they each take three hours to complete on a dedicated cluster of twenty-four servers. If all three jobs run simultaneously, all three jobs complete in twelve hours. This estimate assumes that 25% of the time, the three tasks are competing for the compute resources of the server. If the jobs are CPU-bound, this would be an optimistic assumption, and the runtime might be longer.

The scripts run overnight to gain access to dedicated resources. During the day, the cluster runs queries, and contention between the batch scripts and the queries for CPU is hard to manage. Finally, let’s imagine that the cost of these twenty-four servers in the cloud is $4 per server per hour with software or $1152 per day to run the three ETL scripts, not counting storage server costs ($4/server per hour times 24 servers times 12 hours equals $1152).

If our ETL programs are scalable, we could spin up twice as many servers and complete the jobs in 6 hours. Note that the cost is still $1152 ($4/server * 48 * 6 = $1152) and we could double it again to complete the job in 3 hours at the same price ($4 * 96 * 3 = $1152). This math continues as far as you would like to go as long as your cloud provider will let you pay in ever-smaller increments.

This example makes the first important point: if you have self-contained and scalable workloads, you can scale up in the cloud to reduce runtimes at no extra cost.

Now let’s consider what happens if we run each script on a separate cluster. With dedicated servers, each job takes three hours to complete, and the cost per job is $4/server times 24 servers * 3 hours or $288. If we spin up 72 servers and run each script as a separate process, all three complete in three hours for $864. The savings are the result of removing the contention between the three jobs and giving each job dedicated resources.

Even though it may seem obvious, we are so used to sharing computers that we forget that contention is wasteful. Whether we contend for a disk drive to read or write, for memory, for CPU (L3, L2, and L1) cache, for instruction fetch or instruction execution, the cost of managing contention adds inefficiencies. More on that in a couple of posts, I want to talk about how databases can reduce contention, how processor technology helps, and especially how technology like Intel Optane may play a role in the future.

Let me wrap up with a couple of caveats regarding this made-up scenario.

First, if the scripts are IO-bound, not CPU-bound, then they may execute together with less contention. ETL programs that stream data between steps will be CPU-bound as they do not perform IO to spool intermediate results. The contention will still be there, and the cost will reflect this. If the jobs are more completely CPU-bound when the scripts run in-memory, then the contention will be more significant, and the cost difference will be higher.

Second, there are startup costs associated with cloud clusters. Spinning up a machine will take several minutes, and if there are more servers, then there will be more cost associated with the startup. We will consider this more in the next post.

So far, we have made two essential points:
If we have a scalable system and a self-contained workload, then we can deploy cloud compute at scale to reduce runtimes at no extra cost. There is no reason to ever suffer through long-running batch jobs.
If we have multiple units-of-work running, where in the past, we might run them concurrently and allow the workload to compete for a finite number of computers, with cloud computing, we can provide each workload with discrete resources and scale at a reduced cost.

In the next post, we will discuss smaller units-of-work in a database. With this foundation, we will then be able to talk about the power provided by products like Snowflake, and we will be able to show a path for cloud databases to become even more efficient.

Back in the saddle

In 2015, President Obama appointed me the CIO of the Social Security Administration, and I had to stop posting here. The products and vendors I would have liked to review were often installed at the SSA, and unfortunately, it was not appropriate for me to comment. Now I’m at a place without constraints, so I’ll try my hand again.

I could spend the next year posting stories of my experience at the SSA. It was a fantastic time. I think that I left an impression on the IT programs there. It may be that I even left a good impression.

I will mention two things for fun. First, as CIO, I was in charge of a $1.2B/year IT budget. Who knew that a geeky guy could perform at that level? Not me. Second, in case you think $1.2B on IT per year is an example of wasteful government spending, consider that the operations at the SSA are directly responsible for payments that represent 5% of the GDP of the US. $1.2B for IT that drives 5% of $20,544B is a bargain.

A lot happened during my time away from the Database Fog Blog, and my stint as a CIO provided me with a new perspective. I implemented a data lake, a data warehouse, and utilized the cloud in a very modern manner. I applied agile methods and DevOps effectively. I imagined a unique application of machine learning and prototyped it. I worked to make this all happen in a very large organization. These experiences are the fodder for what I’ll write.

Since I implemented cloud-native apps, I have a lot to share on how cloud computing will impact data warehousing (hint: Snowflake is just a significant first step). You may be surprised at where the cloud could take us. There are significant technologies, besides Snowflake, that will impact how we might architect analytic data going forward.

The first topic (this preface, does not count) considers workload granularity and cloud-native computing. I hope that you will find it interesting and different.

It is great to be back.

A New Way of Thinking About EDW Federation

There is a new way to think about data warehouse architecture. The Gartner Group calls it a logical data warehouse and it uses database federation to dynamically integrate a universe of data warehouses, operational data stores, and data marts into a single, united, structure. This blog has suggested that there is a special case of the logical data warehouse that uses Hadoop to provide a modern data warehouse architecture with significant economic advantages (see here, here, and here).


This is the second post inspired by my chats with Bityota… and sort of, but not altogether, commercial in nature (the first post is here). That is, Bityota will use these posts in their collateral… but you won’t see foam about their products in the narrative below.

– Rob


The economics are driven by the ability, through database federation, to place tables on a less expensive database platform. In short, the aim is to place data on the least expensive platform that still provides enough performance to satisfy service level agreements (SLAs). In the case of the modern data warehouse architecture this means placing older, colder, data on Hadoop where the costs may be $1000/TB instead of having all of the data in a single data warehouse platform at a parallel RDBMS price point of $35,000/TB. Federation allows these two layers to seem as one to any program or end-user.

This approach may be thought of as a data life cycle infrastructure that has significantly more economic power that the hardware based life cycles suggested by database vendors to date. Let’s consider some of the trade-offs that define the hardware approach and the Hadoop-based approach.

The power of Hadoop federation comes from its ability to manage data placement at a macro level. Here, data is placed appropriately into a separate database management system running on differentiated hardware so that the economics of the entire infrastructure: hardware, software, and network can be optimized. It is even possible to add a third or fourth level to provide more fine-grained economic optimization. But this approach come at a cost. Each separate database optimizes queries at the database level. Despite advances in federation software technology this split optimization cannot optimize many queries. The optimization is not poor, but it is not optimal optimization. The global optimization provided by a single DBMS will almost always out-perform federated optimization.

The temperature-based optimization touted by some warehouse vendors provides good global optimization. To date a single DBMS must run on a homogeneous hardware platform with a single price point. Queries run optimally but the optimization can only twiddle around hardware details placing data in memory or on an SSD device or on the faster portion of a disk platter.

Figure 7. Federated Elastic Shared-nothing IaaS
Figure 7. Federated Elastic Shared-nothing IaaS

To eliminate this unfortunate trade-off: good optimization over minor hardware capabilities or fair optimization over the complete hardware eco-system we need a single DBMS that can run queries over a heterogeneous mix of hardware. We need a single database management system, with global query optimization, that can execute queries over multiple layers of hardware deployed in the cloud. We can easily imagine a multi-layered data warehouse with queries federated over several AWS offerings with hotter data on fast nodes that are always available, with warm data on less expensive nodes that are always available, and with cold historical data on inexpensive nodes that come online in processing windows so that you pay for the nodes only when you need them. Figure 7 shows a modern data warehouse deployed across an Amazon cloud.

This different way of thinking about a logical data warehouse is exciting… and a great example of how cloud computing may change everything in the database and data warehouse space.