After years of tuning data warehouses, queries, data loads, and BI applications, I give up. In the long run it is not really possible anyway… and better still… no longer necessary. A better approach is to build your database and your hardware infrastructure to scan fast and smart. So here’s a blog on why it’s impossible to tune a warehouse… and on why it’s no longer necessary.
My argument against tuning is easy to grasp. By definition a data warehouse serves many constituencies: Marketing and Finance and Customer Support and Distribution; and these business units will each access the data from their unique perspective following a unique path through the warehouse. A designer cannot lay out the data effectively to support each access path… cannot index every column, cannot map more than one zone, cannot replicate the data again and again with aggregates and materialized views, cannot cache the entire warehouse. Even if you get it right changing business requirements will fracture your approach; or worse, the design will not support new queries and constrain your business.
Many readers will be skeptical at this point… suggesting that the software and hardware to eliminate tuning does not exist. So let’s build a model and test the state of the art.
Let us imagine and model a 25TB data warehouse with a 20TB fact table that holds 25 months of daily facts partitioned by day. The fact table is 100 columns wide and we will model two queries that reference 20 of the columns… One that touches every row and one that is date constrained and touches only 14 days of data.
Here are some hardware specs. A server with a single I/O controller can read about 1.5GB/second into the database. With two controllers can read around 2.7GB/sec. Note that these are not the theoretical limits of the hardware but real measurements taken from the current hardware on the market: Dell, HP, and SUN/Oracle.
Now let’s deploy our imaginary warehouse on a strong state of the market multi-core server with, to be conservative, a single controller. This server would scan our fact table in around 222 minutes. Partition elimination would allow the date constrained query to complete in just over 4 minutes. Note that these imaginary queries ignore the effort to join and/or aggregate data. Later I’ll have more to say on this…
If we deploy our warehouse on a shared-nothing cluster with 20 nodes the aggregate I/O bandwidth increases to 30GB/sec and the execution times for our two queries improves to 11 minutes and 12 seconds, respectively. This is the power of parallel I/O.
Now we have to factor in compression. Typical row-based compression yields approximately a 2.5x result… columnar compression varies wildly… But let’s assume 25X in our model. There is a cost to be paid to decompress the data… But since it is paid by everyone and CPU is a relatively inexpensive commodity, we’ll ignore it in our model.
For 2.5X row-based compression our big query now completes in 4.4 minutes and the smaller query completes in 4.8 seconds.
The model is a little more complicated when we throw in columnar compression so let’s consider two columnar models. For an implementation such as Exadata we get the benefit of columnar compression but not the benefit of columnar projection. 25X hybrid columnar compression will execute our two scans in 26 seconds and .5 seconds. Now we are talking! A more complete columnar implementation will only touch the columns required by our query, 20% of the data, providing another 5X improvement. This drops our scan queries to 5.2 seconds and .1 second, respectively. Smoking fast. Note that the more simple columnar compression approach will provide the same fast response when every column is touched and the more complex approach will slow down in that case… so you can make the trade off in your shop as required.
Let me remind you again… This is a full scan of 20TB with no tricks: no indices, no pre-aggregation, no materialized views, no cache and no flash, no pre-sorted zone maps. All that is required is a parallel implementation with partitioning, compression, and a columnar table type… and this implementation works. It is robust.
A note on joins… It is more difficult to model joins… and I’ll attempt a simple model in another post. But you can see that this fast scan approach has solved the costly part of the problem using parallel processing… and you can imagine that a shared-nothing massively parallel approach to joins may hold the key.