A Segue from ETL to DB

This is a short post to segue to point where I’ve been headed all along. Figure 1 recasts the picture from the last post, showing storage separated from compute from ETL/ELT to a data warehouse. It should be a familiar picture to Snowflake architects who may have implemented multiple DW instances against a single storage layer.

Decoupled Multi Instance DW
Figure 1. Multiple DW Compute Instances Decoupled from Shared Storage

I’ll not give away the next article, other than to say that it derives from the same concepts just discussed.

Since this is so short, I will add a tangent just-for-fun.

Here is a post from seven years ago that anticipates how the cloud impacts DW performance. When you combine this with the economics presented in the last two posts (here and here), suggesting that performance is free, you can begin to see why database tuning is no longer an urgent requirement for a data warehouse.

When you tune, you specialize for a particular workload, and if your workload changes, the tuning wears thin. In other words, I now believe that you should build a robust data warehouse with minimal tuning and use cloud compute to get performance. No tuning lets you add a new workload without adjusting. Tuning makes your database fragile in the face of change.

More on Cloud Data Elasticity

The last post (here) demonstrated how scalability in the cloud provides the ability to reduce runtimes from days or hours to minutes without raising the cost. We used a batch ETL service running three ETL scripts as an example. We then showed how the same use of scalability could allow us to break the batch ETL service into discrete jobs and remove contention between the three scripts to further improve throughput and reduce costs.

There is a great deal more to say about how the cloud changes our thinking about applying resources to our big data workloads. I promised to get to a discussion of database work, but that will have to wait another week. Sorry. Let’s carry on.

In the scenario where we ran the three jobs together, we assumed that all three ran in the same amount of time and so we shut down the ETL service when all three scripts completed. We shut down the servers when they were all complete to stop billing at the $1152 price point. It is more likely that each script would take more-or-less time than another.

When we run each script in its own set of servers, we can stop billing as each job ends. If one of the jobs takes only 2.5 hours to complete, and your cloud provider will allow you to bill minute, not hour increments, then the cost of that single job drops from $288 to $240. Over a year, these savings add up, and you can ask for a raise.

So, we have added a third point: by using scalability, and by scheduling discrete workloads on dedicated cloud servers, you can scale performance and significantly reduce costs.

You have no doubt noticed that the scenarios describe scalability without any impact on the data. We assume that compute can scale independently of storage, and re-sharding of the data is not required. Every database has special sauce for managing data across storage. All we assume here is that the file scan of data, be they rows or columns, occurs in compute nodes after the data is read from storage. In a future post, we will see how modern storage systems impact this assumption without changing the economics benefits described here. Figure 1 is a classic, really too simple, depiction of this separation.

Simple Storage Separate from Compute
Figure 1. Compute Separate from Storage

The database logic here is straightforward. When compute and storage are tied, the query planner knows precisely and in advance how many parallel nodes are in play, and the system can spread data across those nodes in every step that requires data distribution. The number of nodes is fixed for both storage (processing IO) and compute. Figure 2 shows a system with storage and logic connected.

Couple Storage
Figure 2. Classic Shared-nothing Connected Storage and Compute

In a system with storage and compute separated, the query planner has to ask how many nodes are available for data distribution. I am using the term “database” here, but any parallel data processing system that shards data with a processing plan has some form of this logic.

In the ETL scenarios, data flows from the storage layer into a compute layer dynamically allocated at the start of the workload. The planner learns the configuration when the system starts up.

Shared Nodes
Figure 3. Multi-processing Compute Nodes

Figure 3 depicts the case where all three ETL scripts run together on a six-node cluster. Figure 4 shows each ETL script running on a dedicated cluster. Note that, just-for-fun, I have adjusted the configurations in Figure 4 to show that in the dedicated system case, it is possible to size systems differently if there is an advantage to do so. In a later post, I’ll discuss why this could be important (as right now, it may seem that in every case, you would want 10,000 severs to complete every job in 10 seconds).

Dedicated Compute Nodes
Figure 4. Dedicated Compute Nodes with Separated Storage

A couple of closing remarks. First, I cannot imagine why anyone would not run ETL scripts on a scalable cloud platform. The ability to scale up to reduce runtimes at no extra cost is remarkable (and I’m not sure that I have ever used that word in one of my blogs). Next, I cannot see why anyone who could run ETL in the cloud would not run each script in a dedicated cloudy configuration. If the issue is that your ETL product is not cloud-native does not separate from compute, then get a product that does (or use a cloud database and ELT).

Finally, here is a post from five years ago that anticipates the separation of storage from compute.

Next time: I’ll take this down a notch to talk about workloads smaller than an ETL batch job and consider how to run big data queries in the cloud.

Cloud DBMS < High Performance DBMS

English: Cloud
English: Cloud (Photo credit: Wikipedia)

In my post here I suggested that database computing was becoming a special case of high-performance computing. This trend will bump up against the trend towards cloud computing and the bump will be noisy.

In the case of general commercial computing customers running cloudy virtualized servers paid a 5%-20% performance penalty… but the economics still worked for the cloud side.

For high-performance database computing it is unclear how much the penalty will be? If a virtualized, cloudy, database gives up performance because SIMD becomes problematic, priming the cache becomes hard, CPU stalls become more common, and there is a move from a shared nothing architecture to SANs or SAN-like shared data devices, then the penalty may be 300%-500% and the cloud databases will likely lose.

As I noted in the series starting here, there are lots of issues around high-performance database computing in the cloud. It will be interesting to see how the database vendors manage the bump and the noise. So keep an eye out. If your database of choice starts to look cloudy… if it becomes virtualized and it starts moving from a shared-nothing cluster to a SAN… then you will know which side of the bump they are betting on. And if they pick the cloudy side then you need to ask how they plan to architect the system to hold the penalty to under 20%…

I also mentioned in that series that in-memory databases had an advantage over peripheral-based databases as they did not have to pay a penalty for de-coupling the IO bandwidth that is part of a shared-nothing cluster. But even those vendors have to manage the fact that the database is abstracted… virtualized… away from the hardware.

If I were King I would develop a high-performance database that implemented the features of a cloud database: elasticity, easy provisioning, multi-tenancy; over bare metal. Then you might get the best of both worlds.