ETL – Database Fog Blog

The last post (here) demonstrated how scalability in the cloud provides the ability to reduce runtimes from days or hours to minutes without raising the cost. We used a batch ETL service running three ETL scripts as an example. We then showed how the same use of scalability could allow us to break the batch ETL service into discrete jobs and remove contention between the three scripts to further improve throughput and reduce costs.

There is a great deal more to say about how the cloud changes our thinking about applying resources to our big data workloads. I promised to get to a discussion of database work, but that will have to wait another week. Sorry. Let’s carry on.

In the scenario where we ran the three jobs together, we assumed that all three ran in the same amount of time and so we shut down the ETL service when all three scripts completed. We shut down the servers when they were all complete to stop billing at the $1152 price point. It is more likely that each script would take more-or-less time than another.

When we run each script in its own set of servers, we can stop billing as each job ends. If one of the jobs takes only 2.5 hours to complete, and your cloud provider will allow you to bill minute, not hour increments, then the cost of that single job drops from $288 to $240. Over a year, these savings add up, and you can ask for a raise.

So, we have added a third point: by using scalability, and by scheduling discrete workloads on dedicated cloud servers, you can scale performance and significantly reduce costs.

You have no doubt noticed that the scenarios describe scalability without any impact on the data. We assume that compute can scale independently of storage, and re-sharding of the data is not required. Every database has special sauce for managing data across storage. All we assume here is that the file scan of data, be they rows or columns, occurs in compute nodes after the data is read from storage. In a future post, we will see how modern storage systems impact this assumption without changing the economics benefits described here. Figure 1 is a classic, really too simple, depiction of this separation.

Simple Storage Separate from Compute — Figure 1. Compute Separate from Storage

The database logic here is straightforward. When compute and storage are tied, the query planner knows precisely and in advance how many parallel nodes are in play, and the system can spread data across those nodes in every step that requires data distribution. The number of nodes is fixed for both storage (processing IO) and compute. Figure 2 shows a system with storage and logic connected.

Figure 2. Classic Shared-nothing Connected Storage and Compute

In a system with storage and compute separated, the query planner has to ask how many nodes are available for data distribution. I am using the term “database” here, but any parallel data processing system that shards data with a processing plan has some form of this logic.

In the ETL scenarios, data flows from the storage layer into a compute layer dynamically allocated at the start of the workload. The planner learns the configuration when the system starts up.

Figure 3. Multi-processing Compute Nodes

Figure 3 depicts the case where all three ETL scripts run together on a six-node cluster. Figure 4 shows each ETL script running on a dedicated cluster. Note that, just-for-fun, I have adjusted the configurations in Figure 4 to show that in the dedicated system case, it is possible to size systems differently if there is an advantage to do so. In a later post, I’ll discuss why this could be important (as right now, it may seem that in every case, you would want 10,000 severs to complete every job in 10 seconds).

Figure 4. Dedicated Compute Nodes with Separated Storage

A couple of closing remarks. First, I cannot imagine why anyone would not run ETL scripts on a scalable cloud platform. The ability to scale up to reduce runtimes at no extra cost is remarkable (and I’m not sure that I have ever used that word in one of my blogs). Next, I cannot see why anyone who could run ETL in the cloud would not run each script in a dedicated cloudy configuration. If the issue is that your ETL product is not cloud-native does not separate from compute, then get a product that does (or use a cloud database and ELT).

Finally, here is a post from five years ago that anticipates the separation of storage from compute.

Next time: I’ll take this down a notch to talk about workloads smaller than an ETL batch job and consider how to run big data queries in the cloud.

My last post (here) blathered about the effect that Hadoop must have on database vendor profits. An associate wrote me with the reminder that Hadoop is also impacting revenues and profits of ETL companies.

If you think about Hadoop as both an inexpensive staging area for an EDW and as a parallel compute engine that can transform ungoverned, extracted data and load it into a governed EDW platform… then you are just one thought from realizing that these two functions have heretofore been in the domain of ETL… and that moving these functions to Hadoop might have an effect in the ETL space.

I do not believe that ETL tools will go away… but they may become just the GUI development environment that lets you quickly develop transformations and connect them into an end-to-end ETL process. The scheduling, processing engine, and monitoring could then be handled by the Hadoop eco-system.

Here is the idea from a previous post.

About five years ago the precursor to Alpine Data Labs, then an EMC Greenplum subsidiary, was developing a GUI for analytics that connected processes and I suggested they spin the product both into analytics and into ETL… I’ll have to look and see where they are these days…

Category: ETL

More on Cloud Data Elasticity

Hadoop and ETL

Share this:

Share this: