Over the next several weeks, I’ll share my perspective of current best practices for big data, which is the term I’ll use to blend thinking about analytic data systems: data lakes, data warehouses, data marts operational data stores. On this journey, I’ll consider how analytic workloads are changing with AI and machine learning, discuss data architecture and virtual database technology, preview new hardware technologies (memory and processor), and, importantly, review the implications of cloud computing on the kit and kaboodle.
In this post, I need to start laying a foundation for discussing the cloud. We will see how scalable cloud-computing makes performance “free,” and then we will see how dedicated resources increase efficiency and further reduce costs. The next post builds on these concepts to describe where cloud database products will evolve.
To start, let’s describe a workload that executes three ETL scripts and consider the result when the three scripts run as separate workloads. Imagine an ETL batch job. Batch jobs are insulated. They read data from one or more source systems, perform a series of integration steps as programs, and then load the results using a data load utility into a lake, warehouse, or mart. By “insulated,” I mean that the compute resources required for the integration steps do not need to interact with other systems.
If you had a dedicated server to run a single ETL script, as long as it could read the raw data from the source and the reference data required for integration, there would be no need for connectivity to other systems. With all of the data and the ETL scripts in hand, the process could execute stand-alone. If you needed to run two ETL scripts at the same time, you could deploy the software on two distinct sets of servers; and three scripts could execute on three individual servers or server clusters. As long as you replicate the ETL software and the required data each time, there would be no issue with any number of distinct ETL systems.
In a cloud environment, you could easily spin up three distinct clusters, run the scripts, and spin them back down, paying for only what you use. The ability to dynamically acquire resources and release them in the cloud is called “elasticity.” It is a characteristic of cloud-native applications and not a characteristic of any application running in the cloud. That is, if you design your ETL software to be self-contained and deploy it using a cloud operating system that manages resources, you can take advantage of cloud elasticity. Tools like Docker containers and Kubernetes make this possible.
To continue, imagine that the three ETL jobs run against large datasets overnight, and they each take three hours to complete on a dedicated cluster of twenty-four servers. If all three jobs run simultaneously, all three jobs complete in twelve hours. This estimate assumes that 25% of the time, the three tasks are competing for the compute resources of the server. If the jobs are CPU-bound, this would be an optimistic assumption, and the runtime might be longer.
The scripts run overnight to gain access to dedicated resources. During the day, the cluster runs queries, and contention between the batch scripts and the queries for CPU is hard to manage. Finally, let’s imagine that the cost of these twenty-four servers in the cloud is $4 per server per hour with software or $1152 per day to run the three ETL scripts, not counting storage server costs ($4/server per hour times 24 servers times 12 hours equals $1152).
If our ETL programs are scalable, we could spin up twice as many servers and complete the jobs in 6 hours. Note that the cost is still $1152 ($4/server * 48 * 6 = $1152) and we could double it again to complete the job in 3 hours at the same price ($4 * 96 * 3 = $1152). This math continues as far as you would like to go as long as your cloud provider will let you pay in ever-smaller increments.
This example makes the first important point: if you have self-contained and scalable workloads, you can scale up in the cloud to reduce runtimes at no extra cost.
Now let’s consider what happens if we run each script on a separate cluster. With dedicated servers, each job takes three hours to complete, and the cost per job is $4/server times 24 servers * 3 hours or $288. If we spin up 72 servers and run each script as a separate process, all three complete in three hours for $864. The savings are the result of removing the contention between the three jobs and giving each job dedicated resources.
Even though it may seem obvious, we are so used to sharing computers that we forget that contention is wasteful. Whether we contend for a disk drive to read or write, for memory, for CPU (L3, L2, and L1) cache, for instruction fetch or instruction execution, the cost of managing contention adds inefficiencies. More on that in a couple of posts, I want to talk about how databases can reduce contention, how processor technology helps, and especially how technology like Intel Optane may play a role in the future.
Let me wrap up with a couple of caveats regarding this made-up scenario.
First, if the scripts are IO-bound, not CPU-bound, then they may execute together with less contention. ETL programs that stream data between steps will be CPU-bound as they do not perform IO to spool intermediate results. The contention will still be there, and the cost will reflect this. If the jobs are more completely CPU-bound when the scripts run in-memory, then the contention will be more significant, and the cost difference will be higher.
Second, there are startup costs associated with cloud clusters. Spinning up a machine will take several minutes, and if there are more servers, then there will be more cost associated with the startup. We will consider this more in the next post.
So far, we have made two essential points:
If we have a scalable system and a self-contained workload, then we can deploy cloud compute at scale to reduce runtimes at no extra cost. There is no reason to ever suffer through long-running batch jobs.
If we have multiple units-of-work running, where in the past, we might run them concurrently and allow the workload to compete for a finite number of computers, with cloud computing, we can provide each workload with discrete resources and scale at a reduced cost.
In the next post, we will discuss smaller units-of-work in a database. With this foundation, we will then be able to talk about the power provided by products like Snowflake, and we will be able to show a path for cloud databases to become even more efficient.