Cloud-native Computing, Workloads, and Elasticity

Over the next several weeks, I’ll share my perspective of current best practices for big data, which is the term I’ll use to blend thinking about analytic data systems: data lakes, data warehouses, data marts operational data stores. On this journey, I’ll consider how analytic workloads are changing with AI and machine learning, discuss data architecture and virtual database technology, preview new hardware technologies (memory and processor), and, importantly, review the implications of cloud computing on the kit and kaboodle.

In this post, I need to start laying a foundation for discussing the cloud. We will see how scalable cloud-computing makes performance “free,” and then we will see how dedicated resources increase efficiency and further reduce costs. The next post builds on these concepts to describe where cloud database products will evolve.

To start, let’s describe a workload that executes three ETL scripts and consider the result when the three scripts run as separate workloads. Imagine an ETL batch job. Batch jobs are insulated. They read data from one or more source systems, perform a series of integration steps as programs, and then load the results using a data load utility into a lake, warehouse, or mart. By “insulated,” I mean that the compute resources required for the integration steps do not need to interact with other systems.

If you had a dedicated server to run a single ETL script, as long as it could read the raw data from the source and the reference data required for integration, there would be no need for connectivity to other systems. With all of the data and the ETL scripts in hand, the process could execute stand-alone. If you needed to run two ETL scripts at the same time, you could deploy the software on two distinct sets of servers; and three scripts could execute on three individual servers or server clusters. As long as you replicate the ETL software and the required data each time, there would be no issue with any number of distinct ETL systems.

In a cloud environment, you could easily spin up three distinct clusters, run the scripts, and spin them back down, paying for only what you use. The ability to dynamically acquire resources and release them in the cloud is called “elasticity.” It is a characteristic of cloud-native applications and not a characteristic of any application running in the cloud. That is, if you design your ETL software to be self-contained and deploy it using a cloud operating system that manages resources, you can take advantage of cloud elasticity. Tools like Docker containers and Kubernetes make this possible.

To continue, imagine that the three ETL jobs run against large datasets overnight, and they each take three hours to complete on a dedicated cluster of twenty-four servers. If all three jobs run simultaneously, all three jobs complete in twelve hours. This estimate assumes that 25% of the time, the three tasks are competing for the compute resources of the server. If the jobs are CPU-bound, this would be an optimistic assumption, and the runtime might be longer.

The scripts run overnight to gain access to dedicated resources. During the day, the cluster runs queries, and contention between the batch scripts and the queries for CPU is hard to manage. Finally, let’s imagine that the cost of these twenty-four servers in the cloud is $4 per server per hour with software or $1152 per day to run the three ETL scripts, not counting storage server costs ($4/server per hour times 24 servers times 12 hours equals $1152).

If our ETL programs are scalable, we could spin up twice as many servers and complete the jobs in 6 hours. Note that the cost is still $1152 ($4/server * 48 * 6 = $1152) and we could double it again to complete the job in 3 hours at the same price ($4 * 96 * 3 = $1152). This math continues as far as you would like to go as long as your cloud provider will let you pay in ever-smaller increments.

This example makes the first important point: if you have self-contained and scalable workloads, you can scale up in the cloud to reduce runtimes at no extra cost.

Now let’s consider what happens if we run each script on a separate cluster. With dedicated servers, each job takes three hours to complete, and the cost per job is $4/server times 24 servers * 3 hours or $288. If we spin up 72 servers and run each script as a separate process, all three complete in three hours for $864. The savings are the result of removing the contention between the three jobs and giving each job dedicated resources.

Even though it may seem obvious, we are so used to sharing computers that we forget that contention is wasteful. Whether we contend for a disk drive to read or write, for memory, for CPU (L3, L2, and L1) cache, for instruction fetch or instruction execution, the cost of managing contention adds inefficiencies. More on that in a couple of posts, I want to talk about how databases can reduce contention, how processor technology helps, and especially how technology like Intel Optane may play a role in the future.

Let me wrap up with a couple of caveats regarding this made-up scenario.

First, if the scripts are IO-bound, not CPU-bound, then they may execute together with less contention. ETL programs that stream data between steps will be CPU-bound as they do not perform IO to spool intermediate results. The contention will still be there, and the cost will reflect this. If the jobs are more completely CPU-bound when the scripts run in-memory, then the contention will be more significant, and the cost difference will be higher.

Second, there are startup costs associated with cloud clusters. Spinning up a machine will take several minutes, and if there are more servers, then there will be more cost associated with the startup. We will consider this more in the next post.

So far, we have made two essential points:
If we have a scalable system and a self-contained workload, then we can deploy cloud compute at scale to reduce runtimes at no extra cost. There is no reason to ever suffer through long-running batch jobs.
If we have multiple units-of-work running, where in the past, we might run them concurrently and allow the workload to compete for a finite number of computers, with cloud computing, we can provide each workload with discrete resources and scale at a reduced cost.

In the next post, we will discuss smaller units-of-work in a database. With this foundation, we will then be able to talk about the power provided by products like Snowflake, and we will be able to show a path for cloud databases to become even more efficient.

HANA, BLU, Hekaton, and Oracle 12c vs. Teradata and Greenplum – November 2013

Catch Me If You Can (musical)
(Photo credit: Wikipedia)

I would like to point out a very important section in the paper on Hekaton on the Microsoft Research site here. I will quote the section in total:

2. DESIGN CONSIDERATIONS 

An analysis done early on in the project drove home the fact that a 10-100X throughput improvement cannot be achieved by optimizing existing SQL Server mechanisms. Throughput can be increased in three ways: improving scalability, improving CPI (cycles per instruction), and reducing the number of instructions executed per request. The analysis showed that, even under highly optimistic assumptions, improving scalability and CPI can produce only a 3-4X improvement. The detailed analysis is included as an appendix. 

The only real hope is to reduce the number of instructions executed but the reduction needs to be dramatic. To go 10X faster, the engine must execute 90% fewer instructions and yet still get the work done. To go 100X faster, it must execute 99% fewer instructions. This level of improvement is not feasible by optimizing existing storage and execution mechanisms. Reaching the 10-100X goal requires a much more efficient way to store and process data. 

This is important because it confirms the difference in a Level 3 and a Level 2 columnar implementation as described here. It is just not possible for a Level 2 implementation with a row-based join engine to achieve the performance of a Level 3 implementation. This will allow the Level 3 implementations: HANA, BLU, Hekaton, and Oracle 12c to distance themselves from the Level 2 products: Teradata and Greenplum; by more than 10X… and this is a very significant advantage.

Related articles

NoCOUG Referral

I would like to point you to two articles in the latest Northern California Oracle Users Group (NoCOUG) Journal here.

The first is an interview of Kevin Closson here. The interview is long and will take some time to get through… so set aside 30 minutes… it will be worth it as Kevin discusses Exadata, shared-nothingness, and other topics related to database hardware architecture.

The second article I would like to suggest (by the way there are several other excellent articles) is by Dr. Bert Scalzo. He reminds us that our job as engineers is to build the most cost-effective solution… not to build the perfect solution. He suggests that hardware should be treated as a dynamic resource that can be provisioned easily to solve performance problems.

I have argued that in a shared-nothing, scalable, architecture it is often cheaper to add another $20,000 fat server than to spend $100,000 of staff time to tune around a performance problem. This is especially true when the tuning involves building indexes and materialized views or pre-aggregated tables that make your warehouse fragile and more difficult to tune the next time. See here

Back to Kevin’s interview and to tie the two articles together… Kevin suggests that as long as data flows into the CPUs fast enough then there is no reason to pick a shared-nothing architecture over a shared-everything architecture. He insists on symmetry and rightfully points out that a shared-everything system can be symmetrical. But it is more difficult to maintain symmetry as you scale up a shared-everything system… and easy scale is what is required to treat hardware as a dynamic resource. So… I remain convinced that shared-nothing is the way to go…