Numbers Every Database Professional Should Know

Latency Estimates in a Processor

In this post I will setup the next post by reminding you of these numbers every programmer should know. The picture shows the latency to access data across the three levels of cache in a modern processor, and across the memory bus to DRAM, and then across to an SSD or spinning disk drive. 

These numbers are the key to in-memory database performance. L1 cache is expensive and small. All data and all instructions are fed to the core through the L1 cache and typically the instruction and the data it will operate on are each fetched independently. If the data (or instruction… but I will just talk about the data from here on out) is not found in the small L1 cache the L2 cache is searched. Likewise, if the data is not in the L2 cache the L3 cache is searched and then DRAM. As you can see there is a 200X performance difference between a fetch from L1 cache and a fetch from DRAM. If the data must be fetched from SSD or disk, there is a 1000X penalty. 

Databases work to mitigate the 1000X penalty by pre-fetching data into DRAM. Modern processors mitigate the penalty between cache and DRAM by trying to anticipate the data required and pre-fetching it into L3 cache. This is a gamble and the likelihood of winning the pre-fetch bet varies with your DBMS and with the workload. More simultaneous queries against many tables puts pressure on the memory and lessens the effectiveness of data pre-fetch. 

Databases also mitigate the 1000X penalty by compressing data so that each fetch of a block from disk captures more data. For tabular (not columnar) data each block must be decompressed before it can be used. Decompression steals CPU cycles from query processing but mitigates some of the cost of I/O to disk. Columnar databases fetch compressed data and, if the databases use modern vector processing techniques, they operate directly on the compressed vector data without decompression. This is especially powerful as these vectorized columnar databases can push compressed data into cache and use that memory more effectively. 

Vectorized data can use the supercomputing instruction set that includes SIMD instructions. SIMD stands for Single Instruction; Multiple Data and these instructions fetch a single instruction from L1 cache and reuse it on the vectorized data over and over with pausing to fetch new instructions. Note that in a modern processor waiting for instructions or data to be fetched from cache or DRAM appears as CPU busy time. When the processor is stalled waiting for data to be fetched from disk the query/job gives up the CPU and dispatches a new query/job. 

Finally, for the longer-running queries associated with analytics and business intelligence (where long-running means a few seconds or more… “long” in CPU instruction time), it is highly likely that the L1 and L2 cache in each core will be flushed and new data will be fetched. It is even likely that all the data in L3 cache will be flushed. In this case most of the goodness associated with compression and vectorization is flushed with it. The vectors must be fetched again from L3 and DRAM. Further, note that when you run in a virtual machine or in the cloud you may think that you are the only user but other virtual machines other containers, or other serverless processes may be flushing you out of memory all the time. 

This is important. Until we swapped out a query for a new one it looked like in-memory vector processing might be 200X-1000X faster than a tabular query; but in an environment where tens or hundreds of queries are running concurrently the pressure on memory keeps flushing the cache and the advantage of vectorized query processing is reduced. Reduced, not eliminated. We would still expect the advantages of compressed data in cache and SIMD supercomputing instructions to provide much more than a 10X speedup. 

As mentioned, this post is a detailed review of material I covered years ago when HANA was introduced. In the next post I will add some new thinking. 

Here is more information on processor architecture and cache usage.

DW Cloud Economics – The Model

This post describe the model used to generate the content from the DW Cloud Economics – A Do Over post incase you want to check the numbers…

First, keep in mind that the numbers in this model are relative and indicative, not absolute. Here are the fixed parameters I used to generate these indicative numbers:

Cost per server per hour: $4.00 This could be any number or any currency and it would not matter.

Runtime per job with no contention: 3 hours This could be any duration and it would not change the relative results. If the number was well below an hour, it would increase the hour-based billing costs.

Jobs per Workload: 6 Again any number works.

Baseline Configuration: 24 Servers

Contention between 6 jobs on 24 Servers: 50% This sets the average amount of time a job will wait for resources. So, a job that runs for 3 hours with no contention will run in 4.5 hours with 50% contention. As the number of servers goes up the contention comes down for a fixed workload.

Baseline Performance: 2000 A completely arbitrary number that allows the model to calculate a relative price/performance. Relative “Performance” is calculated by dividing the Baseline Performance by the actual runtime. This means that a workload that takes 5 hours will yield a performance of 400 and a workload that take 2 hours will yield a performance number of 1000.

There are also a series of calculated values:

Total Servers / Workload is equal to Servers/Job for the Multi-tasking and Serial Execution configurations. It is equal to Servers/Job * Number of Jobs for the dedicated server per job configuration.

Scaleup is the ratio between the number of Total Servers / Workload and the Baseline Configuration. It is used to scale down the runtime and the contention.

Runtime / Job is the number of hours per job with no contention scaled down based on the Scaleup plus the time the Job is waiting due to contention. Contention scales down as the number of servers scales up.

Actual Runtime / Workload is Runtime / Job plus contention. Note that in the Serial Configuration and the Dedicated Server configuration there is no contention.

Paid Runtime / Workload is the Actual Runtime / Workload rounded to the nearest hour for the hourly billing table and equal to the Actual Runtime / Workload for the billing per minute table.

Cost / Workload is the Cost per Server per Hour parameter times the Paid Runtime / Hour value times the Total Servers / Workload value.

Price / Performance is the Cost / Workload divided by the Actual Runtime / Workload times the Baseline Performance parameter divided by the Cost / Workload value as noted above.

DB Cloud Economics – A Do Over

Here is a better model that demonstrates how cloudy databases can provide dramatic performance increases without increasing costs.

Mea culpa… I was going to extend the year-old series of posts on the economics of cloud computing and recognized that I poorly described the simple model I used, and worse, made some mistakes in the model. It was very lame. I’m so sorry… That said, I hope that you will like the extension to the model and the new conclusions.

I’ll get to those conclusions in a few short paragraphs and add a new post to explain the model.

If we imagine a workload consisting of 6 “jobs” that would execute on a dedicated cluster of 24 servers in 3 hours each with no contention, we can build a simple model to determine the price performance of the workload for three different configurations:

  1. M1: The usual configuration where all six jobs run together and share the configuration, multitasking and contending for the resources. As the configuration scales up the cost goes up, the runtime goes down, and the contention drops.
  2. M2: A silly model where each job runs serially on the configuration with no contention. As the configuration scales up the costs go up, the runtime goes down.
  3. M3: A model where each job runs independently, simultaneously, on its own cluster with no contention.

If we pay for servers by the hour the model shows that there is an opportunity to deploy many more servers, as many as one per “job”, to reduce the runtime and the cost. Here are the results:

ConfigServers DeployedActual Runtime / Workload (Hours)Paid Runtime / Workload
(Hours)
Cost / WorkloadPrice / Performance
Multitasking244.55$480444
 483.84$768533
 963.44$1,536593
Serial Execution2418.018$1,728111
 489.09$1,728222
 964.54.5$1,920444
Dedicated Server per Job1440.51$5764000
 2880.31$1,1528000
 5760.11$2,30416000
Table 1. Costs with Hourly Billing

Note that the price/performance metric is relative across the models. You can see the dramatic performance and price/performance increases using a configuration where each unit of work gets a dedicated set of resources. But the more dramatic picture is exposed in Table 2.

ConfigurationServers DeployedActual Runtime / Workload (Hours)Paid Runtime / Workload (Hours)Cost / WorkloadPrice / Performance
Multitasking244.54.5$432444
 483.83.8$720533
 963.43.4$1,296593
Serial Execution2418.018.0$1,728111
 489.099.0$1,728222
 964.54.5$1,728444
Dedicated Server per Job1440.50.5$2884000
 2880.30.3$2888000
 5760.10.1$28816000
Table 2. Costs with per Minute Billing

Here the granular billing reduces the total cost with the same dramatic price and price/performance increases. This was the point I was trying to make in my earlier posts. Note that this point, that fine-grained billing can be used to significantly reduce costs, is why deploying work in containers, or better still as serverless transactions, is so cost-effective. It is why deploying virtual machines in the cloud misses the real cost savings. In other words, it is why building cloud-native implementations is so important and why cloud-native databases will quickly overcome databases that cannot get there.

I also was trying to show that work deployed across these configurations, what I call a “job”, can be ETL jobs or single queries or a set of queries or a Spark job. If you have lots of smaller work, it may be best to run them in a multitasking configuration to avoid the cost of tearing down and starting up new configurations. But even here there is a point where the tear down cost can be mitigated across multiple semi-dedicated configurations.

What is a Cloud-native Database?

Before this series is complete, I plan on defining in some detail what the various levels of Cloud-nativeness might be to allow readers to classify products based on architecture, not marketing. In this post, I’ll lay out some general concepts. First, let’s be real about cloud-things that are not cloud-native.

Any database that runs on physical hardware can run on virtual machines and, therefore, can run on virtual machines in the cloud. Databases with no cloud capabilities other than the ability to run on a VM on a cloud-provider are not cloud-native. Worse, there are lots of anecdotal stories that suggest that there are no meaningful savings to be had from moving a database from an on-premise server or VM to a cloud VM with no other change to take advantage of cloud elasticity.

So here are two general definitions for your consideration:

1) A cloud-native database will have one or more features that utilize capabilities found only on a cloud-computing platform, and

2) A cloud-native database will demonstrate economic benefits derived from those cloud-specific features.

Note that the way you pay for services via capital expenses (CapEx) or as operating expenses (OpEx) does not provide economic benefit. If the monetary costs of a subscription are more-or-less equal to the financial costs of a license, then savings are tied to tax law, not to economics. Beware of cloudy subscriptions that change how you pay without clearly adding beneficial cloudy features. It is these subscriptions that often are the source of the no-savings anecdotes mentioned above.

This next point is about the separation of storage from compute. Companies have long ago disconnected their databases from just-a-bunch-of-disks (JBOD) to shared storage such as SAN or S3. Any database today can use shared storage. It is not useful to say that any database that can use shared storage has separated storage from compute. Using the idea that there must be features, not marketing, that allow compute to scale separately and more-or-less dynamically from storage as the definition, we will be able to move forward in this area.

So:

3) For storage to be appropriately separated from Compute, it must be possible to scale Compute up and down dynamically.

Next, when compute scales, it scales at different granularity. An application or database that automatically adds and subtracts virtual machines provides different economics than a database that scales using containers. Apps that add and subtract containers have different economics than applications that use so-called serverless containers to scale. In this dimension, we will try to characterize granularity to account for the associated cloud economics. This topic will be covered in more detail later, so I’ll save the rule for that post.

Note that it is possible for database vendors to develop a granular architecture and to use the associated economics to their advantage. They may charge you for the time when any part of your database is running but be billed by their provider for smaller chunks. This is not an issue unless their overall costs become uncompetitive.

Last, and in some ways least, different products may charge for time in smaller or larger chunks. You might be charged by the hour, by the minute, by the second, or in smaller increments. Think about the scaling economics I suggested in the first posts of this thread. If you are charged by the hour, then there is no financial incentive to scale up to finish jobs to the minute. You will be charged the same for ten minutes or fifty minutes when you are charged by the hour. The rule:

4) Cloud databases that charge in smaller time increments are more economical than those that charge in larger increments.

The rationale here is probably obvious, but I’ll cover it in-depth in a later post.

With these concepts in place, we can discuss how architectural changes affect each aspect of the economics of databases in the cloud.

Note that this last sentence was written assuming that a British computer scientist with an erudite accent would speak it when they create the PBS series from these posts. Not.

By the way, a few posts from now, I am going to go back to some ideas I shared five years ago around the relationship between database processing and the underlying hardware platform. I’ll update this thinking with cloud computing in mind. You can find this thinking here which originally came from Jeff Dean and Peter Norvig (displayed in lots of places but here is one).

A Segue from ETL to DB

This is a short post to segue to point where I’ve been headed all along. Figure 1 recasts the picture from the last post, showing storage separated from compute from ETL/ELT to a data warehouse. It should be a familiar picture to Snowflake architects who may have implemented multiple DW instances against a single storage layer.

Decoupled Multi Instance DW
Figure 1. Multiple DW Compute Instances Decoupled from Shared Storage

I’ll not give away the next article, other than to say that it derives from the same concepts just discussed.

Since this is so short, I will add a tangent just-for-fun.

Here is a post from seven years ago that anticipates how the cloud impacts DW performance. When you combine this with the economics presented in the last two posts (here and here), suggesting that performance is free, you can begin to see why database tuning is no longer an urgent requirement for a data warehouse.

When you tune, you specialize for a particular workload, and if your workload changes, the tuning wears thin. In other words, I now believe that you should build a robust data warehouse with minimal tuning and use cloud compute to get performance. No tuning lets you add a new workload without adjusting. Tuning makes your database fragile in the face of change.

Cloud-native Computing, Workloads, and Elasticity

Over the next several weeks, I’ll share my perspective of current best practices for big data, which is the term I’ll use to blend thinking about analytic data systems: data lakes, data warehouses, data marts operational data stores. On this journey, I’ll consider how analytic workloads are changing with AI and machine learning, discuss data architecture and virtual database technology, preview new hardware technologies (memory and processor), and, importantly, review the implications of cloud computing on the kit and kaboodle.

In this post, I need to start laying a foundation for discussing the cloud. We will see how scalable cloud-computing makes performance “free,” and then we will see how dedicated resources increase efficiency and further reduce costs. The next post builds on these concepts to describe where cloud database products will evolve.

To start, let’s describe a workload that executes three ETL scripts and consider the result when the three scripts run as separate workloads. Imagine an ETL batch job. Batch jobs are insulated. They read data from one or more source systems, perform a series of integration steps as programs, and then load the results using a data load utility into a lake, warehouse, or mart. By “insulated,” I mean that the compute resources required for the integration steps do not need to interact with other systems.

If you had a dedicated server to run a single ETL script, as long as it could read the raw data from the source and the reference data required for integration, there would be no need for connectivity to other systems. With all of the data and the ETL scripts in hand, the process could execute stand-alone. If you needed to run two ETL scripts at the same time, you could deploy the software on two distinct sets of servers; and three scripts could execute on three individual servers or server clusters. As long as you replicate the ETL software and the required data each time, there would be no issue with any number of distinct ETL systems.

In a cloud environment, you could easily spin up three distinct clusters, run the scripts, and spin them back down, paying for only what you use. The ability to dynamically acquire resources and release them in the cloud is called “elasticity.” It is a characteristic of cloud-native applications and not a characteristic of any application running in the cloud. That is, if you design your ETL software to be self-contained and deploy it using a cloud operating system that manages resources, you can take advantage of cloud elasticity. Tools like Docker containers and Kubernetes make this possible.

To continue, imagine that the three ETL jobs run against large datasets overnight, and they each take three hours to complete on a dedicated cluster of twenty-four servers. If all three jobs run simultaneously, all three jobs complete in twelve hours. This estimate assumes that 25% of the time, the three tasks are competing for the compute resources of the server. If the jobs are CPU-bound, this would be an optimistic assumption, and the runtime might be longer.

The scripts run overnight to gain access to dedicated resources. During the day, the cluster runs queries, and contention between the batch scripts and the queries for CPU is hard to manage. Finally, let’s imagine that the cost of these twenty-four servers in the cloud is $4 per server per hour with software or $1152 per day to run the three ETL scripts, not counting storage server costs ($4/server per hour times 24 servers times 12 hours equals $1152).

If our ETL programs are scalable, we could spin up twice as many servers and complete the jobs in 6 hours. Note that the cost is still $1152 ($4/server * 48 * 6 = $1152) and we could double it again to complete the job in 3 hours at the same price ($4 * 96 * 3 = $1152). This math continues as far as you would like to go as long as your cloud provider will let you pay in ever-smaller increments.

This example makes the first important point: if you have self-contained and scalable workloads, you can scale up in the cloud to reduce runtimes at no extra cost.

Now let’s consider what happens if we run each script on a separate cluster. With dedicated servers, each job takes three hours to complete, and the cost per job is $4/server times 24 servers * 3 hours or $288. If we spin up 72 servers and run each script as a separate process, all three complete in three hours for $864. The savings are the result of removing the contention between the three jobs and giving each job dedicated resources.

Even though it may seem obvious, we are so used to sharing computers that we forget that contention is wasteful. Whether we contend for a disk drive to read or write, for memory, for CPU (L3, L2, and L1) cache, for instruction fetch or instruction execution, the cost of managing contention adds inefficiencies. More on that in a couple of posts, I want to talk about how databases can reduce contention, how processor technology helps, and especially how technology like Intel Optane may play a role in the future.

Let me wrap up with a couple of caveats regarding this made-up scenario.

First, if the scripts are IO-bound, not CPU-bound, then they may execute together with less contention. ETL programs that stream data between steps will be CPU-bound as they do not perform IO to spool intermediate results. The contention will still be there, and the cost will reflect this. If the jobs are more completely CPU-bound when the scripts run in-memory, then the contention will be more significant, and the cost difference will be higher.

Second, there are startup costs associated with cloud clusters. Spinning up a machine will take several minutes, and if there are more servers, then there will be more cost associated with the startup. We will consider this more in the next post.

So far, we have made two essential points:
If we have a scalable system and a self-contained workload, then we can deploy cloud compute at scale to reduce runtimes at no extra cost. There is no reason to ever suffer through long-running batch jobs.
If we have multiple units-of-work running, where in the past, we might run them concurrently and allow the workload to compete for a finite number of computers, with cloud computing, we can provide each workload with discrete resources and scale at a reduced cost.

In the next post, we will discuss smaller units-of-work in a database. With this foundation, we will then be able to talk about the power provided by products like Snowflake, and we will be able to show a path for cloud databases to become even more efficient.

Back in the saddle

In 2015, President Obama appointed me the CIO of the Social Security Administration, and I had to stop posting here. The products and vendors I would have liked to review were often installed at the SSA, and unfortunately, it was not appropriate for me to comment. Now I’m at a place without constraints, so I’ll try my hand again.

I could spend the next year posting stories of my experience at the SSA. It was a fantastic time. I think that I left an impression on the IT programs there. It may be that I even left a good impression.

I will mention two things for fun. First, as CIO, I was in charge of a $1.2B/year IT budget. Who knew that a geeky guy could perform at that level? Not me. Second, in case you think $1.2B on IT per year is an example of wasteful government spending, consider that the operations at the SSA are directly responsible for payments that represent 5% of the GDP of the US. $1.2B for IT that drives 5% of $20,544B is a bargain.

A lot happened during my time away from the Database Fog Blog, and my stint as a CIO provided me with a new perspective. I implemented a data lake, a data warehouse, and utilized the cloud in a very modern manner. I applied agile methods and DevOps effectively. I imagined a unique application of machine learning and prototyped it. I worked to make this all happen in a very large organization. These experiences are the fodder for what I’ll write.

Since I implemented cloud-native apps, I have a lot to share on how cloud computing will impact data warehousing (hint: Snowflake is just a significant first step). You may be surprised at where the cloud could take us. There are significant technologies, besides Snowflake, that will impact how we might architect analytic data going forward.

The first topic (this preface, does not count) considers workload granularity and cloud-native computing. I hope that you will find it interesting and different.

It is great to be back.

Hadoop and ETL

My last post (here) blathered about the effect that Hadoop must have on database vendor profits. An associate wrote me with the reminder that Hadoop is also impacting revenues and profits of ETL companies.

If you think about Hadoop as both an inexpensive staging area for an EDW and as a parallel compute engine that can transform ungoverned, extracted data and load it into a governed EDW platform… then you are just one thought from realizing that these two functions have heretofore been in the domain of ETL… and that moving these functions to Hadoop might have an effect in the ETL space.

I do not believe that ETL tools will go away… but they may become just the GUI development environment that lets you quickly develop transformations and connect them into an end-to-end ETL process. The scheduling, processing engine, and monitoring could then be handled by the Hadoop eco-system.

Here is the idea from a previous post.

About five years ago the precursor to Alpine Data Labs, then an EMC Greenplum subsidiary, was developing a GUI for analytics that connected processes and I suggested they spin the product both into analytics and into ETL… I’ll have to look and see where they are these days…

Hadoop and Company Financial Performance

I have posted several times about the impact of the Hadoop eco-system on a several companies (here, here, here, for example). The topic cam up in a tweet thread a few weeks back… which prompts this quick note.

Fours years ago the street price for a scalable, parallel, enterprise data warehouse platform was $US25K-$US35K per terabyte. This price point provided vendors like Teradata, Netezza, and Greenplum reasonable, lucrative, margins. Hadoop entered the scene and captured the Big Data space from these vendors by offering 20X slower performance at 1/20th the price: $US1K-$US5K per terabyte. The capture was immediate and real… customers who were selecting these products for specialized, very large, 1PB and up deployments switched to Hadoop as fast as possible.

Now, two trends continue to eat at the market share of parallel database products.

First, relational implementations on HDFS continue to improve in performance and they are now 4X-10X slower than the best parallel databases at 1/10th-1/20th the street price. This puts pressure on prices and on margins for these relational vendors and this pressure is felt.

In order to keep their installed base of customers in the fold these vendors have built ever more sophisticated integration between their relational products and Hadoop. This integration, however, allows customers to significantly reduce expense by moving large parts of their EDW to an Annex (see here)… and this trend has started. We might argue whether an EDW Annex should store the coldest 80% or the coldest 20% of the data in your EDW… but there is little doubt that some older data could satisfy SLAs by delivering 4X-10X slower performance.

In addition, these trends converge. If you can only put 20% of your old, cold data in an Annex that is 10X slower than your EDW platform then you might put 50% of your data into an Annex that is only 4X slower. As the Hadoop relational implementations continue to add columnar, in-memory, and other accelerators… ever more data could move to a Hadoop-based EDW Annex.

I’ll leave it to the gamblers who read this to guess the timing and magnitude of the impact of Hadoop on the relational database markets and on company financial performance. I cannot see how it cannot have an impact.

Well, actually I can see one way out. If the requirement for hot data that requires high performance accelerates faster than the high performance advances of Hadoop then the parallel RDBMS folks will hold their own or advance. Maybe the Internet of Things helps here…. but I doubt it.

Database Super-computing

Today I am going to focus on a topic that I’ve suggested previously without the right emphasis: the new database architecture that uses vector processing on compressed columns to significantly accelerate performance.

The term “super-computing” was coined to describe the extreme hardware and software optimization developed to crunch numbers in scientific applications. As these technologies developed super-computer hardware evolved to leverage parallel microcomputers, software evolved to better leverage parallelism. Recently, microcomputers have started to incorporate the specialized instructions that support advanced mathematical applications. These super-computer instructions directly support vector algebra by manipulating strings of bits, vectors, in a single instruction. Finally, application developers recognized that these bit strings, these vectors, could be loaded into the microprocessors in a more effective manner to optimize their applications to the bare metal.

The effect of these optimizations accumulate for these applications as vectors compress and use memory more effectively, vectors load into processor cache more effectively, and vector instructions dramatically outperform integer instructions. The cumulative effect is that super-computer programs may be 10X-100X faster than commercial applications that provide the same result.

As this evolution progressed there was a similar evolution changing the architecture of database technology. Databases actually leveraged microcomputers before the high performance space made the move. But databases focused on the benefits of massively parallel I/O more than on the benefits of parallel compute. The drive to minimize the cost of I/O eventually led database developers to implement column store and then a very interesting discovery was made. Engineers recognized that a highly compressed column, a string of bits, could be processed as a vector.

Let’s see if we can make this 10X-100X number more than marketing foam. We can do this by roughly comparing the low-level processing of a chunk of data in integer and then in vector formats.

Let’s skip I/O processing and just focus on internals. This simplification greatly favors our integer DBMS. Keep in mind that the vector DBMS will process compressed vector data directly while the integer DBMS will expend resources to uncompress data and then take up 4X or more memory. This less efficient memory utilization will increase the chance that an I/O may be required and I/O is very expensive in the scenario we will discuss. Even an I/O on 1% of the time by the integer DBMS will provide a 1000X-100,000X advantage to the vector DBMS (see Figure 8 to gauge the latency to SSD or to disk).

Figure 8. Some Latency Metrics
Figure 8. Some Latency Metrics

So we’ll start with uncompressed integer data versus compressed vector data. We can assume that both databases are effective at populating cache. But the 4X compression advantage means that the vector processor is more likely to find data in the fast Level 1 cache and in the mid-range L2 cache. Given the characteristics outlined in Figure 8 we might suggest that the vector database is 4X more likely of finding data in cache than the integer database and that if we assume the latency of L2 cache as an estimate this results in a 15X-200X performance advantage.

Since data is in a vector form we can perform relational algebra and basic mathematics using vector algebra and vector addition. This provides another 8X-50X boost to the vector side

When we combine these advantages we see that a 10X-100X advantage is conservative. The bottom line is clear. A columnar database that effectively manages vectors into cache and further utilizes super-computing instructions will significantly out-perform an integer-based product.

The era of database super-computing has begun.