Database Supercomputing in the Cloud

This post will combine the information in the last two posts to show how the cloud might be used as a database supercomputer at no extra cost. 

To quickly recap: in the post here, you saw a simple model that shows how a cluster dedicated to a single query uses the on-demand resources of cloud computing to provide a 10X-100X performance boost over a multitasking cluster running many queries at once. Note that this model did not account for the start-up time of spinning up a new cluster instance for each query, but we assume that there is still a significant saving for queries that run for several minutes. Further, we believe that queries can be queued and run one at a time on clusters to reduce the start-up costs whenever there is work queued. 

The next post here reminded us of the cost of populating cache from DRAM whenever there is a multitask context switch from one query to the next. You also saw how supercomputing SIMD instructions further accelerate processing. Finally, we discussed how modern columnar databases store data as vectors in a way that allows compressed columnar data to be loaded directly into cache to optimize the use of that memory. 

One last note: all the cores share the L3 cache in a chip. On a 4-CPU processor, this means that you could run 4 parallel processes, all on compressed vectorized data. 

So how might the cloud help here? First, if we dedicate a cluster to query processing where “dedicate” means that the hardware is dedicated to a single queue of queries, then once the query data is loaded in memory and the cache is primed, the performance can be 1000X the speed of an alternative configuration where the hardware is shared both by a multitasking DBMS and that DBMS is sharing the cluster with other cloud workloads dispatched by the cloud operating system. Better still, since in the cloud using fine-grained billing, you pay only for what you use, so this extra performance is nearly free. 

When SAP HANA was introduced, SAP published amazing 1000X benchmarks using in-memory supercomputing instructions. Unfortunately, the technology did not quite exist to deploy queries on-demand on tens or hundreds or thousands of dedicated cloud computers. I imagine that we are not too far from this today. 

The bottom line is that multitasking has never been efficient. When you swap context in and out of cache, you are just wasting resources. It made sense when computing was expensive, and you swapped context to take advantage of the time it took to perform I/O to disk or tape. For analytic workloads where the data is columnar and vectorized, you can fit multiple terabytes of compressed data in memory and stream it to supercomputing instructions quickly.  

I imagine a time when simple, very high performance, single-threaded database processing units, similar in architecture to GPUs, will handle queries in a dedicated manner, fetching vectorized data from products like Apache Arrow, and 1000X speed-up at a very low cost will be the standard. 

DW Cloud Economics – The Model

This post describe the model used to generate the content from the DW Cloud Economics – A Do Over post incase you want to check the numbers…

First, keep in mind that the numbers in this model are relative and indicative, not absolute. Here are the fixed parameters I used to generate these indicative numbers:

Cost per server per hour: $4.00 This could be any number or any currency and it would not matter.

Runtime per job with no contention: 3 hours This could be any duration and it would not change the relative results. If the number was well below an hour, it would increase the hour-based billing costs.

Jobs per Workload: 6 Again any number works.

Baseline Configuration: 24 Servers

Contention between 6 jobs on 24 Servers: 50% This sets the average amount of time a job will wait for resources. So, a job that runs for 3 hours with no contention will run in 4.5 hours with 50% contention. As the number of servers goes up the contention comes down for a fixed workload.

Baseline Performance: 2000 A completely arbitrary number that allows the model to calculate a relative price/performance. Relative “Performance” is calculated by dividing the Baseline Performance by the actual runtime. This means that a workload that takes 5 hours will yield a performance of 400 and a workload that take 2 hours will yield a performance number of 1000.

There are also a series of calculated values:

Total Servers / Workload is equal to Servers/Job for the Multi-tasking and Serial Execution configurations. It is equal to Servers/Job * Number of Jobs for the dedicated server per job configuration.

Scaleup is the ratio between the number of Total Servers / Workload and the Baseline Configuration. It is used to scale down the runtime and the contention.

Runtime / Job is the number of hours per job with no contention scaled down based on the Scaleup plus the time the Job is waiting due to contention. Contention scales down as the number of servers scales up.

Actual Runtime / Workload is Runtime / Job plus contention. Note that in the Serial Configuration and the Dedicated Server configuration there is no contention.

Paid Runtime / Workload is the Actual Runtime / Workload rounded to the nearest hour for the hourly billing table and equal to the Actual Runtime / Workload for the billing per minute table.

Cost / Workload is the Cost per Server per Hour parameter times the Paid Runtime / Hour value times the Total Servers / Workload value.

Price / Performance is the Cost / Workload divided by the Actual Runtime / Workload times the Baseline Performance parameter divided by the Cost / Workload value as noted above.

What is a Cloud-native Database?

Before this series is complete, I plan on defining in some detail what the various levels of Cloud-nativeness might be to allow readers to classify products based on architecture, not marketing. In this post, I’ll lay out some general concepts. First, let’s be real about cloud-things that are not cloud-native.

Any database that runs on physical hardware can run on virtual machines and, therefore, can run on virtual machines in the cloud. Databases with no cloud capabilities other than the ability to run on a VM on a cloud-provider are not cloud-native. Worse, there are lots of anecdotal stories that suggest that there are no meaningful savings to be had from moving a database from an on-premise server or VM to a cloud VM with no other change to take advantage of cloud elasticity.

So here are two general definitions for your consideration:

1) A cloud-native database will have one or more features that utilize capabilities found only on a cloud-computing platform, and

2) A cloud-native database will demonstrate economic benefits derived from those cloud-specific features.

Note that the way you pay for services via capital expenses (CapEx) or as operating expenses (OpEx) does not provide economic benefit. If the monetary costs of a subscription are more-or-less equal to the financial costs of a license, then savings are tied to tax law, not to economics. Beware of cloudy subscriptions that change how you pay without clearly adding beneficial cloudy features. It is these subscriptions that often are the source of the no-savings anecdotes mentioned above.

This next point is about the separation of storage from compute. Companies have long ago disconnected their databases from just-a-bunch-of-disks (JBOD) to shared storage such as SAN or S3. Any database today can use shared storage. It is not useful to say that any database that can use shared storage has separated storage from compute. Using the idea that there must be features, not marketing, that allow compute to scale separately and more-or-less dynamically from storage as the definition, we will be able to move forward in this area.

So:

3) For storage to be appropriately separated from Compute, it must be possible to scale Compute up and down dynamically.

Next, when compute scales, it scales at different granularity. An application or database that automatically adds and subtracts virtual machines provides different economics than a database that scales using containers. Apps that add and subtract containers have different economics than applications that use so-called serverless containers to scale. In this dimension, we will try to characterize granularity to account for the associated cloud economics. This topic will be covered in more detail later, so I’ll save the rule for that post.

Note that it is possible for database vendors to develop a granular architecture and to use the associated economics to their advantage. They may charge you for the time when any part of your database is running but be billed by their provider for smaller chunks. This is not an issue unless their overall costs become uncompetitive.

Last, and in some ways least, different products may charge for time in smaller or larger chunks. You might be charged by the hour, by the minute, by the second, or in smaller increments. Think about the scaling economics I suggested in the first posts of this thread. If you are charged by the hour, then there is no financial incentive to scale up to finish jobs to the minute. You will be charged the same for ten minutes or fifty minutes when you are charged by the hour. The rule:

4) Cloud databases that charge in smaller time increments are more economical than those that charge in larger increments.

The rationale here is probably obvious, but I’ll cover it in-depth in a later post.

With these concepts in place, we can discuss how architectural changes affect each aspect of the economics of databases in the cloud.

Note that this last sentence was written assuming that a British computer scientist with an erudite accent would speak it when they create the PBS series from these posts. Not.

By the way, a few posts from now, I am going to go back to some ideas I shared five years ago around the relationship between database processing and the underlying hardware platform. I’ll update this thinking with cloud computing in mind. You can find this thinking here which originally came from Jeff Dean and Peter Norvig (displayed in lots of places but here is one).

Database Super-computing

Today I am going to focus on a topic that I’ve suggested previously without the right emphasis: the new database architecture that uses vector processing on compressed columns to significantly accelerate performance.

The term “super-computing” was coined to describe the extreme hardware and software optimization developed to crunch numbers in scientific applications. As these technologies developed super-computer hardware evolved to leverage parallel microcomputers, software evolved to better leverage parallelism. Recently, microcomputers have started to incorporate the specialized instructions that support advanced mathematical applications. These super-computer instructions directly support vector algebra by manipulating strings of bits, vectors, in a single instruction. Finally, application developers recognized that these bit strings, these vectors, could be loaded into the microprocessors in a more effective manner to optimize their applications to the bare metal.

The effect of these optimizations accumulate for these applications as vectors compress and use memory more effectively, vectors load into processor cache more effectively, and vector instructions dramatically outperform integer instructions. The cumulative effect is that super-computer programs may be 10X-100X faster than commercial applications that provide the same result.

As this evolution progressed there was a similar evolution changing the architecture of database technology. Databases actually leveraged microcomputers before the high performance space made the move. But databases focused on the benefits of massively parallel I/O more than on the benefits of parallel compute. The drive to minimize the cost of I/O eventually led database developers to implement column store and then a very interesting discovery was made. Engineers recognized that a highly compressed column, a string of bits, could be processed as a vector.

Let’s see if we can make this 10X-100X number more than marketing foam. We can do this by roughly comparing the low-level processing of a chunk of data in integer and then in vector formats.

Let’s skip I/O processing and just focus on internals. This simplification greatly favors our integer DBMS. Keep in mind that the vector DBMS will process compressed vector data directly while the integer DBMS will expend resources to uncompress data and then take up 4X or more memory. This less efficient memory utilization will increase the chance that an I/O may be required and I/O is very expensive in the scenario we will discuss. Even an I/O on 1% of the time by the integer DBMS will provide a 1000X-100,000X advantage to the vector DBMS (see Figure 8 to gauge the latency to SSD or to disk).

Figure 8. Some Latency Metrics
Figure 8. Some Latency Metrics

So we’ll start with uncompressed integer data versus compressed vector data. We can assume that both databases are effective at populating cache. But the 4X compression advantage means that the vector processor is more likely to find data in the fast Level 1 cache and in the mid-range L2 cache. Given the characteristics outlined in Figure 8 we might suggest that the vector database is 4X more likely of finding data in cache than the integer database and that if we assume the latency of L2 cache as an estimate this results in a 15X-200X performance advantage.

Since data is in a vector form we can perform relational algebra and basic mathematics using vector algebra and vector addition. This provides another 8X-50X boost to the vector side

When we combine these advantages we see that a 10X-100X advantage is conservative. The bottom line is clear. A columnar database that effectively manages vectors into cache and further utilizes super-computing instructions will significantly out-perform an integer-based product.

The era of database super-computing has begun.

Some HANA and Intel Videos

Here are two videos of me speaking from the 2013 Intel Developer Forum FYI.

The first has some technical detail:

The second is more of a PR pitch about Intel Hadoop:

I’m working with Intel on a new video with a pretty interesting storyline (at least I hope that you find it interesting?)… so stay tuned.

Rob

Some Database Performance Concepts

I’m working on a new idea… it may or may not pan out… but here are some concepts for your consideration… with some thoughts on their performance implications.

First a reminder… a reality check. In my experience if you POC two databases at about the same price point…and one is 30% faster than the other, 1.3X, then 50% of the time the faster DBMS will win the business. If one DBMS is 2X faster… then it will win the business 90% of the time. The 10% where the faster product loses will be because of internal politics or, for an existing application, due to the migration costs. Note that IMO it is silly to do a POC if you know up front that you will not pick the winner.

Now to the concepts… Note that these are ballpark numbers to help you think about trade-offs…

102312_2108_HopFestP1Ex1.png

The latency to start fetching data from DRAM is 100 ns… from disk it is 10M ns. If we assume that a smart RDBMS pre-fetches 80% of the data into DRAM then we can assume that an in-memory DBMS has a 200,000X performance advantage over a disk-based system.

The latency to a Flash/SSD device is 100K-200K ns. With the same 80% pre-fetch assumption an in-memory DBMS will be 20,000X faster.

Note that neither of these models include data transfer times which will favor in-memory databases even more.

If we have a hybrid system with both disk and SSD and we assume that 90% of the reads hit the SSD and that both layers in the storage hierarchy achieve 80% pre-fetch then then the in-memory system will be 38,000X faster. If fewer than 90% of the reads hit the SSD, then the latency goes up quickly.

These numbers form the basis for selecting in-memory caches like Teradata’s Intelligent Memory option as well as in-memory offerings from IBM, Microsoft, Oracle and SAP.

For typical data warehouse workloads column compression will provide around a 2.5X performance boost over row compression. This has two implications: you will get 2.5X better performance using column storage and you will get 2.5X more data into the faster levels of your storage hierarchy… more in SSD and more in-memory.

If we assume that a typical query only touches 10% of the columns in the tables addressed… then column projection provides a 9X performance boost over a row store. Exadata does not support column projection in the storage layer… and other hybrid row-or-column systems provide it only for columnar tables.

If we assume that the average latency from the processor caches is 10ns (.5ns L1, 7ns, L2, 15ns L3) and the latency to DRAM is 100ns then an in-memory system which pre-fetches data effectively into the processor caches will be 10X faster than one which goes to DRAM. If we assume that a standard RDBMS which processes uncompressed standard data types (no vector processing) gets a 20% cache hit ratio then the advantage to a cache aware RDBMS which loads full cache lines is around 8X. HANA, BLU, and the Oracle in-memory products are cache aware and get this boost with some caveats.

BLU and the Oracle in-memory option are hybrid systems that often convert data to a row form for processing (see here for some data on Oracle). If we assume that they use the full columnar in-memory vector-based structures 50% of the time then these products will see a 4X performance boost. HANA recommends that all data be stored in a columnar form so it would often see the full 8x boost.

These vector-based processes also avoid the cost of decompression… and since they process compressed vector data they can fit more information into each cache line. There is another 20%-200% (1.2X-2X) boost here but I cannot estimate it closer than that.

Finally, the vector based processes use the high performance computing instruction sets (AVX2) offered on modern CPUs… and this provides another 10X+ boost. Again, BLU and Oracle will utilize the vector form less often than HANA so they will see a boost over products like Teradata… but not see as large a boost as HANA.

There are other features at play here… some products, like HANA, shard data in-memory to get all of the cores busy on each query. I have not been able to determine if BLU or Oracle in-memory are there yet? Note that this powerful feature will allow a single query to run as fast as possible… but the benefit is mitigated when there is a workload of multiple concurrent queries (if I have 4 cores and 4 queries running concurrently, one query per core, then the 4 queries will take only a little more time than if I run the 4 queries serially with each query using all 4 cores).

It is important to note that the Oracle In-memory option does not run in the storage component of an Exadata cluster… only on the RAC layer. It is unclear how this works with the in-memory option.

The bottom line is that in-memory systems are not all alike. You can sort of add up the multipliers to get in the ballpark on how much faster Teradata will be with the Intelligent Memory option… how much faster than that a hybrid row and vector-column system like BLU or Oracle In-Memory… and how much faster a pure in-memory system might be. One thing is for sure… these in-memory options will always make the difference in a POC… in every case including them or not will blow away the 2X rule I started with… and in every case the performance benefit will outweigh the extra cost… the price/performance is very likely to be there.

I know that is skipped my usual referencing so that you can see where I pulled these numbers from… but most of this information is buried in posts here and there on my blog… and as I stated up front… these are ballpark numbers. Hopefully you can see the sense behind them… but if you think I’m off please comment back and I’ll try to adjust…

CPUs and HW for HANA, BLU, Hekaton, and Oracle 12c

CPUs from retired computers waiting for recycl...
(Photo credit: Wikipedia)

This short post is intended to provide a quick warning regarding in-memory columnar and cpu requirements… with a longer post to follow.

When a row is inserted or bulk-loaded into a DBMS, if there are no indexes, the amount of cpu required is very small. The majority of the time is spent committing a transaction is the time to write a log record to persist the data.

When the same record is reformatted into a column the amount of processing required is significantly higher. The data must be parsed into columns, the values must be compressed, dictionaries may be updated, and the breadcrumbs that let the columnar data be regenerated into rows must be laid. Further, if the columnar structure are to be optimized then the data must be ordered… with a sort or some kind of index structure. I have seen academic papers that suggest that for an insert columnar processing may be 100X more than row processing… and you can see why this could be true (I apologize for not finding the reference… I’ll dig it up… as I recall I read it in a post some time back by Daniel Abadi).

Now let’s think about this… several vendors are suggesting that you can deploy their columnar features with no changes required… no new hardware… in-place. But this does not ring true if the new columnar feature requires 100X extra CPU cycles per row… or 50X… or 10X… unless you are running your database on an empty server.

This claim is a shot at SAP who, more honestly, suggests new hardware with high-end processors for their in-memory columnar product… but methinks it is marketing, not architecture, from these other folks.

Cloud DBMS < High Performance DBMS

English: Cloud
English: Cloud (Photo credit: Wikipedia)

In my post here I suggested that database computing was becoming a special case of high-performance computing. This trend will bump up against the trend towards cloud computing and the bump will be noisy.

In the case of general commercial computing customers running cloudy virtualized servers paid a 5%-20% performance penalty… but the economics still worked for the cloud side.

For high-performance database computing it is unclear how much the penalty will be? If a virtualized, cloudy, database gives up performance because SIMD becomes problematic, priming the cache becomes hard, CPU stalls become more common, and there is a move from a shared nothing architecture to SANs or SAN-like shared data devices, then the penalty may be 300%-500% and the cloud databases will likely lose.

As I noted in the series starting here, there are lots of issues around high-performance database computing in the cloud. It will be interesting to see how the database vendors manage the bump and the noise. So keep an eye out. If your database of choice starts to look cloudy… if it becomes virtualized and it starts moving from a shared-nothing cluster to a SAN… then you will know which side of the bump they are betting on. And if they pick the cloudy side then you need to ask how they plan to architect the system to hold the penalty to under 20%…

I also mentioned in that series that in-memory databases had an advantage over peripheral-based databases as they did not have to pay a penalty for de-coupling the IO bandwidth that is part of a shared-nothing cluster. But even those vendors have to manage the fact that the database is abstracted… virtualized… away from the hardware.

If I were King I would develop a high-performance database that implemented the features of a cloud database: elasticity, easy provisioning, multi-tenancy; over bare metal. Then you might get the best of both worlds.

Database Computing is Supercomputing… Some external reading: May 2013

Superman: Doomsday & Beyond
Superman: Doomsday & Beyond (Photo credit: Wikipedia)

I would like to recommend to you John Appleby’s post  here on the HANA blog site. While the title suggests the article is about HANA, in fact it is about trends in computing and processors… and very relevant to posts here past, present, and upcoming…

I would also recommend Curt Monash’s site. His notes on Teradata here mirror my observation that a 30%-50% performance boost per release cycle is the target for most commercial databases… and what wins in the general market. This is why the in-memory capabilities offered by HANA and maybe DB2 BLU are so disruptive. These products should offer way more than that… not 1.5X but 100X in some instances.

Finally I recommend “What Every Programmer Should Know About Memory” by Ulrich Drepper here. This paper provides a great foundation for the deep hardware topics to come.

Database computing is becoming a special case, a commercial case, of supercomputing… high-performance computing (HPC) to those less inclined to superlatives. Over the next few years the differentiation between products will increasingly be due to the use of high-performance computing techniques: in-memory techniques, vector processing, massive parallelism, and use of HPC instruction sets.

This may help you to get ready…

Teradata CPU Planning

I suggested here that Teradata shipped the EDW 6700 series without waiting for Ivy Bridge because they could not use the cores effectively… but it could be that Haswell (see here) fit their release schedule better. It will be interesting to see whether they can use all of the cores then?