In this post I will setup the next post by reminding you of these numbers every programmer should know. The picture shows the latency to access data across the three levels of cache in a modern processor, and across the memory bus to DRAM, and then across to an SSD or spinning disk drive.
These numbers are the key to in-memory database performance. L1 cache is expensive and small. All data and all instructions are fed to the core through the L1 cache and typically the instruction and the data it will operate on are each fetched independently. If the data (or instruction… but I will just talk about the data from here on out) is not found in the small L1 cache the L2 cache is searched. Likewise, if the data is not in the L2 cache the L3 cache is searched and then DRAM. As you can see there is a 200X performance difference between a fetch from L1 cache and a fetch from DRAM. If the data must be fetched from SSD or disk, there is a 1000X penalty.
Databases work to mitigate the 1000X penalty by pre-fetching data into DRAM. Modern processors mitigate the penalty between cache and DRAM by trying to anticipate the data required and pre-fetching it into L3 cache. This is a gamble and the likelihood of winning the pre-fetch bet varies with your DBMS and with the workload. More simultaneous queries against many tables puts pressure on the memory and lessens the effectiveness of data pre-fetch.
Databases also mitigate the 1000X penalty by compressing data so that each fetch of a block from disk captures more data. For tabular (not columnar) data each block must be decompressed before it can be used. Decompression steals CPU cycles from query processing but mitigates some of the cost of I/O to disk. Columnar databases fetch compressed data and, if the databases use modern vector processing techniques, they operate directly on the compressed vector data without decompression. This is especially powerful as these vectorized columnar databases can push compressed data into cache and use that memory more effectively.
Vectorized data can use the supercomputing instruction set that includes SIMD instructions. SIMD stands for Single Instruction; Multiple Data and these instructions fetch a single instruction from L1 cache and reuse it on the vectorized data over and over with pausing to fetch new instructions. Note that in a modern processor waiting for instructions or data to be fetched from cache or DRAM appears as CPU busy time. When the processor is stalled waiting for data to be fetched from disk the query/job gives up the CPU and dispatches a new query/job.
Finally, for the longer-running queries associated with analytics and business intelligence (where long-running means a few seconds or more… “long” in CPU instruction time), it is highly likely that the L1 and L2 cache in each core will be flushed and new data will be fetched. It is even likely that all the data in L3 cache will be flushed. In this case most of the goodness associated with compression and vectorization is flushed with it. The vectors must be fetched again from L3 and DRAM. Further, note that when you run in a virtual machine or in the cloud you may think that you are the only user but other virtual machines other containers, or other serverless processes may be flushing you out of memory all the time.
This is important. Until we swapped out a query for a new one it looked like in-memory vector processing might be 200X-1000X faster than a tabular query; but in an environment where tens or hundreds of queries are running concurrently the pressure on memory keeps flushing the cache and the advantage of vectorized query processing is reduced. Reduced, not eliminated. We would still expect the advantages of compressed data in cache and SIMD supercomputing instructions to provide much more than a 10X speedup.
As mentioned, this post is a detailed review of material I covered years ago when HANA was introduced. In the next post I will add some new thinking.
Here is more information on processor architecture and cache usage.