Database Supercomputing in the Cloud

This post will combine the information in the last two posts to show how the cloud might be used as a database supercomputer at no extra cost. 

To quickly recap: in the post here, you saw a simple model that shows how a cluster dedicated to a single query uses the on-demand resources of cloud computing to provide a 10X-100X performance boost over a multitasking cluster running many queries at once. Note that this model did not account for the start-up time of spinning up a new cluster instance for each query, but we assume that there is still a significant saving for queries that run for several minutes. Further, we believe that queries can be queued and run one at a time on clusters to reduce the start-up costs whenever there is work queued. 

The next post here reminded us of the cost of populating cache from DRAM whenever there is a multitask context switch from one query to the next. You also saw how supercomputing SIMD instructions further accelerate processing. Finally, we discussed how modern columnar databases store data as vectors in a way that allows compressed columnar data to be loaded directly into cache to optimize the use of that memory. 

One last note: all the cores share the L3 cache in a chip. On a 4-CPU processor, this means that you could run 4 parallel processes, all on compressed vectorized data. 

So how might the cloud help here? First, if we dedicate a cluster to query processing where “dedicate” means that the hardware is dedicated to a single queue of queries, then once the query data is loaded in memory and the cache is primed, the performance can be 1000X the speed of an alternative configuration where the hardware is shared both by a multitasking DBMS and that DBMS is sharing the cluster with other cloud workloads dispatched by the cloud operating system. Better still, since in the cloud using fine-grained billing, you pay only for what you use, so this extra performance is nearly free. 

When SAP HANA was introduced, SAP published amazing 1000X benchmarks using in-memory supercomputing instructions. Unfortunately, the technology did not quite exist to deploy queries on-demand on tens or hundreds or thousands of dedicated cloud computers. I imagine that we are not too far from this today. 

The bottom line is that multitasking has never been efficient. When you swap context in and out of cache, you are just wasting resources. It made sense when computing was expensive, and you swapped context to take advantage of the time it took to perform I/O to disk or tape. For analytic workloads where the data is columnar and vectorized, you can fit multiple terabytes of compressed data in memory and stream it to supercomputing instructions quickly.  

I imagine a time when simple, very high performance, single-threaded database processing units, similar in architecture to GPUs, will handle queries in a dedicated manner, fetching vectorized data from products like Apache Arrow, and 1000X speed-up at a very low cost will be the standard. 

Numbers Every Database Professional Should Know

Latency Estimates in a Processor

In this post I will setup the next post by reminding you of these numbers every programmer should know. The picture shows the latency to access data across the three levels of cache in a modern processor, and across the memory bus to DRAM, and then across to an SSD or spinning disk drive. 

These numbers are the key to in-memory database performance. L1 cache is expensive and small. All data and all instructions are fed to the core through the L1 cache and typically the instruction and the data it will operate on are each fetched independently. If the data (or instruction… but I will just talk about the data from here on out) is not found in the small L1 cache the L2 cache is searched. Likewise, if the data is not in the L2 cache the L3 cache is searched and then DRAM. As you can see there is a 200X performance difference between a fetch from L1 cache and a fetch from DRAM. If the data must be fetched from SSD or disk, there is a 1000X penalty. 

Databases work to mitigate the 1000X penalty by pre-fetching data into DRAM. Modern processors mitigate the penalty between cache and DRAM by trying to anticipate the data required and pre-fetching it into L3 cache. This is a gamble and the likelihood of winning the pre-fetch bet varies with your DBMS and with the workload. More simultaneous queries against many tables puts pressure on the memory and lessens the effectiveness of data pre-fetch. 

Databases also mitigate the 1000X penalty by compressing data so that each fetch of a block from disk captures more data. For tabular (not columnar) data each block must be decompressed before it can be used. Decompression steals CPU cycles from query processing but mitigates some of the cost of I/O to disk. Columnar databases fetch compressed data and, if the databases use modern vector processing techniques, they operate directly on the compressed vector data without decompression. This is especially powerful as these vectorized columnar databases can push compressed data into cache and use that memory more effectively. 

Vectorized data can use the supercomputing instruction set that includes SIMD instructions. SIMD stands for Single Instruction; Multiple Data and these instructions fetch a single instruction from L1 cache and reuse it on the vectorized data over and over with pausing to fetch new instructions. Note that in a modern processor waiting for instructions or data to be fetched from cache or DRAM appears as CPU busy time. When the processor is stalled waiting for data to be fetched from disk the query/job gives up the CPU and dispatches a new query/job. 

Finally, for the longer-running queries associated with analytics and business intelligence (where long-running means a few seconds or more… “long” in CPU instruction time), it is highly likely that the L1 and L2 cache in each core will be flushed and new data will be fetched. It is even likely that all the data in L3 cache will be flushed. In this case most of the goodness associated with compression and vectorization is flushed with it. The vectors must be fetched again from L3 and DRAM. Further, note that when you run in a virtual machine or in the cloud you may think that you are the only user but other virtual machines other containers, or other serverless processes may be flushing you out of memory all the time. 

This is important. Until we swapped out a query for a new one it looked like in-memory vector processing might be 200X-1000X faster than a tabular query; but in an environment where tens or hundreds of queries are running concurrently the pressure on memory keeps flushing the cache and the advantage of vectorized query processing is reduced. Reduced, not eliminated. We would still expect the advantages of compressed data in cache and SIMD supercomputing instructions to provide much more than a 10X speedup. 

As mentioned, this post is a detailed review of material I covered years ago when HANA was introduced. In the next post I will add some new thinking. 

Here is more information on processor architecture and cache usage.

Database Super-computing

Today I am going to focus on a topic that I’ve suggested previously without the right emphasis: the new database architecture that uses vector processing on compressed columns to significantly accelerate performance.

The term “super-computing” was coined to describe the extreme hardware and software optimization developed to crunch numbers in scientific applications. As these technologies developed super-computer hardware evolved to leverage parallel microcomputers, software evolved to better leverage parallelism. Recently, microcomputers have started to incorporate the specialized instructions that support advanced mathematical applications. These super-computer instructions directly support vector algebra by manipulating strings of bits, vectors, in a single instruction. Finally, application developers recognized that these bit strings, these vectors, could be loaded into the microprocessors in a more effective manner to optimize their applications to the bare metal.

The effect of these optimizations accumulate for these applications as vectors compress and use memory more effectively, vectors load into processor cache more effectively, and vector instructions dramatically outperform integer instructions. The cumulative effect is that super-computer programs may be 10X-100X faster than commercial applications that provide the same result.

As this evolution progressed there was a similar evolution changing the architecture of database technology. Databases actually leveraged microcomputers before the high performance space made the move. But databases focused on the benefits of massively parallel I/O more than on the benefits of parallel compute. The drive to minimize the cost of I/O eventually led database developers to implement column store and then a very interesting discovery was made. Engineers recognized that a highly compressed column, a string of bits, could be processed as a vector.

Let’s see if we can make this 10X-100X number more than marketing foam. We can do this by roughly comparing the low-level processing of a chunk of data in integer and then in vector formats.

Let’s skip I/O processing and just focus on internals. This simplification greatly favors our integer DBMS. Keep in mind that the vector DBMS will process compressed vector data directly while the integer DBMS will expend resources to uncompress data and then take up 4X or more memory. This less efficient memory utilization will increase the chance that an I/O may be required and I/O is very expensive in the scenario we will discuss. Even an I/O on 1% of the time by the integer DBMS will provide a 1000X-100,000X advantage to the vector DBMS (see Figure 8 to gauge the latency to SSD or to disk).

Figure 8. Some Latency Metrics
Figure 8. Some Latency Metrics

So we’ll start with uncompressed integer data versus compressed vector data. We can assume that both databases are effective at populating cache. But the 4X compression advantage means that the vector processor is more likely to find data in the fast Level 1 cache and in the mid-range L2 cache. Given the characteristics outlined in Figure 8 we might suggest that the vector database is 4X more likely of finding data in cache than the integer database and that if we assume the latency of L2 cache as an estimate this results in a 15X-200X performance advantage.

Since data is in a vector form we can perform relational algebra and basic mathematics using vector algebra and vector addition. This provides another 8X-50X boost to the vector side

When we combine these advantages we see that a 10X-100X advantage is conservative. The bottom line is clear. A columnar database that effectively manages vectors into cache and further utilizes super-computing instructions will significantly out-perform an integer-based product.

The era of database super-computing has begun.

How DBMS Vendors Admit to an Architectural Limitation: Part 2 – Teradata Intelligent Memory

This is the second post (see Part 1 here) on how vendors adjust their architecture without admitting that the previous architecture was flawed. This time we’ll consider Teradata and in-memory….

When SAP HANA appeared Teradata went on the warpath with a series of posts and statements that were pointed but oddly miscued (see the references below). According to the posts in-memory was unnecessary and SAP was on a misguided journey.

Then Teradata announced Intelligent Memory and in-memory was cool. This is pretty close to an admission that SAP was right and Teradata was wrong. The numbers which drove Teradata here are compelling… 100K-200K ns to access an SSD device or 100 ns to access DRAM… a 1000X reduction… and the latency to disk is 100X worse than SSD.

Intelligent Memory was announced shortly after the release of Teradata’s columnar table type. Column-orientation is important because you need a powerful approach to compression to effectively use an expensive memory resource… and columnar provides this. But Teradata, like Greenplum, extended a row-based engine to support columns in order to get to market quick… they hoped to get 80% of the effectiveness of in-memory with only 20% of the engineering effort. The other 20% comes when you develop a new engine that fully exploits the advantages of a columnar architecture. These advanced exploits allow HANA, DB2 BLU, and Oracle 12c to execute directly on columnar data thereby avoiding decompression, fully utilizing the processor caches, and allowing sets to be operated on by super-computing vector-processing instructions. In fact, Teradata really applied the 50/20 rule… they gained 50%, maybe only 40%, of the benefits with their columnar and Intelligent Memory features… but it was easy to deploy what is in-effect an in-memory cache over their existing relational engine.

Please don’t jump to the wrong conclusion here… Intelligent Memory is a strong product. If you were to put hot data in memory, cool data in Teradata-on-SSD-or-Disk, and cold data in Hadoop and manage them as one EDW you could deploy a very cost-effective platform (see here).

Still, Teradata with Intelligent Memory is not likely to compete effectively against HANA, BLU, or 12c for raw performance… so there will be some marketing foam attached and an appeal for Teradata shops to avoid database apostasy and stick with them. You can see some of the foam in the articles below.

A quick aside here… generally a DBMS should win or lose based on price/performance. The ANSI standard makes a products features nearly, not completely but nearly, irrelevant. If you cannot win on price/performance then you blow foam. When any vendor starts talking about things like TCO you should grab your wallets… it is an appeal to foaminess to hide a weakness. I’m not calling out Teradata here… this is general warning that applies to every software vendor.

Intelligent Memory is a smart move. While it may not win in a head-to-head POC… it will be close-ish… close enough to keep the congregation in their pews. As readers know, I am not a big fan of technical religiosity… being a “Teradata-shop” is lazy… engineers we should pick the best solution and learn it. The tiered approach mentioned three paragraphs up is a good solution and non-Teradata shops should be considering it… but Teradata shops should be open to new technology as well. Still, we should pick new technology with a sensitivity to the cost of a migration… and in many cases Intelligent Memory will save business for Teradata by getting just close enough to make migration a bad trade-off. This is why it was so smart.

Back to the theme of these posts… Teradata back-tracked on the value of in-memory… and in the process admitted-without-admitting a shortcoming in their architecture. So it goes…

Next we will consider whether you should be building data warehouses on z/OS using DB2 or the DB2 Analytics Accelerator aka Netezza.

References

Database Fog Blog

Other

Key Values and Key-Value Stores and In-memory Databases

Back to more geeky topics… although my Mom loved the videos…

When very high performance is required to return key performance metrics derived from large volumes of data to a very large number of clients… in other words when volume and velocity are factors and the results are to be delivered to thousands of users (I suppose that I could conjure up a clever V here… but the cleverness would come from the silliness of the semantic stretch… so I’ll leave it to you all to have some pun); the conventional approach has been to pre-compute the results into a set of values that can be fetched by key. In other words, we build a pre-aggregated and pre-joined result table that provides answers to a single query template. Conventionally building this query-specific result table has been the only way to solve these big problems.

Further, we conventionally store these key-value results in a relational DBMS and fetch a row at a time providing pretty darn good performance. Sometimes pretty darn good is not good enough. So there are new options. Key-value data stores may well offer a solution that provides performance and scale at a price well below what has been conventional to date. This is well and good.

But I would like to challenge the conventional thinking a little. The process of joining and aggregating volumes of data into results is a BI process. For the last twenty years BI practitioners have been building pre-aggregated tables and data marts to solve these same problems… maybe not at scale… and this practice has proven to be very expensive, the opposite of agile, and unsustainable. The people costs to develop and support multiple pre-computed replicas is exorbitant. The lack of flexibility that comes from imposing a longish development project over what is essentially an aggregate BI query is constraining our enterprise and our customers.

A better approach is to use the new high performance database products: HANA, BLU, Oracle 12c, or maybe Spark; to aggregate on-demand. Included in this approach is a requirement to use the new high performance database computing platforms available to house the databases.

Consider this… an in-memory DBMS can aggregate 12M rows/sec/core. It can scan 3MB/msec/core. Companies like SGI and HP are ganging processors together so that you can buy a single node that contains 32, 64, or 128 cores… and this number will go up. A 64-core server will aggregate 768M rows/sec and scan 19.2TB/sec… and you can gang a small number of nodes together and scale out.

Providing an extensible BI platform for big data is so much easier than building single-query key-value clusters… there is much less risk… and the agility and TCO make it close to a no-brainer. We just have to re-think the approach we’ve used for 20 years and let the new software and hardware do the work.

Some Database Performance Concepts

I’m working on a new idea… it may or may not pan out… but here are some concepts for your consideration… with some thoughts on their performance implications.

First a reminder… a reality check. In my experience if you POC two databases at about the same price point…and one is 30% faster than the other, 1.3X, then 50% of the time the faster DBMS will win the business. If one DBMS is 2X faster… then it will win the business 90% of the time. The 10% where the faster product loses will be because of internal politics or, for an existing application, due to the migration costs. Note that IMO it is silly to do a POC if you know up front that you will not pick the winner.

Now to the concepts… Note that these are ballpark numbers to help you think about trade-offs…

102312_2108_HopFestP1Ex1.png

The latency to start fetching data from DRAM is 100 ns… from disk it is 10M ns. If we assume that a smart RDBMS pre-fetches 80% of the data into DRAM then we can assume that an in-memory DBMS has a 200,000X performance advantage over a disk-based system.

The latency to a Flash/SSD device is 100K-200K ns. With the same 80% pre-fetch assumption an in-memory DBMS will be 20,000X faster.

Note that neither of these models include data transfer times which will favor in-memory databases even more.

If we have a hybrid system with both disk and SSD and we assume that 90% of the reads hit the SSD and that both layers in the storage hierarchy achieve 80% pre-fetch then then the in-memory system will be 38,000X faster. If fewer than 90% of the reads hit the SSD, then the latency goes up quickly.

These numbers form the basis for selecting in-memory caches like Teradata’s Intelligent Memory option as well as in-memory offerings from IBM, Microsoft, Oracle and SAP.

For typical data warehouse workloads column compression will provide around a 2.5X performance boost over row compression. This has two implications: you will get 2.5X better performance using column storage and you will get 2.5X more data into the faster levels of your storage hierarchy… more in SSD and more in-memory.

If we assume that a typical query only touches 10% of the columns in the tables addressed… then column projection provides a 9X performance boost over a row store. Exadata does not support column projection in the storage layer… and other hybrid row-or-column systems provide it only for columnar tables.

If we assume that the average latency from the processor caches is 10ns (.5ns L1, 7ns, L2, 15ns L3) and the latency to DRAM is 100ns then an in-memory system which pre-fetches data effectively into the processor caches will be 10X faster than one which goes to DRAM. If we assume that a standard RDBMS which processes uncompressed standard data types (no vector processing) gets a 20% cache hit ratio then the advantage to a cache aware RDBMS which loads full cache lines is around 8X. HANA, BLU, and the Oracle in-memory products are cache aware and get this boost with some caveats.

BLU and the Oracle in-memory option are hybrid systems that often convert data to a row form for processing (see here for some data on Oracle). If we assume that they use the full columnar in-memory vector-based structures 50% of the time then these products will see a 4X performance boost. HANA recommends that all data be stored in a columnar form so it would often see the full 8x boost.

These vector-based processes also avoid the cost of decompression… and since they process compressed vector data they can fit more information into each cache line. There is another 20%-200% (1.2X-2X) boost here but I cannot estimate it closer than that.

Finally, the vector based processes use the high performance computing instruction sets (AVX2) offered on modern CPUs… and this provides another 10X+ boost. Again, BLU and Oracle will utilize the vector form less often than HANA so they will see a boost over products like Teradata… but not see as large a boost as HANA.

There are other features at play here… some products, like HANA, shard data in-memory to get all of the cores busy on each query. I have not been able to determine if BLU or Oracle in-memory are there yet? Note that this powerful feature will allow a single query to run as fast as possible… but the benefit is mitigated when there is a workload of multiple concurrent queries (if I have 4 cores and 4 queries running concurrently, one query per core, then the 4 queries will take only a little more time than if I run the 4 queries serially with each query using all 4 cores).

It is important to note that the Oracle In-memory option does not run in the storage component of an Exadata cluster… only on the RAC layer. It is unclear how this works with the in-memory option.

The bottom line is that in-memory systems are not all alike. You can sort of add up the multipliers to get in the ballpark on how much faster Teradata will be with the Intelligent Memory option… how much faster than that a hybrid row and vector-column system like BLU or Oracle In-Memory… and how much faster a pure in-memory system might be. One thing is for sure… these in-memory options will always make the difference in a POC… in every case including them or not will blow away the 2X rule I started with… and in every case the performance benefit will outweigh the extra cost… the price/performance is very likely to be there.

I know that is skipped my usual referencing so that you can see where I pulled these numbers from… but most of this information is buried in posts here and there on my blog… and as I stated up front… these are ballpark numbers. Hopefully you can see the sense behind them… but if you think I’m off please comment back and I’ll try to adjust…

HANA, BLU, Hekaton, and Oracle 12c vs. Teradata and Greenplum – November 2013

Catch Me If You Can (musical)
(Photo credit: Wikipedia)

I would like to point out a very important section in the paper on Hekaton on the Microsoft Research site here. I will quote the section in total:

2. DESIGN CONSIDERATIONS 

An analysis done early on in the project drove home the fact that a 10-100X throughput improvement cannot be achieved by optimizing existing SQL Server mechanisms. Throughput can be increased in three ways: improving scalability, improving CPI (cycles per instruction), and reducing the number of instructions executed per request. The analysis showed that, even under highly optimistic assumptions, improving scalability and CPI can produce only a 3-4X improvement. The detailed analysis is included as an appendix. 

The only real hope is to reduce the number of instructions executed but the reduction needs to be dramatic. To go 10X faster, the engine must execute 90% fewer instructions and yet still get the work done. To go 100X faster, it must execute 99% fewer instructions. This level of improvement is not feasible by optimizing existing storage and execution mechanisms. Reaching the 10-100X goal requires a much more efficient way to store and process data. 

This is important because it confirms the difference in a Level 3 and a Level 2 columnar implementation as described here. It is just not possible for a Level 2 implementation with a row-based join engine to achieve the performance of a Level 3 implementation. This will allow the Level 3 implementations: HANA, BLU, Hekaton, and Oracle 12c to distance themselves from the Level 2 products: Teradata and Greenplum; by more than 10X… and this is a very significant advantage.

Related articles

CPUs and HW for HANA, BLU, Hekaton, and Oracle 12c

CPUs from retired computers waiting for recycl...
(Photo credit: Wikipedia)

This short post is intended to provide a quick warning regarding in-memory columnar and cpu requirements… with a longer post to follow.

When a row is inserted or bulk-loaded into a DBMS, if there are no indexes, the amount of cpu required is very small. The majority of the time is spent committing a transaction is the time to write a log record to persist the data.

When the same record is reformatted into a column the amount of processing required is significantly higher. The data must be parsed into columns, the values must be compressed, dictionaries may be updated, and the breadcrumbs that let the columnar data be regenerated into rows must be laid. Further, if the columnar structure are to be optimized then the data must be ordered… with a sort or some kind of index structure. I have seen academic papers that suggest that for an insert columnar processing may be 100X more than row processing… and you can see why this could be true (I apologize for not finding the reference… I’ll dig it up… as I recall I read it in a post some time back by Daniel Abadi).

Now let’s think about this… several vendors are suggesting that you can deploy their columnar features with no changes required… no new hardware… in-place. But this does not ring true if the new columnar feature requires 100X extra CPU cycles per row… or 50X… or 10X… unless you are running your database on an empty server.

This claim is a shot at SAP who, more honestly, suggests new hardware with high-end processors for their in-memory columnar product… but methinks it is marketing, not architecture, from these other folks.

BLU Meanies: Data In-memory

Cover of "Sgt. Pepper's Lonely Hearts Clu...
Cover via Amazon

IBM is presenting a DB2 Tech Talk that compares the BLU Accelerator to HANA. There are several mistakes and some odd thinking in the pitch so let me address the issues as a way to explain some things about HANA and about BLU. This blog will consider what data needs to be in-memory.

IBM like several others, continues to repeat a talking point along the lines of: “We believe that you should not have to fit all of you active data in memory…”. Let’s think about this…

Note that in the current release HANA has a constraint that all of the data in a single column, the entire vector that represents the data in that column, must be in-memory before it can be operated on. If the table is partitioned and partition-elimination is applied then the data in the partition for the column must be loaded in-memory. This is a real constraint that will be removed in a subsequent release… but it is not a very severe constraint if you think about it.

But let’s be clear… HANA does not require all data to be in-memory… it will read data from peripheral devices in and out as required just as BLU does.

Now what does this mean? Let’s walk through some scenarios.

First, let’s imagine a customer with 10TB of user data, per the scenario IBM discusses. Let’s not get into a whose product compresses better discussion and assume that both BLU and HANA will get 4X compression… so there is 2.5TB of user data to be processed.

Now let’s imagine a system with only a very little memory available for data. In other words, let’s configure both BLU and HANA so that they are full columnar databases, but not in-memory databases. In this case BLU would operate by doing constant I/O without constraint and HANA would fail whenever it could not fit a required column in memory. Note that HANA might not fail at all… it would depend on whether there was a large single un-partitioned column that was required.

This scenario is really silly though… HANA is an in-memory database, designed to keep data in-memory from the start… so SAP would not support this imaginary configuration. The fact that you could make BLU work out of memory is not really relevant as nowhere does IBM position, or reference, BLU as a disk-based column store add-on… you would just use DB2.

Now let’s configure a system to IBM’s specification with 400GB of memory. IBM does not really say how much of this memory is available to BLU for data… but for the sake of argument let’s ignore the system requirements and assume that BLU uses one-half, 200GB, as work space to process queries so that 200GB is available to store data in-memory. As you will see it does not really matter in this argument whether I am spot on here or not. So using IBM’s recommendation there is now a 200GB cache that can be used as data is paged in and out. Anyone who has ever used a data warehouse knows that caching does not work well for BI queries as each query touches large enough volumes of the data to flush the cache… so BLU will effectively be performing I/O for most queries and is back to being an out-of-memory columnar database. Note that this flushing issue is why the in-memory capabilities from Oracle and Teradata pin certain tables into memory. In this scenario HANA will operate exactly as BLU does with the constraint that any single column that in a compressed form exceeds 200GB will not be able to be processed.

Finally let’s configure a system with 5TB of memory per SAP’s recommendation for HANA. In this case BLU and HANA both fit all of the data in-memory… with 2.5TB of compressed user data in and 2.5TB of work space… and there is no I/O. This is an in-memory DBMS.

But according to the IBM Power 770 spec (here) there is no way to get 5TB of memory on a single p770 node… so to match HANA and eliminate all I/O they would require two nodes… but BLU cannot be deployed on a cluster… so on they would have to deploy on a single node and perform I/O on 20% of the data. The latency for SSD I/O is 200Kns and for disk it is 10Mns… for DRAM it is 100ns and HANA loads full cache lines so that the average latency is under 20ns… so the penalty paid by BLU is severe and it will never keep up with HANA.

There is more bunk around recommendations for the number of cores but I can make no sense of it at all so I do not know where to begin to debunk it. SAP recommends high-end Intel servers to run HANA. In the scenario above we would recommend multiple servers… soon enough there will be Haswell servers with 6TB of DRAM and this case will run on one node.

I have stated repeatedly that anytime a vendor presents a slide comparing their product to their competitors you should immediately throw them out… it will always be twisted. Don’t trust them. And don’t trust me as I work for SAP. But hopefully you can see some logic in my case. If you need an IMDB then you need memory. If you are short of memory then the IMDB operates like a columnar RDBMS with a memory cache. If you are running a BI query workload then you need to pin data in the cache or the system will thrash. Because of this SAP recommends that you get enough memory to get all of the data in… we recommend that you operate our in-memory database product in-memory…

This really the point of the post. The Five Minute Rule informs us about what data should be in-memory (see here). An in-memory database is designed from the bottom up to manage hot data in-memory. The in-memory add-ons being offered over legacy systems are very capable and should not be ignored… and as the price of memory drops the Five Minute Rule will suggest that data in-memory will account for and ever larger percentage of your EDW. But to offer an in-memory capability and recommend that you should keep the bulk of the data on disk is silly… and to state that your product has a competitive advantage because you do not recommend that all of the data managed by your in-memory feature be kept in-memory is silliness squared.

More on the Oracle 12c In-Memory Option

English: a human brain in a jar
In-memory DBMS in a jar (Photo credit: Wikipedia)

I recently listened to a session by Juan Loalza of Oracle on the 12c In-memory option. Here are my notes and comments in order…

There is only one copy of the data on disk and that is in row format (“the column format does not exist on disk”). This has huge implications. Many of the implications come up later in the presentation… but consider: on start-up or recovery the data has to be loaded from the row format and converted to a columnar format. This is a very expensive undertaking.

The in-memory columnar representation is a fully redundant copy of the row format OLTP representation. Note that this should not impact performance as the transaction is gated by the time to write to the log and we assume that the columnar tables are managed by MVCC just like the row tables. It would be nice to confirm this.

Data is not as compressed in-memory as in the hybrid-columnar-compression case. This is explained in my discussion of columnar compression here.

Oracle claims that an in-memory columnar implementation (level 3 maturity by my measure see here) is up to 100X faster than the same implementation in a row-based form. Funny, they did not say that the week before OOW? This means, of course, that HANA, xVelocity, BLU, etc. are 100X faster today.

They had a funny slide talking about “reporting” but explained that this is another word for aggregation. Of course in-memory vector aggregation is faster in-memory.

There was a very interesting discussion of Oracle ERP applications. The speaker suggested that there is no reporting schema for these apps and that users therefore place indexes on the OLTP database to provide performance for reporting and analytics. It was suggested that a typical Oracle E-Business table would have 10-20 indexes on it and it was not unusual to see 30-40 indexes. It was even mentioned that the Siebel application main table could require 90 indexes. It was suggested that by removing these indexes you could significantly speed up the OLTP performance, speed up reporting and analytics, cure cancer and end all wars (OK… they did not suggest that you could cure cancer and end war… this post was just getting a little dry).

The In-memory option is clusterable. Further, because a RAC cluster uses shared disk and the im-memory data is not written to disk there is a new shared-nothing implementation included. This is a very nice and significant architectural advance for Oracle. It uses the direct-to-wire infiniband protocol developed for Exadata to exchange data. Remember when Larry dissed Teradata and shared-nothingness… remember when Larry dissed in-memory and HANA… remember when Teradata dissed HANA as nonsense and said SAP was over-reacting to Oracle. Gotta smile :).

Admins need to reserve memory from the SGA for the in-memory option. This is problematic for Exadata as the maximum memory on a RAC node is 256GB…  My bad… Exadata X3-8 supports 2TB of memory per node… this is only problematic for Exadata X3-2 and below… – Rob

It is possible to configure a table partition into memory while leaving other partitions in row format on disk.

It is suggested again that analytic indexes may be dropped to speed up OLTP.

The presenter talked about how the architecture, which does not change the row-based tables in any way, allows all of the high-availability options available with Oracle to continue to exist and operate unchanged.

But the beautiful picture starts to dull a little here. On start-up or first access… i.e. during recovery… the in-memory columnar data is unavailable and loaded asynchronously from the row format on disk. Note that it is possible to prioritize the order tables are loaded to get high-priority data in-memory first… a nice feature. During this time… which may be significant… all analytic queries are run against the un-indexed row store. Yikes! This kills OLTP performance and destroys analytic performance. In fact, the presenter suggested that maybe you might keep some indexes for reports on your OLTP tables to mitigate this.

Now the story really starts to unravel. Indexes are required to provide performance during recovery… so if your recovery SLA’s cannot be met without indexes then you are back to square one with indexes that are only used during recovery but must be maintained always, slower OLTP, and the extra requirement for a redundant in-memory columnar data image. I imagine that you could throttle some reports after an outage until the in-memory image is rebuilt… but the seamless operations during recovery using standard Oracle HA products is a bit of a stretch.

Let me raise again the question I asked in my post last week on this subject (here)… how are joins processed between the row format and the columnar format. The presenter says that joins are no problem… but there are really only two ways… maybe three ways to solve for this:

  1. When a row-to-columnar join is identified the optimizer converts the columnar table to a row form and processes the join using the row engine. This would be very very slow as there would be no indexes on the newly converted columnar data to facilitate the join.
  2. When a row-to-columnar join is identified the optimizer pushes down aggregation and projection to the columnar processing engine and converts the columnar result to a row form and processes the join using the row engine. This would be moderately slow as there would be no indexes on the newly converted columnar data to facilitate the join (i.e. the columnar fact would have no indexes to row dimensions or visa versa).
  3. When a row-to-columnar join is identified the optimizer converts the row table to a columnar form and processes the join using the columnar engine. This could be fast.

Numbers two and three are the coolest options… but number three is very unlikely due to the fact that columnar data is sharded in a shared-nothing manner and the row data is not… and number two is unlikely because if it were the case Oracle would surely be talking about it. Number one is easy and the likely state in release 1… but these joins will be very slow.

Finally, the presenter said that this columnar processing would be implemented in Times Ten and then in the Exalytics machine. I do not really get the logic here? If a user can aggregate in their OLTP system in a flash why would they pre-aggregate data and pass it to another data stovepipe? If you had to offload workload from your OLTP system why wouldn’t you deploy a small, standard, Oracle server with the in-memory option and move data there where, as the presenter suggested, you can solve any query fast… not just the pre-aggregated queries served by Exalytics. Frankly, I’ve wondered why SAP has not marketed a small HANA server as an Exalytics replacement for just this reason… more speed… more agility… same cost?

There you have it… my half a cents… some may say cents-less evaluation.

I will end this with a question for my audience (Ofir… I’ll provide a link to your site if you post on this…)… how do we suppose the in-memory option supports bulk data load? This has implications for data warehousing…

Here, of course, is the picture I should have used above… labeled as the in-memory database of your favorite vendor:

abbyn