Today I am going to focus on a topic that I’ve suggested previously without the right emphasis: the new database architecture that uses vector processing on compressed columns to significantly accelerate performance.
The term “super-computing” was coined to describe the extreme hardware and software optimization developed to crunch numbers in scientific applications. As these technologies developed super-computer hardware evolved to leverage parallel microcomputers, software evolved to better leverage parallelism. Recently, microcomputers have started to incorporate the specialized instructions that support advanced mathematical applications. These super-computer instructions directly support vector algebra by manipulating strings of bits, vectors, in a single instruction. Finally, application developers recognized that these bit strings, these vectors, could be loaded into the microprocessors in a more effective manner to optimize their applications to the bare metal.
The effect of these optimizations accumulate for these applications as vectors compress and use memory more effectively, vectors load into processor cache more effectively, and vector instructions dramatically outperform integer instructions. The cumulative effect is that super-computer programs may be 10X-100X faster than commercial applications that provide the same result.
As this evolution progressed there was a similar evolution changing the architecture of database technology. Databases actually leveraged microcomputers before the high performance space made the move. But databases focused on the benefits of massively parallel I/O more than on the benefits of parallel compute. The drive to minimize the cost of I/O eventually led database developers to implement column store and then a very interesting discovery was made. Engineers recognized that a highly compressed column, a string of bits, could be processed as a vector.
Let’s see if we can make this 10X-100X number more than marketing foam. We can do this by roughly comparing the low-level processing of a chunk of data in integer and then in vector formats.
Let’s skip I/O processing and just focus on internals. This simplification greatly favors our integer DBMS. Keep in mind that the vector DBMS will process compressed vector data directly while the integer DBMS will expend resources to uncompress data and then take up 4X or more memory. This less efficient memory utilization will increase the chance that an I/O may be required and I/O is very expensive in the scenario we will discuss. Even an I/O on 1% of the time by the integer DBMS will provide a 1000X-100,000X advantage to the vector DBMS (see Figure 8 to gauge the latency to SSD or to disk).
So we’ll start with uncompressed integer data versus compressed vector data. We can assume that both databases are effective at populating cache. But the 4X compression advantage means that the vector processor is more likely to find data in the fast Level 1 cache and in the mid-range L2 cache. Given the characteristics outlined in Figure 8 we might suggest that the vector database is 4X more likely of finding data in cache than the integer database and that if we assume the latency of L2 cache as an estimate this results in a 15X-200X performance advantage.
Since data is in a vector form we can perform relational algebra and basic mathematics using vector algebra and vector addition. This provides another 8X-50X boost to the vector side
When we combine these advantages we see that a 10X-100X advantage is conservative. The bottom line is clear. A columnar database that effectively manages vectors into cache and further utilizes super-computing instructions will significantly out-perform an integer-based product.
The era of database super-computing has begun.
4 thoughts on “Database Super-computing”
Thank you Rob. The complete whitepaper is available at http://www.bityota.com/cloud-changes-every-thing-wp/
Hi Rob, very enlightening as always.
I recall the now defunct Sand Technology ‘vectorising’ and tokenising (both probably spelled with ‘Z’s) the columns on their Nucleus database back in the 80s.
Guess they were ahead of their time.
Well written. Multiple research groups and companies looking at this. A bit more on this topic and some details about the IBM approach for cloud and software over here: http://softwaretradecraft.com/when-ram-is-too-slow-how-dynamic-in-memory-processing-changes-the-game-for-data-analytics/
Comments are closed.