Database Super-computing

Today I am going to focus on a topic that I’ve suggested previously without the right emphasis: the new database architecture that uses vector processing on compressed columns to significantly accelerate performance.

The term “super-computing” was coined to describe the extreme hardware and software optimization developed to crunch numbers in scientific applications. As these technologies developed super-computer hardware evolved to leverage parallel microcomputers, software evolved to better leverage parallelism. Recently, microcomputers have started to incorporate the specialized instructions that support advanced mathematical applications. These super-computer instructions directly support vector algebra by manipulating strings of bits, vectors, in a single instruction. Finally, application developers recognized that these bit strings, these vectors, could be loaded into the microprocessors in a more effective manner to optimize their applications to the bare metal.

The effect of these optimizations accumulate for these applications as vectors compress and use memory more effectively, vectors load into processor cache more effectively, and vector instructions dramatically outperform integer instructions. The cumulative effect is that super-computer programs may be 10X-100X faster than commercial applications that provide the same result.

As this evolution progressed there was a similar evolution changing the architecture of database technology. Databases actually leveraged microcomputers before the high performance space made the move. But databases focused on the benefits of massively parallel I/O more than on the benefits of parallel compute. The drive to minimize the cost of I/O eventually led database developers to implement column store and then a very interesting discovery was made. Engineers recognized that a highly compressed column, a string of bits, could be processed as a vector.

Let’s see if we can make this 10X-100X number more than marketing foam. We can do this by roughly comparing the low-level processing of a chunk of data in integer and then in vector formats.

Let’s skip I/O processing and just focus on internals. This simplification greatly favors our integer DBMS. Keep in mind that the vector DBMS will process compressed vector data directly while the integer DBMS will expend resources to uncompress data and then take up 4X or more memory. This less efficient memory utilization will increase the chance that an I/O may be required and I/O is very expensive in the scenario we will discuss. Even an I/O on 1% of the time by the integer DBMS will provide a 1000X-100,000X advantage to the vector DBMS (see Figure 8 to gauge the latency to SSD or to disk).

Figure 8. Some Latency Metrics
Figure 8. Some Latency Metrics

So we’ll start with uncompressed integer data versus compressed vector data. We can assume that both databases are effective at populating cache. But the 4X compression advantage means that the vector processor is more likely to find data in the fast Level 1 cache and in the mid-range L2 cache. Given the characteristics outlined in Figure 8 we might suggest that the vector database is 4X more likely of finding data in cache than the integer database and that if we assume the latency of L2 cache as an estimate this results in a 15X-200X performance advantage.

Since data is in a vector form we can perform relational algebra and basic mathematics using vector algebra and vector addition. This provides another 8X-50X boost to the vector side

When we combine these advantages we see that a 10X-100X advantage is conservative. The bottom line is clear. A columnar database that effectively manages vectors into cache and further utilizes super-computing instructions will significantly out-perform an integer-based product.

The era of database super-computing has begun.

Data Science or Data Alchemy

English: Karl Popper in the 1980's.

 

I love the tweetness of the @howarddresner posts restated here regarding data science… and the dialog it has started… and I would like to add a twist to the conversation.

 

First a story…

 

I once consulted for a company who was building a marketing service for their clients. They targeted customers with products provided by those clients using data they had in-house. The company had a team of data scientists who built targeting models. The same team built models/reports that evaluated the effectiveness of the targeting. Somehow their evaluations demonstrated that they were brilliant… producing results that were unprecedented and completely justifying the service.

 

But a close look at the targeting and the evaluation showed that the targeting was weak and that the results were grossly inflated.

 

Karl Popper tells us that science must be falsifiable. But science requires enough rigor that somebody must attempt to falsify any claims.

 

Data Science” is not science in this respect. It is alchemy… and the shortage of data scientists is twice as bad as you think… because there must be two data scientists for every claim of data discovery… one to discover it and one to test the discovery.

 

My Father used to say” “Figures never lie… but liars figure”. I am not accusing all data scientists of being as unethical as those in my story… but I worry that under the algorithmic mumbo-jumbo there emerges new versions of the truth that are utterly untested and will often prove inaccurate.

 

 

Decision Support Redux

In the late 1980’s and the early 1990’s the term for software that business users executed to run reports, fire off canned queries, and/or to explore data ad hoc was called “decision support” software. Later, and still today, the term “business intelligence” came into use.

I never understood the sense of the switch. The term “business intelligence” is vague… sort of fluffy and pretentious. “Decision support” implies a purpose. In the years when the switch from one term to the other was in progress, if you asked the question: what do you mean by “business intelligence” the answer was… it is “decision support”.

Today the analytics that underlie both terms are becoming more sophisticated, and they execute in near-real-time. It could be said that there is business intelligence in the process that acquires data, analyzes it, discovers a pattern, and applies a rule automatically as a result. But the software programmer who built the system was focused on automating the decision process… not on creating intelligence.

A clear focus on supporting complex decisions will increase the chances of delivering a return on your investment in analytics. “Intelligence” is not useful unless it is applied to make a better decision. I vote for a return to the phrase “decision support”.

OLAP is not advanced analytics

OLAP searches a set of pre-aggregated data… a cube. If the cube is large enough that you don’t bump into the edges you might think that your search is ad hoc… but that is an illusion. The set is prescribed not ad hoc.

In the 1980’s we sent paper reports out… they were moved on a pallet with a fork-lift. The reports aggregated key metrics to many levels in a hierarchy sliced and diced across many dimensions. Today we take the lines off the reports and store them digitally in a cube and provide tools to let users navigate the cube to build their reports. What they build looks, to a large extent, like the reports from the 80’s.

Data warehousing provides more data and better data… so there are more cubes, more dimensions, more reports… and hopefully more business intelligence. But these reports provide 1980’s quality business intelligence on a screen instead of on paper… bounded by the OLAP cube.

When you hear folks talk about data science and data mining and advanced analytics and optimization… they are talking about advanced mathematical treatment of the data… know that this is going to require technology that is beyond the capabilities of a OLAP engine.

Exalytics is a OLAP engine. Here are some Exalytics use cases from a proponent. They are about OLAP dashboards… good stuff… but hardly advanced analytics. Oracle says that Exalytics is engineered for Extreme Analytics. If we agree that “extreme” analytics is not in any way advanced… then I agree.

A Big Data Sound Bite…

Here is a sound bite on Big Data I composed for another source…

Big Data is relative. For some firms Big Data will be measured in petabytes and for other in hundreds of gigabytes. The point is that very detailed data provides the vital statistics that quantify the health of your business.

To store and access Big Data you need to build on a scalable platform that can grow. To process Big Data you need a fully scalable parallel computing environment.

With the necessary infrastructure in place the challenge becomes: how do you gauge your business and how do you change the decision-making processes to use the gauges?