Key Values and Key-Value Stores and In-memory Databases

Back to more geeky topics… although my Mom loved the videos…

When very high performance is required to return key performance metrics derived from large volumes of data to a very large number of clients… in other words when volume and velocity are factors and the results are to be delivered to thousands of users (I suppose that I could conjure up a clever V here… but the cleverness would come from the silliness of the semantic stretch… so I’ll leave it to you all to have some pun); the conventional approach has been to pre-compute the results into a set of values that can be fetched by key. In other words, we build a pre-aggregated and pre-joined result table that provides answers to a single query template. Conventionally building this query-specific result table has been the only way to solve these big problems.

Further, we conventionally store these key-value results in a relational DBMS and fetch a row at a time providing pretty darn good performance. Sometimes pretty darn good is not good enough. So there are new options. Key-value data stores may well offer a solution that provides performance and scale at a price well below what has been conventional to date. This is well and good.

But I would like to challenge the conventional thinking a little. The process of joining and aggregating volumes of data into results is a BI process. For the last twenty years BI practitioners have been building pre-aggregated tables and data marts to solve these same problems… maybe not at scale… and this practice has proven to be very expensive, the opposite of agile, and unsustainable. The people costs to develop and support multiple pre-computed replicas is exorbitant. The lack of flexibility that comes from imposing a longish development project over what is essentially an aggregate BI query is constraining our enterprise and our customers.

A better approach is to use the new high performance database products: HANA, BLU, Oracle 12c, or maybe Spark; to aggregate on-demand. Included in this approach is a requirement to use the new high performance database computing platforms available to house the databases.

Consider this… an in-memory DBMS can aggregate 12M rows/sec/core. It can scan 3MB/msec/core. Companies like SGI and HP are ganging processors together so that you can buy a single node that contains 32, 64, or 128 cores… and this number will go up. A 64-core server will aggregate 768M rows/sec and scan 19.2TB/sec… and you can gang a small number of nodes together and scale out.

Providing an extensible BI platform for big data is so much easier than building single-query key-value clusters… there is much less risk… and the agility and TCO make it close to a no-brainer. We just have to re-think the approach we’ve used for 20 years and let the new software and hardware do the work.

4 thoughts on “Key Values and Key-Value Stores and In-memory Databases

  1. Which IMDB approach is better ie more economical- SMP ( Oracle 12c, IBM Blu) , or MPP ( Hana ) ? Processors are more powerful and can address more memory, which points to SMP as winning architecture. MPP has network performance hit problem – distributed join performance is certainly not good even on Hana.That problem does not exist in SMP architectures.

    Like

    • HANA is fully parallel on a single node, Ranko… Just like BLU and Oracle12c (I suspect that they are not all equal here… But I can’t sort one ftom the other based on public info). HANA, though, can also run shared-nothing across nodes while solving within each node in parallel. BLU, for sure, and Oracle, I assume, are working to get there.

      The real question is: if you have a choice to scale-up on a single node or scale-out across nodes… How should you choose? I think that you answered that properly… Scale-up.

      Note that I did not use the SMP/MPP terminology… It is not really accurate. All three products “shard” the data in-memory to run parallel across cores… And multi-core nodes are not really SMP nodes which, by definition, share a single local memory. They are actually NUMA nodes where each core has local memory that are shared over a bus.

      Finally, SAP has some silly and arbitray rules about configuring scale-up systems. These limit the amount of data you can put on a certified single node. Hopefully they will become less prescriptive soon, and let customers choose the performance characteristics they want.

      Rob

      Like

      • Thank you Rob.
        Yes, Oracle 12c ( M6-32 and on ) / IBM BLU architectures are NUMA, but even NUMA is faster than accessing other processor memory over network, like Hana does.
        Another confusing point – about Hana positioning: Is Hana general purpose/OLTP or analytical rdbms ?
        Looks like it is two RDBMSs in one:a) analytical rdbms ( which can’t cost effectively handle really big data because it is memory based, and probably generally expensive). b) OLTP database ( which is not really meant to be OLTP, because yes, it does have row store but columnar seems to be the main theme ). Product positioning seems to be a bit fuzzy, don’t you think ? Hope I am not getting you into trouble 🙂

        Like

      • Let me say again… When HANA runs on a single node it runs just like BLU and Oracle12c. You have the option to scale out with HANA to solve bigger problems… An option not yet available with the others. Let me compose a separate post to answer your second question so more readers see the answer.

        As far as trouble goes… I’ve left SAP for an amazing new opportunity… I’ll post the details early in January. You will like it.

        Like

Comments are closed.