IBM BLU and SAP HANA

Weird blue dot
Weird blue dot (Photo credit: awshots)

As I noted here, I think that the IBM BLU Accelerator is a very nice piece of work. Readers of this blog are in the software business where any feature developed by any vendor can be developed in a relatively short period of time by any other vendor… and BLU certainly moves DB2 forward in the in-memory database space led by HANA… it narrowed the gap. But let’s look at the gap that remains.

First, IBM is touting the fact that BLU requires no proprietary hardware and suggests that HANA does. I do not really understand this positioning? HANA runs on servers from a long list of vendors and each vendor spins the HANA reference architecture a little differently. I suppose that the fact that there is a HANA reference architecture could be considered limiting… and I guess that there is no reference for BLU… maybe it runs anywhere… but let’s think about that.

If you decide to run BLU and put some data in-memory then certainly you need some free memory to store it. Assuming that you are not running on a server with excess memory this means that you need to buy more. If you are running on a blade that only supports 128GB of DRAM or less, then this is problematic. If you upgrade to a 256GB server then you might get a bit of free memory for a little data. If you upgrade to a fat server that supports 512GB of DRAM or more, then you would likely be within the HANA reference architecture set. There is no magic here.

One of the gaps is related: you cannot cluster BLU so the amount of data you can support in-memory is limited to a single node per the paragraphs above. HANA supports shared-nothing clustering and will scale out to support petabytes of data in-memory.

This limit is not so terribly bad if you store some of your data in the conventional DB2 row store… or in a columnar format on-disk. This is why BLU is an accelerator, not a full-fledged in-memory DBMS. But if the limit means that you can get only a small amount of data resident in-memory it may preclude you from putting the sort of medium-to-large fact tables in BLU that would benefit most from the acceleration.

You might consider putting smaller dimension tables in BLU…. but when you join to the conventional DB2 row store the column store tables are materialized as rows and the row database engine executes the join. You can store the facts in BLU in columnar format… but they may not reside in-memory if there is limited availability… and only those joins that do not use row store will use the BLU level 3 columnar features (see here for a description of the levels of columnar maturity). So many queries will require I/O to fetch data.

When you pull this all together: limited available memory on a single node, with large fact tables projecting in and out of disk storage, and joins pushed to the row store you can imagine the severe constraint for a real-world data warehouse workload. BLU will accelerate some stuff… but the application has to be limited to the DRAM dedicated to BLU.

It is only software… IBM will surely add BLU clustering (see here)… and customers will figure out that they need to buy the same big-memory servers that make up the HANA reference architecture to realize the benefits…  For analytics, BLU features will converge over the next 2-3 years to make it ever more competitive with HANA. But in this first BLU release the use of in-memory marketing slogans and of tests that might not reflect a real-world workload are a little misleading.

Right now it seems that HANA might retain two architectural advantages:

  1. HANA real-time support for OLTP and analytics against a single table instance; and
  2. the performance of the HANA platform: where more application logic runs next to the DBMS, in the same address space, across a lightweight thread boundary.

It is only software… so even these advantages will not remain… and the changing landscape will provide fodder for bloggers for years to come.

References

  • Here is a great series of blogs on BLU that shows how joins with the row store materializes columns as rows…

8 thoughts on “IBM BLU and SAP HANA”

  1. Hi Rob, nice description of what you think BLU is. However, it seems you don’t know about or ignore the fact that a big advantage of DB2 with BLU Acceleration is that not all data needs to be held in memory. DB2 BLU is not an in-memory database, it is more. Some “brainiacs” worked on new algorithms for bufferpool management, smart prefetching, and workload management. Now add the fact that data can be skipped and, if needed, compared when still highly compressed.
    With that DB2 with BLU Acceleration is able to run queries for data where the total is not in memory. I think this is a really a neat feature because, most people know, databases grow and you can’t change hardware like T-shirts or want to buy XXL when you need S right now…

    Henrik (I work for customers, IBM pays my bill)

    1. HANA does this as well, Henrik.

      My main point is that if you run on any hardware, including hardware with only a little DRAM, then being constrained to a single node limits what can be done.

      Now if you were arguing that BLU is a highly competitive column-store DBMS that used DRAM very effectively… That in-memory is a performance feature and not the main differentiator… And that it would compete well against Vertica and the like when most of the data managed resides on disk… That would catch my attention.

      Rob

      1. Rob I think you are missing the point Henrique is making. With BLU all the analytics are done in memory. In fact we treat memory like the new disk…avoid main memory at all cost because even memory is too slow and work in the registers and L1/L2 cache as much as possible. But the world doesn’t all fit in memory all the time. When it does great BLU will work in memory just like HANA works in memory. But the world doesn’t fit in memory (even if you scale out it because cost prohibitive to try to fit it all in memory). BLU is optimized to not have to have everything in memory. So if it’s in memory great, if not, no problem queries don’t fail because I don’t run out of memory and kill queries, we just page in required data (and we don’t need the full column partition in memory, just one page at a time). That is what I think is inaccurate or at a minimum misleading in your blog posting above.

        BTW I don’t think there has been issue with “proprietary” hardware that I’ve seen. As you point out lots of vendors have reference architecture for HANA. I think what some people are talking about is the requirement for “prescriptive” hardware. As others like John Appleby have stated in other blog posting (and I quote)
        – “If you see this in your application – big increases and decreases of memory usage in the HANA appliance – then you have got your design wrong and you need some help.”
        – “And yes, it has to be installed by a trained professional. It’s important to get the installation of HANA right.”
        – “Third, you can load and unload data manually, but only down to a specific column or table.”

        This is the issue I have seen with customers is that even with the prescriptive hardware, if you don’t have all the active data in memory there are issues. With BLU you can run it on machines that have less memory than you have data (and yes there are sizing guidelines based on the SLAs you want to meet) but it’s not a hard limit on the hardware because the software will compensate if you have more data than memory.

      2. I get it Chris… Let me summarize…

        Both BLU and HANA support data on disk. BLU does a better job than HANA when there is not enough memory available and data must be fetched in and out from disk… and is comparable when both BLU and HANA have same amount of memory. To effectively get lots of data in-memory you need a server that has lots of memory… so there is no meaningful hardware advantage on either side. If you want to use BLU in-memory you probably need a bigger server with more memory and more cores. HANA scales where BLU does not… And this constraint limits the size of the applications that can be in-memory on BLU. IBM will likely fix this constraint. SAP will likely fix the constraints on using data from disk.

        Rob

  2. Not sure you understood my point. You write:
    “You might consider putting smaller dimension tables in BLU…. but when you join to the conventional DB2 row store the column store tables are materialized as rows and the row database engine executes the join. You can store the facts in BLU in columnar format… but they may not reside in-memory if there is limited availability… and only those joins that do not use row store will use the BLU level 3 columnar features (see here for a description of the levels of columnar maturity). So many queries will require I/O to fetch data.”

    DB2 BLU does not mean columnar data is in-memory and row data is on disk. You can have the big fact table AND the dimensional data stored as columnar data and have a small memory. Only a tiny portion of the columnar data, the active data, may be in memory. That is why I mentioned that BLU Acceleration is more than a single technology. It is the efficient (“smart”) use of multiple new technologies or improvements to classic database technologies to reduce data access (I/O) and to exploit various forms of parallelism. Add simplicity and existing DB2 interfaces, so that no application changes are required.

    I personally haven’t used SAP HANA or Vertica, so I don’t write about them.

    Henrik

    1. If you implemented BLU on a small server with little free memory then at any point in time most of the managed data would reside on disk and only the “active” data would be in-memory. I get this.

      In this case there is little difference between BLU and a non-IMDB product like Vertica which also has some amount of active data in-memory at all times. In other words, with a small memory footprint BLU is a column-store database… Not really an in-memory database.

      This is a cool thing… Not a criticism. But I wonder how it would compete with Vertica or Paraccel?

      In the current release HANA would not do well with a small memory footprint… Even though it supports data on disk it is built on the assumption that the most active data is almost always in-memory. But in the same way that BLU made strides towards in-memory I imagine that HANA will support data on disk better in upcoming releases. 😉

      As I tried to say… If you want to claim that you are an in-memory database then you had better run on hardware with lots of memory… Either lots on a single node (i.e. the HANA reference architecture) or lots across a cluster (which BLU does not currently support).

      1. Now it is funny: Database customers (row storage, 10 years ago) already tried to run the database with lots of memory, so that all data was fitting in (in-memory database). The big difference to today is that we have very efficient compression, skipping of not used data, comparison of compressed data, better bufferpool algorithms, lots of parallelism and data optimized for registers, and much more. The same database from 10 years back would fit into a fraction of memory.

        In-memory is not in-memory.

        Henrik

      2. Column store is what changed the game, Henrik. Column orientation enables better compression, effective use of processor cache, vector processing, and so on. Columnar in-memory is not row in-memory… That is for sure…

Comments are closed.

Discover more from Database Fog Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading