Netezza Zone Maps and I/O Avoidance

A reader recently wrote to me and asked about Netezza: “why does everyone insist that these (zone maps) tell you where ‘not to look’ when hunting for data?”. I’ll provide a direct answer… and a more meaningful answer.

Imagine that you have a list of data blocks with some metadata for each block that tells you the range of data in each block for a given column: FOO as follows:

  • Block 1: FOO 0-173
  • Block 2: FOO 174-323
  • Block 3: FOO 324-500

and a query that selects WHERE FOO=42.

If Netezza scanned the metadata and sent its read routine the list of blocks to not read… it would send 2,3. This is clearly not the case… if there are a million blocks it would not send a list of 999999 block numbers to not read… and force the read routine to figure out what was left to read. So clearly Netezza does not really tell you where ‘not to look’. This is a clever turn-of-phrase.

But I like this particular cleverness. Every DBMS is built with features designed to avoid I/O:

  • indexes are metadata that point to blocks to reduce reading unnecessary blocks;
  • partitions contain metadata to reduce reading unnecessary partitions;
  • column stores are designed to reduce reads of unnecessary columns; and
  • caches hold hot data blocks to avoid re-reading those blocks.

In fact, the highest performance DBMS will almost always be the one that most effectively minimizes I/O. This is why in-memory databases always have the highest performance.

So while zone maps do not really tell the system directly what not to read… the effect is ‘not to look’ at unnecessary data.

11 thoughts on “Netezza Zone Maps and I/O Avoidance”

  1. Gee Whiz, an answer that actually makes 100% sense to me. Not only do you appear to know your ‘stuff’, you can actually communicate your ‘stuffness’.

    Of course, now you’ve messed up because I’m going to have to ask you another question.

    I suppose that this is more of a theoretical, whats-your-gut-feel, sort of question since I don’t expect you could have actual data with which to answer.

    I’ve been doing DB2 for 11 years (Informix for 3 before that). So, I know a little about databases. This Netezza thing is pretty cool. I’ve done some work with it and it is darn fast. However, I can’t help but wonder if it isn’t possible to approach the same speed with standard DB2 if we just apply all the tricks?

    Range-partition our tables on the equivalent of the Organize By key.

    Create tablespaces such that each one is built on top of 8 containers with one container in each of 8 disks in an attempt to emulate the Distribute By key.

    Perhaps an MDC would be useful here too. I haven’t worked with them a lot but this might give us the effect of a Clustered Base Table without having to actually cluster the tables on a key.

    Tune the BLOCK IO, EXTENT SIZE and PREFETCH SIZE parameters of the tablespaces.

    Turn on parallelism (which I generally haven’t been able to do since I’m always in OLTP but it seems like it might make sense in this case)

    Obviously turn on compression.

    For those with DPF systems they could apply this to emulate the SPU concept.

    Maybe there’s some other things that could be done. I wonder if it wouldn’t somehow be possible to load data into non-logged tables somehow to avoid any logging overhead. Not sure if that’s possible to do in conjunction with the other things.

    But, if you did everything, I’m thinking DB2 might not be too far behind.

    I’m about to start a new job doing Netezza and DB2 administration. I’d love to try out a comparison if it seemed worthwhile.

    I’d like your thoughts if you’d care to share them. Is it worth while to try or is this a fruitless task?

    1. Hi Cameron, yes there are some similarities here. When you look though at how IBM positions Netezza and DB2 there are also some differences. Netezza (actually it’s now called PureData For Analytics) is focused on speed for complex analytics, simplicity and rapid time to value http://www-01.ibm.com/software/data/puredata/analytics/. PureData for Operational Analytics which leverages DB2 with DPF under the covers is, as the name suggests, focused on “operational” analytics where of course speed and simplicity are also the focus but it’s more about the workloads that have operational characteristics in addition to the analytics – http://www-01.ibm.com/software/data/puredata/operationalanalytics/

      Of course now there is also BLU Acceleration which I commented on below that you should take a look at too if you are already using DB2. I’ve connected with you via linkedIn (at least I hope it’s the same Cameron Walker that is working with DB2 and Informix in their profile 🙂 So feel free to reach out to me directly if you have more questions on these topics….I’m happy to help.

    2. @Cameron

      You still wouldn’t get the same performance. I’d be suprised if you could get within the same order of magnitude (given equivalent hardware).

      The Netezza [or whatever it’s called at the minute 🙁 ] query engine and cost based optimizer was built *knowing* that optimizations like the ones you mentioned would *always* be in place. The data has to partitioned, the zone map statistics have to be populated, etc. On top of that the data from disk goes through their FPGA to (AFAIK) prevent a large % of unneeded column data and easily discarded row data from making into RAM.

      A small anecdote before I go. We had a Gen1 Netezza box (1.3TB) and I’m currently using AWS Redshift to hold a similar amount of data. Redshift is still not as fast as that Netezza box 10 years later. A lot cheaper and easier to scale up though…

  2. Fundamentally, all large data systems optimize for IO in some form or another, because while CPU cost/performance is guided by Moore’s law, IO Cost/Performance is flat – with predictable single digit performance gains. The disparity between the two over the last 30 years – is many orders of magnitude… Hence, every major data platform architecture is founded on IO cost optimization, be it Column Stores to Hadoop.

  3. Good posting Rob. Some folks do get confused because we do talk about using I/O avoidance in the case of data skipping techniques as “knowing where the data isn’t”. That’s because a zone map or any other data skipping technique differs from index access in that we don’t really know “where the data is”. We only know that “it’s possible that the data is here” but we only know for sure that “the data I’m looking for isn’t over there”. That’s why it is sometimes referred to as a negative index or “where the data isn’t”.

    Funny I just wrote a series of Blog posting on this topic just this week so perhaps your readers would be interested in more details starting here (over three postings) which goes into more depth on data skipping techniques (and in fact how it is used for more than just avoiding I/O).
    http://it.toolbox.com/blogs/db2luw/how-data-skipping-works-in-blu-acceleration-part-3-61029

  4. It might be a little unfair to compare Netezza on dedicated HW to Redshift on virtual HW, Joe. Are there the same number of Redshift virtual nodes as the Netezza box had?

    Also… the point of the post was to think a little about I/O avoidance. Netezza uses zone maps, a specialized index structure (sorry all… if it walks like a duck and quacks like a duck… it is a duck… The notion that a zone map is not a sort of index is silly… and the fact that Cameron can emulate it with DB2 indexes just proves the point)). Cameron asked if you could create an equivalent to a zone map in DB2 and Joe rightfully points out that zone maps are baked into Netezza and supported by specialized HW… so the DB2 equivalent would not likely perform as well. But DB2 has several I/O avoidance techniques that Netezza does not: partition elimination, column elimination, and in-memory options. DB2 should outperform Netezza on similar hardware.

    Rob

    1. If they give me the chance to try a comparison between their Netezza and some (good) db2 hardware I’ll try and post results for you experts to argue about. That’s going to be a while though.

      My belief? Empirical results trump theoretical musings every time.

      1. LOL… If we eliminated everything on the internet that was not empirical… what would be left?

        The musings are meant to create hypothesis that can be tested… so please… test away and let us know? If fact… ping me on LinkedIn and I’ll help edit the results and post it here as a guest blog under your name…

        Thanks, Cameron…

        Rob

  5. Rob,

    Actually there are probabilistic data structures that can tell with certainty if an element does *not* belong to a set, while they cannot tell if the element *does* belong — they are called bloom filters and are used by database engines to optimize I/O. I’m not sure Netezza is using exactly this, but considering their formulation I’d speculate that the answer is yes.

    1. Nice, Alex… Bloom Filters do tell you where not to look. But as far as Netezza goes I have never seen them mentioned as a way to sort through zone maps…

      As I understand it Exadata uses Bloom filters to minimize IO. This is goodness.

      Rob

Comments are closed.

Discover more from Database Fog Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading