MPP on HANA, Exadata, Teradata, and Netezza

6 May… There is a good summary of this post and on the comments here.  – Rob

17 April… A single unit of parallelism is a core plus a thread/process to feed it instructions plus a feed of data. The only exception is when the core uses hyper-threading… in which case 2 instructions can execute more-or-less at the same time… then a core provides 2 units of parallelism. All of the other stuff: many threads per core and many data shards/slices per thread are just techniques to keep the core fed. – Rob

16 April… I edited this to correct my loose use of the word “shard”. A shard is a physical slice of data and I was using it to represent a unit of parallelism. – Rob

I made the observation in this post that there is some inefficiency in an architecture that builds parallel streams that communicate on a single node across operating system boundaries… and these inefficiencies can limit the number of parallel streams that can be deployed. Greenplum, for example, no longer recommends deploying a segment instance per core on a single node and as a result not all of the available CPU can be applied to each query.

This blog will outline some other interesting limits on the level of parallelism in several products and on the definition of Massively Parallel Processing (MPP). Note that the level of parallelism is directly associated with performance.

On HANA a thread is built for each core… including a thread for each hyper-thread. As a result HANA will split and process data with 80 units of parallelism on a high-end 40-core Intel server.

Exadata deploys 12 cores per cell/node in the storage subsystem. They deploy 12 disk drives per node. I cannot see it clearly documented how many threads they deploy per disk… but it could not be more than 24 units of parallelism if they use hyper-threading of some sort. It may well be that there are only 12 units of parallelism per node (see here).

Updated April 16: Netezza deploys 8 “slices” per S-Blade… 8 units of parallelism… one for each FPGA core in the Twin times four (2X4) Twinfin architecture (see here). The next generation Netezza Striper will have 16-way parallelism per node with 16 Intel cores and 16 FPGA cores…

Updated April 17: Teradata uses hyper-threading (see here)… so that they will deploy 24 units of parallelism per node on an EDW 6700C (2X6X2) and  32 units of parallelism per node on an EDW 6700H (2X8X2).

You can see the different definitions of the word “massive” in these various parallel processing systems.

Note that the next generation of Xeon processors coming out later this year will have 8X15 processors or 120 cores on a fat node:

  • This will provide HANA with the ability to deploy 240 units of parallelism per node.
  • Netezza will have to find a way to scale up the FPGA cores per S-Blade to keep up. TwinFin will have to become QuadFin or DozenFin. It became HexadecaFin… see above. – Rob
  • Exadata will have to put 120 SSD/disk drive combos in each node instead of 12 if they want to maintain the same parallelism-to-disk ratio with 120 units of parallelism.
  • Teradata will have to find a way to get more I/O bandwidth on the problem if they want to deploy nodes with 120+ units of parallelism per node.

Most likely all but HANA will deploy more nodes with a smaller number of cores and pay the price of more servers, more power, more floor space, and inefficient inter-node network communications.

So stay tuned…

20 thoughts on “MPP on HANA, Exadata, Teradata, and Netezza

    • Hi Kevin… I used the word “shard” to loosely convey a unit of parallelism rather than a physical slice of data. Maybe this is not perfectly accurate?

      I assume that you cannot simultaneously execute more parallel threads than the number of cores (plus optional hyper-threads). The current Exadata Cell has 12 cores and 12 disks… I assume this provides 12-way parallelism through to the disk I/O (bottlenecked by the disk controller). If they were to move to Ivy-Bridge with 120 cores they would need 120 drives to maintain the same degree of parallelism… wouldn’t they?

      Like

      • No, it doesn’t really work that way. IVB-EX would only appear in the hosts anyway (e.g., an X4-8). At the cell (Storage Server) level they wisely use lower wattage EP parts. All cores hammer I/O against all disks. There is no snippet-like affinity.

        None of what I just said goes directly against the premise of your post. I just wish I could teach everyone everything about Exadata in one brain drop so that not so many people are deceived into thinking it is something it isn’t (thinking of end users in this regard).

        SAP should hold a Clinic and invite poeple to attend who want to actually learn what Exadata *is* as opposed to what they are being told it is. I’d be a guest speaker as EMC and SAP are good partners.

        Like

      • I understand… and I knew that there was no core-to-disk affinity… but that the level of parallelism cannot, of course, be higher than the number of cores + hyper-threads.

        I will take you up on your offer to educate us (it is hard to try and sort through the Exadata noise every time… I’m happy if I get it close to right)…

        Like

  1. Hey Rob, I’m sure you’ll be relieved and delighted to hear that in the current Netezza Striper platform (that superseded the old TwinFin architecture) they have gone beyond your suggested Quad or DozenFin to 16 cores per S-Blade – which are coupled to up to 2.5 data slices each (so there are between 32-40 data slices per S-Blade). This translates to 240 separate parallel I/O data slices in a single full rack 7 S-Blade system. That massive enough for ya?

    Like

    • Cool! I updated the post to reflect that Striper will provide 16-way parallelism per node. Thanks, Huw.

      It is also worth noting that Striper eliminates the affinity Kevin mentions…

      I’ll leave it to the reader to decide if 16-ways per node is massive relative to the competition (it is not about data slices it is about parallel processing).

      Like

  2. Re HexadecaFin: Heh – you’re clearly wasted in engineering…. marketing beckons! I’ll leave it up to someone else to rise to the bait of the ‘parallel disk I/O doesn’t count as parallel processing’ comment.

    Like

  3. Incidentally, if anyone from IBM Marketing happens to read Rob’s blog, HexadecaFin is still a better name than the official ‘IBM PureData System For Analytics Powered by Netezza (N2001)’ branding.

    Like

  4. Gee Rob. People are watching you carefully these days. Guess the blog is a success.

    Teradata actually runs 25-40 AMPs per node — there is no pairing of AMP processes to cores. The machine SAS ran on had 12 cores, not 24. They may have been referencing the fact that the new Intel cores can spawn 2 or more virtual cores. With hyperthreading, each core can have several work streams running in parallel.

    Furthermore, AMPs are processes, not threads. They are virtual machines without the typical VM overhead. So they can have many threads per AMP. A given AMP can easily have 10-20 threads * 40 AMPs so one node could realistically have 400 units of parallelism or more. An AMP can have a lot of things running at once.

    The AMP counts are designed in each generation by total node bandwidth including CPU, memory, and IO. Teradata won’t need 120 AMPs/node anytime soon.

    Like

    • So, just to cover the Teradata vAMPs/Threads stuff… a vAMP is implemented as a process. Amp Worker Tasks are implemented as a pool of threads for each vAMP. If a Node has 40 vAMPs assigned to it, and 80 AWTs per vAMP configured – then there are 3200+ data processes & threads per node.

      One executing session uses 40 concurrent threads in this example, so even a single executing session will oversubscribe the CPUs on a Teradata node provided it is fed enough IO or functions to do. Your mileage *will* vary…

      The issue with these counts becomes how many process objects can the Linux scheduler really keep up with for a reasonable amount of overhead. I think 3K is about the limit with SUSE – but that can be debated by OS kernel guys forever.

      Keep in mind that to the OS, a thread is treated just like a process accept that it’s memory space is shared with it’s parent making the allocation & teardown of a thread cheaper – not the management of the thread.

      Like

      • Guys… I will generate another post to try and clear this up. This is partly my fault for not being very clear in how I used the words “thread” and so forth.

        But a unit of parallelism is a CPU/core plus a thread to feed it plus a way to feed the thread data. If there is one core and 5 threads or worker tasks or slices or… then there is one unit of parallelism. If there is one core and ten threads/processes and twenty shards of data there is one unit of parallelism. The other stuff are just techniques to keep the core fed. The only exception to this is that hyper-threading allows two instructions to execute at the same time… so a hyper-threaded core can provide two units of parallelism.

        Think back to the days when Teradata ran on a single core CPU node… if we deployed 10 nodes we said that there were 10 units of parallelsim regardless of overlapped threads with overlapped I/O.

        Sigh. I thought that I was making a simple point about who was how parallel. I think that the point stands without correction… but I see that I need to explain this a little more clearly to my readers.

        Like

  5. You might also ask: if a single Teradata core can run 2 threads… why would you oversubscribe? There are only two reasonable answers:
    1) You want to start lots of concurrent queries… even though there is considerable overhead in swapping them in and out; or
    2) You want to start lots of concurrent queries… because they are always giving up the core in order to perform I/O… both read I/O from disk and write I/O to spool.

    Both are to be avoided if possible. Avoiding spool I/O is possible… Greenplum and HANA do it via data flow. Avoiding other I/O is possible if your data is in-memory.

    Every vendor would use 120 cores/node if they could. The economics are overwhelming.

    The point of my blog is that some vendors can use more cores/node than others.

    Teradata would happily suggest that there is a shortcoming in the Netezza architecture that limits them to 16 units of parallelism/node… while Teradata can provide 32 units/node (and twice the parallel performance).

    I would suggest that there is a shortcoming in the both the Netezza and the Teradata architectures that limits them from using the 240 units/node offered by the upcoming Intel chips…

    To suggest that Teradata does not “need” 120 cores is wrong, Dan. The issue is that they cannot use 120 cores (ok… if they cannot use them they do not need them).

    Like

    • Paul, as I understand it the current EMC DCA recommends 1 segment instance for every two cores. There is no architectural limitation I know of… So there is likely some other bottleneck when too many segments start quacking to each other. Each DCA node has 16 cores. As I recall, some work in a Segment instance is multi-threaded for each query… So even though they deploy 1 Segment for two cores they should get more than 8 units of parallelism per query per node… But no more than 16 units…

      Rob

      Like

  6. If units of parallelism are defined by core count as you say, then HANA and Greenplum cannot provide any more UoPs than the Intel CPU provides — just like everyone else. To argue that data flow somehow injects new invisible cores into a hardware device makes little sense to me. Furthermore, if you want to count data flows, Netezza’s SPUs have a decent FPGA implementation so you then have to count the threads on the SPUs. I’m not impressed with the overall NZ architecture but with these definitions the facts are NZ qualifies for a better score. And Teradata has built in data flows as well, more on the way.

    Furthermore, you should look into Intel hyperthreads. There are actually 5 ALUs plus a half dozen more parallel processes in the Sandy Bridge/Romley CPUs, more coming in Ivy Bridge. So there are at least 5 parallel threads running every CPU second.

    Teradata cannot use 120 cores because such machines are not cost effective for the workloads our customers run. They are exotic, 1-of-a-kind beasts that pop up every few years, make no dent in the market, then fade away. An industry secret is that machines beyond 16 cores do not scale up well on analytic workloads. You can either have a non-linear scale up with Superdome, UE15000, and SGI, or you can have a better CPU utilization and scaling with shared nothing scale out. The large core counts typically yield 50-60 CPUs of boost for 128 cores — a 50% yield. That’s OK for data mining algorithms, not OK for queries. Software has little effect on the memory buss problem unless you run scientific work with carefully honed algorithms. Scale up is improving — not long ago 8 cores was lunatic fringe. But today anything beyond 32 cores is simply throwing money away for high concurrent DSS workloads. So Teradata uses standard high volume nodes — same as HANA, same as GP, same as Oracle and IBM — to get the best balance of CPU to memory speed. Also, the latest Intel nodes are ccNUMA, a clear indication that memory bandwidth needs some changes to survive the speed of the cores. Only the Sun UE1000/15000 solved that — with a CRAY bus — and we aren’t seeing those sold for DSS/analytics. Nice work by Pervasive on SGI 256-cores for some algorithms but its not a concurrent query machine.

    This is not the best forum for further debate on what is a unit of parallelism, how many tasks can be in flight, etc. Teradata’s best-in-class parallelism is the heavyweight champion. The challengers must prove themselves, not the other way round. Nuff said.

    Like

    • Data flows allow HANA and Greenplum to use CPU without unnecessary spool I/O. This is a distinct architectural advantage and the Five Minute Rule points out the flaw in Teradata’s design (look it up on Wikipedia or see the posts here). I did not suggest that data flows create invisible cores, Dan. I fully agree that there cannot be more units of parallelism than there are cores + hyper threads.

      There cannot possibly be more parallelism, software that executes in parallel not software that waits in parallel, than there are active and executing threads. On an Intel Xeon processor this means the number of cores times two.

      There are, as you point out, multiple things going on (which are not threads at all) inside an Intel core. But there are only two hyper-threads. A hyper-thread is exposed to the OS and is available for use by Teradata and by others… the internal parallel processes in the CPU are not exposed and are therefore irrelevant to this discussion.

      Processing technology is continuing to advance at the pace described by Moore’s Law in large part due to the continued expansion in the number of cores on a CPU. I would argue, as would Intel, that the top end Ivy Bridge CPU is the next generation… Not an esoteric one-off. Intel is not some little niche player to be dismissed as a flash-in-the-pan.

      If you are right Dan, then you have declared the end of Moore’s Law at 32 cores/CPU. If Intel and AMD and SUN/Oracle and IBM are right then Teradata will fall behind for the inability to use the newest hardware… Or, more likely, you will adapt your architecture to feed nodes with more cores in the near future.

      When you have to finish your response by resting on Teradata’s laurels… I can finish confident that you are marketing, not architecting.

      Like

    • That is pretty non-specific, Dan… Pointed at me instead of at my argument. My knowledge of lots is out of date. But I do not see where I made a single claim with regards to Teradata internals… Other than to suggest that you write intermediate results to spool… If this has changed please help me and my readers to understand the new architecture?

      Like

  7. Also… Netezza FPGAs “front-end” the Intel cores… that is, they act in tandem to process a query. If you doubled both the number of cores and the number of FPGAs you would double the throughput. In other words you would double the units of parallelism. If you doubled the number of cores but did not double the number of FPGAs you would not double the throughput… so they do not provide independent units of parallelism.

    Like

Comments are closed.