MPP on HANA, Exadata, Teradata, and Netezza

6 May… There is a good summary of this post and on the comments here.  – Rob

17 April… A single unit of parallelism is a core plus a thread/process to feed it instructions plus a feed of data. The only exception is when the core uses hyper-threading… in which case 2 instructions can execute more-or-less at the same time… then a core provides 2 units of parallelism. All of the other stuff: many threads per core and many data shards/slices per thread are just techniques to keep the core fed. – Rob

16 April… I edited this to correct my loose use of the word “shard”. A shard is a physical slice of data and I was using it to represent a unit of parallelism. – Rob

I made the observation in this post that there is some inefficiency in an architecture that builds parallel streams that communicate on a single node across operating system boundaries… and these inefficiencies can limit the number of parallel streams that can be deployed. Greenplum, for example, no longer recommends deploying a segment instance per core on a single node and as a result not all of the available CPU can be applied to each query.

This blog will outline some other interesting limits on the level of parallelism in several products and on the definition of Massively Parallel Processing (MPP). Note that the level of parallelism is directly associated with performance.

On HANA a thread is built for each core… including a thread for each hyper-thread. As a result HANA will split and process data with 80 units of parallelism on a high-end 40-core Intel server.

Exadata deploys 12 cores per cell/node in the storage subsystem. They deploy 12 disk drives per node. I cannot see it clearly documented how many threads they deploy per disk… but it could not be more than 24 units of parallelism if they use hyper-threading of some sort. It may well be that there are only 12 units of parallelism per node (see here).

Updated April 16: Netezza deploys 8 “slices” per S-Blade… 8 units of parallelism… one for each FPGA core in the Twin times four (2X4) Twinfin architecture (see here). The next generation Netezza Striper will have 16-way parallelism per node with 16 Intel cores and 16 FPGA cores…

Updated April 17: Teradata uses hyper-threading (see here)… so that they will deploy 24 units of parallelism per node on an EDW 6700C (2X6X2) and  32 units of parallelism per node on an EDW 6700H (2X8X2).

You can see the different definitions of the word “massive” in these various parallel processing systems.

Note that the next generation of Xeon processors coming out later this year will have 8X15 processors or 120 cores on a fat node:

  • This will provide HANA with the ability to deploy 240 units of parallelism per node.
  • Netezza will have to find a way to scale up the FPGA cores per S-Blade to keep up. TwinFin will have to become QuadFin or DozenFin. It became HexadecaFin… see above. – Rob
  • Exadata will have to put 120 SSD/disk drive combos in each node instead of 12 if they want to maintain the same parallelism-to-disk ratio with 120 units of parallelism.
  • Teradata will have to find a way to get more I/O bandwidth on the problem if they want to deploy nodes with 120+ units of parallelism per node.

Most likely all but HANA will deploy more nodes with a smaller number of cores and pay the price of more servers, more power, more floor space, and inefficient inter-node network communications.

So stay tuned…

Teradata, HANA and NUMA

Teradata is circulating a document to customers that claims that the numbers SAP has published in its 100TB PoC white paper (here) demonstrates that HANA suffers from scaling issues associated with the NUMA-effect. The document is so annoyingly inaccurate that I have to respond.

NUMA stands for non-uniform-memory-access. This describes an architecture whereby each core in a multi-core system has some very fast local memory accessed directly through a memory bus… but has access to every other core’s local memory through a “remote” access hop over another fast bus. In the case of Intel Xeon servers the other fast bus is know as the QPI bus. “Non-uniform” means that all memory access are not equal… a remote access over the QPI bus is slower than access over the memory bus.

The first mistake in the Teradata document is where they refer to the problem as the “SMP Knee Curve”. SMP stands for symmetric multi-processing… an architecture where multiple cores share the same memory bus. The SMP Knee Curve describes the problem when too many cores are contending for the same bus. HANA is not certified to run on an SMP system. The 100TB PoC described above is not run on an SMP system. When describing issues you might expect Teradata to at least associate the issue with the correct hardware architecture.

The NUMA-effect describes problems scaling processors within a single NUMA node. Those issues can impact the ability to continuously add cores as memory locking issues across the QPI bus slow the system. There are ways to mitigate this problem, though (see here for some examples of how to code around the problem).

Of course HANA, which built an in-memory system with NUMA as a target from the start… has built in these NUMA mitigations. In fact, HANA is designed deeper still using special techniques to keep the processor caches filled and to invoke special-purpose SIMD instructions. HANA is built so close to the hardware that processor cycles that are unused due to cache misses but show up as processor busy are avoided (in other words, HANA will get more work done on a 100% CPU busy system than other software that will show 100% CPU busy). But Teradata chose to ignore this deep integration… or they were unaware of these techniques.

Worse still, the problem Teradata calls out… shouts out… is about scaling over 100 nodes in a shared-nothing configuration. The NUMA-effect has nothing at all to do with scale out across nodes. It is an issue within a single node. For Teradata to claim this is silliness at best. It is especially silly since the shared-nothing architecture upon which HANA is built is the same architecture Teradata uses.

The twists Teradata applies to the numbers are equally absurd… but I’ll stop here and hope that the lack of understanding they exhibit in throwing around terms like “SMP Knee Curve” and “NUMA-effect” will cast enough doubt that the rest of their marketing FUD will be suspect. Their document is surely not about architecture… it is weak marketing… you can see more here