Teradata CPU Planning

I suggested here that Teradata shipped the EDW 6700 series without waiting for Ivy Bridge because they could not use the cores effectively… but it could be that Haswell (see here) fit their release schedule better. It will be interesting to see whether they can use all of the cores then?

13 thoughts on “Teradata CPU Planning

  1. Its a Parallel system, they will put more parallel unit (AMP’s) on the box and all of Intel cores will be used.

    Like

    • Hi Babak…

      It is not that easy. Teradata maintains a careful balance of CPU, memory, IO bandwidth, and storage. The new Intel CPUs dramatically increase the number of cores and the amount of processing available. But it is not possible to maintain the balance mentioned above as there is no compatible increase in available IO bandwidth. This is why Teradata has to move more in-memory.

      Rob

      Like

      • Hi Rob I agree about the dramatic increase of the cpu power but not about the balance of ios. The io bandwich is usually handled by the intel bridge  chips and usually it grows based on their “tick-tock” silicium schedule. The ivy bridge will allows more io bandwitch on the board. So by adding more FC disk array to each node, the io bandwitch will grow. Then Teradata will put more storage on ghe node and managed it with more AMPs in that node.  At the begining of the teradata history, there was one amp per 8×386 chip. This ratio was respected until Intel released multicore Xeons. From this point, Teradata puts one amp per core. The actual sandy bridge based nodes have 32 logical cpus and usual teradata config is about 32 amps per node. Intel schedule to have 12 core per ivy bridge nodes, so the future of TD config will be about 48 amps per node.  I’ve seen that behaviour since i touched my first DBC1012 in the begenning of ghe 90’s.

        Cheers, Babak

        Envoyé depuis un mobile Samsung

        Like

      • IO bandwidth is tied to the number of disk controllers and the speed of the disks, Babak… A single node with 2 disk controllers (there are 4-controller configurations available… But they are rare) can support around 2.4 GB/sec of read IO… Regardless of the disk speed or the placement of data on the disk (i.e. the controller is the bottleneck). It does not increase if SSD devices are connected behind the controller.

        This number will not increase with Ivy Bridge or Haswell CPUs.

        This is why everyone is trying to get more data in silicon behind the controller. There are two ways to do this… You can attach to the PCI bus or use DRAM (on the memory bus).

        Adding virtual AMPs will not help.

        Rob

        (FYI… I was employee 400 at Teradata in 1987… In other words I am old as dirt…)

        Like

      • Again i agree, but the situation can’t stay static. The actual trend of the disk array technology is about 8/16Gb bandwith, however there is some indication that storage providers are going to use infinitband attachement for their controller(SGI has recently released one). So in the near future, the bandwich will grow to 40Gb wich is the actual IB bandwich. Personally i think that this will be the way for everybody. I dont imagine that Teradata will stay out of this. There is a constant armor and knight battle between storage and cpu, but in a long term perspective (by lookinng at the past) both components are moving in parallel.  Babak

        Envoyé depuis un mobile Samsung

        Like

      • Disk arrays have multiple controllers… That is why they carry more bandwidth. But a shared-data disk array is a woeful piece of hardware for a shared-nothing architecture… That is why few Teradata customers used them.

        It will be interesting to see how Infiniband and other network fabrics play out, Babak… But no one is going to dedicate a 40Gb switch to a single node… So the problem is not solved… And 40Gb = 5GB… So there is no real magic there anyway.

        Rob

        Like

      • Let me to make an hazardous comparison,  Teradata made an order of magnitude jump by shifting from Ynet to Bynet. The main  difference between those interconnect fabric was the point to point topology wich allows them to build huge cluster of nodes. We got the same thing here between SCSI/SAS attachement versus IB attachement. 

        Yes , 40Gb is only 5GB, but with only two pci slot  we may have 8 Ib ports. Probably IB protocol will finish to replace the PCI itself and then, the race will continue! 

        Envoyé depuis un mobile Samsung

        Like

      • Also… Ivy Bridge will support 120 cores now… Not 12…. and 240 hyper-threads. The next gen will support twice that… This is why it is critical to get data in-memory…

        Teradata will fall way behind supporting only 48 cores… This is an issue…

        Like

      • There are two different kinds of Bandwidth to consider. One is the bandwidth between CPUs. In the last several intel generations, this has increased far beyond the 6Gbit bridge on the AMD chipsets – which became a problem for even the most sophisticated OS kernels (SunOS arguably at the time). Which I believe to be in the 40 to 80gbit range now.

        When Intel moved the memory controller onto the socket, they also completely redesigned the IO architecture to support greater independence of where a process runs from it’s cache and memory. This is required to prevent applications from having to manage the affinity of what process runs on what CPU (this is where the AMD fell apart).

        The on socket bandwidth limitation is what limits the practical concurrency of an application on a socket, particularly in the case of applications which have large unit of work issues (all DW apps), because the context of the work is highly likely to persist across timeslice intervals – and hence movement of threads to different CPUs. Apps which have small / very small UoW do not have these issues (all transactional DB systems) because the UoW executes within a timeslice, on the same CPU.

        The other type of bandwidth – controller bandwidth is limited by the type and number of disk controllers and motherboard controllers. MB controllers are now easily exceeding 5GB/sec (big B) in overall bandwidth, supporting both multi-GB disk bandwidth AND concurrent networking bandwidth speeds of 1GB and greater (infiniband). Just increase the number of array controllers.

        At the end of the day, I think the 5 minute rule oversimplifies an increasingly complex environment which is now dependent on the type of workload put on an application – and how that application is architected. I don’t think the 5 minute rule applies in the majority of the cases we’re talking about, and as much as I would like there to be a high correlation between CPU count, memory size, and performance – I don’t think there is any longer.

        I believe that we’re increasingly in a situation in which applications and workload need to be matched at a very detailed level – which includes the selection of hardware. An easy example is trying to convert a large body of transformation code (100K SQL statements say) from a Row Store to a Column Store – or the reverse. Both DBMS systems support methodologies of the other – very, very poorly.

        The most important factor in determining the optimal system will be the average size and variance of the working set and the variance of the size of Unit of Work in that working set. None of which is simple.

        Like

  2. @Babak: I just do not see it. If I put 10 nodes together today I get 10 X 2.4GB/sec of IO bandwidth…. 24GB/sec. If I put 10 nodes on an IB fabric to support that IO I get 5GB/sec of bandwidth. It is way too expensive to dedicate an IB fabric to each node.

    It is unclear that the bottleneck for IB is between the PCI bus and memory… so even if that were the next step it is unclear whether this improves the overall IO throughput (although it would reduce the latency to start an IO over Infiniband).

    You can hope for more but I do not see it on the horizon.

    Rob

    Like

  3. @Michael: I could be wrong but I do not believe that the issue you raise: IO bandwidth between the CPUs is relevant to the discussion at hand… which is about whether Teradata can effectively utilize 120 cores per node… although it is a fascinating story for sure… and it may become the relevant bottleneck if data moves in-memory and the IO bottleneck goes away.

    Regarding the Five Minute Rule: I would ask you to provide a case where the Five Minute Rule leads to the wrong result based on workload. To dismiss the rule as overly simple without a case is hard to get my head around.

    I will write up a separate post to try and explain why I think it works… but here is a thought:

    You might agree that having a 4K block of data in-memory and writing it out only to read it back a fraction of a second later would be wasteful regardless of the workload. In other words if the cost of an extra 4K of memory was not too high then it would be worth the cost… regardless of the workload. On the other side, you might agree that putting a 4K block of data that was accessed once a year in memory would be wasteful regardless of the workload… unless the cost of the memory was equal to or less than disk (and there were no constraints on the amount of memory). The point of the Five Minute Rule is to find the trade-off between these two extremes regardless of the workload.

    Workload management manages workload by starting and dispatching work according to business priorities. Memory and data management should be designed to optimize the performance of those subsystems trading off price and performance. If anything workload management should optimize the utilization of memory within the business constraints… Your argument implies that there is some basic disconnect between these two optimizations… but I do not see it?

    I can see where some data should be retained in-memory longer than the Five Minute Rule suggests based on business priorities. Further, if memory is in scarce supply, I can see where the high-value data should be retained in-memory and over-ride the Five Minute Rule. But I do not see a case where, if you can deploy more memory in support of the Five Minute Rule, you should not.

    Like

    • @Rob – I like the though to take the discussion about the 5 Minute rule to another post, a lot of good stuff to discuss.

      Relevant to this post, the reason the inter-CPU and inter-Socket memory bandwidth is an issue, the argument asserts a need to use 120 sockets on a node, ergo there will need to have quite a fair number of sockets.

      In order for any application to use a large number of sockets it will need to know a lot about affinity of threads and it will need an extraordinary amount of inter socket and CPU bandwidth to make up for any lack thereof. NUMA and Sequent come to mind…

      On another note, take the thinking on the 4K data block to 100K per block, it creates a different landscape – are there still analytic vendors that use 4K blocks? Oracle maybe?

      Like

  4. I picked 4K because I am old. It is not important to my argument. I do not think that it changes the landscape at all. There is a place where memory is cost-justified regardless of the workload… and a place where it is not cost-effective regardless of the workload, and a place where, if there if a shortage of memory, that business priorities trump the Five Minute Rule.

    Let’s take the affinity/NUMA discussion to another post. Affinity is a big issue when there are lots of queries, each with their own cores… each with primed cache lines, which are subsequently invalidated if the thread is interrupted and then dispatched on another core. It is much less of an issue if there is no IO interruption (because all of the data is in-memory) and/or if all of the cores are working in-parallel on the same query. I’ll ping you privately about how we might work together on this…

    Like

Comments are closed.