This post is a follow on to the video on Systems Architecture here (video). I’ll describe how HANA, the Platform not the Database, is cutting fat from the landscape. If you have not seen the video recently it will be worth a quick look as I will only briefly recap it here.
The video, kindly sponsored by Intel, suggests that the distributed architecture we developed to collect lots of very small microprocessors into a single platform in order to solve big enterprise sized problems is evolving as multi-core microprocessors become ever more capable. In my opinion, we are starting to collapse the distributed architecture into something different.
Figure 1 shows how we might deploy a series of servers to implement an SAP software landscape. Note that if we removed a couple of pieces the picture represents a platform for just about any enterprise solution. In other words, it is not just about SAP ERP solutions. I have left this in so that companies with SAP in place can see where they may be headed.
Figure 2 shows the same landscape deployed in the HANA Platform. It is a very lean deployment with all of the fat of inter-operating systems combination replaced by processes communicating over lightweight thread boundaries. Squeezing the fat out will provide significant performance benefits over the distributed deployment… it will be 2-5 times faster easily… and 2-3 times faster even if you deployed the landscape on a single server in a series of virtual machines as you have to add in the overhead of a VM plus the cost of communication over virtual networks.
The HANA Platform is very powerful stuff… and if you had the choice between SAP and a competitor, the ability to match the performance on 1/2 the hardware might provide an overwhelming advantage to HANA.Note that in this evaluation I have not included any advantages from the performance kick due to the HANA Database… so the SAP story is better still.
Those of you who heard me describe HANA while I was at SAP know that this is not a new story. It was to be the opening of the shelved book on HANA. So this post satisfies my longstanding promise to get this story documented even though the book never generated much executive enthusiasm.
The trend in systems architecture outlined in the video is being capitalized by SAP as the HANA Platform… and the advantage provided is significant. You should expect SAP customers to rapidly move to HANA to take advantage of this tight, high-performance landscape.
But the software world does not stand still for long… and there are already options emerging that approximate this approach. Stay tuned…
This short post is intended to provide a quick warning regarding in-memory columnar and cpu requirements… with a longer post to follow.
When a row is inserted or bulk-loaded into a DBMS, if there are no indexes, the amount of cpu required is very small. The majority of the time is spent committing a transaction is the time to write a log record to persist the data.
When the same record is reformatted into a column the amount of processing required is significantly higher. The data must be parsed into columns, the values must be compressed, dictionaries may be updated, and the breadcrumbs that let the columnar data be regenerated into rows must be laid. Further, if the columnar structure are to be optimized then the data must be ordered… with a sort or some kind of index structure. I have seen academic papers that suggest that for an insert columnar processing may be 100X more than row processing… and you can see why this could be true (I apologize for not finding the reference… I’ll dig it up… as I recall I read it in a post some time back by Daniel Abadi).
Now let’s think about this… several vendors are suggesting that you can deploy their columnar features with no changes required… no new hardware… in-place. But this does not ring true if the new columnar feature requires 100X extra CPU cycles per row… or 50X… or 10X… unless you are running your database on an empty server.
This claim is a shot at SAP who, more honestly, suggests new hardware with high-end processors for their in-memory columnar product… but methinks it is marketing, not architecture, from these other folks.
As I noted here, I think that the IBM BLU Accelerator is a very nice piece of work. Readers of this blog are in the software business where any feature developed by any vendor can be developed in a relatively short period of time by any other vendor… and BLU certainly moves DB2 forward in the in-memory database space led by HANA… it narrowed the gap. But let’s look at the gap that remains.
First, IBM is touting the fact that BLU requires no proprietary hardware and suggests that HANA does. I do not really understand this positioning? HANA runs on servers from a long list of vendors and each vendor spins the HANA reference architecture a little differently. I suppose that the fact that there is a HANA reference architecture could be considered limiting… and I guess that there is no reference for BLU… maybe it runs anywhere… but let’s think about that.
If you decide to run BLU and put some data in-memory then certainly you need some free memory to store it. Assuming that you are not running on a server with excess memory this means that you need to buy more. If you are running on a blade that only supports 128GB of DRAM or less, then this is problematic. If you upgrade to a 256GB server then you might get a bit of free memory for a little data. If you upgrade to a fat server that supports 512GB of DRAM or more, then you would likely be within the HANA reference architecture set. There is no magic here.
One of the gaps is related: you cannot cluster BLU so the amount of data you can support in-memory is limited to a single node per the paragraphs above. HANA supports shared-nothing clustering and will scale out to support petabytes of data in-memory.
This limit is not so terribly bad if you store some of your data in the conventional DB2 row store… or in a columnar format on-disk. This is why BLU is an accelerator, not a full-fledged in-memory DBMS. But if the limit means that you can get only a small amount of data resident in-memory it may preclude you from putting the sort of medium-to-large fact tables in BLU that would benefit most from the acceleration.
You might consider putting smaller dimension tables in BLU…. but when you join to the conventional DB2 row store the column store tables are materialized as rows and the row database engine executes the join. You can store the facts in BLU in columnar format… but they may not reside in-memory if there is limited availability… and only those joins that do not use row store will use the BLU level 3 columnar features (see here for a description of the levels of columnar maturity). So many queries will require I/O to fetch data.
When you pull this all together: limited available memory on a single node, with large fact tables projecting in and out of disk storage, and joins pushed to the row store you can imagine the severe constraint for a real-world data warehouse workload. BLU will accelerate some stuff… but the application has to be limited to the DRAM dedicated to BLU.
It is only software… IBM will surely add BLU clustering (see here)… and customers will figure out that they need to buy the same big-memory servers that make up the HANA reference architecture to realize the benefits… For analytics, BLU features will converge over the next 2-3 years to make it ever more competitive with HANA. But in this first BLU release the use of in-memory marketing slogans and of tests that might not reflect a real-world workload are a little misleading.
Right now it seems that HANA might retain two architectural advantages:
HANA real-time support for OLTP and analytics against a single table instance; and
the performance of the HANA platform: where more application logic runs next to the DBMS, in the same address space, across a lightweight thread boundary.
It is only software… so even these advantages will not remain… and the changing landscape will provide fodder for bloggers for years to come.
Here is a great series of blogs on BLU that shows how joins with the row store materializes columns as rows…
I would like to recommend to you John Appleby’s post here on the HANA blog site. While the title suggests the article is about HANA, in fact it is about trends in computing and processors… and very relevant to posts here past, present, and upcoming…
I would also recommend Curt Monash’s site. His notes on Teradata here mirror my observation that a 30%-50% performance boost per release cycle is the target for most commercial databases… and what wins in the general market. This is why the in-memory capabilities offered by HANA and maybe DB2 BLU are so disruptive. These products should offer way more than that… not 1.5X but 100X in some instances.
Finally I recommend “What Every Programmer Should Know About Memory” by Ulrich Drepper here. This paper provides a great foundation for the deep hardware topics to come.
Database computing is becoming a special case, a commercial case, of supercomputing… high-performance computing (HPC) to those less inclined to superlatives. Over the next few years the differentiation between products will increasingly be due to the use of high-performance computing techniques: in-memory techniques, vector processing, massive parallelism, and use of HPC instruction sets.
This post has been thrown at me a couple of times now… so I’ll now take the time to go through it… and try to address the junk.
It starts by suggesting that “the Germans” have started a war… but the next sentence points out that the author tossed grenades at HANA two months before the start he suggests. It also ignores the fact that the HANA post in question was a response to incorrect public statements by a Microsoft product manager about HANA (here).
The author suggests some issue with understanding clustered indexes… Note that “There are 2 implementations of xVelocity columnstore technology: 1. Non clustered index which is read only – this is the version available in SMP (single node) SQL Server 2012. 2. Columnstore as a clustered index that is updateable – This is the version available in MPP or PDW version of SQL 2012.”. The Microsoft documentation I read did not distinguish between the two and so I mistakenly attributed features of one to the other. Hopefully this clears up the confusion.
He suggests that the concept of keeping redundant versions of the data… one for OLTP and one for BI is “untrue”… I believe that the conventional way to deal with OLTP and BI is to build separate OLTP and BI databases… data warehouses and data marts. So I stand by the original comment.
The author rightfully suggests that I did not provide a reference for my claim that there are odd limitations to the SQL that require hand-coding… here they are (see the do’s and dont’s).
He criticizes my statement that shared-nothing gave us the basis for solving “big data”. I do not understand the criticism? Nearly very large database in the world is based on a shared-nothing architecture… and the SQL Server PDW is based on the same architecture in order to allow SQL Server to scale.
He is critical of the fact that HANA is optimized for the hardware and suggests that HANA does not support Intel’s Ivy Bridge. HANA is optimized for Ivy Bridge… and HANA is designed to fully utilize the hardware… If we keep it simple and suggest that using hardware-specific instruction sets and hardware-specific techniques to keep data in cache together provide a 50X performance boost [This ignores the advantages of in-memory and focusses only on hw-specific optimizations… where data in cache is either 15X (L3) or 20X (L2) or 200X (L1) faster than data fetched from DRAM… plus 10X or more using super-computer SIMD instructions], I would ask… would you spend 50X more for under-utilized hardware if you had a choice? SAP is pursuing a distinct strategy that deserves a more thoughtful response than the author provided.
He accuses me of lying… lying… about SQL being architected for single-core x286 processors. Sigh. I am unaware of a rewrite of the SQL Server product since the 286… and tacking on support for modern processors is not re-architecting. If SQL Server was re-architected from scratch since then I would be happy to know that I was mistaken… but until I hear about a re-write I will assume the SQL Server architecture, the architecture, is unchanged from when Sybase originally developed it and licensed it to Microsoft.
He says that HANA is cobbled together from older piece parts… and points to a Wikipedia page. But he does not use the words in the article… that HANA was synthesized from other products and , as stated in the next sentence, built on: “a new application architecture“. So he leaves the reader to believe that there is nothing new… he is mistaken. HANA is more than a synthesis of in-memory, column-store, and shared-nothing. It includes a new execution engine built on algorithms from the search space… columns in the column store are processed as vectors rather than the rote tuple-by-tuple approach from the 1980’s. It includes powerful in-database support for procedural languages with facilities that convert loops to fully parallel set-based processes. It provides, as noted above, a unique approach to supporting OLTP and BI queries in the same instance (see here)… and more. I’m not trying to hype HANA here… time and the market will determine if these new features are important… but there is no doubt that they are new.
I did not find the Business Intelligist post to be very informative or helpful. With the exception of the Wikipedia article mentioned above there is only unsubstantiated opinion in the piece… … and a degree of rudeness that is wholly uncalled for.
I posted a blog on the SAP site here that discussed the implications of mobile clients. I want to re-emphasize the issue as it is crucial.
While at Greenplum we routinely replaced older EDW platforms and provided stunning performance. I recall one customer in particular where we were given a query that ran in 7 hours and Greenplum executed the query in seven seconds. This was exceptional… more typical were cases where we reduced run-times from several hours to under 30 minutes… to 10 minutes… to 5 minutes. I’m sure that every major competitor: Teradata, Greenplum, Netezza, and Exadata has similar stories to tell.
But 5 minutes will not cut it if you are servicing a mobile client where sub-second response to the device is a requirement… and 10 minutes is out of the question. It does not matter if it ran in 10 hours before… 10 minute response is not acceptable to a mobile device.
Today we see sub-second response delivered to our phones by custom applications built on special high-performance platforms designed specifically to service a mobile client: iPhones, iPads, and Android devices.
But what will we do about the BI applications built on commercial platforms which have just used every trick in the book to become one of the 5 minute stories mentioned above?
I think that there are only a couple of architectural choices.
We can rewrite the high-value queries as custom applications using specialized infrastructure… at great expense… and leaving the vast majority of queries un-serviced.
We can apply the 80/20 rule to get the easiest queries serviced with only 20% of the effort. But according to Murphy the 20% left will be the highest value queries.
We can tack on expensive, specialized, accelerators to some queries… to those that can be accelerated… but again we leave too much behind.
Or we can move to a general purpose high performance computing platform that can service the existing BI workload with sub-second response.
In-memory computing will play a role… Exalytics provides option #3… HANA option #4.
SSD devices may play a role… but the performance improvements being quoted by vendors who use SSD as a block I/O device is 10X or less. A 10X improvement applied to a query that was just improved to 10 minutes yields a 1 minute query… still not the expected level of service.
IT departments will have to evaluate the price/performance, not just the price, as they consider their next platform purchases. The definition of adequate response is changing… and the old adequate, at the least cost, may not cut it. Mobile clients are here to stay. The productivity gains expected from these devices is significant. High performance BI computing is going to be a requirement.
When I was at Greenplum… and now again at SAP… I ran into a strange logic from Teradata about query concurrency. They claimed that query concurrency was a good thing and an indicator of excellent workload management. Let’s look at a simple picture of how that works.
In Figure 1 we depict a single query on a Teradata cluster. Since each node is working in parallel the picture is representative no matter how many nodes are attached. In the picture each line represents the time it takes to read a block from disk. To make the picture simple we will show I/O taking only 1/10th of the clock time… in the real world it is slower.
Given this simplification we can see that a single query can only consume 10% of the CPU… and the rest of the time the CPU is idle… waiting for work. We also represented some I/O to spool files… as Teradata writes all intermediate results to disk and then reads them in the next step. But this picture is a little unfair to Greenplum and HANA as I do not represent spool I/O completely. For each qualifying row the data is read from the table on disk, written to spool, and then read from spool in the subsequent step. But this note is about concurrency… so I simplified the picture.
Figure 2 shows the same query running on Greenplum. Note that Greenplum uses a data flow architecture that pushes tuples from step to step in the execution plan without writing them to disk. As a result the query completes very quickly after the last tuple is scanned from the table.
Let me say again… this story is about CPU utilization, concurrency, and workload management… I’m not trying to say that there are not optimizations that might make Teradata outperform Greenplum… or optimizations that might make Greenplum even faster still… I just want you to see the impact on concurrency of the spool architecture versus the data flow architecture.
Note that on Greenplum the processors are 20% busy in the interval that the query runs. For complex queries with lots of steps the data flow architecture provides an even more significant advantage to Greenplum. If there are 20 steps in the execution plan then Teradata will do spool I/O, first writing then reading the intermediate results while Greenplum manages all of the results in-memory after the initial reads.
In Figure 3 we see the impact of having the data in-memory as with HANA or TimeTen. Again, I am ignoring the implications of HANA’s columnar orientation and so forth… but you can clearly see the implications by removing block I/O.
Now let’s look at the same pictures with 2 concurrent queries. Let’s assume no workload management… just first in, first out.
In Figure 4 we see Teradata with two concurrent queries. Teradata has both queries executing at the same time. The second query is using up the wasted space made available while the CPUs wait for Query 1’s I/O to complete. Teradata spools the intermediate results to disk; which reduces the impact on memory while they wait. This is very wasteful as described here and here (in short, the Five Minute Rule suggests that data that will be reused right away is more economically stored in memory)… but Teradata carries a legacy from the days when memory was dear.
But to be sure… Teradata has two queries running concurrently. And the CPU is now 20% busy.
Figure 5 shows the two-query picture for Greenplum. Like Teradata, they use the gaps to do work and get both queries running concurrently. Greenplum uses the CPU much more efficiently and does not write and read to spool in between every step.
In Figure 6 we see HANA with two queries. Since one query consumed all of the CPU the second query waits… then blasts through. There is no concurrency… but the work is completed in a fraction of the time required by Teradata.
If we continue to add queries using these simple models we would get to the point where there is no CPU available on any architecture. At this point workload management comes into play. If there is no CPU then all that can be done is to either manage queries in a queue… letting them wait for resources to start… or start them and let them wastefully thrash in and out… there is really no other architectural option.
So using this very simple depiction eventually all three systems find themselves in the same spot… no CPU to spare. But there is much more to the topic and I’ve hinted about these in previous posts.
Starting more queries than you can service is wasteful. Queries have to swap in and out of memory and/or in and out of spool (more I/O!) and/or in and out of the processor caches. It is best to control concurrency… not embrace it.
Running virtual instances of the database instead of lightweight threads adds significant communications overhead. Instances often become unbalanced as the data returned makes the shards uneven. Since queries end when the slowest instance finishes it’s work this can reduce query performance. Each time you preempt a running query you have to restore state and repopulate the processor’s cache… which slows the query by 12X-20X. … Columnar storage helps… but if the data is decompressed too soon then the help is sub-optimal… and so on… all of the tricks used by databases and described in these blogs count.
But what does not count is query concurrency. When Teradata plays this card against Greenplum or HANA they are not talking architecture… it is silliness. Query throughput is what matters. Anyone would take a system that processes 100,000 queries per hour over a system that processes 50,000 queries per hour but lets them all run concurrently.
I’ve been picking on Teradata lately as they have been marketing hard… a little too hard. Teradata is a fine system and they should be proud of their architecture and their place in the market. I am proud to have worked for them. I’ll lay off for a while.
Teradata makes several claims from these numbers. First they claim that the numbers demonstrate a bottleneck that is tied to either the NUMA effect or to the SMP Knee Curve. This nonsense is the subject of a previous blog here.
For any database system as you increase the number of queries to the point where there is contention the throughput decreases. This is just common sense. If you have 10 cores and 10 threads and there is no contention then all threads run at the same speed as fast as possible. If you add an 11th thread then throughput falls off, as one thread has to wait for a core. As you add more threads the throughput falls further until the system is saturated and throughput flattens. Figure 1 is an example of the saturation curve you would expect from any system as the throughput flattens.
There are some funny twists to this, though. If you are an IMDB then each query can use 100% of a core. If you are multi-threaded IMDB then each query can use 100% of all cores. If you are a disk-based system then you give up the CPU to another query while you wait for I/O… so throughput falls. I’ll address these twists in a separate blog… but you will see a hint at the issue here.
Teradata claims that these numbers reflect a scaling issue. This is a very strange claim. Teradata tests scaling by adding hardware, data, and queries in equal amounts to see if the query performance holds constant… or they add hardware and data to look for a correlation between the number of nodes and query performance… hoping that as the nodes increase the response time decreases. In fact Teradata scales well… as does HANA… But the hardware is constant in the HANA benchmark so there is no view into scaling at all. Let me emphasis this… you cannot say anything about scaling from the numbers above.
Teradata claims that they can extrapolate the saturation point for the system… this represents very bad mathematics. They take the four data points in the table and create an S curve like the one in Figure 1… except they invert it to show how throughput decreases as you move towards the saturation point… Figure 2 shows the problem.
If you draw a straight line through the curve using any sort of math you miss the long tail at the end. This is an approximation of the picture Teradata drew… but even in their picture you can see a tail forming… which they ignore. It is also questionable math to extrapolate from only four observations. The bottom line is that you cannot extrapolate the saturation point from these four numbers… you just don’t know how far out the tail will run unless you measure it.
To prove this is nonsense you just have to look here. It turns out that SAP publicly published these benchmark results in two separate papers and this second one has numbers out to 60 streams. Unsurprisingly at 60 streams HANA processed 112,602 queries per hour while Teradata told their customers that it would saturate well short of that… at 49,601 queries (they predicted that HANA would thrash and the number of queries/hour would fall back… more FUD).
Teradata is sending propaganda to their prospects with scary extrapolations and pronouncements of architectural bottlenecks in HANA. The mathematics behind their numbers is weak and their incorrect use of deep architectural terms demonstrates ignorance of the concepts. They are trying to create Fear, Uncertainty, and Doubt. Bad marketing… not architecture, methinks.
In the past SAP applications have, in general, avoided using database features. Even a SELECT with a projection was out-of-bounds. They did not want to depend on any database, so they tended to pull all data from the data layer to the application layer and loop through the data using procedural languages like ABAP. You might say that they were religiously database agnostic. My mistake… you might say that we were religiously database agnostic. I have to get used to these new surroundings.
Besides the obvious attributes of HANA: in-memory, shared-nothing, MPP, and column-oriented… the aim is to move the application logic next to the data and into HANA.
Any of you who have labored to convert procedural code into set-based SQL will understand the issue here. There are hundreds of thousands or millions of lines of procedural code… often very simple loops… that have to be converted to SQL to make the HANA architecture support the SAP application portfolio.
The surprise is not that there is this outstanding issue.. nor is it the ambitious architecture designed to push the application deep into the database (we are not talking about SQL-based stored procedures… we are talking about the application). The surprise is that the HANA development team has built a state-of-the-art facility that programmatically converts procedural logic into its set-based equivalent (not necessarily into SQL but sometimes into a language that can execute in-parallel). This is not a tool requiring manual intervention… it is an automatic, mathematically provable, transformation.
Right now the technique is used to covert logic in stored-procedures and in ABAP. But I hope to see it applied in the optimizer to convert those ugly Oracle cursor loops on-the-fly.
By the way… SAP will continue to support ABAP using the database as a file server… moving all of the data from the database server to the application server for processing. But you can imagine that… when running applications that use this powerful capability… over time HANA will emerge with a huge performance advantage over other databases…
I found myself wondering where did the rule-of-thumb for Exalytics that suggests that TimesTen can use 800GB of a 1TB memory space… and requires 400GB of that space for work tables leaving room for 400GB of user data… come from (it is quoted everywhere… here is an example… see question #13).
Sure enough, this rule has been around for a while in the TimesTen literature… in fact it predates Exalytics (see here).
Why is this important? The workspace per query for a TPC-A transaction is very small and the amount of time the memory is held by a TPC-A transaction is very short. But the workspace required by a TPC-H query is at least 10X the space required by a TPC-A query and the duration of a TPC-H query is at least 10X the duration of a TPC-A query. The result is at least 100X more pressure on memory utilization.
So… I suspect that the 600GB of user data I calculated here may be off by more than a little. Maybe Exalytics can support 300GB of user data or 100GB of user data or maybe 60GB?
Note that this is not bad… all of this pressure on memory is still moved to Exalytics from the ExadataRAC subsystem… where memory is dear.
As a side note… it is always important to remember that the pressure on memory is the amount of memory utilized times the duration of the utilization. This is why the data flow architecture used in modern databases like Greenplum are effective. Greenplum uses more memory per transaction but it holds the memory for less time by never (almost) writing it to disk. This is different from older database architectures like Teradata and Oracle which use disk to store intermediate results… lowering the overall amount of memory required but increasing the duration of the query. More on this here…