Co-processing and Exadata

In my first blog (here) I discussed the implications of using co-processors to offload CPU. The point was that with multi-core processors it made more sense to add generalized processing hardware that could be applied to all parts of the query process than to add specialized processors that dealt with only part of the problem.

Kevin Closson has produced two videos that critically evaluate the architecture of Exadata and I strongly suggest that you view them here before you go on with this post… They are enlightening, irreverent, and make the long post I’ve been drafting on Exadata lightweight and unnecessary.

If you have seen Kevin’s post you understand that Exadata is asymmetric and unbalanced. But his post extends and generalizes my discussion of co-processing in a nice way. Co-processing is asymmetric by definition. The co-processor is not busy after it has executed on its part of the problem.

In fact, Oracle has approximately mirrored the Netezza architecture with Exadata but used commercial processors instead of FPGAs to offload I/O and predicate processing. The result is the same in both cases… underutilized processing capability. The difference is that Netezza wastes some power on relatively inexpensive FPGA processors while Exadata wastes general and expensive CPU resources that might actually be applied usefully elsewhere. And Netezza splits the processing within a shared-nothing architecture while Exadata mixes architectures adding to the inefficiency.

More on Exalytics: How much user data fits?

Sorry… this is a little geeky…

The news and blogs on Exalytics tend to say that Exalytics is an in-memory implementation with 1TB of memory. They then mention, often in the same breath, that the TimesTen product which is the foundation for Exalytics now supports Hybrid Columnar Compression which might compress your data 5X or more. This leaves the reader to conclude that an Exalytics Server can support 5TB of user data. This is not the case.

If you read the documentation (here is a summary…) a 1TB Exalytics server can allocate 800GB to TimesTen of which half may be allocated to store user data. The remainder is work space… so 400GB uncompressed is available for user data. You might now conclude that with 5X compression there is 2TB of compressed user data supported. But I am not so sure…

In Exadata Hybrid Columnar Compression is a feature of the Storage Servers. It is unknown to the RAC layer. The compression allows the Storage Servers to retrieve 5X the data with each read significantly improving the I/O performance of the subsystem. But the data has to be decompressed when it is shipped to the RAC layer.

I expect that the same architecture is implemented in TimesTen… The data is stored in-memory compressed… but decompressed when it moves to the work storage. What does this mean?

If, in a TimesTen implementation without Hybrid Columnar Compression, 400GB of work space in memory is required to support a “normal” query workload against 400GB of user data then we can extrapolate the benefits of 5X compressed data as follows:

  • x of user data compressed uses x/5 of memory plus x of work space in memory… all of which must fit into 800GB

This resolves to x = 667GB… a nice boost for sure… with some CPU penalty for decompressing.

So do not jump to the conclusion that Hybrid Columnar Compression in TimesTen of 5X allows you to put 5TB of user data on a 1TB Exalytics box… or even that it allows you to load 2TB into the 400GB user memory… the real number may be under 1TB.

Exalytics vs. HANA: What are they thinking?

I’ve been trying to sort through the noise around Exalytics and see if there are any conclusions to be drawn from the architecture. But this post is more about the noise. The vast majority of the articles I’ve read posted by industry analysts suggest that Exalytics is Oracle‘s answer to SAP‘s HANA. See:

But I do not see it?

Exalytics is a smart cache that holds a redundant copy of aggregated data in memory to offload aggregate queries from your data warehouse or mart. The system is a shared-memory implementation that does not scale out as the size of the aggregates increase. It does scale up by daisy-chaining Exalytics boxes to store more aggregates. It is a read-only system that requires another DBMS as the source of the aggregated data. Exalytics provides a performance boost for Oracle including for Exadata (remember, Exadata performs aggregation in the RAC layer… when RAC is swamped Exalytics can offload some processing).

HANA is a fully functional in-memory shared-nothing columnar DBMS. It does not store a copy of the data.. it stores the data. It can be updated. HANA replaces Oracle… it does not speed it up.

I’ll post more on Exalytics… and on HANA… but there is no Exalytics vs. HANA competition ahead. There will be no Exalytics vs. HANA POCs. They are completely different technologies solving different problems with the only similarity being that they both leverage the decreasing costs of RAM to eliminate the expense of I/O to disk or SSD devices. Don’t let the common phrase “in-memory” confuse you.

The Best Data Warehouse Spin of 2011

At this time of the year bloggers everywhere look back and reflect. Some use the timing to highlight significant achievements… and it is in the spirit that I would like to announce my choice for the best marketing in the data warehouse vendor space for 2011.

Marketing is a difficult task. Marketeers need to walk a line between reality and bull-pucky. They need to appeal to real and apparent needs yet differentiate. Often they need to generate spin to fuzz a good story by a competitors marketing or to de-emphasize some short-coming in their own product line.

Below is a picture taken on the floor of a prospect where we engaged in a competitive proof-of-concept. The customer requested that vendors ship a single rack configuration… and so we did.

But the marketing coup is that the vendor on the right, Teradata, told the customer that this is a single rack configuration and that they are in compliance. The customer has asked us if this is reasonable?

This creative marketing spin wins the 2011 award going away… against very tough competition.

I expect this marketing approach to start a trend in the space. Soon we will see warehouse appliance vendors claiming that 1TB = 50TB due to compression… or was that already done this year?

Sorry to be cynical… but I hope that the picture and story provide you with a giggle… and that the giggle helps you to start a happy holiday season.

– Rob Klopp

Predictive Analytics, Event Processing, and Rules… Oh, and in-memory databases…

There is a lot of talk these days about predictive analytics, big data, real-time analytics, dashboards, and active data warehousing. These topics are related in a fairly straightforward way. Further, there are new claims about in-memory database processing that blends these issues into a promise of real-time predictive analytics. Lets tease the topic apart…

Predictive analytics is really composed of two parts: modeling and scoring.

Modeling requires big data to discover which information from the enterprise predicts some interesting event. Big data is both broad and deep: broad because no one knows which data elements will be predictive… “which elements” has to be discovered in the modeling exercise… and deep is required because it takes history to detect a trend.

The model that results represents a rule: if the rule’s conditions are true then we can predict the outcome within some statistical boundaries. For example: “if a payment has not been received within 120 days and the customer’s account balance has dropped over 40% in the last year then there is a 83.469271346% chance that the customer will default on the payment. Note that you can create a rule without predictive modeling and without a statistical boundary… for example: “if payment has not been received within 120 days then the customer will likely default”. We have been creating these heuristic rules since time began… and the point of predictive analytics is to discover rules more accurately than by heuristics. You might say then that a model defines, or predicts with some certainty, an event of interest. The definition may be described as a set of rules.

Scoring requires only the elements discovered by the modeling exercise… and may or may not require big deep data. Big data is required if any of the discovered elements represents a trend. For example, if to predict a stock price there are elements that represent the average price over the last 30 days, 90 days, 180 days, etc.; and that calculate the difference between the 90 day number and the 30 day number to show the trend; then either the data has to be aggregated on-the-fly from the detail… or it must be pre-aggregated. This distinction is important… the result of a modeling exercise may require the creation of some new aggregated data. Note that we are suggesting that a score depicts an interesting event.

Real-time analytics, or more fairly, near-real-time analytics, requires these rules to be checked ASAP after new information is available.

A dashboard can provide one of two features: either the dashboard applies the rules and presents alerts when some rule triggers… or the dashboard may present raw data for evaluation by a human. For example, the speedometer on your car dashboard presents raw data and it is up to you to apply rules based on the input. Note that the speedometer on your car provides a real-time display. Sometimes BI dashboards use real-time displays like a speedometer to display static data. For example, I have seen daily metrics displayed using a speedometer widget… but since the speedometer updates just once a day this is clearly metaphorical.

Active data warehousing implies some sort of rule-based activity. The activity may be triggered in near-real-time or as a batch process.

But in any consideration of real-time processing there is an issue if the rules cross data input boundaries. By this I mean… it is simpler to build a speedometer that reads from one input, the rotation of the axle, than to build a meter that incorporates multiple inputs… for example the meter that displays how many miles/meters you have left before you run out of petrol. But to provide this in real-time your car has real-time access to two inputs… and an embedded processor is required to integrate the data and derive the data display. Near-real-time displays of data warehouse data have the same constraints… if the display requires data from more than one source that data has to be acquired and integrated and calculated/scored in near-real-time. This is a daunting problem if the sources cross application system boundaries.

In the book “In-memory Data Management” Plattner and Zeier promise near-real-time analytics from an in-memory DBMS, HANA. But there is no discussion of how this really works for a data warehouse across source systems based on integrated data. Near-real-time predictive modeling requires broad data that will cross these systems boundaries. It may be possible to develop a system with near-real-time data acquisition and integration can occur… and rules may be applied immediately to identify interesting events. But this sort of data acquisition is very advanced… and an in-memory database does not inherently solve the problem… every application in the enterprise will not be in the same memory space.

I do not see it. SAP may live in a single memory space… and maybe every possible application of that data can live there as well. But as long as there is relevant data outside of the space data integration is required and the argument for real-time weakens.

Teradata’s Excellent Memory Management… and Poor Memory Utilization

I was recently surprised to hear from a prospect that Teradata’s memory management was considered a differentiator over Greenplum. This is not because of bad information… Teradata probably does have better memory management… but they have better memory management because they use memory less efficiently than Greenplum. Let me explain…

First let’s be clear… the utilization of memory is not measured by the amount of memory required… but by the amount required times the amount of time it is required. Think about it… if you have a query that requires 16MB of memory and holds it for 1 second… and another query that requires 4MB of memory and holds it for 4 seconds… the effective memory utilization is the same.

Greenplum uses pipelining to flow data from step to step in the query plan. Teradata writes the results from each step in the query plan to disk, to a spool file. This architectural difference allows Greenplum to complete any single query in a small fraction of the time that Teradata requires… as Teradata pays the cost of writing results after each step and of reading these results into the subsequent steps add up. The result is that Teradata uses a smaller memory footprint for each query… but holds the memory significantly longer… resulting in relatively poor memory utilization.

Note that this is not really bad design by Teradata… it is just old design. Once upon a time the servers Teradata ran on had only a little memory… as little as 32MB… so they had to spool data to disk to make it all fit. Greenplum is designed for modern processors with 100X more memory… and we use that memory effectively to get queries in and out as fast as possible.

By the way, as a side effect Greenplum does not require management of spool space… so this sysadmin task is eliminated.

So… Teradata does tightly manage memory… but this is not an advantage… they manage it tightly because they have to.

The Worst Data Warehouse in the World

So far this blog has focused on issues related to database architecture… so this title might not seem on message. But architecture has implications.

The aim of any BI system is to support the decision-making process of the business. BI infrastructure is clearly a success when your company learns to make fact-based decisions as part of the day-to-day operation of the business. The best data warehouse in the world would be one that provides such effective decision support that the business gains a competitive advantage over the competition.

But I often run into companies where sweet success has turned sour. Why, because in these sour situations the BI eco-system cannot keep up. In these bad cases the best data warehouse in the world becomes the worst.

Usually the problem comes in one of two flavors: either the required decision support is unavailable in time to make a decision, or the eco-system cannot extend to support new business opportunities.

The first case usually shows up during periods when decision-making increases: during seasonal peaks in business. The second appears when the business grows: after a merger or when a new product is introduced. In both cases the cost of the failure is significant.

But these worst cases do not happen out of the blue. They creep up on you. There are symptoms. Often the first symptom is when the nightly reporting process starts missing its service level targets. That is, the nightly load of the warehouse and the refresh of the indexes, materialized views, the summary tables, the cubes, and the marts; and then the running of reports cannot complete in the batch window. This is followed by slow response in your online query processing as the nightly process creeps into the day. Then, the business asks for more users and/or for more data to be added and the problem grows… until decision-making is delayed or unsupported altogether.

Sadly, this problem is avoidable and the solution is well understood. All that is required is a scalable foundation that can extend through the addition of relatively inexpensive hardware. If you could easily add storage and compute then as the constraints hit you can scale up.

A shared nothing architecture scales. We have examples at Greenplum of production systems that scale from hundreds of gigabytes to thousands of terabytes… and other shared nothing vendors: Teradata and Netezza at least, can boast the same. When our customers run out of gas we add hardware. And the architecture scales bigger still… shared nothing is the foundation for all web scale data base technology… scaling to hundreds of petabytes.

So why do companies build, and continue to build, on shared memory systems with built-in limits? Because… they continually underestimate the growth in data… the failure is a failure of vision (consider the name “Teradata”… selected when a terabyte was considered nearly unreachable). Data does not just grow, it explodes in leaps and bounds as technology advances.

But let’s be real… Why do companies really select limiting infrastructure? Because they mistakenly believe that they can build BI infrastructure on technology designed for OLTP… and they already have DBAs trained on this technology who heavily influence the decision. Or, they have an enterprise license for the OLTP database and they want to save some money.

I imagine that I’ve made my point. The worst data warehouse in the world is a warehouse that constrains your business… one that cannot scale as the demand for data and decision support grows… one that costs you hundreds of thousands of dollars in staff time with every change… one that is tuned to the breaking point, rather than robust.

Why would anyone ever put their business at risk like this?

Greenplum and Teradata: Simliar Architecture, Different Strategies

Hardware systems, servers and network fabric, provide the foundation upon which all shared-nothing database management systems rest. Hardware systems are a major contributor to the overall price/performance and total cost of ownership for a data warehouse platform. This blog considers the hardware strategies of EMC/Greenplum: applying the idea of using common, off-the-shelf (COTS) components to build a competitive foundation; and Teradata: developing a proprietary hardware system by tightly integrating components.

Strategies

The Teradata hardware strategy is simple to describe. They expend R&D dollars to couple low-level technologies into a tightly integrated system. Their servers are custom-designed within a set of guidelines that allows both the LINUX and Microsoft Windows operating systems to execute there. Their network fabric is highly proprietary, using cycles within the fabric to offload sort/merge data processing from the server CPU.

In other words, Teradata believes that the time and effort required to engineer an integrated proprietary offering will improve the performance of their offering enough to offset the cost.

EMC and Greenplum have taken a different approach. They have elected a strategy that leverages off-the-shelf servers offered by hardware vendors like Dell or HP, and network switches from vendors like Brocade, Arista, and Cisco. They have elected to expend few dollars on hardware design and development and to leverage the R&D investments made by these other vendors. In other words, Greenplum believes that the advantages in price and performance provided by using off-the-shelf hardware provides a sustainable advantage.

Price

The lower costs associated with Greenplum’s strategy clearly provide an advantage. Greenplum does not have to expend to design and manufacture custom hardware. The manufacturing costs may not be significant, but the staff costs required by the Teradata strategy must affect the price. Clearly the Greenplum strategy provides an advantage on the price side.

Performance

The Teradata strategy has to be about performance… so lets speculate:

  • How much of a performance increase might their integration provide on the server-side?
  • How much of a performance increase might their integration provide on the network side?

In the days before there was a microprocessor based enterprise server market, Teradata could gain substantially here. Microprocessors were built for personal computing and not designed for the high-availability and high-performance requirements in an enterprise. Teradata had much to gain from building rather than buying server.

But today, there is little to gain from a highly customized design. The requirement to run standard LINUX and Windows operating systems limit their ability to innovate and the resulting servers have to be very similar to those built for off-the-shelf enterprise servers. There is little or no performance advantage here.

On the network side, there once was a distinct advantage to Teradata’s ByNet. It was both faster than available off-the shelf switches and it offloaded cycles from the under-powered CPU. Today, however, there are plenty of cheap, fast switches… so the speed advantage has disappeared. Worse still, the introduction of multi-core CPUs have eliminated the advantage of the in-the-switch sort/merge that makes ByNet unique. CPU is inexpensive these days.

The bottom line: it is unclear if the Teradata hardware strategy affords them a performance advantage.

Cost of Ownership

An argument could be made that supporting COTS hardware is inherently less expensive than supporting a Teradata cluster. But there is a more substantial savings that is clear.

Every 2-3 years, as newer Teradata technology obsoletes your currently installed cluster the value of the current hardware goes to zero and the cost of ownership goes up significantly. The costs of this are especially high when you are required to add several nodes to accommodate growth as Teradata refreshes their technology. You may have to buy servers that are already obsolete.

With Greenplum, your current cluster is built from general-purpose servers that are re-purposed with ease. In fact, since the nodes in a Greenplum cluster are usually high-end servers, customers often cycle new technology into their data warehouse and cycle the old servers out into their server farm. The result is a higher performance warehouse and full use of all of the server technology.

A Final Word

The words “proprietary hardware” are sometimes thrown around as an insult. But Teradata’s proprietary approach is based on the belief that a tightly integrated configuration adds benefit to offset the costs. Greenplum believes that today the enterprise server and the network switch vendors have matured their products to the point where off-the-shelf technology can match or exceed the performance of custom hardware… at a significantly reduced cost. You may have an opinion or you may wait to see how the benchmarks, the proof-of-concepts, and the market decide… but its interesting to understand the differing approaches.

Stop Tuning and Scan…

After years of tuning data warehouses, queries, data loads, and BI applications, I give up. In the long run it is not really possible anyway… and better still… no longer necessary. A better approach is to build your database and your hardware infrastructure to scan fast and smart. So here’s a blog on why it’s impossible to tune a warehouse… and on why it’s no longer necessary.

My argument against tuning is easy to grasp. By definition a data warehouse serves many constituencies: Marketing and Finance and Customer Support and Distribution; and these business units will each access the data from their unique perspective following a unique path through the warehouse. A designer cannot lay out the data effectively to support each access path… cannot index every column, cannot map more than one zone, cannot replicate the data again and again with aggregates and materialized views, cannot cache the entire warehouse. Even if you get it right changing business requirements will fracture your approach; or worse, the design will not support new queries and constrain your business.

Many readers will be skeptical at this point… suggesting that the software and hardware to eliminate tuning does not exist. So let’s build a model and test the state of the art.

Let us imagine and model a 25TB data warehouse with a 20TB fact table that holds 25 months of daily facts partitioned by day. The fact table is 100 columns wide and we will model two queries that reference 20 of the columns… One that touches every row and one that is date constrained and touches only 14 days of data.

Here are some hardware specs. A server with a single I/O controller can read about 1.5GB/second into the database. With two controllers can read around 2.7GB/sec. Note that these are not the theoretical limits of the hardware but real measurements taken from the current hardware on the market: Dell, HP, and SUN/Oracle.

Now let’s deploy our imaginary warehouse on a strong state of the market multi-core server with, to be conservative, a single controller. This server would scan our fact table in around 222 minutes. Partition elimination would allow the date constrained query to complete in just over 4 minutes. Note that these imaginary queries ignore the effort to join and/or aggregate data. Later I’ll have more to say on this…

If we deploy our warehouse on a shared-nothing cluster with 20 nodes the aggregate I/O bandwidth increases to 30GB/sec and the execution times for our two queries improves to 11 minutes and 12 seconds, respectively. This is the power of parallel I/O.

Now we have to factor in compression. Typical row-based compression yields approximately a 2.5x result… columnar compression varies wildly… But let’s assume 25X in our model. There is a cost to be paid to decompress the data… But since it is paid by everyone and CPU is a relatively inexpensive commodity, we’ll ignore it in our model.

For 2.5X row-based compression our big query now completes in 4.4 minutes and the smaller query completes in 4.8 seconds.

The model is a little more complicated when we throw in columnar compression so let’s consider two columnar models. For an implementation such as Exadata we get the benefit of columnar compression but not the benefit of columnar projection. 25X hybrid columnar compression will execute our two scans in 26 seconds and .5 seconds. Now we are talking! A more complete columnar implementation will only touch the columns required by our query, 20% of the data, providing another 5X improvement. This drops our scan queries to 5.2 seconds and .1 second, respectively. Smoking fast. Note that the more simple columnar compression approach will provide the same fast response when every column is touched and the more complex approach will slow down in that case… so you can make the trade off in your shop as required.

Let me remind you again… This is a full scan of 20TB with no tricks: no indices, no pre-aggregation, no materialized views, no cache and no flash, no pre-sorted zone maps. All that is required is a parallel implementation with partitioning, compression, and a columnar table type… and this implementation works. It is robust.

A note on joins… It is more difficult to model joins… and I’ll attempt a simple model in another post. But you can see that this fast scan approach has solved the costly part of the problem using parallel processing… and you can imagine that a shared-nothing massively parallel approach to joins may hold the key.

Co-processors and database machines: Intel, Netezza, and Teradata

Intel recently announced a change in direction regarding the use of special hardware co-processing to support RAID. This change is relevant to vendors of data warehouse appliances like Netezza, and to a smaller extent, Teradata, who both use co-processors to offload processing. A report on the Intel announcement is here: http://www.theregister.co.uk/2010/09/15/sas_in_patsburg/ . In short, Intel says that because multi-cores provide so much compute for so little cost it just does not make sense to build a special co-processor to support the RAID function.

What, you may ask, does this have to do with Netezza and Teradata. The answer is that both companies have built database machines that use co-processors more-or-less. And Intel’s conclusion that multi-cores obsolete co-processors significantly affects their value proposition to the market.

In the case of Netezza the significance is considerable. Netezza conceived the FPGA-based architecture when compute was expensive relative to I/O. A FPGA provided a way to add compute power at a low-cost point. But according to Intel this changed as multi-core technology matured. Further, as  the Intel processing line adds even more cores, the value of a FPGA co-processor becomes less still.

In the case of Teradata the issue is more manageable. The original Teradata Y-Net and their current BYNET interconnect offloads the merge process from the CPU to the interconnect. Back when Teradata ran on 8086 cores from Intel this was significant and it formed a cornerstone of Teradata’s value proposition. Today, the value of BYNET is diminished and one could imagine Teradata dropping it altogether sometime in the future for a less proprietary network fabric.

Please do not think that I am suggesting that multi-cores make either Teradata or Netezza obsolete. They are both formidable technologies. But multi-cores will reduce the price/performance advantage that co-processing once afforded in favor of database servers that do not use co-processing like Exadata, Greenplum, and DB2.

%d bloggers like this: