More on Big Data… and on Big Data Analytics… and on a definition of a Big Data Store…

After a little more thinking I’m not sure that Big Data is a new thing… rather it is a trend that has “crossed the chasm” and moved into the mainstream. Call Detail records are Big Data and they are hardly new. In the note below I will suggest that, contrary to the long-standing Teradata creed, Big Data is not Enterprise Data Warehouse (EDW) data. It belongs in a new class of warehouse to be defined…

The phrase “Big Data” refers to a class of data that comes in large volumes and is not usually joined directly with your Enterprise Data Warehouse data… even if it is stored on the same platform. It is very detailed data that must be aggregated and summarized and analyzed to meaningfully fit into an EDW. It may sit adjacent to the EDW in a specialized platform tailored to large-scale data processing problems.

Big Data may be data structured in fields or columns, semi-structured data that is de-normalized and un-parsed, or unstructured data such as text, sound, photographs, or video.

The machinery that drives your enterprise, either software or hardware, is the source of big Data. It is operational data at the lowest level.

Your operations staff may require access to the detail, but at this granular level the data has a short shelf life… so it is often a requirement to provide near-real-time access to Big Data.

Because of the volume and low granularity of the data the business usually needs to use it in a summarized form. These summaries can be aggregates or they can be the result statistical summarization. These statistical summaries are the result of Big Data analytics. This is a key concept.

Before this data can be summarized it has to be collected… which requires the ability to load large volumes of data within business service levels. The Big Data requires data quality control at scale.

You may recognize these characteristics as EDW requirements; but where an EDW requires support for a heterogeneous environment with thousands of data subject areas and thousands and thousands of different queries that cut across the data in an ever-increasing number of paths, a Big Data store supports billions of homogeneous records in a single subject area with a finite number of specialized operations. This is the nature of an operational system.

In fact, a Big Data store is really an Operational Data Store (ODS)… with a twist. In order to evaluate changes over time the ODS must store a deep history of the details. The result is a Big Data Warehouse… or an Operational Big Data Store.

Exalytics vs. HANA: What are they thinking?

I’ve been trying to sort through the noise around Exalytics and see if there are any conclusions to be drawn from the architecture. But this post is more about the noise. The vast majority of the articles I’ve read posted by industry analysts suggest that Exalytics is Oracle‘s answer to SAP‘s HANA. See:

But I do not see it?

Exalytics is a smart cache that holds a redundant copy of aggregated data in memory to offload aggregate queries from your data warehouse or mart. The system is a shared-memory implementation that does not scale out as the size of the aggregates increase. It does scale up by daisy-chaining Exalytics boxes to store more aggregates. It is a read-only system that requires another DBMS as the source of the aggregated data. Exalytics provides a performance boost for Oracle including for Exadata (remember, Exadata performs aggregation in the RAC layer… when RAC is swamped Exalytics can offload some processing).

HANA is a fully functional in-memory shared-nothing columnar DBMS. It does not store a copy of the data.. it stores the data. It can be updated. HANA replaces Oracle… it does not speed it up.

I’ll post more on Exalytics… and on HANA… but there is no Exalytics vs. HANA competition ahead. There will be no Exalytics vs. HANA POCs. They are completely different technologies solving different problems with the only similarity being that they both leverage the decreasing costs of RAM to eliminate the expense of I/O to disk or SSD devices. Don’t let the common phrase “in-memory” confuse you.

The Worst Data Warehouse in the World

So far this blog has focused on issues related to database architecture… so this title might not seem on message. But architecture has implications.

The aim of any BI system is to support the decision-making process of the business. BI infrastructure is clearly a success when your company learns to make fact-based decisions as part of the day-to-day operation of the business. The best data warehouse in the world would be one that provides such effective decision support that the business gains a competitive advantage over the competition.

But I often run into companies where sweet success has turned sour. Why, because in these sour situations the BI eco-system cannot keep up. In these bad cases the best data warehouse in the world becomes the worst.

Usually the problem comes in one of two flavors: either the required decision support is unavailable in time to make a decision, or the eco-system cannot extend to support new business opportunities.

The first case usually shows up during periods when decision-making increases: during seasonal peaks in business. The second appears when the business grows: after a merger or when a new product is introduced. In both cases the cost of the failure is significant.

But these worst cases do not happen out of the blue. They creep up on you. There are symptoms. Often the first symptom is when the nightly reporting process starts missing its service level targets. That is, the nightly load of the warehouse and the refresh of the indexes, materialized views, the summary tables, the cubes, and the marts; and then the running of reports cannot complete in the batch window. This is followed by slow response in your online query processing as the nightly process creeps into the day. Then, the business asks for more users and/or for more data to be added and the problem grows… until decision-making is delayed or unsupported altogether.

Sadly, this problem is avoidable and the solution is well understood. All that is required is a scalable foundation that can extend through the addition of relatively inexpensive hardware. If you could easily add storage and compute then as the constraints hit you can scale up.

A shared nothing architecture scales. We have examples at Greenplum of production systems that scale from hundreds of gigabytes to thousands of terabytes… and other shared nothing vendors: Teradata and Netezza at least, can boast the same. When our customers run out of gas we add hardware. And the architecture scales bigger still… shared nothing is the foundation for all web scale data base technology… scaling to hundreds of petabytes.

So why do companies build, and continue to build, on shared memory systems with built-in limits? Because… they continually underestimate the growth in data… the failure is a failure of vision (consider the name “Teradata”… selected when a terabyte was considered nearly unreachable). Data does not just grow, it explodes in leaps and bounds as technology advances.

But let’s be real… Why do companies really select limiting infrastructure? Because they mistakenly believe that they can build BI infrastructure on technology designed for OLTP… and they already have DBAs trained on this technology who heavily influence the decision. Or, they have an enterprise license for the OLTP database and they want to save some money.

I imagine that I’ve made my point. The worst data warehouse in the world is a warehouse that constrains your business… one that cannot scale as the demand for data and decision support grows… one that costs you hundreds of thousands of dollars in staff time with every change… one that is tuned to the breaking point, rather than robust.

Why would anyone ever put their business at risk like this?

Stop Tuning and Scan…

After years of tuning data warehouses, queries, data loads, and BI applications, I give up. In the long run it is not really possible anyway… and better still… no longer necessary. A better approach is to build your database and your hardware infrastructure to scan fast and smart. So here’s a blog on why it’s impossible to tune a warehouse… and on why it’s no longer necessary.

My argument against tuning is easy to grasp. By definition a data warehouse serves many constituencies: Marketing and Finance and Customer Support and Distribution; and these business units will each access the data from their unique perspective following a unique path through the warehouse. A designer cannot lay out the data effectively to support each access path… cannot index every column, cannot map more than one zone, cannot replicate the data again and again with aggregates and materialized views, cannot cache the entire warehouse. Even if you get it right changing business requirements will fracture your approach; or worse, the design will not support new queries and constrain your business.

Many readers will be skeptical at this point… suggesting that the software and hardware to eliminate tuning does not exist. So let’s build a model and test the state of the art.

Let us imagine and model a 25TB data warehouse with a 20TB fact table that holds 25 months of daily facts partitioned by day. The fact table is 100 columns wide and we will model two queries that reference 20 of the columns… One that touches every row and one that is date constrained and touches only 14 days of data.

Here are some hardware specs. A server with a single I/O controller can read about 1.5GB/second into the database. With two controllers can read around 2.7GB/sec. Note that these are not the theoretical limits of the hardware but real measurements taken from the current hardware on the market: Dell, HP, and SUN/Oracle.

Now let’s deploy our imaginary warehouse on a strong state of the market multi-core server with, to be conservative, a single controller. This server would scan our fact table in around 222 minutes. Partition elimination would allow the date constrained query to complete in just over 4 minutes. Note that these imaginary queries ignore the effort to join and/or aggregate data. Later I’ll have more to say on this…

If we deploy our warehouse on a shared-nothing cluster with 20 nodes the aggregate I/O bandwidth increases to 30GB/sec and the execution times for our two queries improves to 11 minutes and 12 seconds, respectively. This is the power of parallel I/O.

Now we have to factor in compression. Typical row-based compression yields approximately a 2.5x result… columnar compression varies wildly… But let’s assume 25X in our model. There is a cost to be paid to decompress the data… But since it is paid by everyone and CPU is a relatively inexpensive commodity, we’ll ignore it in our model.

For 2.5X row-based compression our big query now completes in 4.4 minutes and the smaller query completes in 4.8 seconds.

The model is a little more complicated when we throw in columnar compression so let’s consider two columnar models. For an implementation such as Exadata we get the benefit of columnar compression but not the benefit of columnar projection. 25X hybrid columnar compression will execute our two scans in 26 seconds and .5 seconds. Now we are talking! A more complete columnar implementation will only touch the columns required by our query, 20% of the data, providing another 5X improvement. This drops our scan queries to 5.2 seconds and .1 second, respectively. Smoking fast. Note that the more simple columnar compression approach will provide the same fast response when every column is touched and the more complex approach will slow down in that case… so you can make the trade off in your shop as required.

Let me remind you again… This is a full scan of 20TB with no tricks: no indices, no pre-aggregation, no materialized views, no cache and no flash, no pre-sorted zone maps. All that is required is a parallel implementation with partitioning, compression, and a columnar table type… and this implementation works. It is robust.

A note on joins… It is more difficult to model joins… and I’ll attempt a simple model in another post. But you can see that this fast scan approach has solved the costly part of the problem using parallel processing… and you can imagine that a shared-nothing massively parallel approach to joins may hold the key.