A Big Data Sound Bite…

Here is a sound bite on Big Data I composed for another source…

Big Data is relative. For some firms Big Data will be measured in petabytes and for other in hundreds of gigabytes. The point is that very detailed data provides the vital statistics that quantify the health of your business.

To store and access Big Data you need to build on a scalable platform that can grow. To process Big Data you need a fully scalable parallel computing environment.

With the necessary infrastructure in place the challenge becomes: how do you gauge your business and how do you change the decision-making processes to use the gauges?

More on Big Data… and on Big Data Analytics… and on a definition of a Big Data Store…

After a little more thinking I’m not sure that Big Data is a new thing… rather it is a trend that has “crossed the chasm” and moved into the mainstream. Call Detail records are Big Data and they are hardly new. In the note below I will suggest that, contrary to the long-standing Teradata creed, Big Data is not Enterprise Data Warehouse (EDW) data. It belongs in a new class of warehouse to be defined…

The phrase “Big Data” refers to a class of data that comes in large volumes and is not usually joined directly with your Enterprise Data Warehouse data… even if it is stored on the same platform. It is very detailed data that must be aggregated and summarized and analyzed to meaningfully fit into an EDW. It may sit adjacent to the EDW in a specialized platform tailored to large-scale data processing problems.

Big Data may be data structured in fields or columns, semi-structured data that is de-normalized and un-parsed, or unstructured data such as text, sound, photographs, or video.

The machinery that drives your enterprise, either software or hardware, is the source of big Data. It is operational data at the lowest level.

Your operations staff may require access to the detail, but at this granular level the data has a short shelf life… so it is often a requirement to provide near-real-time access to Big Data.

Because of the volume and low granularity of the data the business usually needs to use it in a summarized form. These summaries can be aggregates or they can be the result statistical summarization. These statistical summaries are the result of Big Data analytics. This is a key concept.

Before this data can be summarized it has to be collected… which requires the ability to load large volumes of data within business service levels. The Big Data requires data quality control at scale.

You may recognize these characteristics as EDW requirements; but where an EDW requires support for a heterogeneous environment with thousands of data subject areas and thousands and thousands of different queries that cut across the data in an ever-increasing number of paths, a Big Data store supports billions of homogeneous records in a single subject area with a finite number of specialized operations. This is the nature of an operational system.

In fact, a Big Data store is really an Operational Data Store (ODS)… with a twist. In order to evaluate changes over time the ODS must store a deep history of the details. The result is a Big Data Warehouse… or an Operational Big Data Store.

What is Big Data? No kidding this time…

I posted a little joke on this topic here… this time I’ll try to say some a little more substantive…

Big Data is the new, new, thing. The phrase is everywhere. A Google search on the exact words “Big Data” updated in the last year yields 39,300,000 results. The Wikipedia entry for Big Data suggests that big data is related to data volumes that are difficult to process. There is specific mention of data volumes that are beyond the ability to process easily with relational technology. Examples are typically listed of weblog data and sensor data.

I am not a fan of the if-its-so-big-its-difficult-to-handle line of thinking. This definition lets anyone and everyone claim to process Big Data. Even the Wikipedia article suggests that for small enterprises “Big Data” could be under a Terabyte.

Nor I am a fan of the anti-relational approach. I have seen Greenplum relational technology solve 7000TB weblog queries on a fraction of the hardware required by Big Data alternatives like Hadoop in a fraction of the processing time. If relational can handle 7PB+ then Big Data means web-scale size… 1000’s of petabytes and only Google-sized companies can contain it. Big Data seems smaller than that.

Maybe the answer lies in focusing on the “new” part? An Enterprise Data Warehouse (EDW) can be smallish or large… but there are new data subject areas in the Big Data examples that may not be appropriate for an EDW. Sensor data might not be usefully joined to more than a few dimensions from the EDW… so maybe it does not make sense to store it in the same infrastructure? The same goes for click-stream and syslog data… and maybe for call detail records and smart meter reads in telcos and utilities?

So Big Data is associated with new subject areas not conventionally stored in an EDW… big enough… and made up of atomic data such that there is little business value in placing it in the EDW. Big Data can stand alone… value derived from it may be added to the EDW. Deriving that value come from another new buzzword: Big Data Analytics… surely the topic of another note…

Co-processing and Exadata

In my first blog (here) I discussed the implications of using co-processors to offload CPU. The point was that with multi-core processors it made more sense to add generalized processing hardware that could be applied to all parts of the query process than to add specialized processors that dealt with only part of the problem.

Kevin Closson has produced two videos that critically evaluate the architecture of Exadata and I strongly suggest that you view them here before you go on with this post… They are enlightening, irreverent, and make the long post I’ve been drafting on Exadata lightweight and unnecessary.

If you have seen Kevin’s post you understand that Exadata is asymmetric and unbalanced. But his post extends and generalizes my discussion of co-processing in a nice way. Co-processing is asymmetric by definition. The co-processor is not busy after it has executed on its part of the problem.

In fact, Oracle has approximately mirrored the Netezza architecture with Exadata but used commercial processors instead of FPGAs to offload I/O and predicate processing. The result is the same in both cases… underutilized processing capability. The difference is that Netezza wastes some power on relatively inexpensive FPGA processors while Exadata wastes general and expensive CPU resources that might actually be applied usefully elsewhere. And Netezza splits the processing within a shared-nothing architecture while Exadata mixes architectures adding to the inefficiency.

More on Exalytics: How much user data fits?

Sorry… this is a little geeky…

The news and blogs on Exalytics tend to say that Exalytics is an in-memory implementation with 1TB of memory. They then mention, often in the same breath, that the TimesTen product which is the foundation for Exalytics now supports Hybrid Columnar Compression which might compress your data 5X or more. This leaves the reader to conclude that an Exalytics Server can support 5TB of user data. This is not the case.

If you read the documentation (here is a summary…) a 1TB Exalytics server can allocate 800GB to TimesTen of which half may be allocated to store user data. The remainder is work space… so 400GB uncompressed is available for user data. You might now conclude that with 5X compression there is 2TB of compressed user data supported. But I am not so sure…

In Exadata Hybrid Columnar Compression is a feature of the Storage Servers. It is unknown to the RAC layer. The compression allows the Storage Servers to retrieve 5X the data with each read significantly improving the I/O performance of the subsystem. But the data has to be decompressed when it is shipped to the RAC layer.

I expect that the same architecture is implemented in TimesTen… The data is stored in-memory compressed… but decompressed when it moves to the work storage. What does this mean?

If, in a TimesTen implementation without Hybrid Columnar Compression, 400GB of work space in memory is required to support a “normal” query workload against 400GB of user data then we can extrapolate the benefits of 5X compressed data as follows:

  • x of user data compressed uses x/5 of memory plus x of work space in memory… all of which must fit into 800GB

This resolves to x = 667GB… a nice boost for sure… with some CPU penalty for decompressing.

So do not jump to the conclusion that Hybrid Columnar Compression in TimesTen of 5X allows you to put 5TB of user data on a 1TB Exalytics box… or even that it allows you to load 2TB into the 400GB user memory… the real number may be under 1TB.

Exalytics vs. HANA: What are they thinking?

I’ve been trying to sort through the noise around Exalytics and see if there are any conclusions to be drawn from the architecture. But this post is more about the noise. The vast majority of the articles I’ve read posted by industry analysts suggest that Exalytics is Oracle‘s answer to SAP‘s HANA. See:

But I do not see it?

Exalytics is a smart cache that holds a redundant copy of aggregated data in memory to offload aggregate queries from your data warehouse or mart. The system is a shared-memory implementation that does not scale out as the size of the aggregates increase. It does scale up by daisy-chaining Exalytics boxes to store more aggregates. It is a read-only system that requires another DBMS as the source of the aggregated data. Exalytics provides a performance boost for Oracle including for Exadata (remember, Exadata performs aggregation in the RAC layer… when RAC is swamped Exalytics can offload some processing).

HANA is a fully functional in-memory shared-nothing columnar DBMS. It does not store a copy of the data.. it stores the data. It can be updated. HANA replaces Oracle… it does not speed it up.

I’ll post more on Exalytics… and on HANA… but there is no Exalytics vs. HANA competition ahead. There will be no Exalytics vs. HANA POCs. They are completely different technologies solving different problems with the only similarity being that they both leverage the decreasing costs of RAM to eliminate the expense of I/O to disk or SSD devices. Don’t let the common phrase “in-memory” confuse you.

The Best Data Warehouse Spin of 2011

At this time of the year bloggers everywhere look back and reflect. Some use the timing to highlight significant achievements… and it is in the spirit that I would like to announce my choice for the best marketing in the data warehouse vendor space for 2011.

Marketing is a difficult task. Marketeers need to walk a line between reality and bull-pucky. They need to appeal to real and apparent needs yet differentiate. Often they need to generate spin to fuzz a good story by a competitors marketing or to de-emphasize some short-coming in their own product line.

Below is a picture taken on the floor of a prospect where we engaged in a competitive proof-of-concept. The customer requested that vendors ship a single rack configuration… and so we did.

But the marketing coup is that the vendor on the right, Teradata, told the customer that this is a single rack configuration and that they are in compliance. The customer has asked us if this is reasonable?

This creative marketing spin wins the 2011 award going away… against very tough competition.

I expect this marketing approach to start a trend in the space. Soon we will see warehouse appliance vendors claiming that 1TB = 50TB due to compression… or was that already done this year?

Sorry to be cynical… but I hope that the picture and story provide you with a giggle… and that the giggle helps you to start a happy holiday season.

– Rob Klopp

Is there a new DW database architecture lurking in the wings?

The link here describes a new 2TB SSD technology which should be on the market in 2013.

This prompts me to ask the question: How would we redesign relational databases for data warehousing if random reads become as fast as sequential reads?

I’ll look forward to your comments and post more on this over time…

Predictive Analytics, Event Processing, and Rules… Oh, and in-memory databases…

There is a lot of talk these days about predictive analytics, big data, real-time analytics, dashboards, and active data warehousing. These topics are related in a fairly straightforward way. Further, there are new claims about in-memory database processing that blends these issues into a promise of real-time predictive analytics. Lets tease the topic apart…

Predictive analytics is really composed of two parts: modeling and scoring.

Modeling requires big data to discover which information from the enterprise predicts some interesting event. Big data is both broad and deep: broad because no one knows which data elements will be predictive… “which elements” has to be discovered in the modeling exercise… and deep is required because it takes history to detect a trend.

The model that results represents a rule: if the rule’s conditions are true then we can predict the outcome within some statistical boundaries. For example: “if a payment has not been received within 120 days and the customer’s account balance has dropped over 40% in the last year then there is a 83.469271346% chance that the customer will default on the payment. Note that you can create a rule without predictive modeling and without a statistical boundary… for example: “if payment has not been received within 120 days then the customer will likely default”. We have been creating these heuristic rules since time began… and the point of predictive analytics is to discover rules more accurately than by heuristics. You might say then that a model defines, or predicts with some certainty, an event of interest. The definition may be described as a set of rules.

Scoring requires only the elements discovered by the modeling exercise… and may or may not require big deep data. Big data is required if any of the discovered elements represents a trend. For example, if to predict a stock price there are elements that represent the average price over the last 30 days, 90 days, 180 days, etc.; and that calculate the difference between the 90 day number and the 30 day number to show the trend; then either the data has to be aggregated on-the-fly from the detail… or it must be pre-aggregated. This distinction is important… the result of a modeling exercise may require the creation of some new aggregated data. Note that we are suggesting that a score depicts an interesting event.

Real-time analytics, or more fairly, near-real-time analytics, requires these rules to be checked ASAP after new information is available.

A dashboard can provide one of two features: either the dashboard applies the rules and presents alerts when some rule triggers… or the dashboard may present raw data for evaluation by a human. For example, the speedometer on your car dashboard presents raw data and it is up to you to apply rules based on the input. Note that the speedometer on your car provides a real-time display. Sometimes BI dashboards use real-time displays like a speedometer to display static data. For example, I have seen daily metrics displayed using a speedometer widget… but since the speedometer updates just once a day this is clearly metaphorical.

Active data warehousing implies some sort of rule-based activity. The activity may be triggered in near-real-time or as a batch process.

But in any consideration of real-time processing there is an issue if the rules cross data input boundaries. By this I mean… it is simpler to build a speedometer that reads from one input, the rotation of the axle, than to build a meter that incorporates multiple inputs… for example the meter that displays how many miles/meters you have left before you run out of petrol. But to provide this in real-time your car has real-time access to two inputs… and an embedded processor is required to integrate the data and derive the data display. Near-real-time displays of data warehouse data have the same constraints… if the display requires data from more than one source that data has to be acquired and integrated and calculated/scored in near-real-time. This is a daunting problem if the sources cross application system boundaries.

In the book “In-memory Data Management” Plattner and Zeier promise near-real-time analytics from an in-memory DBMS, HANA. But there is no discussion of how this really works for a data warehouse across source systems based on integrated data. Near-real-time predictive modeling requires broad data that will cross these systems boundaries. It may be possible to develop a system with near-real-time data acquisition and integration can occur… and rules may be applied immediately to identify interesting events. But this sort of data acquisition is very advanced… and an in-memory database does not inherently solve the problem… every application in the enterprise will not be in the same memory space.

I do not see it. SAP may live in a single memory space… and maybe every possible application of that data can live there as well. But as long as there is relevant data outside of the space data integration is required and the argument for real-time weakens.