Getting started with Hadoop… Enhance Your Data Warehouse Eco-system

Gartner thinks that the Big Data hype is going to die down a little for the lack of progress… (see here) Companies without web-scale, big, data are finding it hard to do anything commercially interesting… still CIO’s sense that Hadoop is going to become important. This post provides a suggestion that might help you to get started.

Hadoop goes here

In most data warehouse eco-systems there is an area, a staging place, where data lands after it is extracted from the source and before it is transformed. Sometimes the staging area and the ETL process are continuous and data flows through the ETL hardware system without seeming to land… but it usually is written somewhere.

The fact is that often enterprises only move data to their data warehouse that will be consumed by a user query. Often users want to see only lightly aggregated data in which case aggregation is part of the ETL process… the raw detail is lost. A great example of this comes from the telecommunications space. Call details may be aggregated into a call record… and often call records are sufficient to support a telco’s business processes.

But sometimes the detail is important. In this case the staging area needs to become a raw data warehouse… a place where piles of data may be stored inexpensively for a time… possibly for a long time.

This is where Hadoop comes in. Hadoop uses inexpensive hardware and very inexpensive software. It can become your staging area and your raw data warehouse with little effort. In subsequent phases, you can build up a library of the jobs that need to look at raw data. You might even start to build up a series of transformations and aggregations that might eventually replace your ETL system.

This is what Sears Holdings is up to (see here).

As I suggested in an earlier post, the economics of Hadoop make it the likely repository for big data. Using Hadoop as the staging area for your data warehouse data might provide a low risk way to get started with Hadoop… with an ROI… preparing your staff for other Hadoop things to come…

 

More on Big Data… and on Big Data Analytics… and on a definition of a Big Data Store…

After a little more thinking I’m not sure that Big Data is a new thing… rather it is a trend that has “crossed the chasm” and moved into the mainstream. Call Detail records are Big Data and they are hardly new. In the note below I will suggest that, contrary to the long-standing Teradata creed, Big Data is not Enterprise Data Warehouse (EDW) data. It belongs in a new class of warehouse to be defined…

The phrase “Big Data” refers to a class of data that comes in large volumes and is not usually joined directly with your Enterprise Data Warehouse data… even if it is stored on the same platform. It is very detailed data that must be aggregated and summarized and analyzed to meaningfully fit into an EDW. It may sit adjacent to the EDW in a specialized platform tailored to large-scale data processing problems.

Big Data may be data structured in fields or columns, semi-structured data that is de-normalized and un-parsed, or unstructured data such as text, sound, photographs, or video.

The machinery that drives your enterprise, either software or hardware, is the source of big Data. It is operational data at the lowest level.

Your operations staff may require access to the detail, but at this granular level the data has a short shelf life… so it is often a requirement to provide near-real-time access to Big Data.

Because of the volume and low granularity of the data the business usually needs to use it in a summarized form. These summaries can be aggregates or they can be the result statistical summarization. These statistical summaries are the result of Big Data analytics. This is a key concept.

Before this data can be summarized it has to be collected… which requires the ability to load large volumes of data within business service levels. The Big Data requires data quality control at scale.

You may recognize these characteristics as EDW requirements; but where an EDW requires support for a heterogeneous environment with thousands of data subject areas and thousands and thousands of different queries that cut across the data in an ever-increasing number of paths, a Big Data store supports billions of homogeneous records in a single subject area with a finite number of specialized operations. This is the nature of an operational system.

In fact, a Big Data store is really an Operational Data Store (ODS)… with a twist. In order to evaluate changes over time the ODS must store a deep history of the details. The result is a Big Data Warehouse… or an Operational Big Data Store.

%d bloggers like this: