My last post (here) blathered about the effect that Hadoop must have on database vendor profits. An associate wrote me with the reminder that Hadoop is also impacting revenues and profits of ETL companies.
If you think about Hadoop as both an inexpensive staging area for an EDW and as a parallel compute engine that can transform ungoverned, extracted data and load it into a governed EDW platform… then you are just one thought from realizing that these two functions have heretofore been in the domain of ETL… and that moving these functions to Hadoop might have an effect in the ETL space.
I do not believe that ETL tools will go away… but they may become just the GUI development environment that lets you quickly develop transformations and connect them into an end-to-end ETL process. The scheduling, processing engine, and monitoring could then be handled by the Hadoop eco-system.
About five years ago the precursor to Alpine Data Labs, then an EMC Greenplum subsidiary, was developing a GUI for analytics that connected processes and I suggested they spin the product both into analytics and into ETL… I’ll have to look and see where they are these days…
Gartner thinks that the Big Data hype is going to die down a little for the lack of progress… (see here) Companies without web-scale, big, data are finding it hard to do anything commercially interesting… still CIO’s sense that Hadoop is going to become important. This post provides a suggestion that might help you to get started.
In most data warehouse eco-systems there is an area, a staging place, where data lands after it is extracted from the source and before it is transformed. Sometimes the staging area and the ETL process are continuous and data flows through the ETL hardware system without seeming to land… but it usually is written somewhere.
The fact is that often enterprises only move data to their data warehouse that will be consumed by a user query. Often users want to see only lightly aggregated data in which case aggregation is part of the ETL process… the raw detail is lost. A great example of this comes from the telecommunications space. Call details may be aggregated into a call record… and often call records are sufficient to support a telco’s business processes.
But sometimes the detail is important. In this case the staging area needs to become a raw data warehouse… a place where piles of data may be stored inexpensively for a time… possibly for a long time.
This is where Hadoop comes in. Hadoop uses inexpensive hardware and very inexpensive software. It can become your staging area and your raw data warehouse with little effort. In subsequent phases, you can build up a library of the jobs that need to look at raw data. You might even start to build up a series of transformations and aggregations that might eventually replace your ETL system.
As I suggested in an earlier post, the economics of Hadoop make it the likely repository for big data. Using Hadoop as the staging area for your data warehouse data might provide a low risk way to get started with Hadoop… with an ROI… preparing your staff for other Hadoop things to come…