Hadoop and ETL

My last post (here) blathered about the effect that Hadoop must have on database vendor profits. An associate wrote me with the reminder that Hadoop is also impacting revenues and profits of ETL companies.

If you think about Hadoop as both an inexpensive staging area for an EDW and as a parallel compute engine that can transform ungoverned, extracted data and load it into a governed EDW platform… then you are just one thought from realizing that these two functions have heretofore been in the domain of ETL… and that moving these functions to Hadoop might have an effect in the ETL space.

I do not believe that ETL tools will go away… but they may become just the GUI development environment that lets you quickly develop transformations and connect them into an end-to-end ETL process. The scheduling, processing engine, and monitoring could then be handled by the Hadoop eco-system.

Here is the idea from a previous post.

About five years ago the precursor to Alpine Data Labs, then an EMC Greenplum subsidiary, was developing a GUI for analytics that connected processes and I suggested they spin the product both into analytics and into ETL… I’ll have to look and see where they are these days…

One thought on “Hadoop and ETL

  1. I agree.. I still believe ETL tools have role to play in this ERA. More than development support is still a major concern having tough time finding admins who knows /can be trained in oozie. It will be easy to cross train INFA or any ETL admins for them it is just an add on feature and rest everything remains the same rather than training them in a whole new product that needs a lot of scripting knowledge.Hearing very positive feedback about INFA BDE except for the cost. We did some of the Hadoop as staging platform implementations. But I have been very selective in moving components into staging area. I wasn’t brave enough to test the SCDs in Hadoop :(. I left all the small dimensions as in the DB and moved fact processing and some of the data cleansing activities too Hadoop. To keep it simple I chose only 2 components, pig for data cleansing and hive for analysis. Has anyone tried soft deletes with Hadoop ?


Comments are closed.