About four years ago Michael McIntire and I were pondering the rise of Hadoop. This blog will share bits of that conversation, provide an update based on the state of Hadoop today, and suggest a future state…
Briefly… we believed that the Hadoop eco-system was building all of the piece-parts of a very large database management system. You could see the basics: a distributed file system in HDFS, a low-level query engine in Map/Reduce with an abstraction in Pig, and the beginnings of optimization, SQL, availability, backup & recovery, etc.
We wondered why this process was underway… why would enterprises go to Hadoop when there were perfectly good relational VLDBs that could solve most of the problems… and where they could not… extending a mature RDBMS would be easier than the giant start-from-scratch of Hadoop.
We saw two reasons for the Hadoop project, process, and progress:
- Michael pointed out that the RDBMS vendors just did not understand how to price their products on “Big Data” (to be fair that term was not in use then)… if you have 7PB of data, as Michael did… then at the current $35K/TB list price the bill would be $245M. Even if you discounted to $1K/TB the tab would be $7M. The DBMS vendors were giving the big guys a financial incentive to Build instead of Buy… and so Google Built and Yahoo Built and Hadoop emerged.
- I pointed out that the academic community would support this… the ability to write a thesis based on new work in the DBMS space was becoming harder… but it was possible to sponsor papers that applied DBMS concepts to Hadoop and keep the PhD pipeline filled.
So there was funding, research, and development.
The narrative from here on is my own… Michael is off the hook..
Today Hadoop has a first release in the public domain. Dozens of companies are working to extend the core… some as contributors, some with a commercial interest, many with both incentives. The stack is maturing… and we now easily imagine a day when Hadoop will rival Teradata, Exadata, Netezza, and Greenplum in VLDB performance… with some product maturity and a rich set of features. And if Hadoop gets close and the price is free (or nearly so…) then the price/performance of Hadoop will make it unbeatable for “Big Data”.
In fact, the trigger for writing this piece now was the news a few weeks back from one of our Hadoop partners that HIVE was POC’d against one of the databases mentioned above on a big data problem and came close. The main query ran in 35 minutes on the DBMS and in 45 minutes with HIVE. The end is in sight… and sooner than expected.
What might this mean for the future?
Imagine a market where Hadoop can solve for big data problems… let’s say problems over 500TB just to draw a line… with the same performance as the best RDBMS, in a write-once/read-many use case like a data warehouse… for free. For FREE… plus the cost of the hardware. Hadoop wins… no contest.
Let’s suggest a market from 50TB to 500TB where a conventional RDBMS can out-perform Hadoop by 2X more-or-less… but Hadoop is free… so only applications where the performance matters can pay the price premium.
And let’s suggest a high performance in-memory database (IMDB) market that beats disk-based and SSD-based RDBMS by 50X for a 50% premium (based on new technologies like phase-change memory see here…) and can beat Hadoop by 1000X but at a higher cost.
You can see the squeeze:
- IMDB will own the high performance market… most-likely in the 100TB and under space…
- Hadoop will own the big data 500TB+ low-cost market…
- and the conventional DBMS vendors will fight it out for adequate-performance/medium-priced applications from 100TB to 500TB… with continued pressure from the top and the bottom.
Economics will drive this. The conventional DBMS vendors are moving to SSD’s… which increases their price in the direction of an IMDB… and increases their price/performance in the same good direction. But the same memory in SSD’s will soon be generally available as primary memory. So the IMDB prices and the conventional DBMS prices will converge… but the IMDB products will retain a 50X-100X performance advantage by managing the new memory as memory instead of as a peripheral device. Hadoop may or may not leverage SSD’s… but it will be free.
3 thoughts on “The Future of Hadoop and of Big Data DBMSs”
Your thinking goes along many of the same lines as my own, but one thing you seem to be leaving out of the discussion is the potential for “FREE” RDBMS (most likely Postgres at this point, but perhaps something like Big Query as well) to also claim some of the 500TB working space, by eliminating the primary competitive advantage you’ve labeled here for Hadoop (pricing). I’m curious if you have accounted, and perhaps discounted, that segment, or if you see that as a potential way for RDBMS to push back on the squeezing.
Postgres has to get a shared-nothing implementation into the market to get big. If that happens, and some market momentum forms behind it, then it could change the landscape. But right now it only seems possible… not likely. I’d still bet on Hadoop.
PostgreSQL is unlikely to make it into the MPP market. There are three basic problems:
1) the stack is not block oriented with heavy weight functions, trading code minimalism for stack performance (this makes it nearly impossible to optimize the cache line without going columnar),
2) there is no open source foundation for the required interconnect support (not the hardware – the distribution protocols) which is why GreenPlum and to a lesser extent Aster were major rewrites and have split their code lines from the open open source versions.
3) the architecture fundamentally requires a single control node. Hadoop continues to suffer from the same problem. Anyone seen any work on a parallel PostMaster yet? I’ve asked every single MPP PG vendor to do it, none has even put it on the roadmap. Several have said it simply wasn’t needed.
Sure you can “shard” a bunch of PG instances, but then you’re doing the work yourself in a limited set of functions which can likely only support small OLTP like units of work, essentially a hand built and self maintained software stack. See the very talented team at Skype and some of the great work they have done with shared and big PG implementation – and remember that it is not analytics.
My view may be biased because of my years building multi-PB databases out on the edge, but there are fundamental limitations to the architecture which need to be solved in order to go any larger than a big SMP.
Comments are closed.