I recently listened to a pitch by a vendor that was well spun. Let’s use this post to consider some spin around what is not metadata and what is a proprietary interface.
The vendor described how they stored photos… I pointed out that photos were clearly not metadata… so they went on to talk about the metadata attached to the photo: who took the photo, where it was taken, who was the photographer, and so on. But none of this is metadata either. It is data.
Consider this… if instead of a photo the “fact” was a transaction… we would never consider the name or customer id of the customer metadata. It is data. Yet this is exactly what the vendor was proposing. Not only were attributes of the transaction called metadata but the dimensional data around the fact was categorized as metadata. It was just wrong. Metadata about a photo might include the format of the photo: jpg or tiff. It might include information about the resolution. But it would not include the location of where the photo was taken.
I’ve seen this metadata mistake several times before… so I thought that I would call it out. Sometimes it is honest… and sometimes vendors spin their products as required to get an audience. As engineers we should call them on their spinning now and again.
If you are interested there is a great definition of metadata on Wikipedia (here)…
The same vendor was told that we are focussed on products that include no vendor lock-in… or that we can get around the locks by relying on standards. Their response was no worries… we use JSON formats.
Ugh. There is no portability afforded by this response. If I write code that expects to receive data wrapped in JSON but in a vendor-specific form underneath then I have no ability to plug-in another product and have it work. In database-land this would be like saying that any ODBC call is non-proprietary even if the call includes vendor-specific SQL syntax. It is like saying that it is OK… we use XML. If you hear these responses be careful… the vendor is leading you down a path that will lock you to their product.
Spinning too much makes me dizzy… and it insults the intelligence of the audience. I wish that vendors would just stop.
I want to soften my criticism of Greenplum‘s announcement of HAWQ a little. This post by Merv Adrian convinced me that part of by blog here looked at the issue of whether HAWQ is Hadoop too simply. I could outline a long chain of logic that shows the difficulty in making a rule for what is Hadoop and what is not (simply: MapR is Hadoop and commercial… Hadapt is Hadoop and uses a non-standard file format… so what is the rule?). But it is not really important… and I did not help my readers by getting sucked into the debate. It is not important whether Greenplum is Hadoop or not… whether they have committers or not. They are surely in the game and when other companies start treating them as competitors by calling them out (here) it proves that this is so.
It is not important, really, whether they have 5 developers or 300 on “Hadoop”. They may have been over-zealous in marketing this… but they were trying to impress us all with their commitment to Hadoop… and they succeeded… we should not doubt that they are “all-in”.
This leaves my concern discussed here over the technical sense in deploying Greenplum on HDFS as HAWQ… or deploying Greenplum in native mode with the UAP Hadoop integration features which include all of the same functionality as HAWQ… and 2x-3X better performance.
It leaves my concern that their open source competition more-or-less matches them in performance when queries are run against non-proprietary, native Hadoop, data structures… and my concerns that the community will match their performance very soon in every respect.
It is worth highlighting the value of HAWQ’s very nearly complete support for the SQL standard against native Hadoop data structures. This differentiates them. Building out the SQL dialect is not a hard technical problem these days. I predict that there will be very nearly complete support for SQL in an open source offering in the next 18-24 months.
These technical issues leave me concerned with the viability of Greenplum in the market. But there are two ways to look at the EMC Pivotal Initiative: it could be a cloud play… in which case Greenplum will be an uncomfortable fit; or it could be an open source play… in which case, here comes the wacky idea, Greenplum could be open-sourced along side Cloud Foundry and then this whole issue on committers and Hadoopiness becomes moot. Greenplum is, after all, Postgres under the covers.
First, you should look at Google’s Spanner paper here… this is the next-gen from Google and once it is embraced by the open source community it will put even more pressure on the big data DBMSs. Also have a look at YARN the next Map/Reduce… more pressure still…
Next… you can imagine that the conventional database folks will quibble a little with my analysis. Lets try to anticipate the push-back:
Hadoop will never be as fast as a commercial DBMS
Maybe not… but if it is close then a little more hardware will make up the difference… and “free” is hard to beat in price/performance.
SSD devices will make a conventional DBMS as fast as in-memory
I do not think so… disk controllers, the overhead of non-memory I/O, and an inability to fully optimize processing for in-memory will make a big difference. I said 50X to be conservative… but it could be 200X… and a 200X performance improvement reduces the memory required to process a query by 200X… so it adds up.
The Price of IMDB will always be prohibitive
Nope. The same memory that is in SSD’s will become available as primary memory soon and the price points for SSD-based and IMDB will converge.
Here is a sound bite on Big Data I composed for another source…
Big Data is relative. For some firms Big Data will be measured in petabytes and for other in hundreds of gigabytes. The point is that very detailed data provides the vital statistics that quantify the health of your business.
To store and access Big Data you need to build on a scalable platform that can grow. To process Big Data you need a fully scalable parallel computing environment.
With the necessary infrastructure in place the challenge becomes: how do you gauge your business and how do you change the decision-making processes to use the gauges?