What is Big Data? No kidding this time…

I posted a little joke on this topic here… this time I’ll try to say some a little more substantive…

Big Data is the new, new, thing. The phrase is everywhere. A Google search on the exact words “Big Data” updated in the last year yields 39,300,000 results. The Wikipedia entry for Big Data suggests that big data is related to data volumes that are difficult to process. There is specific mention of data volumes that are beyond the ability to process easily with relational technology. Examples are typically listed of weblog data and sensor data.

I am not a fan of the if-its-so-big-its-difficult-to-handle line of thinking. This definition lets anyone and everyone claim to process Big Data. Even the Wikipedia article suggests that for small enterprises “Big Data” could be under a Terabyte.

Nor I am a fan of the anti-relational approach. I have seen Greenplum relational technology solve 7000TB weblog queries on a fraction of the hardware required by Big Data alternatives like Hadoop in a fraction of the processing time. If relational can handle 7PB+ then Big Data means web-scale size… 1000’s of petabytes and only Google-sized companies can contain it. Big Data seems smaller than that.

Maybe the answer lies in focusing on the “new” part? An Enterprise Data Warehouse (EDW) can be smallish or large… but there are new data subject areas in the Big Data examples that may not be appropriate for an EDW. Sensor data might not be usefully joined to more than a few dimensions from the EDW… so maybe it does not make sense to store it in the same infrastructure? The same goes for click-stream and syslog data… and maybe for call detail records and smart meter reads in telcos and utilities?

So Big Data is associated with new subject areas not conventionally stored in an EDW… big enough… and made up of atomic data such that there is little business value in placing it in the EDW. Big Data can stand alone… value derived from it may be added to the EDW. Deriving that value come from another new buzzword: Big Data Analytics… surely the topic of another note…

