The Big Data Bang

There is still an open question over whether, after the Big Bang, there is enough mass in the Universe to slow the expansion and cause the universe to contract. While the Big Data Bang continues to expand the universe of bits and bytes… I would like to ask whether some of these numbers are overstated? I know that the sum of the bits and bytes is expanding but I wonder if the universe of information is expanding as much as we claim?

Note that by “information” I mean a unique combination of bits and bytes representing some new information. In other words, if the same information is copied redundantly over and over does that count?

There is a significant growth industry in deduplication software that can backup data without copying redundant information. The savings from these products is astounding. NetApp claims 70% of the unstructured data may be redundant (see here). Data Domain says that eliminating (and compressing) redundant data reduces storage requirements by 10X-30X (see here).  What’s up with that?

In the data warehouse space it is just as bad. The same data lives in OLTP systems, ETL staging areas, Operational Data Stores, Enterprise Data Warehouses, Data Marts, and now Hadoop clusters. The same information is replicated in aggregate tables, indexes, materialized views, and cubes.  If you go into many shops you can find 50TB of EDW data exploded into 500TB of sandboxes for the data scientists to play with. Data is stored in snapshots on an hourly basis where less than 10% of the data changes from hour to hour. There is redundancy everywhere. There is redundancy everywhere. 🙂

I believe that there is a data explosion… and I believe that it is significant… but  there is also a sort of laziness about copying data.

Soon we will see in production the first systems where a single copy of OLTP and EDW and analytic data can reside in the same platform and be shared. It will be sort of shocking to see the Big Data Bang slow a little…