More thinking on Specialized Databases

Recently I posted (here) some thinking that suggested that the cost of replicating data into specialized databases might outweigh the benefits of specialization. This post will present a counter view and try to sort out when a specialized database might make sense.

In the ZDNet post here: “Look at What Google and Amazon are doing with Databases: That’s your future” Toby Wolpe and Neo Technology CEO Emil Eifrem suggest that:

“The era of the one-size-fits-all database is over. It used to be when I grew up as a developer that for the architect in the project, when it came to choosing the bottom layer of the stack — the persistence layer — the choice was Microsoft, or IBM, or Oracle, or Sybase. It was a vendor choice.

They were all the same type of database. But that era has gone forever and it will never come back because data is just so big and so irregularly shaped now that you’re always going to be able to get a hundred times improvement, a thousand times improvement, a million times improvement if you get a data technology that is shaped like the shape of your data.”

While I have suggested that a swiss army knife DBMS that solves many problems from a single data source… thereby eliminating the cost and complexity of data replication and data synchronization… might provide a sensible choice for most commercial applications.

Actually I agree with Eifrem and Wolpe in many respects… but there is a difference in our starting assumptions. Let me be clear first about where I strongly agree.

When data volumes grow to web-scale… to Google-scale or Amazon-scale… then the inefficiencies of one-size-fits-all amplify and become intractable… so with a specialized DBMS you might indeed see 100X, 1000X, or more performance advantage and gain a competitive edge from replication and specialization.

But a lot of core data is not Big Data. This is where we do not seem to agree. While our company’s all aspire to have a customer database that is in the petabyte range… it is just not usually the case. Likewise we aspire to have a transaction database requiring petabyte scale… but it just is not the case in most businesses even if you keep years and years of history.

Let’s consider graph databases… Maybe customer data should be in a graph database to specialize it for processing relationships. But this is likely to make it sub-optimal for many other processes… in fact it is the thesis of the ZDNet article that it will make it sub-optimal for many other processes… and so replication to more specialized databases is the only alternative.

How might we handle this relationship problem in a generalized DBMS? HANA, for example, can form graphs in-memory from data shaped into columns… unfolding the graphDB blade from the swiss army knife when required but storing the data in a generalized shape otherwise.

It may be true that there could be orders of magnitude advantage for big data shaped into a specialized graphDB form… But if your customer database is in the terabyte range or less, then the advantage may be negligible… or at least the advantage may not justify the cost of replication into two forms.

And think about the implications of specializing big data. Google replicates tens of petabytes of data into multiple shapes to gain competitive advantage… and ten petabytes specialized and replicated ten times is really really big data.

So I agree with parts of the ZDNet post… big data companies are likely to be pushed by the competition to store the data multiple times in specialized replicated big databases… and for this you will look to Google, Amazon, Netflix, and the like for database technology. But most enterprises will be able to store core data in generalized databases… and will extend into big data realms only as machine-to-machine transactions and/or the Internet of Things drive them there… and then they will extend their data architectures rather than replicate again and again.