Michael Stonbreaker has suggested several times… and again in this interview… that databases will become more specialized and that “one size will fit none”. I’m sure that his argument is more nuanced than the sound bites in the interview, but in this post I’ll suggest a line of thinking that may lead to a different conclusion.
First, let’s agree that the word “fit” means the best price for the performance required to meet your company’s service level requirements.
Then let’s agree with the basic premise behind Dr. Stonebreaker’s argument… we agree that in any single-purpose application a specialized single-purpose DBMS can be developed that will out perform a generalized DBMS. This means that one-half of the fit, performance, is likely.
We would also agree that between the growth of open source databases and the general growth in the database space that it is likely that someone can and will develop specialized databases and bring them to market in cases where there is enough market to make it worthwhile. But it is important to note that specialization will be not become infinitely narrow… there has to be enough market for the special case to generate an attractive product.
So where is the disagreement: I do not believe that data is ever used in a single specialized business context. Not ever.
Let’s imagine that we have a business requirement for an extremely high volume OLTP application… and let us assume that the performance and/or scalability requirements are beyond what any general DBMS can provide and that the business ROI is significant… in other words, let us imagine that we are Google or Facebook. In this case we have no choice but to select or to develop an extreme, specialized, DBMS to solve the problem and extract the return.
But in this case, once the OLTP transactions are recorded… what do we do with the data? We need to use the data elsewhere in the business as the basis for deep analytics or for basic business intelligence… so we have to replicate the data to a second database. Since the second DBMS is also sort of specialized… it does not have to support OLTP… we select a second specialized, data warehousish product.
And then come new requirements for doing light queries in near real time to support operational analytics… so we build some sort of operational data store. Again we can select a product with a narrow technical sweet spot… but we have to replicate the data a third time.
In other words… given my premise… that data is never used in a single specialized context… specialized databases force replication… and replication allows for further specialization.
But what if the requirements are not so extreme? Then we might use a single conventional RDBMS for the EDW and for the ODS problems. If a more generalized DBMS product exists that could handle both the operational reporting and the analytic reporting requirements in a single image of the data then we could eliminate one replica and one DBMS. In other words, if the problem is not so extreme then a generalized solution might provide a solution in a single instance of the data avoiding replication.
Now the issue becomes: is the cost of a specialized system plus replication plus a second specialized system plus the cost of operating these systems less than the cost of a single generalized system? I believe that the answer will often be in favor of a single system even when the specialized systems are low-cost open source. Since “fit” is about cost maybe one size does fit now and again.
This suggests the strategy of the OldSQL vendors. They are offering a Swiss Army Knife product that serves multiple requirements. Their feature sets have grown over 30 years and they are pretty capable across a wide array of business problems… and with the columnar and in-memory features being added they continue to cover ever more extreme uses cases… not the most extreme use cases… but they cover more ground each year.
The strategy of the NewSQL vendors is to focus tight and hard. They might develop an extreme OLTP DBMS with no ability to do a join… a product with extreme scalability and no performance… or a graph database to solve for an important, narrow, set of queries… or a columnar product that performs analytics but support no OLTP. This trend feeds the specialize and replicate meme advocated by Dr. Stonebreaker.
HANA is a horse of a different color… neither NewSQL nor OldSQL. It is a new code base designed to solve for a very wide set of uses cases in a single instance of the data. We certainly agree in this blog with Dr. Stonebreaker’s contention that the 30 year old legacy code base has to be retired. But SAP contends that you can build a new, generalized, DBMS that solves for all but the most extreme cases.
This is a great spot to end the year… having laid out the battle we will cover in this blog ongoing… with the legacy OldSQL vendors trying to tack on to their legacy code base… and doing pretty well at it… with the NewSQL vendors trying to specialize and replicate… and with HANA offering a new code base designed to solve for the the whole picture. 2014 will be great fun to watch. This also sets the stage to ask next year whether “big data” applications are so extreme as to force users to specialize-replicate-specialize.
Have a great holiday season… and my best wishes. Thank you all for reading the Database Fog Blog in 2013… I hope for your continued attention in the New Year…
12 thoughts on “Specialized Databases vs. Swiss Army Knives”
Rob, great thought provoking post as ever. By an eerie coincedence I’d just re-read the paper “Efficient Transaction Processing in SAP HANA Database –The End of a Column Store Myth” by Vishal Sikka et al , for SIGMOD 2012.
(Avialable at: http://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p731-sikka.pdf )
This quotes Stonebraker’s original paper – but goes on to discuss how HANA tackles the design issue of having a variety of specialist operators (and their separate engines) which can collaborate and each of which are well suited to different tasks (OLAP, RDBMS, Spatial, text etc) and yet not give up the advantage of having a common and efficient foundation for efficient processing – a transactionally updatable in memory column store. So by adopting an extensible layered model for the system it is possible to do both. I just looked it up again and the paragraph in that paper that your blog reminded me of was:
“We would like to point out that dierent workload characteristics do not fully justify going for the zoo of specialized systems. Our past experience of handling business applications leads us to support the hypothesis for a need of specialized collections of operators. We are biased against individual
systems with separate life cycles and administration set-ups. However, we do not aim at providing a single closed system but a extensible data management platform with common service primitives. The currently available SAP HANA database functionality, which is a core part the SAP HANA appliance (), may be considered as one specific incarnation of such a tool box”
So with this architecture it is possible to gain the benefits of specialised engines, without the penalty of having to integrate totally separate hardware and software stacks. It seems to me that the recent addition of Spatial operators, and the planned expansion of Graph operators for HANA are good examples of this in practice. What are your thoughts on this?
First line of quotes should of course read:
“We would like to point out that different workload characteristics do not fully justify going for the zoo of …
No idea where the missing characters went 🙂
I believe, Henry, that it will always be possible to develop a narrow, specialized, DBMS with a shorter instruction path length for narrow, specialized, operations than could be supported by any generalized DBMS. So, I would disagree with the idea that “it is possible to gain the benefits of specialised engines, without the penalty of having to integrate totally separate hardware and software stacks”.
My point is that it is possible, in a single modern DBMS, to gain the benefits of specialized engines… once you account for the cost of integrating separate stacks. This cost greatly reduces the benefits and leaves the specialized databases sensible only in niches where the extreme performance provided is an absolute necessity… warranting the cost of specialization + replication + specialization.
It is a very interesting question to ask: Is HANA a tightly-coupled set of specialized databases with very efficient memory-based, internally managed, replication (this would seem to be Vishal’s point)… or is it a single data image used to address a wide range of functions? I see it as the latter.
The operators you mention are also very interesting… do the graph operators operate on tabular data? If so then it would lend credence to a tightly-coupled product a single set of operators working against a single instance of data. If not then it leads to a tightly-coupled product with data in multiple forms, replicated, with specialized operators working against specialized data formats.
Frankly, I suspect that the Vishal quote just does not tell the whole HANA story… HANA will provide a broad array of highly-efficient operators against a single data image… and provide a single data management system for specialized data operators requiring specialized data formats. You can see both forms in the product today.
A very thought-provoking question, Henry… likely to get me into some hot water… So it goes…
Happy New Year…
Rob, I agree with what you say above, and it sparks a further thought. I don’t think that there necessarily has to be a black and white / binary split between general purpose systems and completely separated specialised stacks. In the extreme specialist systems will be optimised for a particular job, with other jobs ignored, or deliberately suboptimised so as to focus solely on the main design target.
If you regard HANA as having two layers, an engine layer and a storage layer, then clearly there are multiple engines in the top layer – SQL, column, text, planning, etc and these are programmed to efficiently process particular styles of processing.
At the lower level there are multiple storage models, columnar, row – and you could count federation as allowing access to other storage mechanisms (e.g disk based column store, Hadoop etc), it just so happens that the majority of the engines use the columnar store.
So maybe we should regard the lower level not as a tabular (relational) data store, but rather as a columnar store, this is used as the storage mechanism for the SQL (tabular) engine but also for text, planning etc, so the familiar tabular representation is actually built on top of columnar primitives.
The reason why column storage is used the most often is that in-memory columnar stores are the most efficient way of using modern CPU’s with their L1, 2 and 3 cache, SIMD instructions etc. It so happens that this benefits multiple engine types, in particular relational and text (and shortly graph too), therefore the multiple engines built on the top layer can all benefit from a common physical representation – and in the process make data sharing (across engines) easier too by allowing sharing via memory.
Therefore we can allow a common storage mechanism at the lower storage level (and linkages to other storages where necessary), we can gain the benefits of having separate specialist engines but avoid or reduce the cost of having to integrate completely separate stacks with wildly different internal mechanisms, its not a pure single engine approach, but neither is it an extreme separation of engines and all thier storages, but rather a balance of the two … just a thought.
I’ve been trying to get my head around this, Henry… but I cannot see the point. Either a database is specialized or not. If not then it will solve a wide variety of problems in a single instance of the data. If it is specialized then it will solve for a single use case and replication will be required to solve for another case. Of course it is possible that a DBMS product can be specialized and solve for two uses cases… but at some point the word “specialized” is inappropriate and we call the product a generalized DBMS.
There is an important distinction here that I have not made clear enough… So I’m going to open a new blog entry. Stay tuned.
But let me say this… The magic of HANA is in it’s ability to solve the vast majority of commerical OLTP use cases and support an EDW analytic workload against a single image of the data. This lets HANA cut operational costs, effectively deploy data in-memory, and enable a new range of real-time applications that have yet to be conceived and developed. The OldSQL vendors are addressing this by employing two versions of the data: one for OLTP and one for BI & Analytics; in a single stack. This is a weak incremental advance… tacking the new stuff on the old… The result will be a weak, frail, architecture that will let you down over time.
Insofar as HANA allows you to store data in multiple formats and use internal replication to manage them we are going down the same path as others. There are extreme specialize-replicate-specialize use cases where it will make sense and in those cases HANA will be compared and compete with other NewSQL players.
As for your idea that there is a single columnar format that supports text processing, graph processing, and other data forms… this is TBD. The big HANA advance is to support OLTP and analytics in one form. Time will tell if we can adapt the other processes… and this is the tension that we will continue to discuss here as specialized databases deny progress in the generalized space and the generalized databases strive to solve more and more problems swiss army knife style… again HANA stands out here with a unique approach… Your background or paycheck may push you to work on other products… but everyone should be cheering for HANA at least a little… it seems the best bet for a single instance of the truth in your enterprise… the best bet off of the specialize-replicate-specialize nightmare.
Rob, I wrote you a sort of followup:
AnywayHana might aim to do it all, especially in the enterprise software context, but I don’t think it makes sense even with Hana to run web-scale OLTP with operational reporting and adhoc data science on the same environment… Sometimes you’ll have to break it down
Of course, Ofir… I’m not suggesting that you can solve extreme problems in a single instance.
But conventional business problems are not web scale problems… and there is a considerable attempt, led by HANA but followed with add-ons to legacy code, to extend the reach of generalized database technology…
If you can avoid replication at a reasonable cost…you should.
I agree with you – in an enterprise context, Hana is a great solution as it makes life much easier, when you can afford it… Still, the rest of the vendors are working on catching up, bolting in-memory technologies and engines on their existing product. Of course, the burden of proving their implementation quality is up to them…
I think however that Stonebraker is less interested in the classical enterprise application space – his observations makes more sense in the web-scale world.
I’d like to see the Stonebreaker quote that supports this opinion, Ofir. I think that Stonebreaker’s multi database stance conveys his view of the direction of the entire database space… And his specific comments about legacy databases in the interview I cited would seem to support this.
Well, I won’t put words in his mouth… But I personally don’t see performance as a main challenge in most enterprise databases – it is about cost, manageability, HA, cost, security and cost (of course, when you aim to cut costs of enterprise licenses, you’ll hurt performance). So, there is really no need to go with half a dozen specialized databases if SQL Server is good enough – no need for a solution if there is no problem.
I don’t think even SAP position Hana as the only database option – it is a premium offering for the high-end database spectrum. For the less challenging, cost-driven system – I think you still promote Sybase.
I think that we agree then, that specialized databases are not useful if a generalized database can solve the problem. And I suspect that we agree that HANA is a generalized database that solves some harder problems than those normally addressed by something like SQL Server (I know, there are in-memory OLTP extensions, and columnar extensions, and scalable extensions… not picking on SQL Server just using it because O did)… so HANA, and the legacy+tack-on products (again… I do not mean this to sound too harsh… but it is the architecture clearly emerging… and it is working to a fair degree) must impinge on the space that would otherwise be claimed by multiple specialized DBMSs.
SAP’s positioning of HANA vs. Sybase is clearly not driven solely by architectural considerations and product capabilities… you cannot draw solid architectural conclusions from this line of thinking… despite my attempts to ignore them in this blog… business considerations count. 😉
Comments are closed.