I very much like Curt Monash’s posts on dynamic schemas and schema-on-need… here and here are two examples. They make me think… But I am missing something… I mean that sincerely not just as a setup for a critical review. Let’s consider how dynamism is implemented here and there…so that I can ask a question of the audience.
First imagine a simple unschema’d row:
Rob KloppDatabase Fog Bloghttp://robklopp.wordpress.com42
A human with some context could see that there is a name string, a title string, a URL string, and an integer string. If you have the right context you would recognize that the integer holds the answer to the question: “What is the meaning of Life, the Universe, and Everything?”… see here… otherwise you are lost as to the meaning.
If you load this row into a relational database you could leave the schema out and load a string of 57 characters… or load the data parsed into a schema with Name, Title, URL, Answer. If you load this row into a key-value pair you can load it into an unschema’d row with the Key = Row and the Value equal to the string… or parse the data into four key-value pairs.
In any case you have to parse the data… If you store the data in an unschema’d format you have to parse the data and bind value to keys to columns late… if you store the data parsed then this step is unnecessary. To bind the data late in SQL you might create a view from your program… or more likely you would name the values parsed with SQL string functions. To parse the data into key-value pairs you must do the equivalent. The same logic holds true for more complex parsing. A graph database can store keys, values, and relationships… but these facets have to be known and teased out of the data either early or late. An RDBMS can do the same.
So what is the benefit of a database product that proclaims late binding as an advantage? Is it that late binding is easier to do than in an RDBMS? What am I missing?
Please do not respond with a list of other features provided by NewSQL and NoSQL databases… I understand many of the trade-offs… what I want to know is:
- What can they do connected to binding values to names that an RDBMS cannot? And if there is no new functionality…
- Is there someway they allow for binding that is significantly easier?
By the way, the Hitchhiker’s Guide is silent on the question of whether 42 is a constant or ever-changing. I think that I’ll ask Watson.
7 thoughts on “Dynamic Late Binding Schemas on Need”
The advantage to late binding is related to when the cost of structure is paid. In any and all systems, structure is required to do anything, even on unstructured data (RAW digital photograph data has a structure…). The underlying issues are related to the organizational costs of implementation, ergo a) in what time frame do I get access to the data, and b) at what cost does that access come.
Issue A is about if I have to pay the elapsed time and organizational costs inherent in having a centralized team to implement/manage logical/physical models, implement standard ETL – blah, blah, blah BEFORE I can use a single part of the data. Issue B is about the ongoing costs of maintenance and if they are implicit or explicit (and when they occur) – do I have to change the entire structure for a front end developer to add a new data attribute to the feed?
Issue B is highly relevant and correlated – the larger the organization becomes; since transport, movement, and access issues are limited to the knowledge required by the front end and the back end Knowledge Workers only – and even at that only to the knowledge required for each individual data element and its key.
I get it Michael… this is why there are advantages to late binding… But if I can bind late or early in an RDBMS then why would I go to a new DBMS that touts late binding as a feature?
Are we just so used to early binding in an RDBMS that we cannot see the options to bind late?
Relational systems, or more generally set based logic, which includes all Relational row and column stores – are optimized with a structure in mind. A developer can certainly abstract some of these assumptions away – but there is a fundamental problem with applying set logic against data without schema because boundary conditions are implemented on a per datum (singular) basis. No construct exists in relational set theory to do this – or we wouldn’t need ETL tools, Hadoop, etc. Runtime binding requires procedural code to structure the data under set operators.
I do not see the link between set-based logic and the issue at hand? In fact I do not believe this to be the case. After all, deep in every old school database engine is a row-by-row loop to process an execution step. Set-based is a programming language construct not tied to the data.
Further, if you go back to my example, an unschema’d row is the same in either case… a late-binding system has to store the row as it is… as it has no way to parse the row early. So data storage of raw data cannot be where the advantage lies.
But you may be on to something around the impact of applying structure. If we bind early in an RDBMS we gain the ability to structure the data into columns, to apply indexes, to partition data, etc. There is a cost to be paid at load time for this… but some benefit as well. Are you suggesting that when you take unschema’d data and bind it late, that some structuring takes place dynamically (still at a cost)? If we bind late do we restructure the data into columns more effectively and/or build partitions?
I guess that I still do not see it. You apply structure sooner or later? Once you parse and structure once you have the equivalent of an RDBMS. If you try to restructure you either have the equivalent of a view or of a materialized view (i.e. redundant and restructured).
Sorry… could you try again? I am missing something.
Rob, the next “row” you have described might have two of the values you have, or it might have 20. Write set logic for 10K variations of that… And you end up having a procedure language inside a set loop to project the things you need for whatever your function needs to do.
And yes – sooner or later you apply structure. It doesn’t matter if you conform the data and load it to a relational store (early binding) and access it with SQL or if you write a hadoop mapper function (late binding) in a procedural language – everything must have a structure.
The only question to answer is – will you reuse enough of the data elements to overcome the front loaded costs of normalizing the data. Or is it cheaper/more effective to pay the higher runtime costs. It’s not a question of only doing one or the other – it’s a question of what data belongs in which infrastructure, and how to decide how to matriculate one to the other.
Ah…. but normalizing and conforming data… ETL… is not binding. I still do not see why I need a special DBMS to bind late?
This gives me another idea for a post though… cool.
We do not need a special DBMS to bind late. We just need to extend DBMS with a structure to support late binding. JSON datatype in Postgres is an example, so is the Flex table in Vertica.
Comments are closed.