The link here describes a new 2TB SSD technology which should be on the market in 2013.
This prompts me to ask the question: How would we redesign relational databases for data warehousing if random reads become as fast as sequential reads?
I’ll look forward to your comments and post more on this over time…
8 thoughts on “Is there a new DW database architecture lurking in the wings?”
Data Warehouses are about sequential reads.
OLTP databases are about random reads.
I think, the question should be: “How would we redesign relational databases in general, if we don’t have to care about random reads”.
The fall-out from the OLTP market will be most interesting.
OLTP, DW, … are just actual (past) taxonomy to optimize data management tools design. The future (present) way to organize data management is “cloudy” that means web-oriented, data-intensive, and deployed in virtualized and cloud environments.
Demand and technologies will converge to “in memory” and “dispersed” DBs. If i well remeber that is in the agenda of all mayor IT providers: SAP with HANA, EMC (VMware) with Gemfire, Oracle with his “ten times in memory DB”, ….
I agree that the database world may split along the lines you suggest… In-memory databases will handle more and more OLTP workload… As well as “small data” BI workload. This will make up the bulk of the database market. But “big data” will not fit in-memory for some time… So there will be room for two architectures.
Cloud computing is something different which has to do with the underlying hardware platform more than the database application. And as I noted in other blogs… There is an architectural mismatch between the shared data paradigm favored by cloud computing and the shared-nothing architecture favored by “big data” proponents.
But I am trying to ask something different here… Will the introduction of cheap big solid state drives force a rethink of the “scan fast” approach in vogue for “big data”? Is there another architecture that could be implemented that uses indexes for all access? After all, there will be enough memory to keep “big data” indexes in RAM.
Depending on the usage and purpose, DataWarehouses too may be for random reads, as you can have an OLTP application running against a Datawarehouse (if the DB is capable of performing for different kind of applications, and so for sequential or random reads).
With the SSD technology, a big problem is the life of the dye, with as low as 3000 erase cycles: I’m not confident that it fits well with OLTP applications.
Designing for relational databases will probably be impacted on undefining secondary indexes and materialized views, and also probably a good approach in Data Modeling would be to use fixed datatypes to have fixed record lenghts avoiding row relocation.
Another consideration is in implementing a columnar design moving frequently accessed columns to the SSD and frequently updated columns to classic spinning hardware.
So… How would you redesign a DBMS from scratch if you could use indexed access for every read with no performance penalty?
There is no such thing like “no performance penalty” 🙁
Updating indexes is the most expensive part, plus you need to have the storage available. In addition, OLTP typically updates rows again and again, adding even more overhead.
For BI, things like index-only columns (where all the data is in the index instead in the table) could be a huge performance boost – I assume.
Again PostgreSQL is on the way to implement such features, they already got index-only scans (they can pick, under some circumstances, the value from the index instead of loading the table page), but there’s still a long way to index-only columns.
It’s not that easy to implement, especially in BI environments: think about a typical btree index, you get a new value and have to redistribute huge parts of your column. But there are better ways available to handle this problem.
The DBMS should be aware of how fast is the technology used to memorize data: be it memory, SSD, classic disks or somewhere in the cloud. The DBMS should be aware of all the different hardware components, distributing data based on a deterministic algorithm and also on frequency of access and hardware performance: the more frequently accessed data should go to the best performing hardware. But the frequency of access is something you can’t tell at design time, so the DBMS should be adaptive by continuosly collecting and analizing usage statistics.
If all the storage hardware were on SSD, then minimizing the erase cycles could be performed by adding Temporal features to each and every table: readings for a BI tool would perform fast enough without much impact on SSDs by an OLTP application that frequently changes bits of data.
PostgreSQL already uses fixed row length (and offload variable record lengths to TOAST tables).
I see a problem with the approach to move single columns to different storage types: frequenty you need to read different columns and then you have to access all the storages – the slowest one will define your answer times. Plus you either need to have deep knowledge about your database design (knowing the most used columns) or you need to have heavy access statistics, adding another burden to the database for questionable results. Also you have to keep in mind that BI access patterns usually name the columns, but many OLTP applications do something like “SELECT *” (because they don’t know or don’t care about columnar storage) and therefore cross out any attempt to optimize columnar storage.
Comments are closed.