In many of my posts I refer to the issues associated with building “extra” data structures to meet performance goals (see one of my first posts ever here). These extra structures are always a trade-off… slowing the performance of one function in order to speed up another. I thought that it might be helpful to be very clear about where I stand on this.
Indexes improve the performance of queries that address a small set of data. They also can improve join performance if your favorite optimizer can apply an index intersection to the execution plan for your queries. Indexes dramatically slow the performance of inserts, updates, and bulk data loads as they have to be maintained when data changes. You can mitigate the cost and update indexes in the background… the trade-off does not go away. Indexes are probably required for OLTP applications that pick out single rows.
Wouldn’t it be great if your favorite DBMS could resolve every query very fast without the overhead and operational effort associated with maintaining indexes? Certainly we should aspire to a read-optimized database, a data warehouse DBMS, that does not require indexes.
Vertica projections provide an optimized, materialized, view that improves the performance for a set of queries. The Vertica optimizer automatically selects the optimal projection. Vertica provides a very slick tool that builds projections based on the query set provided. I worded my post on Vertica a little vague… so let me be sure here to point out that every Vertica query runs against a projection… so it is possible to have only one. In this case there is no additional overhead. Adding projections slows the data load process and increases the storage requirements. This is the trade-off.
Other databases offer materialized views. They make the same trade-off as above.
An OLAP cube is a physical structure that pre-aggregates data so that your query workload can avoid the aggregation. The best implementations of this express the cube as a materialized view so that queries can use the pre-aggregated data without explicitly pointing at a cube structure… the optimizer picks it for you. In addition the best implementations let you drill out of the cube to the detail records. These products have the update/delete/load issues of an index plus add an extra data latency issue as the data has to be aggregated on some interval… usually hours or days. Many products do not allow joins from a cube. You can see the trade-off. The Oracle Exalytics product materializes the aggregated cube on a separate server in-memory. This provides even more performance but adds the system and operational overhead of moving data across system boundaries.
Wouldn’t it be nice if you could query raw data and perform aggregation so fast that even against terabytes of data you could run any query with 3 second or less response without the overhead of building cubes?
You may build specialized table structures and pre-join, pre-aggregate, or pre-compute data to make a set of queries run fast. The cost of building and maintaining this sort of implementation versus just querying the base tables is the trade-off. Further, this approach is sort of a trap. You cannot build these structures for every query… if you did the business would conceive another critical query the next day that required work.
You can add indexes to the structures built using the technique above and provide very fast application-specific performance to a small set of queries. This is currently the favored approach when companies build iOS or Android apps as it provides the best possible performance… at a significant price.
Wouldn’t it be great if this was unnecessary… you could just scan so fast that mobile response service levels could be met from the base data regardless of the query.
You can deploy redundant data in operational data stores, data marts, cube servers, analytic data stores, and so on… with each specialized store providing performance for some limited set of queries at the cost of development and support ongoing. Each of these copies could deploy specialized database products that speed up that set of queries a little more. Again, this surround-the-EDW approach is a trap that leads to the proliferation of data marts and of database technologies.
Please do not take that last paragraph the wrong way… I believe that the worst possible approach is to blindly standardize on one or two database products. This trade-off makes life convenient for the IT department at the expense of performance and agility in the business. It is OK to have one or two favored products but IT must always serve the business to the best of their ability as a first priority… and sometime the new start-up has just the thing (remember that once Teradata was a start-up and DB2 on the mainframe was the IT standard…).
What I wish was that one or two products could solve all of the performance and functionality problems without the cost of building “extra” stuff… one product would be better that two. I like products that make the extra stuff “free”. Netezza does a nice job of making zone maps “free”, for example. Teradata and Greenplum provide the option of row store or column store for “free”. Vertica automatically build extra projections for “cheap”… and while there is a cost to the projection it at least does not require staff to tune it up. Oracle materialized views are “cheap”.
What I dislike are products that require DBAs to work harder and harder to apply all of the techniques above to meet performance SLAs. Each of these techniques trades off performance for development and operational expense.
As I have noted before… the performance SLAs for BI are about to become severe as companies try to support BI on mobile devices. The development and operational costs of tuning up; that is the TCO; will be significant unless better, faster, software infrastructure becomes available.
The TCO for a database that could eliminate these extra constructs and could eliminate the cost of developing and maintaining them; and could eliminate the architectural fragility these approaches imply… and replace this with a DBMS that holds base data which could satisfy all queries in seconds; delivering the business agility this implies… the TCO would be compelling.
I actually believe that the answer is available in the market today… this is no longer a pipe dream… more later…