Indexes are not a good thing… A blog on TCO

In many of my posts I refer to the issues associated with building “extra” data structures to meet performance goals (see one of my first posts ever here). These extra structures are always a trade-off… slowing the performance of one function in order to speed up another. I thought that it might be helpful to be very clear about where I stand on this.

Indexes improve the performance of queries that address a small set of data. They also can improve join performance if your favorite optimizer can apply an index intersection to the execution plan for your queries. Indexes dramatically slow the performance of inserts, updates, and bulk data loads as they have to be maintained when data changes. You can mitigate the cost and update indexes in the background… the trade-off does not go away. Indexes are probably required for OLTP applications that pick out single rows.

Wouldn’t it be great if your favorite DBMS could resolve every query very fast without the overhead and operational effort associated with maintaining indexes? Certainly we should aspire to a read-optimized database, a data warehouse DBMS, that does not require indexes.

Vertica projections provide an optimized, materialized, view that improves the performance for a set of queries. The Vertica optimizer automatically selects the optimal projection. Vertica provides a very slick tool that builds projections based on the query set provided. I worded my post on Vertica a little vague… so let me be sure here to point out that  every Vertica query runs against a projection… so it is possible to have only one. In this case there is no additional overhead. Adding projections slows the data load process and increases the storage requirements. This is the trade-off.

Other databases offer materialized views. They make the same trade-off as above.

An OLAP cube is a physical structure that pre-aggregates data so that your query workload can avoid the aggregation. The best implementations of this express the cube as a materialized view so that queries can use the pre-aggregated data without explicitly pointing at a cube structure… the optimizer picks it for you. In addition the best implementations let you drill out of the cube to the detail records. These products have the update/delete/load issues of an index plus add an extra data latency issue as the data has to be aggregated on some interval… usually hours or days. Many products do not allow joins from a cube. You can see the trade-off. The Oracle Exalytics product materializes the aggregated cube on a separate server in-memory. This provides even more performance but adds the system and operational overhead of moving data across system boundaries.

Wouldn’t it be nice if you could query raw data and perform aggregation so fast that even against terabytes of data you could run any query with 3 second or less response without the overhead of building cubes?

You may build specialized table structures and pre-join, pre-aggregate, or pre-compute data to make a set of queries run fast. The cost of building and maintaining this sort of implementation versus just querying the base tables is the trade-off. Further, this approach is sort of a trap. You cannot build these structures for every query… if you did the business would conceive another critical query the next day that required work.

You can add indexes to the structures built using the technique above and provide very fast application-specific performance to a small set of queries. This is currently the favored approach when companies build iOS or Android apps as it provides the best possible performance… at a significant price.

Wouldn’t it be great if this was unnecessary… you could just scan so fast that mobile response service levels could be met from the base data regardless of the query.

You can deploy redundant data in operational data stores, data marts, cube servers, analytic data stores, and so on… with each specialized store providing performance for some limited set of queries at the cost of development and support ongoing. Each of these copies could deploy specialized database products that speed up that set of queries a little more. Again, this surround-the-EDW approach is a trap that leads to the proliferation of data marts and of database technologies.

Please do not take that last paragraph the wrong way… I believe that the worst possible approach is to blindly standardize on one or two database products. This trade-off makes life convenient for the IT department at the expense of performance and agility in the business. It is OK to have one or two favored products but IT must always serve the business to the best of their ability as a first priority… and sometime the new start-up has just the thing (remember that once Teradata was a start-up and DB2 on the mainframe was the IT standard…).

What I wish was that one or two products could solve all of the performance and functionality problems without the cost of building “extra” stuff… one product would be better that two. I like products that make the extra stuff “free”. Netezza does a nice job of making zone maps “free”, for example. Teradata and Greenplum provide the option of row store or column store for “free”. Vertica automatically build extra projections for “cheap”… and while there is a cost to the projection it at least does not require staff to tune it up. Oracle materialized views are “cheap”.

What I dislike are products that require DBAs to work harder and harder to apply all of the techniques above to meet performance SLAs. Each of these techniques trades off performance for development and operational expense.

As I have noted before… the performance SLAs for BI are about to become severe as companies try to support BI on mobile devices. The development and operational costs of tuning up; that is the TCO; will be significant unless better, faster, software infrastructure becomes available.

The TCO for a database that could eliminate these extra constructs and could eliminate the cost of developing and maintaining them; and could eliminate the architectural fragility these approaches imply… and replace this with a DBMS that holds base data which could satisfy all queries in seconds; delivering the business agility this implies… the TCO would be compelling.

I actually believe that the answer is available in the market today… this is no longer a pipe dream… more later…

5 thoughts on “Indexes are not a good thing… A blog on TCO”

  1. This is what Codd had in mind 4 decades ago. The separation of logical concerns (uniform representation of data (in relations) and integrity) separated from the implementation concerns like direct data access and performance to such an extent that they can not directly interfere with each other was on of the biggest drivers for the creation of the relational model. As long as SQL with it’s mixing of concerns is the basis for most database products I’m skeptical on fundamental solutions, We will continue to see new technologies trying to fix what should not have been broken in the first place and desperate customers forced to buy a slew of different data platforms for every set of slightly different information requirements.

    As soon as it is possible to negate a direct vendor lock in on *both* logical access and performance/storage, data platforms we will be finally free to structure our data for our customer without excluding advanced and performant technological data platform solutions.

  2. This made me chuckle: “remember that once Teradata was a start-up and DB2 on the mainframe was the IT standard”.

    I should be so lucky! For the first few years of my Teradata life IMS DB/DC was still the standard in the finance sector…I suspect it still is in many places. The cutover to DB2 was still happening – at least in the UK – whilst Teradata made inroads into the ‘shift reporting workloads off the mainframe’ market back in the late 80’s early 90’s.

    The “surround-the-EDW approach” is a trap, as you say. The default number of indexes in an MPP system should be zero. That’s a very good place to start.

  3. Zero Indexes means a full table scan even though you only want one record which has a frequently used identifier – and that means CPU, and therefore high cost, as well as elapsed time.

    I too dislike products that require DBAs to work harder and harder to apply all of the techniques above to meet performance SLAs. That is why Teradata has automated most of the work except the judgement as to which technique should be traded for which benefit i.e. where is the cost of an index warranted. The extra costs of an RDBMS that does not offer you the option of using indexes should be recognised as an impediment to efficiency.

    1. I noted that an index is justified for single-row operations… So we agree. I also believe that Teradata was the original provider of systems that just scan fast.

      To be honest, Kenneth, indexes are the least expensive add-on and, used judiciously, are ok. The other stuff… marts, cubes, aggregates, materialized views, are harder to justify. And as long as there are databases that can solve analytic queries without these constructs I recommend them in this blog.

      I do not know why anyone would use an OLTP RDBMS for a BI implementation.

  4. If the objective is to retrieve a single record and an equality condition is provided on the primary index (PI = X) then Teradata does not need to do a full table scan (FTS). Teradata has a unique ability to always retrieve all of the rows for a given PI value without scanning the table, no matter how large it might be.

    This feature is enabled via the use of in-memory maps so that the AMP and block location of each PI hash value is always known. This allows direct retrieval of the required block(s) that match the required PI value.

    Also, if a table is partitioned there is also a chance a FTS will not be required for queries not retrieving rows via the PI. This depends on the precise nature of the query, of course.

    I would hope that single row operations, especially against a large table, would be relatively infrequent in a BI system. If not, then an index may be worthwhile.

    My stance is to start with zero indexes and expect to deploy as few as possible, but where they are justified use them by all means – ‘judiciously’ as Rob says.

    Performance should ultimately rest on parallelism, a sensible schema and well-written queries. Rob would add in-memory capability as well, no doubt!

Comments are closed.

%d bloggers like this: