Thoughts on AWS Redshift…

English: In visible light, 4C 71.07 is less th...
English: In visible light, 4C 71.07 is less than impressive, just a distant speck of light. It’s in radio and in X-rays – and now, gamma rays – that this object really shines. (Photo credit: Wikipedia)

The shared-nothing architecture has, from the beginning, offered the promise of using hardware to solve performance problems rather than applying staff and tuning. By this I mean… if you can add nodes and scale out to improve query response then why not throw hardware at performance problems rather than build a fragile infrastructure of aggregate tables, cubes, pre-joined/de-normalized marts, materialized views, indexes, etc. Each of these performance workarounds are both expensive to build and expensive to operate.

There are several reasons, I think tuning has been more popular than scaling. Not in any particular order:

First, hardware vendors made it too hard to order/provision new nodes. You could not just press a button and buy capacity. Vendors wanted to charge you for terabytes when all you wanted might be CPU and Memory to fix the problem (see here, sigh). You had to negotiate a deal with a rep, work through your procurement group, wait weeks for delivery. Then, the hardware you have might not match the hardware for sale. New models could not be mixed with old nodes… so you had to consider a whole new cluster. The process was so not-agile. There have been attempts to fix this… and some of them are credible… but none are popular.

Next, the process to install the new nodes was moderately difficult… not rocket science but not seamless to be sure. Data had to move. Backups had to be reconfigured and sometimes old backups could not be easily restored to the new configuration. There was no easy way to burn in the new hardware and if it failed early there were issues reversing the process. It just was not considered an everyday operational process… it was the exception and that made it tough. This process too has improved over time but it never became a no-brainer.

Finally, buying hardware is a capital expense (CAPEX). Even if you had to pay more in people costs to do the hard work of tuning those were operational expenses… and funding was easier to get.

Redshift changes the game here. Even if the Paraccel database is just OK (see here)… and if the overhead of running in the virtualized AWS environment makes it worse… it is still OK. You can provision new hardware in a couple of minutes. If Teradata is 25% faster than Paraccel for your query set… so what? You can add 25% more Redshift for a fraction of the extra cost of Teradata. Need more performance? Dial it in. Need permission? No problem because it is all OPEX dollars.

Redshift will deliver the flexibility to make scale out less expensive than tune it out. The TCO reductions from running a simple system where hardware solves performance problems instead of ETL and staff will be significant. This is how it always should have been.

The issue for Redshift will be… given the trend to reduce the data latency from operations to BI… can you move significant amounts of data from on-premise into the cloud fast enough to meet service level agreements?

Do not overlook Redshift… Amazon could be a player in the EDW space… But look for other databases to make inroads here as well. In-memory databases could work well in the cloud as they avoid some of the hardware abstraction required to access disks.

The Cost of Dollars per Terabyte

Dollars
(Photo credit: Images_of_Money)

Let me be blunt: using price per terabyte as the measure of a data warehouse platform is holding back the entire business intelligence industry.

Consider this… The Five Minute Rule (see here and here) clearly describes the economics of HW technology… suggesting exactly when data should be retained in memory versus when it may be moved to a peripheral device. But vendors who add sufficient memory to abide by the Rule find themselves significantly improving the price/performance of their products but weakening their price/TB and therefore weakening their competitive position.

We see this all of the time. Almost every database system could benefit from a little more memory. The more modern systems which use a data flow paradigm, Greenplum for example, try to minimize I/O by using memory effectively. But the incentive is to keep the memory configured low to keep their price/TB down. Others, like Teradata, use memory carefully (see here) and write intermediate results to disk or SSD to keep their price/TB down… but they violate the Five Minute Rule with each spool I/O. Note that this is not a criticism of Teradata… they could use more memory to good effect… but the use of price/TB as the guiding principle dissuades them.

Now comes Amazon Redshift… with the lowest imaginable price/TB… and little mention of price/performance at all. Again, do not misunderstand… I think that Redshift is a good thing. Customers should have options that trade-off performance for price… and there are other things I like about Redshift that I’ll save for another post. But if price/TB is the only measure then performance becomes far too unimportant. When price/TB is the driver performance becomes just a requirement to be met. The result is that today adequate performance is OK if the price/TB is low. Today IT departments are judged harshly for spending too much per terabyte… and judged less harshly or excused if performance becomes barely adequate or worse.

I believe that in the next year or two that every BI/DW eco-system will be confronted with the reality of providing sub-three second response to every query as users move to mobile devices: phones, tablets, watches, etc. IT departments will be faced with two options:

  1. They can procure more expensive systems with a high price/TB ratio… but with an effective price/performance ratio and change the driving metric… or
  2. They can continue to buy inexpensive systems based on a low price/TB and then spend staff dollars to build query-specific data structures (aggregates, materialized views, data marts, etc.) to achieve the required performance.

It is time for price/performance to become the driver and support for some number of TBs to be a requirement. This will delight users who will appreciate better, not adequate, performance. It will lower the TCO by reducing the cost of developing and operating query-specific systems and structures. It will provide the agility so missed in the DW space by letting companies use hardware performance to solve problems instead of people. It is time.