The shared-nothing architecture has, from the beginning, offered the promise of using hardware to solve performance problems rather than applying staff and tuning. By this I mean… if you can add nodes and scale out to improve query response then why not throw hardware at performance problems rather than build a fragile infrastructure of aggregate tables, cubes, pre-joined/de-normalized marts, materialized views, indexes, etc. Each of these performance workarounds are both expensive to build and expensive to operate.
There are several reasons, I think tuning has been more popular than scaling. Not in any particular order:
First, hardware vendors made it too hard to order/provision new nodes. You could not just press a button and buy capacity. Vendors wanted to charge you for terabytes when all you wanted might be CPU and Memory to fix the problem (see here, sigh). You had to negotiate a deal with a rep, work through your procurement group, wait weeks for delivery. Then, the hardware you have might not match the hardware for sale. New models could not be mixed with old nodes… so you had to consider a whole new cluster. The process was so not-agile. There have been attempts to fix this… and some of them are credible… but none are popular.
Next, the process to install the new nodes was moderately difficult… not rocket science but not seamless to be sure. Data had to move. Backups had to be reconfigured and sometimes old backups could not be easily restored to the new configuration. There was no easy way to burn in the new hardware and if it failed early there were issues reversing the process. It just was not considered an everyday operational process… it was the exception and that made it tough. This process too has improved over time but it never became a no-brainer.
Finally, buying hardware is a capital expense (CAPEX). Even if you had to pay more in people costs to do the hard work of tuning those were operational expenses… and funding was easier to get.
Redshift changes the game here. Even if the Paraccel database is just OK (see here)… and if the overhead of running in the virtualized AWS environment makes it worse… it is still OK. You can provision new hardware in a couple of minutes. If Teradata is 25% faster than Paraccel for your query set… so what? You can add 25% more Redshift for a fraction of the extra cost of Teradata. Need more performance? Dial it in. Need permission? No problem because it is all OPEX dollars.
Redshift will deliver the flexibility to make scale out less expensive than tune it out. The TCO reductions from running a simple system where hardware solves performance problems instead of ETL and staff will be significant. This is how it always should have been.
The issue for Redshift will be… given the trend to reduce the data latency from operations to BI… can you move significant amounts of data from on-premise into the cloud fast enough to meet service level agreements?
Do not overlook Redshift… Amazon could be a player in the EDW space… But look for other databases to make inroads here as well. In-memory databases could work well in the cloud as they avoid some of the hardware abstraction required to access disks.
- Amazon preparing ‘disruptive’ big data AWS service? (go.theregister.com)
- Amazon RedShift Open for Everyone (ozeanmedia.com)