In the previous blogs on this topic (Part 1, Part 2, Part 3) I suggested that:
- Shared-nothing is required for an EDW,
- An EDW is not usually under-utilized,
- There are difficulties in re-distributing sharded, shared-nothing data to provide elasticity, and
- A SAN cannot provide the same IO bandwidth per server as JBOD… nor hit the same price/performance targets.
Note that these issues are tied together. We might be able to spread the EDW workload over so many shards and so many SANs that the amount of I/O bandwidth per GB of EDW data is equal to or greater than that provided on a DW Appliance. This introduces other problems as there are typically overhead issues with a great many nodes. But it could work.
But what if we changed the architecture so that I/O was not the bottleneck? What if we built a cloud-based shared-nothing in-memory database (IMDB)? Now the data could live on SAN as it would only be read at start-up and written at shut-down… so the issues with the disk subsystem disappear… and issues around sharing the SAN disappear. Further, elasticity becomes feasible. With an IMDB we can add and delete nodes and re-distribute data without disk I/O… in fact it is likely that a column store IMDB could move column-compressed data without re-building rows. IMDB changes the game by removing the expense associated with disk I/O.
There is evidence emerging that IMDB technology is going to change the playing field (see here).
Right now there are only a few IMDB products ready in the market:
- TimeTen: which is not shared-nothing scalable, nor columnar, but could be the platform for a very small, 400GB or less (see here), cloud-based EDW;
- SQLFire: which is semi-shared-nothing scalable (no joins across shards), not columnar, but could be the platform for a larger, maybe 5TB, specialized EDW;
- ParAccel: which is shared-nothing scalable, columnar, but not fully an IMDB… but could be (see C. Monash here); or
- SAP HANA: which is shared-nothing, IMDB, columnar and scalable to 100TB (see here).
So it is early… but soon enough we should see real EDWs in the cloud and likely on Amazon EC2, based on in-memory database technologies.