Cloud Computing and Data Warehousing: Part 4 – IMDB Data Warehouse in a Cloud

In the previous blogs on this topic (Part 1, Part 2, Part 3) I suggested that:

  1. Shared-nothing is required for an EDW,
  2. An EDW is not usually under-utilized,
  3. There are difficulties in re-distributing sharded, shared-nothing data to provide elasticity, and
  4. A SAN cannot provide the same IO bandwidth per server as JBOD… nor hit the same price/performance targets.

Note that these issues are tied together. We might be able to spread the EDW workload over so many shards and so many SANs that the amount of I/O bandwidth per GB of EDW data is equal to or greater than that provided on a DW Appliance. This introduces other problems as there are typically overhead issues with a great many nodes. But it could work.

But what if we changed the architecture so that I/O was not the bottleneck? What if we built a cloud-based shared-nothing in-memory database (IMDB)? Now the data could live on SAN as it would only be read at start-up and written at shut-down… so the issues with the disk subsystem disappear… and issues around sharing the SAN disappear. Further, elasticity becomes feasible. With an IMDB we can add and delete nodes and re-distribute data without disk I/O… in fact it is likely that a column store IMDB could move column-compressed data without re-building rows. IMDB changes the game by removing the expense associated with disk I/O.

There is evidence emerging  that IMDB technology is going to change the playing field (see here).

Right now there are only a few IMDB products ready in the market:

  • TimeTen: which is not shared-nothing scalable, nor columnar, but could be the platform for a very small, 400GB or less (see here), cloud-based EDW;
  • SQLFire: which is semi-shared-nothing scalable (no joins across shards), not columnar, but could be the platform for a larger, maybe 5TB, specialized EDW;
  • ParAccel: which is shared-nothing scalable, columnar, but not fully an IMDB… but could be (see C. Monash here); or
  • SAP HANA: which is shared-nothing, IMDB, columnar and scalable to 100TB (see here).

So it is early… but soon enough we should see real EDWs in the cloud and likely on Amazon EC2, based on in-memory database technologies.

Greenplum and Teradata: Simliar Architecture, Different Strategies

Hardware systems, servers and network fabric, provide the foundation upon which all shared-nothing database management systems rest. Hardware systems are a major contributor to the overall price/performance and total cost of ownership for a data warehouse platform. This blog considers the hardware strategies of EMC/Greenplum: applying the idea of using common, off-the-shelf (COTS) components to build a competitive foundation; and Teradata: developing a proprietary hardware system by tightly integrating components.

Strategies

The Teradata hardware strategy is simple to describe. They expend R&D dollars to couple low-level technologies into a tightly integrated system. Their servers are custom-designed within a set of guidelines that allows both the LINUX and Microsoft Windows operating systems to execute there. Their network fabric is highly proprietary, using cycles within the fabric to offload sort/merge data processing from the server CPU.

In other words, Teradata believes that the time and effort required to engineer an integrated proprietary offering will improve the performance of their offering enough to offset the cost.

EMC and Greenplum have taken a different approach. They have elected a strategy that leverages off-the-shelf servers offered by hardware vendors like Dell or HP, and network switches from vendors like Brocade, Arista, and Cisco. They have elected to expend few dollars on hardware design and development and to leverage the R&D investments made by these other vendors. In other words, Greenplum believes that the advantages in price and performance provided by using off-the-shelf hardware provides a sustainable advantage.

Price

The lower costs associated with Greenplum’s strategy clearly provide an advantage. Greenplum does not have to expend to design and manufacture custom hardware. The manufacturing costs may not be significant, but the staff costs required by the Teradata strategy must affect the price. Clearly the Greenplum strategy provides an advantage on the price side.

Performance

The Teradata strategy has to be about performance… so lets speculate:

  • How much of a performance increase might their integration provide on the server-side?
  • How much of a performance increase might their integration provide on the network side?

In the days before there was a microprocessor based enterprise server market, Teradata could gain substantially here. Microprocessors were built for personal computing and not designed for the high-availability and high-performance requirements in an enterprise. Teradata had much to gain from building rather than buying server.

But today, there is little to gain from a highly customized design. The requirement to run standard LINUX and Windows operating systems limit their ability to innovate and the resulting servers have to be very similar to those built for off-the-shelf enterprise servers. There is little or no performance advantage here.

On the network side, there once was a distinct advantage to Teradata’s ByNet. It was both faster than available off-the shelf switches and it offloaded cycles from the under-powered CPU. Today, however, there are plenty of cheap, fast switches… so the speed advantage has disappeared. Worse still, the introduction of multi-core CPUs have eliminated the advantage of the in-the-switch sort/merge that makes ByNet unique. CPU is inexpensive these days.

The bottom line: it is unclear if the Teradata hardware strategy affords them a performance advantage.

Cost of Ownership

An argument could be made that supporting COTS hardware is inherently less expensive than supporting a Teradata cluster. But there is a more substantial savings that is clear.

Every 2-3 years, as newer Teradata technology obsoletes your currently installed cluster the value of the current hardware goes to zero and the cost of ownership goes up significantly. The costs of this are especially high when you are required to add several nodes to accommodate growth as Teradata refreshes their technology. You may have to buy servers that are already obsolete.

With Greenplum, your current cluster is built from general-purpose servers that are re-purposed with ease. In fact, since the nodes in a Greenplum cluster are usually high-end servers, customers often cycle new technology into their data warehouse and cycle the old servers out into their server farm. The result is a higher performance warehouse and full use of all of the server technology.

A Final Word

The words “proprietary hardware” are sometimes thrown around as an insult. But Teradata’s proprietary approach is based on the belief that a tightly integrated configuration adds benefit to offset the costs. Greenplum believes that today the enterprise server and the network switch vendors have matured their products to the point where off-the-shelf technology can match or exceed the performance of custom hardware… at a significantly reduced cost. You may have an opinion or you may wait to see how the benchmarks, the proof-of-concepts, and the market decide… but its interesting to understand the differing approaches.