Teradata has recently announced a very complete Teradata database-to-Hadoop integration. Is this note we’ll consider how a Teradata shop might effectively use these features to significantly reduce the TCO of any Teradata system.
The Teradata Appliance for Hadoop (here) offering is quite well thought out and complete… including a Teradata appliance, a Hadoop appliance, and the new QueryGrid capability to seamlessly connect the two… so hardware, software, support, and services are all available in very easy-to-consume bundles.
There is little published on the details of the QueryGrid feature… so I cannot evaluate where it stands on the query integration maturity curve (see here)… but it certainly provides a significant advance over the current offering (see here and Dan Graham’s associated comments).
I believe that there is some instant financial gratification to be had by existing Teradata customers from this Hadoop mashup. Let’s consider this…
Before the possibility of a Hadoop annex to Teradata, Teradata customers had no choice but to store cold, old, data in the Teradata database. If, on occasion, you wanted to perform year by year comparisons over ten years of data then you needed to keep ten years of data in the database at a rough cost of $50K/TB (see here) … even if these queries were rarely executed and were not expected to run against a high performance service level requirement. If you wanted to perform some sophisticated predictive analysis against this data it had to be online. If fact, the Teradata mantra… one which I wholeheartedly agree with… suggests that you really should keep the details online forever as the business will almost always find a way to glean value from this history.
This mantra is the basis of what the Hadoop vendors call a data lake. A data warehouse expert would quickly recognize a data lake as a staging area for un-scrubbed detailed data… with the added benefit that a Hadoop-based data lake can store and process data at a $1K/TB price point… and this makes it cost-effective to persist the staged data online forever.
So what does this mean to a Teradata EDW owner? Teradata has published numbers (here) suggesting that 92% of the queries in an EDW only touch 20% of the data. I would suggest that there is some sort of similar ratio that holds for 90% of the remaining queries… they may touch only another 40% of the data. This suggests that the 40% of the data remaining is online to service less than 1% of the queries… and I suggest that these queries can be effectively serviced from the $1K/TB Hadoop annex.
In other words, almost every Teradata shop can immediately benefit from Teradata’s new product announcements by moving 40% of their Teradata database data to Hadoop. Such a move would free Teradata disk space and likely take pressure off to upgrade the cluster. Further, when an upgrade is required, users can reduce the disk footprint of the Teradata database side of the system; add a Hadoop annex, and significantly reduce the TCO of the overall configuration.
Some time back I suggested that Teradata would be squeezed by Hadoop (here and here). To their credit Teradata is going to try and mitigate the squeeze. But the economics remain… and Teradata customers should seriously consider how to leverage the low $/TB of Teradata’s Hadoop offering to reduce costs. Data needs to reside in the lowest cost infrastructure that still provides the required level of service… and the Teradata Hadoop integration provides an opportunity to leverage a new, low-cost, infrastructure.