Teradata has recently announced a very complete Teradata database-to-Hadoop integration. Is this note we’ll consider how a Teradata shop might effectively use these features to significantly reduce the TCO of any Teradata system.
The Teradata Appliance for Hadoop (here) offering is quite well thought out and complete… including a Teradata appliance, a Hadoop appliance, and the new QueryGrid capability to seamlessly connect the two… so hardware, software, support, and services are all available in very easy-to-consume bundles.
There is little published on the details of the QueryGrid feature… so I cannot evaluate where it stands on the query integration maturity curve (see here)… but it certainly provides a significant advance over the current offering (see here and Dan Graham’s associated comments).
I believe that there is some instant financial gratification to be had by existing Teradata customers from this Hadoop mashup. Let’s consider this…
Before the possibility of a Hadoop annex to Teradata, Teradata customers had no choice but to store cold, old, data in the Teradata database. If, on occasion, you wanted to perform year by year comparisons over ten years of data then you needed to keep ten years of data in the database at a rough cost of $50K/TB (see here) … even if these queries were rarely executed and were not expected to run against a high performance service level requirement. If you wanted to perform some sophisticated predictive analysis against this data it had to be online. If fact, the Teradata mantra… one which I wholeheartedly agree with… suggests that you really should keep the details online forever as the business will almost always find a way to glean value from this history.
This mantra is the basis of what the Hadoop vendors call a data lake. A data warehouse expert would quickly recognize a data lake as a staging area for un-scrubbed detailed data… with the added benefit that a Hadoop-based data lake can store and process data at a $1K/TB price point… and this makes it cost-effective to persist the staged data online forever.
So what does this mean to a Teradata EDW owner? Teradata has published numbers (here) suggesting that 92% of the queries in an EDW only touch 20% of the data. I would suggest that there is some sort of similar ratio that holds for 90% of the remaining queries… they may touch only another 40% of the data. This suggests that the 40% of the data remaining is online to service less than 1% of the queries… and I suggest that these queries can be effectively serviced from the $1K/TB Hadoop annex.
In other words, almost every Teradata shop can immediately benefit from Teradata’s new product announcements by moving 40% of their Teradata database data to Hadoop. Such a move would free Teradata disk space and likely take pressure off to upgrade the cluster. Further, when an upgrade is required, users can reduce the disk footprint of the Teradata database side of the system; add a Hadoop annex, and significantly reduce the TCO of the overall configuration.
Some time back I suggested that Teradata would be squeezed by Hadoop (here and here). To their credit Teradata is going to try and mitigate the squeeze. But the economics remain… and Teradata customers should seriously consider how to leverage the low $/TB of Teradata’s Hadoop offering to reduce costs. Data needs to reside in the lowest cost infrastructure that still provides the required level of service… and the Teradata Hadoop integration provides an opportunity to leverage a new, low-cost, infrastructure.
13 thoughts on “Using Teradata’s Appliance for Hadoop to Reduce TCO”
Good assessment Rob. Let’s expand on your observation about cold data.
Teradata Virtual Storage software shows that 92% of the queries only touch 20% of the data. Those measurements are taken over a 1 week period on customer systems as we seek to identify hot versus warm versus cold data. 92% is an average but not all customers are average (some have 30-40% hot data). The TVS software is based on data popularity. As users request data, it heats up. If they don’t it cools off. The hottest data then moves into SSD or memory for performance. But users need different data each day, each week, and especially month end and quarter end. Arctic cold data heats up quick as the Finance organization compiles quarterly revenue numbers. For a few weeks, they are accessing year over year data plus a couple years prior, so cold data heats up. Marketing is the wild card doing complex predictive analytics –even simple segmentation– that can heat up ice 4-5 year old cold data for a couple weeks. And then there is audit…
Yes, some cold data will move to Hadoop. Or it will move to the Teradata Integrated Big Data appliance which uses the same high density 4TB disk drives as Hadoop. Hadoop style storage plus the Teradata Database costs substantially less than $10K/TB (so much less I’d rather not say).
Teradata is making progress in balancing data across platforms including Hadoop. Will 20% of the cold data move off the big data warehouse machines? Let’s change the discussion to data age: the central data warehouse will evolve to hold data that’s up to 5 years old on average. Cold data machines will hold data that’s older than 5. In my experiences so far, customers tend to offload data from the high end data warehouse to a lower cost machine and then –SWOOSH– the users fill up the vacant capacity on the high end system with all the projects they had backlogged. They save money on cold data, and expand applications on the central data warehouse too. There are exceptions and unique use cases, but that’s the way its playing out so far.
The “five year old data” on the EDW suggestion feels arbitrary to me, Dan…. your analysis that suggests 20% of the total data is “hot” is more measured.
I think that it is also more likely that the growth in data volumes on an existing and mature EDW platform is more likely to come from the retention of historical data than from the addition of new data… the SWOOSH you hear. This historical data should swoosh to Hadoop. Further, new data from the Internet of Things should land in Hadoop.
Maybe if you free up space on the Teradata system there will be a swoosh and it will fill… but as the Hadoop side becomes more capable… and it is improving at a very rapid rate… economics will allow more data to move to Hadoop without sacrificing service and the swoosh sound could be the squeeze I’ve suggested.
Yes, the five year boundary is somewhat arbitrary. It works as an average of customers we surveyed but its not for everyone. Some customers, particularly those with audit and SOX controls are much more likely to keep cold data 7 to 8 years. Telcos, on the other hand, have way too many CDR records to want to keep them beyond five years. It depends. Teradata knows exactly how many customers are offloading what kind of data to Hadoop and to our Teradata Integrated Big Data Appliance (database).
I think you malign the basics of data warehousing when you suggest the only EDW data growth comes from historical data. Perhaps that was not your intent. By definition, the data warehouse is a forever project, adding new applications and new data model components year after year. We heard this from Gartner loud and clear. Growth in an EDW comes from new data types and new applications. One data model, many applications. We have one site with 186 applications on a single data warehouse — all of it tightly integrated. But growth is not measured only in storage capacity.
Most of the Internet of Things data is from sensors and embedded computers. Teradata manufacturing, utility, retail, and healthcare customers have long been accumulating sensor data and IoT data in their data warehouse for analysis. The perception that Hadoop is cheap and therefor the only logical place to put IoT data is false. As it turns out, the vast majority of IoT/sensor data can be compressed 20-to-1, 50-to-1, even 100-to-1 in size. This is the signal to noise ratio. Sensor data is highly repetitive so dedup, compression, columnar, and even temporal techniques shrink the data so small that storage cost is not the main issue: analytic ROI is the issue. That said, there are times and good reasons to put IoT data in Hadoop. Again, it depends.
Costs are indeed a critical decision point but only one of many. Hadoop today is not so easy to query or apply analytics. The ease of use, interactive BI tools accessing a database, security strength, workload management, and high performance of an EDW must be weighed against ‘economic’ purchase price. http://bit.ly/1p9cZe8 I agree that Hadoop costs are attractive to some, even mandatory at times. Thats why Teradata sells a Hadoop Appliance. Other buyers want performance at the business users desktop without a Java programmer go-between. Furthermore, the data warehouse is much more than the simple star schema data mart Hadoop aspires to be within the next 3-5 years. I’ve spoken to 4 major analyst houses in the last 4 weeks about SQL-on-Hadoop and the consensus is “not improving at a rapid rate”. Teradata has survived and thrived for many years versus data mart products. Hadoop is becoming a data mart, and a highly interesting one — which I support. Its marvelous, but its not a data warehouse.
Good chatting with you Rob. Its always a pleasure. You always challenge me.
And thanks for commenting, Dan… I appreciate it every time…
Here’s a more thoughtful response, Dan.
The 5 year rule does not sync with your 20% of the data is hot measurement. The hot, current, CDRs might need to be in Teradata only. They may keep the data online for 5 years or longer… but my point the old cold records can move to Hadoop and shrink the Teradata database footprint. The fact that they now tend to reside in Teradata is because there was no alternative… and you have now given them a great alternative with your Hadoop appliance.
I did not meant to imply that all growth in an EDW comes from the longer retention of historical data… only that a significant percentage of the growth comes from this. IoT nothwithstanding, there are only so many large fact tables in any enterprise.
Of course there is current IoT data in Teradata… Hadoop is new. But the vast majority of this data will move to or start in Hadoop going forward, methinks.
Finally, I am not posting on Hadoop as a stand-alone system… nor on Hadoop as an EDW… I am suggesting Hadoop as an annex to an EDW with all queries coming through a smaller Teradata EDW front-end. To the end user, application program, BI, or analytic tool it all looks like Teradata… Hadoop is a transparent, inexpensive, massively parallel, fault tolerant, inexpensive (yep… its there twice on purpose… my argument is based on $/TB) cold storage behind the Teradata database. Its a good thing, methinks.
Offloading old/cold data makes perfect sense – the old concept of data temperature. Getting folks to actually do it has been the challenge 🙁
I was advised by Teradata a few months ago that the data in HDFS is brought to the Teradata nodes for processing when referenced in a query via SQLH. This places extra CPU demand on the existing Teradata nodes.
Like Dan says, any freed up space is quickly used for other purposes once old/cold data is offloaded.
The net effect is a Teradata EDW that still lacks spare CPU cycles and storage.
Paul — “Teradata EDW still lacks spare CPU cycles and storage” sounds like its a problem. Most customers would say they are utilizing the capacity to the fullest. Teradata systems run between 90-95% utilization so you get your money’s worth. We have one customer who insists the machine run at 100% all the time! As for storage, it is the least expensive component in the hardware. Yes, SQL-H is a federated query that costs some CPU time to join Hadoop data to EDW data. You should always leverage Hive to do as much filtering and SQL work as possible.
If you have a current Teradata workload and you add Hadoop data then, even though there is push down to Hive, you are adding workload to the Teradata system.
What I am suggesting is that if you have a current Teradata workload and you move Teradata data to Hadoop then you offload workload to Hive.
This offloaded workload will run at a price point that may be 1/50th the price of the same run on the Teradata system.
I think it’s fair to use price/performance as a metric instead of price alone. People may assume that the offloaded workload will run as fast on Hadoop as on Teradata.
Your point is well taken, Wei Tang… That is why I would not recommend completely replacing the Teradata database. But if, as Teradata suggests, Hadoop is 1/50th the price it might beat Teradata DB even in price/performance.
This is why I suggest that it is about the lowest price that meets your service level requirements. Even if Hadoop is better in price/performance the performance will not be acceptable for many workloads and the Teradata DB might be required. But cold data that supports infrequent reporting against a long history of data may be serviced by Hadoop at a lower price.
Thanks for the comment…
Thanks Rob for responding. I’ve been enjoying the insightful posts on your blog. Customers will be the ultimate judge in buying the products and services that suit them best. And the needs and requirements will vary quite a lot. What they want to see is more competition in the marketplace to drive prices down (to get the same job done) and innovation & technology advancement to take them to new heights.
Comments are closed.