Cloud Computing and Data Warehousing: Part 1 – The Architectural Issues

My apologies… I was playing with the iPad version of WordPress and accidentally published a very rough outline/first draft of this post. I immediately un-published it… but not before subscribers were notified that there was a new post.

I wonder about the idea that data warehousing is suited to operate in the cloud? This was prompted by Paraccel‘s venture to deploy on the Amazon EC2 cloud infrastructure. Lets work through the architectural implications…

Here are the assumptions I’ll take into this exploration:

  1. A shared-nothing architecture is required to scale.
  2. Cloud infrastructure is cost-effective when the infrastructure is under-utilized and workloads can be consolidated to achieve full utilization… and not so cost-effective when the infrastructure is highly utilized. This is because applications can easily share underutilized resources in the Cloud.
  3. Cloud infrastructure is justified when the workload is inconsistent and either CPU or storage requirements fluctuate widely over the business cycle. This is because a Cloud is elastic and can easily flex as the requirements fluctuate. Cloud computing may not be well suited to static workload requirements.

You can probably see where I’m going with this from the assumptions.

In the end I’ll suggest that there is a database architecture that is suited to warehousing and cloud computing… but let me build to that.

Before I start let me also be clear that I am talking about the database infrastructure… not the application/BI infrastructure required for data warehousing. The BI and ETL components are perfectly suited to cloud computing… they reflect a workload that, in general, runs on under-utilized hardware with BI running during the day and ETL running at night. I have suggested this to my current employer… but alas, I am neither King nor a member of Court.

So in Part 1 let me discuss my first two assumptions and the implications… In Part 2 I’ll discuss data warehousing and elasticity… In Part 3 I’ll consider the Paraccel/Amazon collaboration and in Part 4 I’ll wrap up and consider several new things coming that may change the equations.
—————-
I’ll not work too hard to justify my first assumption… I think that it is well-understood that a shared-nothing architecture provides the best possible approach to scale out. Google and others use this approach to scale to hundreds of petabytes of data and Teradata, Greenplum, Netezza, Paraccel, SAP HANA, and others use it in the data warehouse space. Exadata uses a hybrid approach that scales I/O in a shared-nothing-like storage subsystem… but fails to scale as it passes data to the RAC layer (see Kevin Closson here on the subject).

But the implications are significant for our cloud discussion. First, cloud infrastructure is designed to support general client-server or web-server based commercial computing requirements. A shared-nothing database cluster is a specialized infrastructure optimized for database processing. Implementing the specialized problem on the generalized infrastructure is possible, but sub-optimal. Next, cloud computing requires, more or less, a shared storage subsystem. A shared-nothing architecture shares nothing. Implementing a shared-nothing database on a shared storage subsystem is possible, but sub-optimal.

I believe that the second assumption is also pretty straightforward. The primary rationale for cloud computing comes from the recognition that many data centers deployed applications on servers that were not fully utilized. By virtualizing the hardware on a cloud platform the data center could better service the applications with fewer hardware resources and therefore less cost.

So… in order for cloud computing to be a perfect fit we need to observe a data warehouse database workload with underutilized hardware infrastructure… You might ask yourself… are there underutilized hardware resources upon which my EDW is built? In most cases I believe that the answer to this question will be “no”. Almost every EDW I’ve seen is over-burdened… stretched… with users demanding more and more resource… more data, more users, more queries, deeper queries drive the resource requirements up exponentially. The database is swamped all day with queries and swamped all night by ETL and reporting tasks.

So let’s end this blog concluding that there is a problematic architectural mismatch between a shared cloud and a shared-nothing implementation… and that if your warehouse database platform is highly utilized then there may be little benefit from implementing a warehouse in the cloud.

See Part 2 here

More on Exalytics Capacity…

I found myself wondering where did the rule-of-thumb for Exalytics  that suggests that TimesTen can use 800GB of a 1TB memory space… and requires 400GB of that space for work tables leaving room for 400GB of user data… come from (it is quoted everywhere… here is an example… see question #13).

Sure enough, this rule has been around for a while in the TimesTen literature… in fact it predates Exalytics (see here).

Why is this important? The workspace per query for a TPC-A transaction is very small and the amount of time the memory is held by a TPC-A transaction is very short. But the workspace required by a TPC-H query is at least 10X the space required by a TPC-A query and the duration of a TPC-H query is at least 10X the duration of a TPC-A query. The result is at least 100X more pressure on memory utilization.

So… I suspect that the 600GB of user data I calculated here may be off by more than a little. Maybe Exalytics can support 300GB of user data or 100GB of user data or maybe 60GB?

Note that this is not bad… all of this pressure on memory is still moved to Exalytics from the Exadata RAC subsystem… where memory is dear.

As a side note… it is always important to remember that the pressure on memory is the amount of memory utilized times the duration of the utilization. This is why the data flow architecture used in modern databases like Greenplum are effective. Greenplum uses more memory per transaction but it holds the memory for less time by never (almost) writing it to disk. This is different from older database architectures like Teradata and Oracle which use disk to store intermediate results… lowering the overall amount of memory required but increasing the duration of the query. More on this here

Co-processing and Exadata

In my first blog (here) I discussed the implications of using co-processors to offload CPU. The point was that with multi-core processors it made more sense to add generalized processing hardware that could be applied to all parts of the query process than to add specialized processors that dealt with only part of the problem.

Kevin Closson has produced two videos that critically evaluate the architecture of Exadata and I strongly suggest that you view them here before you go on with this post… They are enlightening, irreverent, and make the long post I’ve been drafting on Exadata lightweight and unnecessary.

If you have seen Kevin’s post you understand that Exadata is asymmetric and unbalanced. But his post extends and generalizes my discussion of co-processing in a nice way. Co-processing is asymmetric by definition. The co-processor is not busy after it has executed on its part of the problem.

In fact, Oracle has approximately mirrored the Netezza architecture with Exadata but used commercial processors instead of FPGAs to offload I/O and predicate processing. The result is the same in both cases… underutilized processing capability. The difference is that Netezza wastes some power on relatively inexpensive FPGA processors while Exadata wastes general and expensive CPU resources that might actually be applied usefully elsewhere. And Netezza splits the processing within a shared-nothing architecture while Exadata mixes architectures adding to the inefficiency.

More on Exalytics: How much user data fits?

Sorry… this is a little geeky…

The news and blogs on Exalytics tend to say that Exalytics is an in-memory implementation with 1TB of memory. They then mention, often in the same breath, that the TimesTen product which is the foundation for Exalytics now supports Hybrid Columnar Compression which might compress your data 5X or more. This leaves the reader to conclude that an Exalytics Server can support 5TB of user data. This is not the case.

If you read the documentation (here is a summary…) a 1TB Exalytics server can allocate 800GB to TimesTen of which half may be allocated to store user data. The remainder is work space… so 400GB uncompressed is available for user data. You might now conclude that with 5X compression there is 2TB of compressed user data supported. But I am not so sure…

In Exadata Hybrid Columnar Compression is a feature of the Storage Servers. It is unknown to the RAC layer. The compression allows the Storage Servers to retrieve 5X the data with each read significantly improving the I/O performance of the subsystem. But the data has to be decompressed when it is shipped to the RAC layer.

I expect that the same architecture is implemented in TimesTen… The data is stored in-memory compressed… but decompressed when it moves to the work storage. What does this mean?

If, in a TimesTen implementation without Hybrid Columnar Compression, 400GB of work space in memory is required to support a “normal” query workload against 400GB of user data then we can extrapolate the benefits of 5X compressed data as follows:

  • x of user data compressed uses x/5 of memory plus x of work space in memory… all of which must fit into 800GB

This resolves to x = 667GB… a nice boost for sure… with some CPU penalty for decompressing.

So do not jump to the conclusion that Hybrid Columnar Compression in TimesTen of 5X allows you to put 5TB of user data on a 1TB Exalytics box… or even that it allows you to load 2TB into the 400GB user memory… the real number may be under 1TB.

Exalytics vs. HANA: What are they thinking?

I’ve been trying to sort through the noise around Exalytics and see if there are any conclusions to be drawn from the architecture. But this post is more about the noise. The vast majority of the articles I’ve read posted by industry analysts suggest that Exalytics is Oracle‘s answer to SAP‘s HANA. See:

But I do not see it?

Exalytics is a smart cache that holds a redundant copy of aggregated data in memory to offload aggregate queries from your data warehouse or mart. The system is a shared-memory implementation that does not scale out as the size of the aggregates increase. It does scale up by daisy-chaining Exalytics boxes to store more aggregates. It is a read-only system that requires another DBMS as the source of the aggregated data. Exalytics provides a performance boost for Oracle including for Exadata (remember, Exadata performs aggregation in the RAC layer… when RAC is swamped Exalytics can offload some processing).

HANA is a fully functional in-memory shared-nothing columnar DBMS. It does not store a copy of the data.. it stores the data. It can be updated. HANA replaces Oracle… it does not speed it up.

I’ll post more on Exalytics… and on HANA… but there is no Exalytics vs. HANA competition ahead. There will be no Exalytics vs. HANA POCs. They are completely different technologies solving different problems with the only similarity being that they both leverage the decreasing costs of RAM to eliminate the expense of I/O to disk or SSD devices. Don’t let the common phrase “in-memory” confuse you.