In my blog yesterday (Part 1) I suggested that we could evaluate RDBMS-Hadoop integration architecture using three criteria:
- How parallel are the pipes to move data between the RDBMS and the parallel file system;
- Is there intelligence to push down predicates; and
- Is there more intelligence to push down joins and other relational operators?
But Exadata is a split RDBMS with a parallel file system backing it… how does it measure up by these criteria?
There are effective parallel pipes between the Oracle RAC RDBMS and the Exadata Storage Subsystem… so Exadata passes the first test. Further, Exadata is smart about pushing scan and projection both down to the Storage layer.
Unfortunately there is a fairly severe imbalance between the number of nodes on the RAC side and the number of nodes on the Storage side and this creates a bottleneck. We cannot give Exadata full marks here… but as far as parallel pipes goes it stacks up pretty well.
The ability to push down predicates goes a long way towards solving this as the predicate push-down reduces the amount of data that has to move over the bottleneck. But in every data warehouse there will be queries that return lots of rows from the early execution steps… and Exadata cannot join data in the Storage Subsystem so it tries to pull data up sparingly and push down semi-joins whenever possible… it just cannot be done in every case (Note: in Exadata POCs Oracle will try to ensure that no queries are included that pull lots of data up to the RAC layer… and competitors will try to include queries that expose this weakness…).
So… Oracle also includes some intelligence to push some data down to reduce data movement. There is no way to choose to move data from the RAC layer to the Storage Subsystem and execute the query there… the Storage Subsystem can only scan and project… so again we cannot give Exadata full marks… but it is pretty smart as you will see when we start looking at alternative implementations.
Finally, Exadata cannot effectively split a single query plan across both layers… so no marks at all here.
So Exadata is pretty good… but it has weak spots that will be severe for an important set of DW queries in any implementation.
Now, on to the Hadoop implementations… to Part 3