Open Source is Not a Market…

This post is more about the technology business than about technology… but it may be relevant as you try to sort out winners and losers… and this sort of sorting is important if you consider new companies who may, or may not, succeed in the long run.

To make my point let us do a little thought experiment. Imagine a company doing $100M in revenue with a commercial, not open source, database product. They win the $100M in revenue by competing with Oracle, IBM, Microsoft, Teradata, et cetera… and maybe competing a little here and there with some open source products.

Let’s assume that they make 50% of their revenue from services and support, and that their average sale is $2M… so they close 25 deals a year competing in this market. Finally, let’s assume that they break-even each year and spend 20% of their revenues on R&D. The industry average for support services is 20%.. so with each $2M sale they add $400K in recurring revenue.

They are considering making their product open source. Let’s assume that they make the base product free… and provide some value-added offering that costs $200K for the average buyer. Further, they offer a support package for the same $400K/year customers currently pay. How does the math work out?

Let’s baseline against the 25 deals/year…

If they make 25 sales and every buyer buys both the support package and the value-added offer the average sale drops from $2M to $200K, sales revenue drops from $50M to $5M, the annual revenue drops from $100M to $55M… and the company loses $45M. So… starting off they need to make 225 more sales just to break even. But now it gets complicated… if they sell 5 extra deals then in the next year they earn $2M extra in support fees… so if they sell 113 extra deals in year one then in year two they have made up the entire $45M difference and they are back to break-even going forward. If it takes them 2 years to get the extra recurring revenue then they lose money in year two… but are back to break-even in year three.

From here it gets even more complicated. The mythical company above sells the baseline of 25 new copies a year with an enterprise sales force that is expensive. There is no way that the same sales force that services 25 sales/year could service 100+ extra deals. So either costs go up or the 100+ extra customers becomes unattainable. We might hope that the cost of sales will drop way off as the sales price moves to $200K. This is not unreasonable… but certainly not guaranteed. Further, if you are one of the existing sales-staff then you have to sell 10X just to make the same commission. Finally these numbers assume that every customer buys the value-add and gets enterprise-level support. Reality will be something less than this.

We might ask: is it even possible to sell 100+ more with the same product in the same market? Let us be clear that the market the database product plays in has not changed. Open Source is not a market. All we have done is reduced the sales price for the product with some hope that price is a significant driver in the market.

This is not meant as an academic exercise. Tomorrow we will consider how this thought experiment applies to Pivotal’s announcements last week… and to the future of Pivotal’s database assets (here).

A Modern Data Warehouse Architecture: Part 3 – Build an EDW Annex

In the first two post of this series (here and here) I first suggested that Hadoop could be effectively used as the platform for staging and then suggested that a modern warehouse would have a federation layer that turned it into a logical data warehouse. Figure 3 depicts this extended architecture.

Figure 3. A Logical EDW

But if we have both Hadoop and a federation layer implemented… and we recognize the economics associated with moving data to Hadoop… Hadoop provides a 5X-50X price advantage over a commercial very large DBMS product… and we can move data from the expensive environment to the low-cost environment without impacting any applications… then we have the opportunity to move governed EDW data to Hadoop and place it into a Hadoop EDW Annex. Figure 4 shows this.

An EDW Annex

Now you might suggest that there is an impact… Hadoop will be significantly slower than a commercial EDW platform (for now…). But experienced EDW architects understand that in the classic architecture we had to co-locate data in a single database to join the data. So, we put all of the data, hot and cold data, in our EDW even though the service levels required for queries that touch old cold historical data did not justify the power and price of the EDW infrastructure. We had to but did not need to. We knew, if only implicitly, that most EDW queries touch a small subset of the data. Following the ratio suggested by Teradata (see the reference here) that 90% of the queries touch only 20% of the data we can imagine a system where 80% of the data resides in Hadoop to service 10% of the queries… and only that 10% experiences Hadoop performance.

I suggested this approach for Teradata here… but an architecture with an EDW Annex to store cleansed governed historical data works for any expensive RDBMS that can federate with Hadoop: Exadata, Netezza, Teradata, or HANA.

This concludes this series… sort of. I’ll post soon to express more about how this architecture provides long-term strategic value. I think that these three concepts: Hadoop as an EDW staging area, federation and logical data warehousing, and Hadoop as an EDW Annex; provide the foundation for a modern EDW… and I imagine that over the next several years this will become the reference architecture most of us will build to.

A Modern Data Warehouse Architecture: Part 2 – Make it a Logical Data Warehouse

In the first post of this series (here) I suggested that Hadoop could be effectively used as the platform for staging. Figure 2 describes the result. In this post I will extend the architecture by adding a data federation feature and turning the entire picture into a logical data warehouse… and then we’ll consider the consequences.

Figure 2. Add a Data Lake

Figure 3 shows this extension.. but lets quickly review what a federation fabric provides (for more detail on this please see the series that starts here).

Figure 3. Add a Federation Layer

First, the fabric allows any of the tables and files in the picture to be registered as “virtual” tables. This includes tables in EDWs, and any marts, in one or more Hadoop systems, the sandbox, and even in the source systems. From here the fabric is viewed by any programs in the BI, Analytics, Apps layer as a single relational database composed of all of the registered tables.This fabric would consume ODBC and JDBC queries at a minimum and provide some level of data translation, function translation, and query translation to allow all of the virtual tables to be queried through the single SQL dialect offer by the fabric. Finally, the fabric would provide some measure of optimization to reduce the overhead of accessing these distributed systems. It is this optimization that is the main topic of the series I suggested above.

One of the important implications of this… one that is often overlooked before implementation… is that the queries emitted from the fabric add workload to the underlying databases. If, in Figure 3, the top EDW is 100% busy servicing un-federated queries then adding a workload that joins that data to the 2nd EDW’s data will overburden the system. This is why it is not usually sound design to integrate source, OLTP systems into a fabric. The OLTP systems are not likely to be optimized in any way for the resulting workload.

But the upside of the fabric is significant. Consider:

  • If we just said “yes” and loaded ungoverned data into a sandbox the business users can immediately access that data and join it to dimensions and facts deployed elsewhere in the enterprise.
  • Rogue data marts can now be integrated back into the fold.
  • Redundant data deployed to allow joins within a single database instance can be eliminated and the joins can be federated. Note that federation is no silver bullet… there may be performance reasons for co-locating data… but you now can consider the trade-offs (I’ll post later on a way to federate and improve performance).
  • Data bases products can be retired without affecting the programs that access them. If you have an old data mart built on a product that you would rather not support or license… you can move the data and re-point the virtual tables without impacting the tools and applications at all.
  • Data can be relocated based on economics and/or for performance reasons… this will be the subject of the next blog in this series… but as a teaser, remember the economics… Hadoop costs $1K/TB (ok… $1K-$4K) hardware included and commercial databases cost much more.

There is one final advantage to this that is strategic and important enough to deserve to not be buried in the previous bullet list… All of us have seen the database product market move in the last 6+ years from a place where the choices were DB2, Oracle, SQL Server, or Teradata… to a market with those products plus Exadata, plus two flavors of SQL Server, plus Netezza, plus Greenplum, plus Vertica… and then plus Impala, plus Hive, plus Tez, plus Spark, plus MongoDB, plus Cassandra, plus plus plus. I think that it is impossible to place database bets today with the confidence that these bets will pay off five years from now… and certainly there are no bets good for ten years. If you are betting on infrastructure to support the Internet of Things the bets are more risky still. In my opinion a federation layer provides critical insulation from this chaos. With federation you can build applications knowing that you can acquire and retire database products and not affect the queries. IMO this insulation is a strategic imperative…

Sorry… I got a little excited… I don’t usually spew foam. Let me try again. IMO you should seriously consider the benefits of federation in both your EDW architecture and in your enterprise data architecture.

Part 3 of this series considers a new extension to the architecture by adding an EDW Annex… you can see it here.

References

A Modern Data Warehouse Architecture: Part 1 – Add a Data Lake

In this series I will present a perspective on how the classic EDW architecture is changing under the influence of new technologies, new requirements, and new economics.

Part 1 will show the first of three architectural changes by introducing a data lake into the picture… in Part 2 we’ll extend the picture by adding a logical EDW layer… and in Part 3 we’ll consider the implications of Hadoop as an EDW Annex.

Figure 1. A Classic EDW

Figure 1 describes a classic EDW architecture with sources feeding a staging area, transformations feeding cleaned and certified data to the EDW, and data consumed by analytic applications. In this picture we assume that data marts are part of the EDW fabric and that it is the responsibility of the applications to know which database to query. In most enterprises there are rogue data marts with ungoverned data and these are an issue… but more on that.

Note that Figure 1 also represents an optional sandbox area where uncertified and un-governed data can be landed This helps avoid paying the cost of data governance upfront… before data has proven it has value. A sandbox provides support for rogue data marts and cubes. Obviously use of uncertified data introduces issues… but many believe, as I do, that the trade-off of agility over governance can be productive under the right circumstances.

Figure 2. An EDW with a Data Lake

Figure 2 represents a step towards a modern EDW architecture with a Hadoop-based data lake replacing the staging area as well as providing support for a sandbox.

A data lake provides all of the same capabilities as the staging area it replaced… it may be a landing zone for untransformed data… but it has several other important characteristics:

  • A data lake can hold raw data forever rather than store it temporarily.
  • A data lake has compute included so it can execute transformations and before a single platform for staging and ETL.
  • A data lake has compute and tools included so it can be used to analyze raw data for trends and anomalies.
  • A data lake can easily store semi-structured and unstructured data.
  • A data lake can store big data.

Of course a staging area based on a RDBMS could do the same… but the economics are completely different. A Hadoop system provides storage and processing for as little as $1K/TB; the cost of an EDW hardware and software ranges from $15K/TB to $50K/TB… 15X to 50X more.

In addition, the data lake provides a cost-effective, extensible platform for building more sandboxes. If, by using Hadoop, you lower the upfront cost of a sandbox to $1K/TB then, when combined with the cost reduction from postponing the development of data quality and governance processes; you may find yourself in a position where you can just say “yes” to requests to add new data. At $1K/TB is may be unnecessary to force the business to build a business case and perform the difficult intellectual challenge of developing an ROI argument for data that has never been available and therefore has an unproven value proposition. Later, once the value is proven, you might move it to a governed state and to a governed platform.

Hadoop as an EDW staging area is not a new concept (see here)… the ETL vendors are supporting Hadoop as an ETL engine already. But a Hadoop staging approach starts to solve one of the nagging problems with the classic architecture: the lack of agility. Just saying “yes” to new data and corralling rogue marts provides a foundation to experiment and evolve while also providing the means to leverage successful experiments across the Enterprise.

There is more required to satisfy sandbox users… on to that in Part 2 – Logical Data Warehousing here.

References:

  • Wikitionary definition of a Data Lake here.
  • Informatica Big Data Edition here.
  • Curt Monash on IBM ETL here.

Using Teradata’s Appliance for Hadoop to Reduce TCO

Teradata has recently announced a very complete Teradata database-to-Hadoop integration. Is this note we’ll consider how a Teradata shop might effectively use these features to significantly reduce the TCO of any Teradata system.

The Teradata Appliance for Hadoop (here) offering is quite well thought out and complete… including a Teradata appliance, a Hadoop appliance, and the new QueryGrid capability to seamlessly connect the two… so hardware, software, support, and services are all available in very easy-to-consume bundles.
There is little published on the details of the QueryGrid feature… so I cannot evaluate where it stands on the query integration maturity curve (see here)… but it certainly provides a significant advance over the current offering (see here and Dan Graham’s associated comments).
I believe that there is some instant financial gratification to be had by existing Teradata customers from this Hadoop mashup. Let’s consider this…
Before the possibility of a Hadoop annex to Teradata, Teradata customers had no choice but to store cold, old, data in the Teradata database. If, on occasion, you wanted to perform year by year comparisons over ten years of data then you needed to keep ten years of data in the database at a rough cost of $50K/TB (see here) … even if these queries were rarely executed and were not expected to run against a high performance service level requirement. If you wanted to perform some sophisticated predictive analysis against this data it had to be online. If fact, the Teradata mantra… one which I wholeheartedly agree with… suggests that you really should keep the details online forever as the business will almost always find a way to glean value from this history.
This mantra is the basis of what the Hadoop vendors call a data lake. A data warehouse expert would quickly recognize a data lake as a staging area for un-scrubbed detailed data… with the added benefit that a Hadoop-based data lake can store and process data at a $1K/TB price point… and this makes it cost-effective to persist the staged data online forever.
So what does this mean to a Teradata EDW owner? Teradata has published numbers (here) suggesting that 92% of the queries in an EDW only touch 20% of the data. I would suggest that there is some sort of similar ratio that holds for 90% of the remaining queries… they may touch only another 40% of the data. This suggests that the 40% of the data remaining is online to service less than 1% of the queries… and I suggest that these queries can be effectively serviced from the $1K/TB Hadoop annex.
In other words, almost every Teradata shop can immediately benefit from Teradata’s new product announcements by moving 40% of their Teradata database data to Hadoop. Such a move would free Teradata disk space and likely take pressure off to upgrade the cluster. Further, when an upgrade is required, users can reduce the disk footprint of the Teradata database side of the system; add a Hadoop annex, and significantly reduce the TCO of the overall configuration.
Some time back I suggested that Teradata would be squeezed by Hadoop (here and here). To their credit Teradata is going to try and mitigate the squeeze. But the economics remain… and Teradata customers should seriously consider how to leverage the low $/TB of Teradata’s Hadoop offering to reduce costs. Data needs to reside in the lowest cost infrastructure that still provides the required level of service… and the Teradata Hadoop integration provides an opportunity to leverage a new, low-cost, infrastructure.

Thoughts on Oracle 12c…

Plugs
Plugs (Photo credit: Brad.K)

 

Here are some quick thoughts on Oracle 12c…

 

First, I appreciate the tone of the announcements. They were sober and smart.

 

I love the pluggable database stuff. It fits into the trends I have discussed here and here. Instead of consolidating virtual machines on multi-core processors and incurring the overhead of virtual operating systems Oracle has consolidated databases into a single address space. Nice.

 

But let’s be real about the concept. The presentations make it sound like you just unplug from server A and plug into server B… no fuss or muss. But the reality is that the data has to be moved… and that is significant. Further, there are I/O bandwidth considerations. If database X runs adequately on A using 5GB/sec of read bandwidth then there better be 5GB/sec of free bandwidth on server B. I know that this is obvious… but the presentations made it sound magic. In addition 12c added heat maps and storage tiering… but when you plug-in the whole profile of what is hot for that server changes. This too is manageable but not magic. Still, I think that this is a significant step in the right direction.

 

I also like the inclusion of adaptive execution plans. This capability provides the ability to change the plan on-the-fly if the execution engine determines that the number of rows it is seeing from a step differs significantly from the estimate that informed the optimizer. For big queries this can improve query performance significantly… and this is especially the case because prior to 12c Oracle’s statistics collection capability was weak. This too has been improved. Interestingly the two improvements sort of offset. With better statistics it is less likely that the execution plan will have to adapt… but less likely does not mean unlikely. So this feature is a keeper.

 

I do not see any of the 12c major features significantly changing Oracle’s competitive position in the data warehouse market. If you run a data warehouse flat-out you will not likely plug it elsewhere… the amount of data to move will be daunting. The adaptive execution plan feature will improve performance for a small set of big queries… but not enough to matter in a competitive benchmark. But for Oracle shops adaptive execution is all positive.

 

Hadoop and the EDW

Squeeze If You Feel Pain
Squeeze If You Feel Pain (Photo credit: Artotem)

Cloudera and Teradata have jointly published a nice paper here that presents an interesting perspective of how Hadoop and an EDW play together. Simply put, Hadoop becomes the staging area for “raw data streams” while the EDW stores data from “operational systems”. Hadoop then analyzes the raw data and shares the results with the EDW. Two early examples provided suggest:

  • Click stream data is analyzed to identify customer preferences that are then shared with the EDW. Note that the amount of data sent from Hadoop to the EDW would be fairly small in this case.
  • Detailed data is stored on Hadoop to build analytic models. The models are then transferred to the EDW to score sales activity data. Note that in this scenario the scored activity detail has to live in Hadoop to perform modeling… but it is unclear why it has to live in the EDW as well. I presume that scoring takes place on the EDW instead of in Hadoop for performance reasons… but maybe the data, the modeling, and the scoring should just take place in Hadoop?

The paper then positions Hadoop as an active archive. I like this idea very much. Hadoop can store archived data that is only accessed once a month or once a quarter or less often… and that data can be processed directly by Hadoop programs or shared with the EDW data using facilities such as Teradata’s SQL-H, or Greenplum‘s External Hadoop tables (not by HAWQ, though… see here), or by other federation engines connected to HANA, SQL Server, Oracle, etc.

But think about the implications on how much data has to stay in your EDW if you archive everything older than 90, or even 180, days to Hadoop. The EDW shrinks significantly and the TCO advantage to your Enterprise will be significant. This is very cool.

There is one item in the paper I disagree with, though… and another statement that I think has a very short shelf-life.

The paper suggests that indexes, materialized views, aggregate join indexes, and other tweaks are what differentiates an EDW. I believe that reliance on these structures make for a fragile EDW where only some queries can run fast. I like Teradata better when it just robustly scans fast and none of these redundant-data tuning artifacts are required (more here and here). Teradata was the original scan-fast DBMS… it is more than capable.

The paper also suggests that an EDW maintains value by including a sophisticated cost-based optimizer that uses data demographic statistics to identify an optimal query execution plan. I agree that Hadoop lacks this now… but there are several projects like Cloudera Impala that will eliminate this gap in the near term.

I believe that if you read between the lines you will see more evidence to support my belief (here) that Hadoop will squeeze the EDW vendors hard… and that the size of a squeezed EDW will then fit in an in-memory database.

Indexes are not a good thing… A blog on TCO

In many of my posts I refer to the issues associated with building “extra” data structures to meet performance goals (see one of my first posts ever here). These extra structures are always a trade-off… slowing the performance of one function in order to speed up another. I thought that it might be helpful to be very clear about where I stand on this.

Indexes improve the performance of queries that address a small set of data. They also can improve join performance if your favorite optimizer can apply an index intersection to the execution plan for your queries. Indexes dramatically slow the performance of inserts, updates, and bulk data loads as they have to be maintained when data changes. You can mitigate the cost and update indexes in the background… the trade-off does not go away. Indexes are probably required for OLTP applications that pick out single rows.

Wouldn’t it be great if your favorite DBMS could resolve every query very fast without the overhead and operational effort associated with maintaining indexes? Certainly we should aspire to a read-optimized database, a data warehouse DBMS, that does not require indexes.

Vertica projections provide an optimized, materialized, view that improves the performance for a set of queries. The Vertica optimizer automatically selects the optimal projection. Vertica provides a very slick tool that builds projections based on the query set provided. I worded my post on Vertica a little vague… so let me be sure here to point out that  every Vertica query runs against a projection… so it is possible to have only one. In this case there is no additional overhead. Adding projections slows the data load process and increases the storage requirements. This is the trade-off.

Other databases offer materialized views. They make the same trade-off as above.

An OLAP cube is a physical structure that pre-aggregates data so that your query workload can avoid the aggregation. The best implementations of this express the cube as a materialized view so that queries can use the pre-aggregated data without explicitly pointing at a cube structure… the optimizer picks it for you. In addition the best implementations let you drill out of the cube to the detail records. These products have the update/delete/load issues of an index plus add an extra data latency issue as the data has to be aggregated on some interval… usually hours or days. Many products do not allow joins from a cube. You can see the trade-off. The Oracle Exalytics product materializes the aggregated cube on a separate server in-memory. This provides even more performance but adds the system and operational overhead of moving data across system boundaries.

Wouldn’t it be nice if you could query raw data and perform aggregation so fast that even against terabytes of data you could run any query with 3 second or less response without the overhead of building cubes?

You may build specialized table structures and pre-join, pre-aggregate, or pre-compute data to make a set of queries run fast. The cost of building and maintaining this sort of implementation versus just querying the base tables is the trade-off. Further, this approach is sort of a trap. You cannot build these structures for every query… if you did the business would conceive another critical query the next day that required work.

You can add indexes to the structures built using the technique above and provide very fast application-specific performance to a small set of queries. This is currently the favored approach when companies build iOS or Android apps as it provides the best possible performance… at a significant price.

Wouldn’t it be great if this was unnecessary… you could just scan so fast that mobile response service levels could be met from the base data regardless of the query.

You can deploy redundant data in operational data stores, data marts, cube servers, analytic data stores, and so on… with each specialized store providing performance for some limited set of queries at the cost of development and support ongoing. Each of these copies could deploy specialized database products that speed up that set of queries a little more. Again, this surround-the-EDW approach is a trap that leads to the proliferation of data marts and of database technologies.

Please do not take that last paragraph the wrong way… I believe that the worst possible approach is to blindly standardize on one or two database products. This trade-off makes life convenient for the IT department at the expense of performance and agility in the business. It is OK to have one or two favored products but IT must always serve the business to the best of their ability as a first priority… and sometime the new start-up has just the thing (remember that once Teradata was a start-up and DB2 on the mainframe was the IT standard…).

What I wish was that one or two products could solve all of the performance and functionality problems without the cost of building “extra” stuff… one product would be better that two. I like products that make the extra stuff “free”. Netezza does a nice job of making zone maps “free”, for example. Teradata and Greenplum provide the option of row store or column store for “free”. Vertica automatically build extra projections for “cheap”… and while there is a cost to the projection it at least does not require staff to tune it up. Oracle materialized views are “cheap”.

What I dislike are products that require DBAs to work harder and harder to apply all of the techniques above to meet performance SLAs. Each of these techniques trades off performance for development and operational expense.

As I have noted before… the performance SLAs for BI are about to become severe as companies try to support BI on mobile devices. The development and operational costs of tuning up; that is the TCO; will be significant unless better, faster, software infrastructure becomes available.

The TCO for a database that could eliminate these extra constructs and could eliminate the cost of developing and maintaining them; and could eliminate the architectural fragility these approaches imply… and replace this with a DBMS that holds base data which could satisfy all queries in seconds; delivering the business agility this implies… the TCO would be compelling.

I actually believe that the answer is available in the market today… this is no longer a pipe dream… more later…

My 2 Cents: Netezza 1Q2013

The TwinFin Surf Board
The TwinFin Surf Board (Photo credit: tvanhoosear)

Since my blogs tend to be in response to some stimulus they may not reflect a holistic view on any particular product. The “My 2 Cents” series will try to provide a broader view…

Please consider this as you read on…

Summary

Netezza put a new spin on data warehousing… they made it easy. The Netezza software includes a unique clustered index feature called a zone map that is powerful and easy to use. They also use a FPGA co-processor to augment the CPUs, offloading data compression and projection. When both of these innovations combine Netezza is hard to beat.

Zone maps are powerful when they can be used in a query plan… but the hardware is only good, not great, when zone maps are not in the plan. FPGAs provided a huge boost when Netezza first came on the scene… but as discussed here they do not provide the same boost today. In addition, FPGAs may limit the ability of a Netezza cluster to handle concurrent queries (see here and especially the comments).

The IBM acquisition has opened up a market of Blue shops to Netezza… so they are selling… and as a result Netezza is here to stay.

Where They Win

Of course, Netezza will win in all-Blue shops.

Netezza wins when there is a naturally sequenced field in each big table that is also used in the predicate for most queries. For example, if data is naturally in date/time sequence and every query has a date/time constraint then Netezza is hard to beat. This is the case most often for focussed data marts or single application databases… so look for Netezza for these sort of problems.

Netezza wins when there are a relatively small number of concurrent queries… and they can win when the queries are complex… as long as the zone map is in the plan.

Netezza can win when the POC is designed such that zone maps may be used in the POC… for example when the POC models only a single data load and the data is pre-sorted… even when the real application would fragment the data (for example… data will not naturally enter the warehouse sequentially by customer number… the same customer will be represented time and again… but if you load once only for a POC then you can sort by customer number and use it in the query predicates).

Note that I am not saying that Netezza is a poor performer when zone maps are not used… it is good… but they would never win a POC if no queries used the zone map.

Where They Lose

Guess what? Netezza loses when the zone maps cannot be used or can be used for only a small fraction of the query workload. Note again that the use of a zone map depends on two factors: the data has to be in sequence over all time, and the queries must use the columns mapped in the predicate. If data enters the system out of sequence then the zone map fragments and eventually loses the ability to speed up queries (a few random out of sequence rows are OK).

This constraint makes it hard for Netezza to service data warehouses where, by definition, lots of different user constituencies come at the data from lots of different directions… rather than always using the path grooved with a zone map.

Netezza was designed when only Sybase IQ had columnar oriented tables… today columnar is in nearly every DW database and this allowed the competition to cut deeply into Netezza’s competitive, zone-map enabled, edge. Teradata columns, Greenplum columns, or the natural column stores can win even when zone maps are on target.

Bottom line: do a POC…

In the Market

I spend most of my time in the general market for data warehousing. You won’t see me offer much of an opinion on HANA for BW, for example… even though there are ten thousand plus BW warehouses I just do not see them in the places I work.

Before Netezza was acquired by IBM they were everywhere… in nearly every POC. Now… not so much. To a very large extent they seem to have been directed into the Blue-only customer base (now that I think about it the same thing happened to the Ascential Data Stage suite of ETL products).

My Guess at the Future

As I noted in the reference above… I think that Netezza will eventually go away from the co-processor strategy.

There have been rumors for several years of design that allowed multiple zone maps. This would be very important… but loading out-of-sequence data, which is the necessary the result, could be very slow.

Netezza has lost some of its edge as other technologies added columnar capabilities to their technologies… and Netezza is surely looking at this… but their architecture which includes an execution engine on the server and on the FPGA makes this more complex than you might suspect. Zone maps and two-stage optimization (one in the server and once in the FPGA) is cool… but a tight coupling of the tricks makes for a difficult time extending and adding new features.

If I were the King of Netezza and I could not find a reasonable way to extend beyond the two tricks that got me here I would go with the flow… I would position Netezza as an extremely easy-to-deploy data mart appliance and hook it tightly (i.e. build in some integration) along-side DB2 and Hadoop… and I would cede the EDW space to DB2 and the Big Data space to Hadoop.

Next up… my 2 Cents on Greenplum

May 1, 2013: Here is an update, or maybe a summary, of my view on Netezza… – Rob

Will Hadoop Eat Greenplum and Netezza?

If I were the Register I would have titled this: Raging Stuffed Elephant To Devour Two Warehouse Vendors… I love the Register… if you do not read it have a look

This is a post is about the market implications of architecture…

Let us assume that Hadoop matures and finds a permanent place in the market. This is not certain with some folks expressing concern (here) and others boundless enthusiasm (here). So let’s assume… and consider where it might fit.

The SqueezeOne place is in the data warehouse market… This view says Hadoop replaces the DBMS for data warehouses. But the very mature BI/DW market requires a high level of operational integrity and Hadoop is not there yet… it is advancing rapidly as an enterprise platform and I believe it will get there… but it will be 3-4 years. This is the thinking I provided here that leads me to draw the picture in Figure 1.

It is not that I believe that Hadoop will consume the data warehouse market but I believe that very large EDW’s… those over 1PB… and maybe over 500TB will be compelled by the economics of “free” to move big warehouses to Hadoop. So Hadoop will likely move down into the EDW space from the top.

Another option suggests that Big Data will be a platform unto itself. In this view Hadoop will sit beside the existing BI/DW platform and feed that platform the results of queries that derive structure from unstructured data… and/or that aggregate Big Data into consumable chunks. This is where Hadoop sits today.

In data warehouse terms this positions Hadoop as a very large independent analytic data mart. Figure 2 depicts this. Note that an analytics data mart, and a Hadoop cluster, require far less in the way of operational infrastructure… they share very similar technical requirements.Hadoop Along Side

This leads me to the point of this post… if Hadoop becomes a very large analytic data mart then where will Greenplum and Netezza fit in 2-3 years? Both vendors are positioning themselves in the analytic space… Greenplum almost exclusively so. Both vendors offer integrated Hadoop products… Greenplum offers the Greenplum database and Hadoop in the same hardware cluster (see here for their latest announcement)… Netezza provides a Hadoop connector (here). But if you believe in Hadoop… as both vendors ardently do… where do their databases fit in the analytics space once Hadoop matures and fully supports SQL? In the next 3-4 years what will these RDBMSs offer in the big data analytics space that will be compelling enough to make the configuration in Figure 3 attractive?

Unified HadoopI know that today Hadoop cannot do all that either Netezza or Greenplum can do. I understand that Netezza has two positions in the market… as an analytic appliance and as a data mart appliance… so it may survive in the mart space. But the overlap of technical requirements between Hadoop and an analytic data mart… combined with the enormous human investment in Hadoop R&D, both in the core and in the eco-system… make me wonder about where “Big Data” analytic relational databases will fit?

Note that this is not a criticism of the Greenplum RDBMS. Greenplum is a very fine product, one of the best EDW platforms around. I’ll have more to say about it when I provide my 2 Cents… But if Figure 2 describes the end state for analytics in 2-3 years then where is the place for the Figure 3 architecture? If Figure 3 is the end state then I do not see where the line will be drawn between the analytic workload that requires Greenplum and that that will run on Hadoop? I barely can see it now… and I cannot see it at all in the near future.

Both EMC Greenplum and IBM seem to strongly believe in Hadoop… they must see the overlap in functionality and feel the market momentum of Hadoop. They must see, better than most, that Hadoop wins this battle.

%d bloggers like this: