Data Warehouse – Database Fog Blog

Database Supercomputing in the Cloud

This post will combine the information in the last two posts to show how the cloud might be used as a database supercomputer at no extra cost.

To quickly recap: in the post here, you saw a simple model that shows how a cluster dedicated to a single query uses the on-demand resources of cloud computing to provide a 10X-100X performance boost over a multitasking cluster running many queries at once. Note that this model did not account for the start-up time of spinning up a new cluster instance for each query, but we assume that there is still a significant saving for queries that run for several minutes. Further, we believe that queries can be queued and run one at a time on clusters to reduce the start-up costs whenever there is work queued.

The next post here reminded us of the cost of populating cache from DRAM whenever there is a multitask context switch from one query to the next. You also saw how supercomputing SIMD instructions further accelerate processing. Finally, we discussed how modern columnar databases store data as vectors in a way that allows compressed columnar data to be loaded directly into cache to optimize the use of that memory.

One last note: all the cores share the L3 cache in a chip. On a 4-CPU processor, this means that you could run 4 parallel processes, all on compressed vectorized data.

So how might the cloud help here? First, if we dedicate a cluster to query processing where “dedicate” means that the hardware is dedicated to a single queue of queries, then once the query data is loaded in memory and the cache is primed, the performance can be 1000X the speed of an alternative configuration where the hardware is shared both by a multitasking DBMS and that DBMS is sharing the cluster with other cloud workloads dispatched by the cloud operating system. Better still, since in the cloud using fine-grained billing, you pay only for what you use, so this extra performance is nearly free.

When SAP HANA was introduced, SAP published amazing 1000X benchmarks using in-memory supercomputing instructions. Unfortunately, the technology did not quite exist to deploy queries on-demand on tens or hundreds or thousands of dedicated cloud computers. I imagine that we are not too far from this today.

The bottom line is that multitasking has never been efficient. When you swap context in and out of cache, you are just wasting resources. It made sense when computing was expensive, and you swapped context to take advantage of the time it took to perform I/O to disk or tape. For analytic workloads where the data is columnar and vectorized, you can fit multiple terabytes of compressed data in memory and stream it to supercomputing instructions quickly.

I imagine a time when simple, very high performance, single-threaded database processing units, similar in architecture to GPUs, will handle queries in a dedicated manner, fetching vectorized data from products like Apache Arrow, and 1000X speed-up at a very low cost will be the standard.

Numbers Every Database Professional Should Know

In this post I will setup the next post by reminding you of these numbers every programmer should know. The picture shows the latency to access data across the three levels of cache in a modern processor, and across the memory bus to DRAM, and then across to an SSD or spinning disk drive.

These numbers are the key to in-memory database performance. L1 cache is expensive and small. All data and all instructions are fed to the core through the L1 cache and typically the instruction and the data it will operate on are each fetched independently. If the data (or instruction… but I will just talk about the data from here on out) is not found in the small L1 cache the L2 cache is searched. Likewise, if the data is not in the L2 cache the L3 cache is searched and then DRAM. As you can see there is a 200X performance difference between a fetch from L1 cache and a fetch from DRAM. If the data must be fetched from SSD or disk, there is a 1000X penalty.

Databases work to mitigate the 1000X penalty by pre-fetching data into DRAM. Modern processors mitigate the penalty between cache and DRAM by trying to anticipate the data required and pre-fetching it into L3 cache. This is a gamble and the likelihood of winning the pre-fetch bet varies with your DBMS and with the workload. More simultaneous queries against many tables puts pressure on the memory and lessens the effectiveness of data pre-fetch.

Databases also mitigate the 1000X penalty by compressing data so that each fetch of a block from disk captures more data. For tabular (not columnar) data each block must be decompressed before it can be used. Decompression steals CPU cycles from query processing but mitigates some of the cost of I/O to disk. Columnar databases fetch compressed data and, if the databases use modern vector processing techniques, they operate directly on the compressed vector data without decompression. This is especially powerful as these vectorized columnar databases can push compressed data into cache and use that memory more effectively.

Vectorized data can use the supercomputing instruction set that includes SIMD instructions. SIMD stands for Single Instruction; Multiple Data and these instructions fetch a single instruction from L1 cache and reuse it on the vectorized data over and over with pausing to fetch new instructions. Note that in a modern processor waiting for instructions or data to be fetched from cache or DRAM appears as CPU busy time. When the processor is stalled waiting for data to be fetched from disk the query/job gives up the CPU and dispatches a new query/job.

Finally, for the longer-running queries associated with analytics and business intelligence (where long-running means a few seconds or more… “long” in CPU instruction time), it is highly likely that the L1 and L2 cache in each core will be flushed and new data will be fetched. It is even likely that all the data in L3 cache will be flushed. In this case most of the goodness associated with compression and vectorization is flushed with it. The vectors must be fetched again from L3 and DRAM. Further, note that when you run in a virtual machine or in the cloud you may think that you are the only user but other virtual machines other containers, or other serverless processes may be flushing you out of memory all the time.

This is important. Until we swapped out a query for a new one it looked like in-memory vector processing might be 200X-1000X faster than a tabular query; but in an environment where tens or hundreds of queries are running concurrently the pressure on memory keeps flushing the cache and the advantage of vectorized query processing is reduced. Reduced, not eliminated. We would still expect the advantages of compressed data in cache and SIMD supercomputing instructions to provide much more than a 10X speedup.

As mentioned, this post is a detailed review of material I covered years ago when HANA was introduced. In the next post I will add some new thinking.

Here is more information on processor architecture and cache usage.

DW Cloud Economics – The Model

This post describe the model used to generate the content from the DW Cloud Economics – A Do Over post incase you want to check the numbers…

First, keep in mind that the numbers in this model are relative and indicative, not absolute. Here are the fixed parameters I used to generate these indicative numbers:

Cost per server per hour: $4.00 This could be any number or any currency and it would not matter.

Runtime per job with no contention: 3 hours This could be any duration and it would not change the relative results. If the number was well below an hour, it would increase the hour-based billing costs.

Jobs per Workload: 6 Again any number works.

Baseline Configuration: 24 Servers

Contention between 6 jobs on 24 Servers: 50% This sets the average amount of time a job will wait for resources. So, a job that runs for 3 hours with no contention will run in 4.5 hours with 50% contention. As the number of servers goes up the contention comes down for a fixed workload.

Baseline Performance: 2000 A completely arbitrary number that allows the model to calculate a relative price/performance. Relative “Performance” is calculated by dividing the Baseline Performance by the actual runtime. This means that a workload that takes 5 hours will yield a performance of 400 and a workload that take 2 hours will yield a performance number of 1000.

There are also a series of calculated values:

Total Servers / Workload is equal to Servers/Job for the Multi-tasking and Serial Execution configurations. It is equal to Servers/Job * Number of Jobs for the dedicated server per job configuration.

Scaleup is the ratio between the number of Total Servers / Workload and the Baseline Configuration. It is used to scale down the runtime and the contention.

Runtime / Job is the number of hours per job with no contention scaled down based on the Scaleup plus the time the Job is waiting due to contention. Contention scales down as the number of servers scales up.

Actual Runtime / Workload is Runtime / Job plus contention. Note that in the Serial Configuration and the Dedicated Server configuration there is no contention.

Paid Runtime / Workload is the Actual Runtime / Workload rounded to the nearest hour for the hourly billing table and equal to the Actual Runtime / Workload for the billing per minute table.

Cost / Workload is the Cost per Server per Hour parameter times the Paid Runtime / Hour value times the Total Servers / Workload value.

Price / Performance is the Cost / Workload divided by the Actual Runtime / Workload times the Baseline Performance parameter divided by the Cost / Workload value as noted above.

DB Cloud Economics – A Do Over

Here is a better model that demonstrates how cloudy databases can provide dramatic performance increases without increasing costs.

Mea culpa… I was going to extend the year-old series of posts on the economics of cloud computing and recognized that I poorly described the simple model I used, and worse, made some mistakes in the model. It was very lame. I’m so sorry… That said, I hope that you will like the extension to the model and the new conclusions.

I’ll get to those conclusions in a few short paragraphs and add a new post to explain the model.

If we imagine a workload consisting of 6 “jobs” that would execute on a dedicated cluster of 24 servers in 3 hours each with no contention, we can build a simple model to determine the price performance of the workload for three different configurations:

M1: The usual configuration where all six jobs run together and share the configuration, multitasking and contending for the resources. As the configuration scales up the cost goes up, the runtime goes down, and the contention drops.
M2: A silly model where each job runs serially on the configuration with no contention. As the configuration scales up the costs go up, the runtime goes down.
M3: A model where each job runs independently, simultaneously, on its own cluster with no contention.

If we pay for servers by the hour the model shows that there is an opportunity to deploy many more servers, as many as one per “job”, to reduce the runtime and the cost. Here are the results:

*Config*	Servers Deployed	Actual Runtime / Workload (Hours)	Paid Runtime / Workload (Hours)	Cost / Workload	Price / Performance
Multitasking	24	4.5	5	$480	444
	48	3.8	4	$768	533
	96	3.4	4	$1,536	593
Serial Execution	24	18.0	18	$1,728	111
	48	9.0	9	$1,728	222
	96	4.5	4.5	$1,920	444
Dedicated Server per Job	144	0.5	1	$576	4000
	288	0.3	1	$1,152	8000
	576	0.1	1	$2,304	16000

Table 1. Costs with Hourly Billing

Note that the price/performance metric is relative across the models. You can see the dramatic performance and price/performance increases using a configuration where each unit of work gets a dedicated set of resources. But the more dramatic picture is exposed in Table 2.

*Configuration*	Servers Deployed	Actual Runtime / Workload (Hours)	Paid Runtime / Workload (Hours)	Cost / Workload	Price / Performance
Multitasking	24	4.5	4.5	$432	444
	48	3.8	3.8	$720	533
	96	3.4	3.4	$1,296	593
Serial Execution	24	18.0	18.0	$1,728	111
	48	9.0	99.0	$1,728	222
	96	4.5	4.5	$1,728	444
Dedicated Server per Job	144	0.5	0.5	$288	4000
	288	0.3	0.3	$288	8000
	576	0.1	0.1	$288	16000

Table 2. Costs with per Minute Billing

Here the granular billing reduces the total cost with the same dramatic price and price/performance increases. This was the point I was trying to make in my earlier posts. Note that this point, that fine-grained billing can be used to significantly reduce costs, is why deploying work in containers, or better still as serverless transactions, is so cost-effective. It is why deploying virtual machines in the cloud misses the real cost savings. In other words, it is why building cloud-native implementations is so important and why cloud-native databases will quickly overcome databases that cannot get there.

I also was trying to show that work deployed across these configurations, what I call a “job”, can be ETL jobs or single queries or a set of queries or a Spark job. If you have lots of smaller work, it may be best to run them in a multitasking configuration to avoid the cost of tearing down and starting up new configurations. But even here there is a point where the tear down cost can be mitigated across multiple semi-dedicated configurations.

What is a Cloud-native Database?

Before this series is complete, I plan on defining in some detail what the various levels of Cloud-nativeness might be to allow readers to classify products based on architecture, not marketing. In this post, I’ll lay out some general concepts. First, let’s be real about cloud-things that are not cloud-native.

Any database that runs on physical hardware can run on virtual machines and, therefore, can run on virtual machines in the cloud. Databases with no cloud capabilities other than the ability to run on a VM on a cloud-provider are not cloud-native. Worse, there are lots of anecdotal stories that suggest that there are no meaningful savings to be had from moving a database from an on-premise server or VM to a cloud VM with no other change to take advantage of cloud elasticity.

So here are two general definitions for your consideration:

1) A cloud-native database will have one or more features that utilize capabilities found only on a cloud-computing platform, and

2) A cloud-native database will demonstrate economic benefits derived from those cloud-specific features.

Note that the way you pay for services via capital expenses (CapEx) or as operating expenses (OpEx) does not provide economic benefit. If the monetary costs of a subscription are more-or-less equal to the financial costs of a license, then savings are tied to tax law, not to economics. Beware of cloudy subscriptions that change how you pay without clearly adding beneficial cloudy features. It is these subscriptions that often are the source of the no-savings anecdotes mentioned above.

This next point is about the separation of storage from compute. Companies have long ago disconnected their databases from just-a-bunch-of-disks (JBOD) to shared storage such as SAN or S3. Any database today can use shared storage. It is not useful to say that any database that can use shared storage has separated storage from compute. Using the idea that there must be features, not marketing, that allow compute to scale separately and more-or-less dynamically from storage as the definition, we will be able to move forward in this area.

So:

3) For storage to be appropriately separated from Compute, it must be possible to scale Compute up and down dynamically.

Next, when compute scales, it scales at different granularity. An application or database that automatically adds and subtracts virtual machines provides different economics than a database that scales using containers. Apps that add and subtract containers have different economics than applications that use so-called serverless containers to scale. In this dimension, we will try to characterize granularity to account for the associated cloud economics. This topic will be covered in more detail later, so I’ll save the rule for that post.

Note that it is possible for database vendors to develop a granular architecture and to use the associated economics to their advantage. They may charge you for the time when any part of your database is running but be billed by their provider for smaller chunks. This is not an issue unless their overall costs become uncompetitive.

Last, and in some ways least, different products may charge for time in smaller or larger chunks. You might be charged by the hour, by the minute, by the second, or in smaller increments. Think about the scaling economics I suggested in the first posts of this thread. If you are charged by the hour, then there is no financial incentive to scale up to finish jobs to the minute. You will be charged the same for ten minutes or fifty minutes when you are charged by the hour. The rule:

4) Cloud databases that charge in smaller time increments are more economical than those that charge in larger increments.

The rationale here is probably obvious, but I’ll cover it in-depth in a later post.

With these concepts in place, we can discuss how architectural changes affect each aspect of the economics of databases in the cloud.

Note that this last sentence was written assuming that a British computer scientist with an erudite accent would speak it when they create the PBS series from these posts. Not.

By the way, a few posts from now, I am going to go back to some ideas I shared five years ago around the relationship between database processing and the underlying hardware platform. I’ll update this thinking with cloud computing in mind. You can find this thinking here which originally came from Jeff Dean and Peter Norvig (displayed in lots of places but here is one).

More on Cloud Data Elasticity

The last post (here) demonstrated how scalability in the cloud provides the ability to reduce runtimes from days or hours to minutes without raising the cost. We used a batch ETL service running three ETL scripts as an example. We then showed how the same use of scalability could allow us to break the batch ETL service into discrete jobs and remove contention between the three scripts to further improve throughput and reduce costs.

There is a great deal more to say about how the cloud changes our thinking about applying resources to our big data workloads. I promised to get to a discussion of database work, but that will have to wait another week. Sorry. Let’s carry on.

In the scenario where we ran the three jobs together, we assumed that all three ran in the same amount of time and so we shut down the ETL service when all three scripts completed. We shut down the servers when they were all complete to stop billing at the $1152 price point. It is more likely that each script would take more-or-less time than another.

When we run each script in its own set of servers, we can stop billing as each job ends. If one of the jobs takes only 2.5 hours to complete, and your cloud provider will allow you to bill minute, not hour increments, then the cost of that single job drops from $288 to $240. Over a year, these savings add up, and you can ask for a raise.

So, we have added a third point: by using scalability, and by scheduling discrete workloads on dedicated cloud servers, you can scale performance and significantly reduce costs.

You have no doubt noticed that the scenarios describe scalability without any impact on the data. We assume that compute can scale independently of storage, and re-sharding of the data is not required. Every database has special sauce for managing data across storage. All we assume here is that the file scan of data, be they rows or columns, occurs in compute nodes after the data is read from storage. In a future post, we will see how modern storage systems impact this assumption without changing the economics benefits described here. Figure 1 is a classic, really too simple, depiction of this separation.

Simple Storage Separate from Compute — Figure 1. Compute Separate from Storage

The database logic here is straightforward. When compute and storage are tied, the query planner knows precisely and in advance how many parallel nodes are in play, and the system can spread data across those nodes in every step that requires data distribution. The number of nodes is fixed for both storage (processing IO) and compute. Figure 2 shows a system with storage and logic connected.

Couple Storage — Figure 2. Classic Shared-nothing Connected Storage and Compute

In a system with storage and compute separated, the query planner has to ask how many nodes are available for data distribution. I am using the term “database” here, but any parallel data processing system that shards data with a processing plan has some form of this logic.

In the ETL scenarios, data flows from the storage layer into a compute layer dynamically allocated at the start of the workload. The planner learns the configuration when the system starts up.

Shared Nodes — Figure 3. Multi-processing Compute Nodes

Figure 3 depicts the case where all three ETL scripts run together on a six-node cluster. Figure 4 shows each ETL script running on a dedicated cluster. Note that, just-for-fun, I have adjusted the configurations in Figure 4 to show that in the dedicated system case, it is possible to size systems differently if there is an advantage to do so. In a later post, I’ll discuss why this could be important (as right now, it may seem that in every case, you would want 10,000 severs to complete every job in 10 seconds).

Figure 4. Dedicated Compute Nodes with Separated Storage

A couple of closing remarks. First, I cannot imagine why anyone would not run ETL scripts on a scalable cloud platform. The ability to scale up to reduce runtimes at no extra cost is remarkable (and I’m not sure that I have ever used that word in one of my blogs). Next, I cannot see why anyone who could run ETL in the cloud would not run each script in a dedicated cloudy configuration. If the issue is that your ETL product is not cloud-native does not separate from compute, then get a product that does (or use a cloud database and ELT).

Finally, here is a post from five years ago that anticipates the separation of storage from compute.

Next time: I’ll take this down a notch to talk about workloads smaller than an ETL batch job and consider how to run big data queries in the cloud.

A Short DW DBMS Market History: HANA, Oracle, DB2, Netezza, Teradata, & Greenplum

Here is a quick review of tens years of data warehouse database competition… and a peek ahead…

Maybe ten years ago Netezza shook up the DW DBMS market with a parallel database machine that could compete with Teradata.

About six years ago Greenplum entered the market with a commodity-based product that was competitive… and then added column store to make it a price/performance winner.

A couple of years later Oracle entered with Exadata… a product competitive enough to keep the Oracle faithful on an Oracle product… but nothing really special otherwise.

Teradata eventually added a columnar feature that matched Greenplum… and Greenplum focussed away from the data warehouse space. Netezza could not match the power of columnar and could not get there so they fell away.

At this point Teradata was more-or-less back on top… although Greenplum and the other chipped away based on price. In addition, Hadoop entered the market and ate away at Teradata’s dominance in the Big Data space. The impact of Hadoop is well documented in this blog.

Three-to-four years ago SAP introduced HANA and the whole market gasped. HANA was delivering 1000X performance using columnar formats, memory to eliminate I/O, and bare-metal techniques that effectively loaded data into the processor in full cache lines.

Unfortunately, SAP did not take advantage of their significant lead in the general database markets. They focussed on their large installed base of customers… pricing HANA in a way that generated revenue but did not allow for much growth in market share. Maybe this was smart… maybe not… I was not privy to the debate.

Now Oracle has responded with in-memory columnar capability and IBM has introduced BLU. We might argue over which implementation is best… but clearly whatever lead SAP HANA held is greatly diminished. Further, HANA pricing makes it a very tough sell outside of its implementation inside the SAP Business Suite.

Teradata has provided a memory-based cache under its columnar capabilities… but this is not at the same level of sophistication as the HANA, 12c, BLU technologies which compute directly against compressed columnar data.

Hadoop is catching up slowly and we should expect that barring some giant advance from the commercial space that they will reach parity in the next 5 years or so (the will claim parity sooner… but if we require all of the capabilities offered to be present there is just no way to produce mature software any faster than 5 years).

Interestingly there is one player who seems to be advancing the state of the art. Greenplum has rolled out a best-in-class optimizer with Orca… and now has acquired Quickstep which may provide the state-of-the-art in bare metal columnar computing. When these come together Greenplum could once again bounce to the top of the performance, and the price/performance, stack. In addition, Greenplum has skinnied down and is running on an open source business model. They are very Hadoop-friendly.

It will be interesting to see if this open-source business model provides the revenue to drive advanced development… there is not really a “community” behind Greenplum development. It will also be interesting to see if the skinny business model will allow for the deployment of an enterprise-level sales force… but it just might. If Pivotal combines this new technology with a focus on the large EDW market… they may become a bigger player.

Note that was sort of dumb-luck that I posted about how Hadoop might impact revenues of big database players like Teradata right before Teradata posted a loss… but do not over think this and jump to the conclusion that Teradata is dying. They are the leader in their large space. They have great technology and they more-or-less keep up with the competition. But skinnier companies can afford to charge less and Teradata, who grew up in the days of big enterprise software, will have to skinny down like Greenplum. It will be much harder for Teradata than it was for Greenplum… and both companies will struggle with profitability for a while. But it is these technology and market dynamics that give us all something to think about, blog about, and talk about over beers…

Hadoop and Company Financial Performance

I have posted several times about the impact of the Hadoop eco-system on a several companies (here, here, here, for example). The topic cam up in a tweet thread a few weeks back… which prompts this quick note.

Fours years ago the street price for a scalable, parallel, enterprise data warehouse platform was $US25K-$US35K per terabyte. This price point provided vendors like Teradata, Netezza, and Greenplum reasonable, lucrative, margins. Hadoop entered the scene and captured the Big Data space from these vendors by offering 20X slower performance at 1/20th the price: $US1K-$US5K per terabyte. The capture was immediate and real… customers who were selecting these products for specialized, very large, 1PB and up deployments switched to Hadoop as fast as possible.

Now, two trends continue to eat at the market share of parallel database products.

First, relational implementations on HDFS continue to improve in performance and they are now 4X-10X slower than the best parallel databases at 1/10th-1/20th the street price. This puts pressure on prices and on margins for these relational vendors and this pressure is felt.

In order to keep their installed base of customers in the fold these vendors have built ever more sophisticated integration between their relational products and Hadoop. This integration, however, allows customers to significantly reduce expense by moving large parts of their EDW to an Annex (see here)… and this trend has started. We might argue whether an EDW Annex should store the coldest 80% or the coldest 20% of the data in your EDW… but there is little doubt that some older data could satisfy SLAs by delivering 4X-10X slower performance.

In addition, these trends converge. If you can only put 20% of your old, cold data in an Annex that is 10X slower than your EDW platform then you might put 50% of your data into an Annex that is only 4X slower. As the Hadoop relational implementations continue to add columnar, in-memory, and other accelerators… ever more data could move to a Hadoop-based EDW Annex.

I’ll leave it to the gamblers who read this to guess the timing and magnitude of the impact of Hadoop on the relational database markets and on company financial performance. I cannot see how it cannot have an impact.

Well, actually I can see one way out. If the requirement for hot data that requires high performance accelerates faster than the high performance advances of Hadoop then the parallel RDBMS folks will hold their own or advance. Maybe the Internet of Things helps here…. but I doubt it.

A New Way of Thinking About EDW Federation

There is a new way to think about data warehouse architecture. The Gartner Group calls it a logical data warehouse and it uses database federation to dynamically integrate a universe of data warehouses, operational data stores, and data marts into a single, united, structure. This blog has suggested that there is a special case of the logical data warehouse that uses Hadoop to provide a modern data warehouse architecture with significant economic advantages (see here, here, and here).

This is the second post inspired by my chats with Bityota… and sort of, but not altogether, commercial in nature (the first post is here). That is, Bityota will use these posts in their collateral… but you won’t see foam about their products in the narrative below.

– Rob

The economics are driven by the ability, through database federation, to place tables on a less expensive database platform. In short, the aim is to place data on the least expensive platform that still provides enough performance to satisfy service level agreements (SLAs). In the case of the modern data warehouse architecture this means placing older, colder, data on Hadoop where the costs may be $1000/TB instead of having all of the data in a single data warehouse platform at a parallel RDBMS price point of $35,000/TB. Federation allows these two layers to seem as one to any program or end-user.

This approach may be thought of as a data life cycle infrastructure that has significantly more economic power that the hardware based life cycles suggested by database vendors to date. Let’s consider some of the trade-offs that define the hardware approach and the Hadoop-based approach.

The power of Hadoop federation comes from its ability to manage data placement at a macro level. Here, data is placed appropriately into a separate database management system running on differentiated hardware so that the economics of the entire infrastructure: hardware, software, and network can be optimized. It is even possible to add a third or fourth level to provide more fine-grained economic optimization. But this approach come at a cost. Each separate database optimizes queries at the database level. Despite advances in federation software technology this split optimization cannot optimize many queries. The optimization is not poor, but it is not optimal optimization. The global optimization provided by a single DBMS will almost always out-perform federated optimization.

The temperature-based optimization touted by some warehouse vendors provides good global optimization. To date a single DBMS must run on a homogeneous hardware platform with a single price point. Queries run optimally but the optimization can only twiddle around hardware details placing data in memory or on an SSD device or on the faster portion of a disk platter.

Figure 7. Federated Elastic Shared-nothing IaaS

To eliminate this unfortunate trade-off: good optimization over minor hardware capabilities or fair optimization over the complete hardware eco-system we need a single DBMS that can run queries over a heterogeneous mix of hardware. We need a single database management system, with global query optimization, that can execute queries over multiple layers of hardware deployed in the cloud. We can easily imagine a multi-layered data warehouse with queries federated over several AWS offerings with hotter data on fast nodes that are always available, with warm data on less expensive nodes that are always available, and with cold historical data on inexpensive nodes that come online in processing windows so that you pay for the nodes only when you need them. Figure 7 shows a modern data warehouse deployed across an Amazon cloud.

This different way of thinking about a logical data warehouse is exciting… and a great example of how cloud computing may change everything in the database and data warehouse space.

Vendor Spin: What is not Metadata… What is not open…

I recently listened to a pitch by a vendor that was well spun. Let’s use this post to consider some spin around what is not metadata and what is a proprietary interface.

The vendor described how they stored photos… I pointed out that photos were clearly not metadata… so they went on to talk about the metadata attached to the photo: who took the photo, where it was taken, who was the photographer, and so on. But none of this is metadata either. It is data.

Consider this… if instead of a photo the “fact” was a transaction… we would never consider the name or customer id of the customer metadata. It is data. Yet this is exactly what the vendor was proposing. Not only were attributes of the transaction called metadata but the dimensional data around the fact was categorized as metadata. It was just wrong. Metadata about a photo might include the format of the photo: jpg or tiff. It might include information about the resolution. But it would not include the location of where the photo was taken.

I’ve seen this metadata mistake several times before… so I thought that I would call it out. Sometimes it is honest… and sometimes vendors spin their products as required to get an audience. As engineers we should call them on their spinning now and again.

If you are interested there is a great definition of metadata on Wikipedia (here)…

The same vendor was told that we are focussed on products that include no vendor lock-in… or that we can get around the locks by relying on standards. Their response was no worries… we use JSON formats.

Ugh. There is no portability afforded by this response. If I write code that expects to receive data wrapped in JSON but in a vendor-specific form underneath then I have no ability to plug-in another product and have it work. In database-land this would be like saying that any ODBC call is non-proprietary even if the call includes vendor-specific SQL syntax. It is like saying that it is OK… we use XML. If you hear these responses be careful… the vendor is leading you down a path that will lock you to their product.

Spinning too much makes me dizzy… and it insults the intelligence of the audience. I wish that vendors would just stop.

Tag: Data Warehouse

Database Supercomputing in the Cloud

Like this:

Numbers Every Database Professional Should Know

Like this:

DW Cloud Economics – The Model

Like this:

More on Cloud Data Elasticity

Like this:

A Short DW DBMS Market History: HANA, Oracle, DB2, Netezza, Teradata, & Greenplum

Like this:

Hadoop and Company Financial Performance

Like this:

A New Way of Thinking About EDW Federation

Like this:

Vendor Spin: What is not Metadata… What is not open…

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: