Distrubuted Computing – Database Fog Blog

The last post (here) demonstrated how scalability in the cloud provides the ability to reduce runtimes from days or hours to minutes without raising the cost. We used a batch ETL service running three ETL scripts as an example. We then showed how the same use of scalability could allow us to break the batch ETL service into discrete jobs and remove contention between the three scripts to further improve throughput and reduce costs.

There is a great deal more to say about how the cloud changes our thinking about applying resources to our big data workloads. I promised to get to a discussion of database work, but that will have to wait another week. Sorry. Let’s carry on.

In the scenario where we ran the three jobs together, we assumed that all three ran in the same amount of time and so we shut down the ETL service when all three scripts completed. We shut down the servers when they were all complete to stop billing at the $1152 price point. It is more likely that each script would take more-or-less time than another.

When we run each script in its own set of servers, we can stop billing as each job ends. If one of the jobs takes only 2.5 hours to complete, and your cloud provider will allow you to bill minute, not hour increments, then the cost of that single job drops from $288 to $240. Over a year, these savings add up, and you can ask for a raise.

So, we have added a third point: by using scalability, and by scheduling discrete workloads on dedicated cloud servers, you can scale performance and significantly reduce costs.

You have no doubt noticed that the scenarios describe scalability without any impact on the data. We assume that compute can scale independently of storage, and re-sharding of the data is not required. Every database has special sauce for managing data across storage. All we assume here is that the file scan of data, be they rows or columns, occurs in compute nodes after the data is read from storage. In a future post, we will see how modern storage systems impact this assumption without changing the economics benefits described here. Figure 1 is a classic, really too simple, depiction of this separation.

Simple Storage Separate from Compute — Figure 1. Compute Separate from Storage

The database logic here is straightforward. When compute and storage are tied, the query planner knows precisely and in advance how many parallel nodes are in play, and the system can spread data across those nodes in every step that requires data distribution. The number of nodes is fixed for both storage (processing IO) and compute. Figure 2 shows a system with storage and logic connected.

Couple Storage — Figure 2. Classic Shared-nothing Connected Storage and Compute

In a system with storage and compute separated, the query planner has to ask how many nodes are available for data distribution. I am using the term “database” here, but any parallel data processing system that shards data with a processing plan has some form of this logic.

In the ETL scenarios, data flows from the storage layer into a compute layer dynamically allocated at the start of the workload. The planner learns the configuration when the system starts up.

Shared Nodes — Figure 3. Multi-processing Compute Nodes

Figure 3 depicts the case where all three ETL scripts run together on a six-node cluster. Figure 4 shows each ETL script running on a dedicated cluster. Note that, just-for-fun, I have adjusted the configurations in Figure 4 to show that in the dedicated system case, it is possible to size systems differently if there is an advantage to do so. In a later post, I’ll discuss why this could be important (as right now, it may seem that in every case, you would want 10,000 severs to complete every job in 10 seconds).

Figure 4. Dedicated Compute Nodes with Separated Storage

A couple of closing remarks. First, I cannot imagine why anyone would not run ETL scripts on a scalable cloud platform. The ability to scale up to reduce runtimes at no extra cost is remarkable (and I’m not sure that I have ever used that word in one of my blogs). Next, I cannot see why anyone who could run ETL in the cloud would not run each script in a dedicated cloudy configuration. If the issue is that your ETL product is not cloud-native does not separate from compute, then get a product that does (or use a cloud database and ELT).

Finally, here is a post from five years ago that anticipates the separation of storage from compute.

Next time: I’ll take this down a notch to talk about workloads smaller than an ETL batch job and consider how to run big data queries in the cloud.

Here is a short story about distributed versus monolithic application design…

Around 1999 Dan Holle and I started thinking about how to better deploy applications that used distributed computing. To be fair… Dan schooled me on his ideas for a better approach. We started with a thing called XML that had not yet released a V1 standard. Application subroutines stored state in XML and persisted the XML the best that we could with the tools at the time.

Around 2002 he and I approached Teradata with some very advanced functionality, and some code, that was built on this distributed foundation… and Dan eventually took a job at Teradata to develop this into one or more products.

Unfortunately, the Teradata product management powers could not see the sense. They were building apps the old way and would continue to do so. Dan was caught up in several years worth of battling over the proper architecture and our real intellectual property, the advanced functionality, was lost in the fray. The product management folks insisted that a monolithic architecture was the right way to go and so it went…

This was an odd battle as you might assume that Teradata was a thought-leader in distributed computing…

As you might guess… Dan eventually left Teradata a little bruised by the battle.

Today we realize that a monolithic application architecture is a poor substitute for a distributed architecture. The concepts we brought to Teradata in 2002 underpin the 12 factor app formulation that is the best practice today (see here).

So I’d like to point out that Dan was right… the Teradata Product Management folks were wrong… and that their old school thinking robbed Teradata of the chance to lead in the same manner that they once led in parallel database computing (today they lead in the market… but not in tech). I’d like to point out the lost opportunity. Note that just recently Teradata acquired some new engineering leadership who understand the distributed approach… so they will likely be ok (although the old dweebs are still there).

I hope that this story provides a reason to pause for all companies that depend on software for differentiation. When the silly dweebs, the product managers and the marketeers, are allowed to exercise authority over the very smart folks… and when the executive management team becomes so divorced from technology that they cannot see this… your company is in trouble. When your technology decisions become more about marketing and not about architecture, your tech company is headed for trouble.

This blog is not really about Teradata. I’m just using this story to make the point. Many of the folks who made Greenplum great were diminished and left. The story repeats over and over.

This is a story about how technology has changed from being the result of organic growth in a company to a process whereby the tech leaders become secondary to the dweebs… then they leave… and so companies have to buy new tech from the start-up community. This process forces fundamental technology through a filter devised by venture capitalists… and often fundamental technology is difficult to monetize… or the monetization process fails despite sound underlying tech.

It is a story that points out one of the lessons from Apple, Google, and other serially successful tech companies. Executive leadership has to stay in touch with the smart folks… with the geeks… with the technological possibilities rather than with the technological “as is”.

Finally, note that Teradata might have asked me to join back up in 2002… and they did not. So not all of their decisions then were mistaken.

Category: Distrubuted Computing

More on Cloud Data Elasticity

Like this:

A Story about Teradata, Advanced Architectures, and Distributed Applications

Like this:

Share this:

Like this:

Share this:

Like this: