A Segue from ETL to DB

This is a short post to segue to point where I’ve been headed all along. Figure 1 recasts the picture from the last post, showing storage separated from compute from ETL/ELT to a data warehouse. It should be a familiar picture to Snowflake architects who may have implemented multiple DW instances against a single storage layer.

Decoupled Multi Instance DW
Figure 1. Multiple DW Compute Instances Decoupled from Shared Storage

I’ll not give away the next article, other than to say that it derives from the same concepts just discussed.

Since this is so short, I will add a tangent just-for-fun.

Here is a post from seven years ago that anticipates how the cloud impacts DW performance. When you combine this with the economics presented in the last two posts (here and here), suggesting that performance is free, you can begin to see why database tuning is no longer an urgent requirement for a data warehouse.

When you tune, you specialize for a particular workload, and if your workload changes, the tuning wears thin. In other words, I now believe that you should build a robust data warehouse with minimal tuning and use cloud compute to get performance. No tuning lets you add a new workload without adjusting. Tuning makes your database fragile in the face of change.

Cloud-native Computing, Workloads, and Elasticity

Over the next several weeks, I’ll share my perspective of current best practices for big data, which is the term I’ll use to blend thinking about analytic data systems: data lakes, data warehouses, data marts operational data stores. On this journey, I’ll consider how analytic workloads are changing with AI and machine learning, discuss data architecture and virtual database technology, preview new hardware technologies (memory and processor), and, importantly, review the implications of cloud computing on the kit and kaboodle.

In this post, I need to start laying a foundation for discussing the cloud. We will see how scalable cloud-computing makes performance “free,” and then we will see how dedicated resources increase efficiency and further reduce costs. The next post builds on these concepts to describe where cloud database products will evolve.

To start, let’s describe a workload that executes three ETL scripts and consider the result when the three scripts run as separate workloads. Imagine an ETL batch job. Batch jobs are insulated. They read data from one or more source systems, perform a series of integration steps as programs, and then load the results using a data load utility into a lake, warehouse, or mart. By “insulated,” I mean that the compute resources required for the integration steps do not need to interact with other systems.

If you had a dedicated server to run a single ETL script, as long as it could read the raw data from the source and the reference data required for integration, there would be no need for connectivity to other systems. With all of the data and the ETL scripts in hand, the process could execute stand-alone. If you needed to run two ETL scripts at the same time, you could deploy the software on two distinct sets of servers; and three scripts could execute on three individual servers or server clusters. As long as you replicate the ETL software and the required data each time, there would be no issue with any number of distinct ETL systems.

In a cloud environment, you could easily spin up three distinct clusters, run the scripts, and spin them back down, paying for only what you use. The ability to dynamically acquire resources and release them in the cloud is called “elasticity.” It is a characteristic of cloud-native applications and not a characteristic of any application running in the cloud. That is, if you design your ETL software to be self-contained and deploy it using a cloud operating system that manages resources, you can take advantage of cloud elasticity. Tools like Docker containers and Kubernetes make this possible.

To continue, imagine that the three ETL jobs run against large datasets overnight, and they each take three hours to complete on a dedicated cluster of twenty-four servers. If all three jobs run simultaneously, all three jobs complete in twelve hours. This estimate assumes that 25% of the time, the three tasks are competing for the compute resources of the server. If the jobs are CPU-bound, this would be an optimistic assumption, and the runtime might be longer.

The scripts run overnight to gain access to dedicated resources. During the day, the cluster runs queries, and contention between the batch scripts and the queries for CPU is hard to manage. Finally, let’s imagine that the cost of these twenty-four servers in the cloud is $4 per server per hour with software or $1152 per day to run the three ETL scripts, not counting storage server costs ($4/server per hour times 24 servers times 12 hours equals $1152).

If our ETL programs are scalable, we could spin up twice as many servers and complete the jobs in 6 hours. Note that the cost is still $1152 ($4/server * 48 * 6 = $1152) and we could double it again to complete the job in 3 hours at the same price ($4 * 96 * 3 = $1152). This math continues as far as you would like to go as long as your cloud provider will let you pay in ever-smaller increments.

This example makes the first important point: if you have self-contained and scalable workloads, you can scale up in the cloud to reduce runtimes at no extra cost.

Now let’s consider what happens if we run each script on a separate cluster. With dedicated servers, each job takes three hours to complete, and the cost per job is $4/server times 24 servers * 3 hours or $288. If we spin up 72 servers and run each script as a separate process, all three complete in three hours for $864. The savings are the result of removing the contention between the three jobs and giving each job dedicated resources.

Even though it may seem obvious, we are so used to sharing computers that we forget that contention is wasteful. Whether we contend for a disk drive to read or write, for memory, for CPU (L3, L2, and L1) cache, for instruction fetch or instruction execution, the cost of managing contention adds inefficiencies. More on that in a couple of posts, I want to talk about how databases can reduce contention, how processor technology helps, and especially how technology like Intel Optane may play a role in the future.

Let me wrap up with a couple of caveats regarding this made-up scenario.

First, if the scripts are IO-bound, not CPU-bound, then they may execute together with less contention. ETL programs that stream data between steps will be CPU-bound as they do not perform IO to spool intermediate results. The contention will still be there, and the cost will reflect this. If the jobs are more completely CPU-bound when the scripts run in-memory, then the contention will be more significant, and the cost difference will be higher.

Second, there are startup costs associated with cloud clusters. Spinning up a machine will take several minutes, and if there are more servers, then there will be more cost associated with the startup. We will consider this more in the next post.

So far, we have made two essential points:
If we have a scalable system and a self-contained workload, then we can deploy cloud compute at scale to reduce runtimes at no extra cost. There is no reason to ever suffer through long-running batch jobs.
If we have multiple units-of-work running, where in the past, we might run them concurrently and allow the workload to compete for a finite number of computers, with cloud computing, we can provide each workload with discrete resources and scale at a reduced cost.

In the next post, we will discuss smaller units-of-work in a database. With this foundation, we will then be able to talk about the power provided by products like Snowflake, and we will be able to show a path for cloud databases to become even more efficient.

Refactoring Databases P2.75: REST and ORB Thoughts

I’m riffing on database design and modeling in an agile methodology. In the previous post here I suggested that we might cheat a little and design database schemas 2-4 weeks in advance of their use… further, I suggested that to maximize agility we should limit the design to a conceptual schema. In this post I’m going to limit the scope little more by considering the use of a database in a restful application.

I am not going to fully define the REST architecture here… but if you are a systems architect you should know this inside and out… what I will say is that sometimes, in order to build a RESTful application, you will find yourself using the database to store the application state. What I want to say here is that when the application stores state in a database that must persist across boundaries in your business process… but that is not required to persist across business processes… then you do not need to model this. Let the programmers do their worst. In this case database tables play the role of working storage (a very old term that dates me… at least I did not say “Data Division”)… and programmers need to be completely free to add and subtract data elements in their programs as required.

I also have a question for the architects out there…

When programmers touch relational data they typically go through one or more abstractions: maybe they call an XML-based RESTful web service… or maybe they write the service themselves and in the service call an ORB-thingie like Hibernate. I’ve seen terrible schema signs that result from programmers building to an object model without looking at the resulting relational model out of the other end. So… when we assign data architects to build a relational schema to underlie an object-oriented programming language… should we architect up the stack and deliver the relational schema, the ORB layer, and/or the RESTful CRUD services for the objects? We are starting down that path… but I thought that I would ask…

Modernizing IT P1

Hiya,

My apologies for neglecting this blog… The SSA CIO job takes time. I have several database-specific blogs drafted that I hope to get to over the holidays…

In the meantime here is a CIOish post on the problem of modernizing old IT infrastructure:

https://cio.gov/modernizing-federal-it-part-1-catching-up-and-jumping-ahead/

While it is not specifically about database infrastructure… I think that folks with database systems designed and developed over 20 years ago… maybe with schemas that are not subject-oriented or stuck on old or very proprietary database architecture… might see the thinking here to be relevant.

Finally, I want to wish you all a happy end to 2015… and to wish for you a happy 2016…

A Trend in Systems Architecture

I composed the video below on a contract for Intel… but they were kind enough to let me tell the story with only a lite promotional touch. I think that you will find the story interesting as it describes 20+ years of systems architecture and suggests where we may well be headed in the next 5 years…

The bottom line here is that we developed a fully distributed systems architecture over the course of 15 years in order to use the economics of microprocessors. The distributed architecture was required because no micro-based server, and no small cluster of micro-based servers, could manage an enterprise-sized workload. We had to gang micro-processors together to solve the problem. Today we can very nearly solve for an enterprise workload on a small cluster of 32-core or 64-core processors… so distribution may no longer be a driving requirement.

I’ll post a couple of more notes on this video over the next few weeks. There are two possible endings to the video and we’ll explore these future states.

Afterword

About three years ago I started with SAP and early in my second week I was asked to appear before Hasso Platner and Vishal Sikka. In the five minutes before I walked in I was informed that the topic was a book they wanted me to ghost-write for them. I was flabbergasted… I had never written a book.. but so it goes. In the meeting I was told that the topic was “HANA for CIOs” and I was handed a list of forty or fifty key words… topics to be included in the narrative. We agreed that we would meet again to consider content more fully. Despite several requests… that was the last meeting I had on this subject and the project dissolved.

In the month or so before it became clear that there was no real interest in the project I struggled to figure out how to tell a story about HANA that would be compelling… rather than make the book a list of technical features. The story in the video, with the HANA ending that I will post next, was to be the story that opened the book.

More thinking on Specialized Databases

Recently I posted (here) some thinking that suggested that the cost of replicating data into specialized databases might outweigh the benefits of specialization. This post will present a counter view and try to sort out when a specialized database might make sense.

In the ZDNet post here: “Look at What Google and Amazon are doing with Databases: That’s your future” Toby Wolpe and Neo Technology CEO Emil Eifrem suggest that:

“The era of the one-size-fits-all database is over. It used to be when I grew up as a developer that for the architect in the project, when it came to choosing the bottom layer of the stack — the persistence layer — the choice was Microsoft, or IBM, or Oracle, or Sybase. It was a vendor choice.

They were all the same type of database. But that era has gone forever and it will never come back because data is just so big and so irregularly shaped now that you’re always going to be able to get a hundred times improvement, a thousand times improvement, a million times improvement if you get a data technology that is shaped like the shape of your data.”

While I have suggested that a swiss army knife DBMS that solves many problems from a single data source… thereby eliminating the cost and complexity of data replication and data synchronization… might provide a sensible choice for most commercial applications.

Actually I agree with Eifrem and Wolpe in many respects… but there is a difference in our starting assumptions. Let me be clear first about where I strongly agree.

When data volumes grow to web-scale… to Google-scale or Amazon-scale… then the inefficiencies of one-size-fits-all amplify and become intractable… so with a specialized DBMS you might indeed see 100X, 1000X, or more performance advantage and gain a competitive edge from replication and specialization.

But a lot of core data is not Big Data. This is where we do not seem to agree. While our company’s all aspire to have a customer database that is in the petabyte range… it is just not usually the case. Likewise we aspire to have a transaction database requiring petabyte scale… but it just is not the case in most businesses even if you keep years and years of history.

Let’s consider graph databases… Maybe customer data should be in a graph database to specialize it for processing relationships. But this is likely to make it sub-optimal for many other processes… in fact it is the thesis of the ZDNet article that it will make it sub-optimal for many other processes… and so replication to more specialized databases is the only alternative.

How might we handle this relationship problem in a generalized DBMS? HANA, for example, can form graphs in-memory from data shaped into columns… unfolding the graphDB blade from the swiss army knife when required but storing the data in a generalized shape otherwise.

It may be true that there could be orders of magnitude advantage for big data shaped into a specialized graphDB form… But if your customer database is in the terabyte range or less, then the advantage may be negligible… or at least the advantage may not justify the cost of replication into two forms.

And think about the implications of specializing big data. Google replicates tens of petabytes of data into multiple shapes to gain competitive advantage… and ten petabytes specialized and replicated ten times is really really big data.

So I agree with parts of the ZDNet post… big data companies are likely to be pushed by the competition to store the data multiple times in specialized replicated big databases… and for this you will look to Google, Amazon, Netflix, and the like for database technology. But most enterprises will be able to store core data in generalized databases… and will extend into big data realms only as machine-to-machine transactions and/or the Internet of Things drive them there… and then they will extend their data architectures rather than replicate again and again.

A Modern Data Warehouse Architecture: Part 3 – Build an EDW Annex

In the first two post of this series (here and here) I first suggested that Hadoop could be effectively used as the platform for staging and then suggested that a modern warehouse would have a federation layer that turned it into a logical data warehouse. Figure 3 depicts this extended architecture.

Figure 3. A Logical EDW

But if we have both Hadoop and a federation layer implemented… and we recognize the economics associated with moving data to Hadoop… Hadoop provides a 5X-50X price advantage over a commercial very large DBMS product… and we can move data from the expensive environment to the low-cost environment without impacting any applications… then we have the opportunity to move governed EDW data to Hadoop and place it into a Hadoop EDW Annex. Figure 4 shows this.

An EDW Annex

Now you might suggest that there is an impact… Hadoop will be significantly slower than a commercial EDW platform (for now…). But experienced EDW architects understand that in the classic architecture we had to co-locate data in a single database to join the data. So, we put all of the data, hot and cold data, in our EDW even though the service levels required for queries that touch old cold historical data did not justify the power and price of the EDW infrastructure. We had to but did not need to. We knew, if only implicitly, that most EDW queries touch a small subset of the data. Following the ratio suggested by Teradata (see the reference here) that 90% of the queries touch only 20% of the data we can imagine a system where 80% of the data resides in Hadoop to service 10% of the queries… and only that 10% experiences Hadoop performance.

I suggested this approach for Teradata here… but an architecture with an EDW Annex to store cleansed governed historical data works for any expensive RDBMS that can federate with Hadoop: Exadata, Netezza, Teradata, or HANA.

This concludes this series… sort of. I’ll post soon to express more about how this architecture provides long-term strategic value. I think that these three concepts: Hadoop as an EDW staging area, federation and logical data warehousing, and Hadoop as an EDW Annex; provide the foundation for a modern EDW… and I imagine that over the next several years this will become the reference architecture most of us will build to.

A Modern Data Warehouse Architecture: Part 2 – Make it a Logical Data Warehouse

In the first post of this series (here) I suggested that Hadoop could be effectively used as the platform for staging. Figure 2 describes the result. In this post I will extend the architecture by adding a data federation feature and turning the entire picture into a logical data warehouse… and then we’ll consider the consequences.

Figure 2. Add a Data Lake

Figure 3 shows this extension.. but lets quickly review what a federation fabric provides (for more detail on this please see the series that starts here).

Figure 3. Add a Federation Layer

First, the fabric allows any of the tables and files in the picture to be registered as “virtual” tables. This includes tables in EDWs, and any marts, in one or more Hadoop systems, the sandbox, and even in the source systems. From here the fabric is viewed by any programs in the BI, Analytics, Apps layer as a single relational database composed of all of the registered tables.This fabric would consume ODBC and JDBC queries at a minimum and provide some level of data translation, function translation, and query translation to allow all of the virtual tables to be queried through the single SQL dialect offer by the fabric. Finally, the fabric would provide some measure of optimization to reduce the overhead of accessing these distributed systems. It is this optimization that is the main topic of the series I suggested above.

One of the important implications of this… one that is often overlooked before implementation… is that the queries emitted from the fabric add workload to the underlying databases. If, in Figure 3, the top EDW is 100% busy servicing un-federated queries then adding a workload that joins that data to the 2nd EDW’s data will overburden the system. This is why it is not usually sound design to integrate source, OLTP systems into a fabric. The OLTP systems are not likely to be optimized in any way for the resulting workload.

But the upside of the fabric is significant. Consider:

  • If we just said “yes” and loaded ungoverned data into a sandbox the business users can immediately access that data and join it to dimensions and facts deployed elsewhere in the enterprise.
  • Rogue data marts can now be integrated back into the fold.
  • Redundant data deployed to allow joins within a single database instance can be eliminated and the joins can be federated. Note that federation is no silver bullet… there may be performance reasons for co-locating data… but you now can consider the trade-offs (I’ll post later on a way to federate and improve performance).
  • Data bases products can be retired without affecting the programs that access them. If you have an old data mart built on a product that you would rather not support or license… you can move the data and re-point the virtual tables without impacting the tools and applications at all.
  • Data can be relocated based on economics and/or for performance reasons… this will be the subject of the next blog in this series… but as a teaser, remember the economics… Hadoop costs $1K/TB (ok… $1K-$4K) hardware included and commercial databases cost much more.

There is one final advantage to this that is strategic and important enough to deserve to not be buried in the previous bullet list… All of us have seen the database product market move in the last 6+ years from a place where the choices were DB2, Oracle, SQL Server, or Teradata… to a market with those products plus Exadata, plus two flavors of SQL Server, plus Netezza, plus Greenplum, plus Vertica… and then plus Impala, plus Hive, plus Tez, plus Spark, plus MongoDB, plus Cassandra, plus plus plus. I think that it is impossible to place database bets today with the confidence that these bets will pay off five years from now… and certainly there are no bets good for ten years. If you are betting on infrastructure to support the Internet of Things the bets are more risky still. In my opinion a federation layer provides critical insulation from this chaos. With federation you can build applications knowing that you can acquire and retire database products and not affect the queries. IMO this insulation is a strategic imperative…

Sorry… I got a little excited… I don’t usually spew foam. Let me try again. IMO you should seriously consider the benefits of federation in both your EDW architecture and in your enterprise data architecture.

Part 3 of this series considers a new extension to the architecture by adding an EDW Annex… you can see it here.

References

A Modern Data Warehouse Architecture: Part 1 – Add a Data Lake

In this series I will present a perspective on how the classic EDW architecture is changing under the influence of new technologies, new requirements, and new economics.

Part 1 will show the first of three architectural changes by introducing a data lake into the picture… in Part 2 we’ll extend the picture by adding a logical EDW layer… and in Part 3 we’ll consider the implications of Hadoop as an EDW Annex.

Figure 1. A Classic EDW

Figure 1 describes a classic EDW architecture with sources feeding a staging area, transformations feeding cleaned and certified data to the EDW, and data consumed by analytic applications. In this picture we assume that data marts are part of the EDW fabric and that it is the responsibility of the applications to know which database to query. In most enterprises there are rogue data marts with ungoverned data and these are an issue… but more on that.

Note that Figure 1 also represents an optional sandbox area where uncertified and un-governed data can be landed This helps avoid paying the cost of data governance upfront… before data has proven it has value. A sandbox provides support for rogue data marts and cubes. Obviously use of uncertified data introduces issues… but many believe, as I do, that the trade-off of agility over governance can be productive under the right circumstances.

Figure 2. An EDW with a Data Lake

Figure 2 represents a step towards a modern EDW architecture with a Hadoop-based data lake replacing the staging area as well as providing support for a sandbox.

A data lake provides all of the same capabilities as the staging area it replaced… it may be a landing zone for untransformed data… but it has several other important characteristics:

  • A data lake can hold raw data forever rather than store it temporarily.
  • A data lake has compute included so it can execute transformations and before a single platform for staging and ETL.
  • A data lake has compute and tools included so it can be used to analyze raw data for trends and anomalies.
  • A data lake can easily store semi-structured and unstructured data.
  • A data lake can store big data.

Of course a staging area based on a RDBMS could do the same… but the economics are completely different. A Hadoop system provides storage and processing for as little as $1K/TB; the cost of an EDW hardware and software ranges from $15K/TB to $50K/TB… 15X to 50X more.

In addition, the data lake provides a cost-effective, extensible platform for building more sandboxes. If, by using Hadoop, you lower the upfront cost of a sandbox to $1K/TB then, when combined with the cost reduction from postponing the development of data quality and governance processes; you may find yourself in a position where you can just say “yes” to requests to add new data. At $1K/TB is may be unnecessary to force the business to build a business case and perform the difficult intellectual challenge of developing an ROI argument for data that has never been available and therefore has an unproven value proposition. Later, once the value is proven, you might move it to a governed state and to a governed platform.

Hadoop as an EDW staging area is not a new concept (see here)… the ETL vendors are supporting Hadoop as an ETL engine already. But a Hadoop staging approach starts to solve one of the nagging problems with the classic architecture: the lack of agility. Just saying “yes” to new data and corralling rogue marts provides a foundation to experiment and evolve while also providing the means to leverage successful experiments across the Enterprise.

There is more required to satisfy sandbox users… on to that in Part 2 – Logical Data Warehousing here.

References:

  • Wikitionary definition of a Data Lake here.
  • Informatica Big Data Edition here.
  • Curt Monash on IBM ETL here.

An Architecture for the IoT – Part 1

There are so many things in the Internet of Things (IoT) that might record data into your data fabric that a new approach may be required. Let’s think about this… define some terms, and see how these terms fit into current data fabric thinking, let’s consider how they fit into a more modern logical data warehouse architecture, and let’s think about whether the IoT might push us to a different approach.

I’m not going to go overboard on terms here… But we do need to distinguish between a sensor and a processor.

To my way of thinking a sensor is a thing. It creates, but does not necessarily process, data. A sensor has some means to communicate with a processor… but if there is no significant processing on the sensor other than communications then we will suggest that there is no “processor” in a meaningful sense. Let me give you four examples:

  • The first is courtesy of Ray Carnes, a chief architect at Boeing. Imagine a brake-pad in your car with 100,000 dust-sized RFID sensors randomly scattered as part of the pad. These sensors do nothing but signal on an interval that they are present. This allows a processor elsewhere to record the signals and determine how much of the brake pad has worn. If only 80,000 sensors report we can assume that 20% of the pad has worn away.
  • A Nest thermostat senses movement and temperature. It uses a network-connect to send the results of this sensing to the Nest mother-ship and performs little-or-no processing on site.
  • Sensors in my Audi detect rotation of the wheels. There is a network that sends the results to a small embedded anti-lock braking processor that monitors all four wheels as well as the pressure on the brake-pedal and sends signals to all five components to allow the car to brake evenly.
  • There is a sensor in the screen on the ATM I used yesterday that detects that I want to request service. This user interface communicates with a powerful general processor which then communicates with the Bank mother-ship to create and process banking transactions.

This last bullet is important… any device that takes user input is a sensor with an embedded processor. It is a “thing” just like the Nest thing. Today we tend to blur the line between sensor and processor as every thing has a powerful processor onboard. The IoT will change this assumption.

A processor then, is a computer that performs some analysis on the data generated by one or more sensors. A processor may also store data… a sensor will not.

Now let’s think about how we might combine sensors and processors in an architecture. To start lets consider the context of the data the processor can use for analysis:

  • If the processor has only the last data sensed we would say that the context is immediate and local to one sensor. The processor can see streamed data but can only operate on the last event. We would say that this sensor-processor configuration can provide a simple reflexive response. When you press the lock button in your car a sensor detects this event and signals to all four doors and the boot to lock it up.
  • Another configuration might allow the local processor to store more context from a single sensor over a longer period of time… so the context is historical and local. In the case of the anti-lock brakes… the processor receives signals from a group of sensors and stores a very short historical context. This grouped historical context is very powerful…
  • Another configuration might store the group context and then forward the event details to a bigger server that stores and analyzes a universal context of all things to look for patterns. Further, there could be a hierarchy of groups leading to a universal context.
  • Finally, a server with some group context could summarize the details for that group and pass on only a summary over time up to another group server or to a universal server.

I suspect that you can see where I’m going. There is a trade-off is this picture between the advantages of pushing analytic processing close to the sensor and the associated requirement for more analytic processors, the advantages of intermediate analysis requiring more data movement but fewer analytic processors, and the advantage of a central analytic mother ship where all data is stored and analyzed. In the next version of this thread I’ll try to tease apart the trade-offs.