Refactoring Databases P2.5: Scoping the Cheats

One of the side-effects of the little cheat posted here is that, if we are going to design early we have to decide what we will design early… and this question has two complications. First, we have to ask ourselves how detailed do we design before the result becomes un-agile? Next, we have to ask ourselves if we should design up the stack a little? My opinions will be doggedly blogged in this post.

I will offer two ends of a spectrum to suggest a way to manage the scope of your advance design cheat. Let me remind you that the cheat suggests that you look at the user stories that will be sprinted on next and devise the schema required from those stories… and maybe refactor the existing schema a little (more in the next post) no more than that.

On one side we may develop a complete design with every detail specified: subject areas, tables, columns, data types, and domains. The advantage here is that the code developers have a spec to code to and these could increase velocity. But the down side is that developers will be working with users to adjust the code in real-time. If the schema does not fit the adjustments then you may be refactoring the new stuff and velocity may decrease.

The other side of the spectrum would have database designers just build a skeleton; a conceptual schema with subject areas, tables and primary keys for each table. This provides a framework that corrals the developers without fencing them in so tight that they cannot express agility.

Remember that the object here is to reduce refactoring without reducing agility…

IMO the conceptual model approach is best. Let’s raise free range software engineers who can eat bugs in the wild rather than be penned in. The conceptual model delimits the range… a detailed schema defines a pen.

There is one more closely related topic for the next post… how do we manage transient objects in a restful application?

Refactoring Databases P2: Cheating a Little

In this post I am going to suggest doing a little design a little upfront and violating the purity of agility in the process. I think that you will see the sense.

To be fair, I do not think that what I am going to say in this post is particularly original… But in my admittedly weak survey of agile methods I did not find these ideas clearly stated… So apologies up front to those who figured this out before me… And to those readers who know this already. In other words, I am assuming that enough of you database geeks are like me, just becoming agile literate, and may find this useful.

In my prior post here I suggested that refactoring, re-working previous stuff due to the lack of design, was the price we pay to avoid over-engineering in an agile project. Now I am going to suggest that some design could eliminate some refactoring to the overall benefit of the project. In particular, I am going to suggest a little database design in advance as a little cheat.

Generally an agile project progresses by picking a set of user stories from a prioritized backlog of stories and tackling development of code for those stories in a series of short, two-week, sprints.

Since design happens in real-time during the sprints it can be uncoordinated… And code or schemas designed this way are refactored in subsequent sprints. Depending on how uncoordinated the schemas become the refactoring can require a significant effort. If the coders are not data folks… And worse, they are abstracted away from the schema via an ORM layer, the schema can become very silly.

Here is what we are trying to do at the Social Security Administration.

In the best case we would build a conceptual data model before the first sprint. This conceptual model would only define the 5-10 major entities in the system… Something highly likely to stand up over time. Sprint teams would then have the ability to agile define new objects within this conceptual framework… And require permission only when new concepts are required.

Then, and this is a better case, we have data modelers working 1-2 sprints ahead of the coders so that there is a fairly detailed model to pin data to. This better case requires the prioritized backlog to be set 1-2 sprints in advance… A reasonable, but not certain, assumption.

Finally, we are hoping to provide developers with more than just a data model… What we really want is to provide an object model with basic CRUD methods in advance. This provides developers with a very strong starting point for their sprints and let’s the data/object architecture to evolve in a more methodological manner.

Let me be clear that this is a slippery slope. I am suggesting that data folks can work 2-4 weeks ahead of the coders and still be very agile. Old school enterprise data modelers will argue… Why not be way ahead and prescribe adherence to an enterprise model. You all will have to manage this slope as you see fit.

We are seeing improvement in the velocity and quality of the code coming from our agile projects… And in agile… code is king. In the new world an enterprise data model evolves evolve based on multiple application models and enterprise data modelers need to find a way to influence, not dictate, data architecture in an agile manner.

Refactoring Databases P1: Defining Some Terms

We are in the middle of several agile projects at the SSA… so I’ll start the year by sharing some related issues and solutions we are considering…

I am going to try to suggest some ideas about refactoring databases that are a little different from some of the concepts in the book and blogs on the subject here and here. In this first post let me try to define refactoring in a general enough way that the same definition and use of the term works for both code and database design.

To start, refactoring could be a general term for incrementally tweaking any software. We might suggest that we have always refactored software more-or-less. But IMO the term has taken meaning as part of agile software development methods and so I will assume this is the proper use of the term.

Agile is a melting pot for several methodologies that emerged as a reaction to inefficiencies using stepwise waterfall methods. As a result agile has many features that make it useful… I’m going to focus on just one that I consider most relevant to refactoring: an agile method develops software incrementally with only a short-term end state as a target. Each increment adds new functionality and the system evolves. As a result, it is not possible, or not correctly agile, to establish a detailed design in advance and the system design evolves with the system.

This is a very hard concept to grasp. You cannot design up front for a system that has an undetermined end state. The waterfall concepts of design first must be modified to be agile. I’ll suggest how to do this in a later post… rest assured that some design is required… think about how we might build software with no design other than that inherited from the last set of sprints and with only the current sprint user stories to guide us.

If you have grokked this then you are on your way to understanding agile and refactoring.

Imagine that you have built a function that is seldom called… But the current sprint user story will call the function thousands of times a second. Imagine further that you built the original function simply in a stateful manner… But now, in order to meet the new scalability requirements, you realize that the function will need to be stateless. What you are imagining is the need to refactor the function.

Now you might ask: should you have known, in advance, that performance was going to be an issue? Maybe… But maybe not. The point is that when you design just-in-time in an agile manner you cannot, and should not, get too far ahead of yourself and over-engineer. Over-engineering is one of the side-effects of waterfall methods that agile aims to avoid…. And refactoring is the result… It is a trade-off not a perfect solution (again, bear with me and I’ll suggest another trade-off later that you might like).

So refactoring is the process that adjusts design incrementally in an agile project:

Refactoring is a disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior.

Its heart is a series of small behavior preserving transformations. Each transformation (called a “refactoring”) does little, but a sequence of transformations can produce a significant restructuring. Since each refactoring is small, it’s less likely to go wrong. The system is kept fully working after each small refactoring, reducing the chances that a system can get seriously broken during the restructuring.

– Martin Fowler

Note that in the example I suggested where we refactored for performance we followed this definition closely… the refactored function was changed but its behavior was unchanged… no program that called the function was changed as a result.

Let me make two points here and close…

First: refactoring is about making changes to code and to databases that preserves the behavior of the code or the databases. Refactoring is not about new functionality with new behaviors that might be added incrementally as the project agilely progresses.

Next, refactoring is not about any incremental change… It is about incremental change in an agile project where the end state is uncertain enough to preclude a complete design. If we change a column in a database with some certainty that the column will satisfy a long-term vision then that change is not refactoring. Refactoring is not a process to guide a database migration or database modernization process.

When the end state is well understood it is silly to code stuff that you know will break later… And incrementally changing separate parts of a database that you are pretty certain will not change in the future is not refactoring.

This may seem obvious… but as you will see in the next post… the definitions matter.

Modernizing IT P1

Hiya,

My apologies for neglecting this blog… The SSA CIO job takes time. I have several database-specific blogs drafted that I hope to get to over the holidays…

In the meantime here is a CIOish post on the problem of modernizing old IT infrastructure:

https://cio.gov/modernizing-federal-it-part-1-catching-up-and-jumping-ahead/

While it is not specifically about database infrastructure… I think that folks with database systems designed and developed over 20 years ago… maybe with schemas that are not subject-oriented or stuck on old or very proprietary database architecture… might see the thinking here to be relevant.

Finally, I want to wish you all a happy end to 2015… and to wish for you a happy 2016…

A Story about Teradata, Advanced Architectures, and Distributed Applications

Here is a short story about distributed versus monolithic application design…

Around 1999 Dan Holle and I started thinking about how to better deploy applications that used distributed computing. To be fair… Dan schooled me on his ideas for a better approach. We started with a thing called XML that had not yet released a V1 standard. Application subroutines stored state in XML and persisted the XML the best that we could with the tools at the time.

Around 2002 he and I approached Teradata with some very advanced functionality, and some code, that was built on this distributed foundation… and Dan eventually took a job at Teradata to develop this into one or more products.

Unfortunately, the Teradata product management powers could not see the sense. They were building apps the old way and would continue to do so. Dan was caught up in several years worth of battling over the proper architecture and our real intellectual property, the advanced functionality, was lost in the fray. The product management folks insisted that a monolithic architecture was the right way to go and so it went…

This was an odd battle as you might assume that Teradata was a thought-leader in distributed computing…

As you might guess… Dan eventually left Teradata a little bruised by the battle.

Today we realize that a monolithic application architecture is a poor substitute for a distributed architecture. The concepts we brought to Teradata in 2002 underpin the 12 factor app formulation that is the best practice today (see here).

So I’d like to point out that Dan was right… the Teradata Product Management folks were wrong… and that their old school thinking robbed Teradata of the chance to lead in the same manner that they once led in parallel database computing (today they lead in the market… but not in tech). I’d like to point out the lost opportunity. Note that just recently Teradata acquired some new engineering leadership who understand the distributed approach… so they will likely be ok (although the old dweebs are still there).

I hope that this story provides a reason to pause for all companies that depend on software for differentiation. When the silly dweebs, the product managers and the marketeers, are allowed to exercise authority over the very smart folks… and when the executive management team becomes so divorced from technology that they cannot see this… your company is in trouble. When your technology decisions become more about marketing and not about architecture, your tech company is headed for trouble.

PS

This blog is not really about Teradata. I’m just using this story to make the point. Many of the folks who made Greenplum great were diminished and left. The story repeats over and over.

This is a story about how technology has changed from being the result of organic growth in a company to a process whereby the tech leaders become secondary to the dweebs… then they leave… and so companies have to buy new tech from the start-up community. This process forces fundamental technology through a filter devised by venture capitalists… and often fundamental technology is difficult to monetize… or the monetization process fails despite sound underlying tech.

It is a story that points out one of the lessons from Apple, Google, and other serially successful tech companies. Executive leadership has to stay in touch with the smart folks… with the geeks… with the technological possibilities rather than with the technological  “as is”.

Finally, note that Teradata might have asked me to join back up in 2002… and they did not. So not all of their decisions then were mistaken.

A Short DW DBMS Market History: HANA, Oracle, DB2, Netezza, Teradata, & Greenplum

Here is a quick review of tens years of data warehouse database competition… and a peek ahead…

Maybe ten years ago Netezza shook up the DW DBMS market with a parallel database machine that could compete with Teradata.

About six years ago Greenplum entered the market with a commodity-based product that was competitive… and then added column store to make it a price/performance winner.

A couple of years later Oracle entered with Exadata… a product competitive enough to keep the Oracle faithful on an Oracle product… but nothing really special otherwise.

Teradata eventually added a columnar feature that matched Greenplum… and Greenplum focussed away from the data warehouse space. Netezza could not match the power of columnar and could not get there so they fell away.

At this point Teradata was more-or-less back on top… although Greenplum and the other chipped away based on price. In addition, Hadoop entered the market and ate away at Teradata’s dominance in the Big Data space. The impact of Hadoop is well documented in this blog.

Three-to-four years ago SAP introduced HANA and the whole market gasped. HANA was delivering 1000X performance using columnar formats, memory to eliminate I/O, and bare-metal techniques that effectively loaded data into the processor in full cache lines.

Unfortunately, SAP did not take advantage of their significant lead in the general database markets. They focussed on their large installed base of customers… pricing HANA in a way that generated revenue but did not allow for much growth in market share. Maybe this was smart… maybe not… I was not privy to the debate.

Now Oracle has responded with in-memory columnar capability and IBM has introduced BLU. We might argue over which implementation is best… but clearly whatever lead SAP HANA held is greatly diminished. Further, HANA pricing makes it a very tough sell outside of its implementation inside the SAP Business Suite.

Teradata has provided a memory-based cache under its columnar capabilities… but this is not at the same level of sophistication as the HANA, 12c, BLU technologies which compute directly against compressed columnar data.

Hadoop is catching up slowly and we should expect that barring some giant advance from the commercial space that they will reach parity in the next 5 years or so (the will claim parity sooner… but if we require all of the capabilities offered to be present there is just no way to produce mature software any faster than 5 years).

Interestingly there is one player who seems to be advancing the state of the art. Greenplum has rolled out a best-in-class optimizer with Orca… and now has acquired Quickstep which may provide the state-of-the-art in bare metal columnar computing. When these come together Greenplum could once again bounce to the top of the performance, and the price/performance, stack. In addition, Greenplum has skinnied down and is running on an open source business model. They are very Hadoop-friendly.

It will be interesting to see if this open-source business model provides the revenue to drive advanced development… there is not really a “community” behind Greenplum development. It will also be interesting to see if the skinny business model will allow for the deployment of an enterprise-level sales force… but it just might. If Pivotal combines this new technology with a focus on the large EDW market… they may become a bigger player.

Note that was sort of dumb-luck that I posted about how Hadoop might impact revenues of big database players like Teradata right before Teradata posted a loss… but do not over think this and jump to the conclusion that Teradata is dying. They are the leader in their large space. They have great technology and they more-or-less keep up with the competition. But skinnier companies can afford to charge less and Teradata, who grew up in the days of big enterprise software, will have to skinny down like Greenplum. It will be much harder for Teradata than it was for Greenplum… and both companies will struggle with profitability for a while. But it is these technology and market dynamics that give us all something to think about, blog about, and talk about over beers…

Hadoop and ETL

My last post (here) blathered about the effect that Hadoop must have on database vendor profits. An associate wrote me with the reminder that Hadoop is also impacting revenues and profits of ETL companies.

If you think about Hadoop as both an inexpensive staging area for an EDW and as a parallel compute engine that can transform ungoverned, extracted data and load it into a governed EDW platform… then you are just one thought from realizing that these two functions have heretofore been in the domain of ETL… and that moving these functions to Hadoop might have an effect in the ETL space.

I do not believe that ETL tools will go away… but they may become just the GUI development environment that lets you quickly develop transformations and connect them into an end-to-end ETL process. The scheduling, processing engine, and monitoring could then be handled by the Hadoop eco-system.

Here is the idea from a previous post.

About five years ago the precursor to Alpine Data Labs, then an EMC Greenplum subsidiary, was developing a GUI for analytics that connected processes and I suggested they spin the product both into analytics and into ETL… I’ll have to look and see where they are these days…

Hadoop and Company Financial Performance

I have posted several times about the impact of the Hadoop eco-system on a several companies (here, here, here, for example). The topic cam up in a tweet thread a few weeks back… which prompts this quick note.

Fours years ago the street price for a scalable, parallel, enterprise data warehouse platform was $US25K-$US35K per terabyte. This price point provided vendors like Teradata, Netezza, and Greenplum reasonable, lucrative, margins. Hadoop entered the scene and captured the Big Data space from these vendors by offering 20X slower performance at 1/20th the price: $US1K-$US5K per terabyte. The capture was immediate and real… customers who were selecting these products for specialized, very large, 1PB and up deployments switched to Hadoop as fast as possible.

Now, two trends continue to eat at the market share of parallel database products.

First, relational implementations on HDFS continue to improve in performance and they are now 4X-10X slower than the best parallel databases at 1/10th-1/20th the street price. This puts pressure on prices and on margins for these relational vendors and this pressure is felt.

In order to keep their installed base of customers in the fold these vendors have built ever more sophisticated integration between their relational products and Hadoop. This integration, however, allows customers to significantly reduce expense by moving large parts of their EDW to an Annex (see here)… and this trend has started. We might argue whether an EDW Annex should store the coldest 80% or the coldest 20% of the data in your EDW… but there is little doubt that some older data could satisfy SLAs by delivering 4X-10X slower performance.

In addition, these trends converge. If you can only put 20% of your old, cold data in an Annex that is 10X slower than your EDW platform then you might put 50% of your data into an Annex that is only 4X slower. As the Hadoop relational implementations continue to add columnar, in-memory, and other accelerators… ever more data could move to a Hadoop-based EDW Annex.

I’ll leave it to the gamblers who read this to guess the timing and magnitude of the impact of Hadoop on the relational database markets and on company financial performance. I cannot see how it cannot have an impact.

Well, actually I can see one way out. If the requirement for hot data that requires high performance accelerates faster than the high performance advances of Hadoop then the parallel RDBMS folks will hold their own or advance. Maybe the Internet of Things helps here…. but I doubt it.

Trends in Systems Architecture and the HANA Platform

This post is a follow on to the video on Systems Architecture here (video). I’ll describe how HANA, the Platform not the Database, is cutting fat from the landscape. If you have not seen the video recently it will be worth a quick look as I will only briefly recap it here.

The video, kindly sponsored by Intel, suggests that the distributed architecture we developed to collect lots of very small microprocessors into a single platform in order to solve big enterprise sized problems is evolving as multi-core microprocessors become ever more capable. In my opinion, we are starting to collapse the distributed architecture into something different.

DBFB SAP Distributed Landscape v1
Figure 1. A Distributed SAP Landscape

Figure 1 shows how we might deploy a series of servers to implement an SAP software landscape. Note that if we removed a couple of pieces the picture represents a platform for just about any enterprise solution. In other words, it is not just about SAP ERP solutions. I have left this in so that companies with SAP in place can see where they may be headed.

Figure 2 shows the same landscape deployed in the HANA Platform. It is a very lean deployment with all of the fat of inter-operating systems combination replaced by processes communicating over lightweight thread boundaries. Squeezing the fat out will provide significant performance benefits over the distributed deployment… it will be 2-5 times faster easily… and 2-3 times faster even if you deployed the landscape on a single server in a series of virtual machines as you have to add in the overhead of a VM plus the cost of communication over virtual networks.

DBFB SAP Collapsed Landscape v1
Figure 2. The Landscape Collapsed into threads on the HANA Platform

The HANA Platform is very powerful stuff… and if you had the choice between SAP and a competitor, the ability to match the performance on 1/2 the hardware might provide an overwhelming advantage to HANA.Note that in this evaluation I have not included any advantages from the performance kick due to the HANA Database… so the SAP story is better still.

Those of you who heard me describe HANA while I was at SAP know that this is not a new story. It was to be the opening of the shelved book on HANA. So this post satisfies my longstanding promise to get this story documented even though the book never generated much executive enthusiasm.

The trend in systems architecture outlined in the video is being capitalized by SAP as the HANA Platform… and the advantage provided is significant. You should expect SAP customers to rapidly move to HANA to take advantage of this tight, high-performance landscape.

But the software world does not stand still for long… and there are already options emerging that approximate this approach. Stay tuned…

More Database Supercomputing Technology

Last year two associates from Greenplum suggested that I read a very smart academic paper titled “Efficiently Compiling Efficient Query Plans for Modern Hardware” by Thomas Neumann. Having reiterated the idea of database supercomputing in my last blog (here)… I can now suggest this paper to you.

In short this paper suggests that the classic approach to building query plans using an iterator, an approach that assumed that I/O was the bottleneck, misses a significant opportunity for optimization at the hardware level. He suggests that an approach designed to keep data in the hardware registers as long as possible and push instructions to the data provides a significant performance boost. Further, he suggests that this approach extends the advantages of vector processing. The paper is here and a set of slides on the topic are here.

My friends subsequently started-up a little company named Vitesse Data (vitessedata.com) and implemented the technique over Postgres. Check out their site… the benchmark numbers are pretty cool and they certainly prove the paper. My guess is that this might be a next step in the database architecture race.

FYI… here is a link to more information on the LLVM compiler framework… another very cool bit of software.

One last note… some folks see optimization to the bare metal as an odd approach in a cloudy world where even the database is abstracted away from the bare metal by virtualization. But this thinking misses the point. At some point the database program executes in real hardware and these optimizations matter. What really happens is that bare metal optimization exposes more of the inefficiency of virtualization.

We are already starting to see the emergence of clouds that deploy on bare metal. I expect that we will shortly get to the point where things like databases are deployed on bare-metal cloud IaaS to squeeze every drop of performance out… while other programs are deployed in virtualized IaaS…