Tuesday, December 17, 2013

Wrangling Wild Databases

I've received some interesting questions lately.  One question at my most recent talk for the DAMA group in Portland rang very familiar.  It got me thinking.

There are a lot of database instances out there that were developed before my book was published.  Even if everyone adopted the techniques therein, that would still leave trillions of rows of data in databases that were not in accordance with the book.  Among other things, these databases are generally created with insufficient automated test coverage so, in the spirit of Working Effectively with Legacy Code, I call such databases "legacy databases."

I cover legacy databases a little bit in the book but there is a limited number of scenarios.  This blog seems like a natural place to start addressing other issues.

The Scenario

Imagine the following:

You work for a company that sells an enterprise product; we'll call that product "Calm Cheddar."  As with most enterprise software products, Calm Cheddar has a database back end.  Calm Cheddar has been successful in several applicable markets and has been sold to numerous customers over the course of several years.  In that same time, it has grown as a product.  Along with the overall design of Calm Cheddar, the design of its database has grown and morphed over time.
manual deployment and broad customer base leads to
deviation in database builds

The final result is that there is an array of customers with varying versions of the software and each with a different path to their current deployment.  The saving grace is that Calm Cheddar's customers tend to upgrade to the latest version.  They may or may not upgrade often, but then never upgrade to something that is already outdated.

Now let's say you want to start emerging a class of databases in this scenario.  What I've shown in the past and what I teach in the first two thirds of my book do not cover scenarios like this.  This specific scenario is not covered anywhere and I imagine that there are several people in a similar situation.

The Ideal and the Real

The ideal class of databases is expressed as a linear sequence of versions created using a linear series of revisions.  The class of databases is simple, knows how to perform any reasonable upgrade, and is very robust.

Circumstances, however, are rarely ideal.
just because you want it, doesn't make it so

There are at least two lessons from Test-Driven Database Development: Unlocking Agility that apply to this scenario.  First: drive variation out of your database build process as much as you possibly can.  Second: above all else, make your database build process reflect the actual transitions that have really been applied to real databases.

These two forces might appear to be contradictory but, in fact, they align perfectly.  You should drive all variation out of your database development process, yes, but that does not mean you will succeed or will have started autonomating your database build mechanism at the very beginning of your very first database's life.  That you should drive all variation from a system does not mean that none will ever be there.

The Reckoning

The key to reconciling these two forces - the impulse to minimize variation in database build paths and the need to recognize the true database upgrade steps that have actually occurred - with the reality that there is a large, diverse population of databases within a given class is in understanding that you must merge these disparate paths together.  What was once a vast array of trickling creeks should, over time, be coalesced into a single coursing river of features.
there can be only one

Two techniques must be applied to resolve any differences between the various deployment paths.  One is taming a legacy database.  The other is remediation deviations.

Without getting into the details, the former amounts to creating a new class of databases that has conditional build logic in its very first version to address the potential of a database that has been built prior to the class's creation.  The latter consists of documenting variations in a database's historical construction patterns, then using transition tests to drive conditional logic that reconciles the various "flavors" of a database design.

Sometimes it will be a great deal of time.  Imagine that eighty percent of your database instances are almost exactly alike, fifteen percent fall into a few other distinct categories, and the remainder are "lone wolves" with highly deviant paths.

In such a case, you might want to phase your database wrangling activities, starting with the large body of highly similar databases first, moving on to the smaller groups second, and start picking off the lone wolf types on an "as-needed" basis.

This technique works great in a relatively controlled environment where, among other things, all of the databases are roughly the same version.  They don't need to all have exactly the same design to start but it does make things a lot easier if they have approximately the same design.

In the Calm Chowder scenario, however, we don't have the luxury.  Remember: we have variation in both the version and the manner of construction.

The Conditioner

A friend of mind, Seth McCarthy, has also come up with an interesting twist on this way of doing things; one that, I think, addresses the extra kind of variation.

He suggested adding a proxy over the simple, linear-step-oriented database builder at the heart of a class of databases.  This proxy's job is twofold.  First and foremost, it detects the version of a database's design then conditions it to look as though the infrastructure for a class of databases has been used to build it.  Typically, that means creating and populating some kind of version registry table.

Naturally, if acting on an empty database, the proxy does nothing but delegate to the core database builder.

rectify...
Also, the conditioner code is in a position to perform conditional transformations before delegating to the core class of databases.  It can even inject custom transitions in between steps if necessary.

The conditioner represents almost the exact opposite of the linear sequence of database upgrade steps I ordinarily recommend but special circumstances demand special responses.  Ordinarily, one would be adding a new version to a class of databases on a very regular basis and one would have to manage an ever-growing number of possible historical versions as a starting point.  There should be very little variation between one instance and another of any given version.  In that case, having special transformations that go from each version to each other version adds complexity and work.

In this scenario, however, the conditioner proxy exists in the exact opposite context.  There's only one target version with which the proxy is principally concerned.  There are many source versions and there is the potential for some amount of variation between two instances of any given historical version.  So having a special place to handle one or more of the special cases makes perfect sense.

So There You Have It

At a very high level, this should serve as a strategy one could apply to the problem of handling a large, diverse population of databases in various historical states and previously managed with an at least somewhat unreliable process such has being built by hand.

The strategy can be stated simply, though it is not always easy to do.  Capture the historical versions of your class of databases.  Codify those versions, either as conditional logic in the initial version of a simple linear sequence of modification scripts or as a proxy to a similarly sustainable database class format.  Drive each conditional behavior from transition tests that model the variant starting points.  Force all of the variation out of existence in a controlled way.  If you have too much variation to handle all at once, ingest smaller segments of the source database space into your class of databases.