Monday, March 5, 2012

Database Seed Data

Almost two years ago, we started a major new development effort at Pointe Blank.  We had the extreme fortune to be starting some new projects on a completely new platform (namely ASP.NET MVC instead of WinForms), which gave us the opportunity to take all the things we'd learned and start from scratch.

One of the main things we had learned was the importance of designing for automation.  Probably the single most valuable automation related thing we did was automate the scripting of our database.  We are using FluentMigrator to define migrations and psake with sqlcmd to create and drop database instances.  I'm not all that crazy about FluentMigrator (its syntax is overly verbose, it's code seems overly complex) but it has worked very well for us so far.

We use this tool in a few different circumstances:

  1. To apply other people's changes to our local dev databases
  2. To migrate our QA environment
  3. To spin up from scratch (and tear down) automated test environments
  4. To deploy database changes to production
Database migrations have made working with our code base so dramatically easy, it's really amazing.  You can spin up a new environment with one command in seconds.  Incorporate other devs changes without a second thought.

Migration tools pay most of their attention to schema changes, but data gets largely ignored.  So much so that we had to rough in our own approach to dealing with data, which works, but also sucks.  And there doesn't seem to be any clear recommendations for data.  There are some categories of data I'm concerned with:
  1. List Data: these are values that are usually displayed in drop downs and are either not user generated, or there is a default set we want to seed (ex: states, name suffixes, record types, etc)
  2. Configuration Data: this is data that is required for the system to work, (ex: server uris, an email address to send from, labels to display in the app, etc)
There are only two approaches to dealing with seed that I'm aware of:
  1. Run the schema migrations, then insert all the seed data
  2. Insert the seed data inside the migrations
At first blush #1 seemed easier, so it's the approach we originally took, but it has some drawbacks. The first challenge is to avoid duplicates.  You can do that with IF NOT EXISTS(...) statements, or by inserting Primary Keys and ignoring PK violation errors.  The second is dealing with schema changes to tables with seed data.

For example, suppose you write a migration to create a table, and you write insert statements to add seed data to it.  You have already run your migrations and seeded the database.  Now you write a new migration which adds a new not null column to the table.  This migration will fail because the table already has data from the seeds and the new column requires a value.  In this situation, you're hosed.  You have to write the migration to add data into that column before making the column not allow nulls (or use a default value).  Effectively you have to duplicate the seed data for that column in both the migration and the seed data file.  You need it in both places in order to support creating the database from scratch and migrating an existing database.

#2 seems to have it's own downsides too.  There is no one place that shows you all the data that will be seeded in a given table, it's spread throughout the migrations.  It precludes you from having a schema.rb type of system (unless that system could somehow be smart enough to include data).  That point is somewhat academic at this point though, because FluentMigrator doesn't have anything like this.

However, #2 is much more flexible and can accommodate anything you could ever dream of doing with data (that SQL would support of course).  And I feel that pure #2 would be better than a #1/#2 hybrid, because it's just more straight forward.

And now I'd like to ask you, do you know of a good way of dealing with seed data?  Are there any existing tools or strategies out there that you could link me to?  


  1. WE use db ghost. It is designed for the ground up for db scheme and data migrations. It can create a new db and update and existing db. You can specify tables as seed data (insert on create db only) or static data (sync with shipped data).

    The tool comes with a sync engine which does the schema and data compare on the fly.

    This also allows all your scheme and data scripts to be source controlled. We also have our build do a test build of a database if and scripts change. Each dev has a copy of db ghost and can just update their working copy and run a Sync task to bring their local test db up to sync after doing an update from source control.

    OF course this is a SQL Server tool only... so there is that.


  2. On projects I've been on, I've had success using a factory mechanism. I would create a couple layers of scripts that let me do things like this (in pseudocode):

    var library = createLibrary();
    addBook({title: "The Farce", library: library});
    addBook({title: "My Life", author: "Ronald McDonald", library: library});
    addMovie({title: "Free Willey", library: library});

    The point being that I built a set of methods that had sensible defaults that used the business logic to populate the database. In these examples, I would have addBook() know how to generate the other fields randomly if need be, and they would call the objects that create the records.