As developers, we have all worked with databases in some form or another. And there are many suggested guidelines for maintaining change control over your database schema. I am not going to dispute any of those since any procedural control over change is a good thing. I am, however, going to present my view from an operational support perspective because often the Ops view and the Dev view over control simply do not align.
I have been doing both operational support and application development for many years and have therefore developed my own opinions on how Schema Control should be done. Schema Control, just like Source Control, should be stored centrally, revised, branched, reviewed, tested and secured. I have some rules for each of these and how they apply to the different types of content in a database schema. I will try to address each of these, but first let me lay down some simple guidelines. Simple in theory, but how simple or difficult in execution depends on your platform and resistance from your team. We are going to force people out of their comfort zone, just a little, but enough that they will initially feel constrained, when in essence what we are trying to do is empower them.
Ground Rules / Guidelines
First: Never use a shared database for development. This is probably the hardest thing for developers to overcome. If Tom’s database works, why not just let Frank use it instead of having to have his own? It’s a single authoritative source for the schema and all the data is valid. Let’s just use that. But what if Tom needs to change the “Name” field from 50 characters to 30. He failed to tell Frank and now Frank’s code fails. Frank spends hours debugging code that was perfect just days ago. Imagine similar scenarios across a team of 20 developers. Sure, we can say Tom should have warned Frank, but we all know that even the best of us sometime fail to communicate every detail.
Second: Have a single authoritative source for the schema. This will allow Tom and Frank, and every other developer, to have an exact copy of the database. (Scheme, triggers, stored procedures…everything. More on this later.) If you try to have multiple databases, and you will when cloning for another user or moving to QA, you will ultimately end up not knowing which database has the latest correct schema and data. Having a single authoritative source, and having everyone be able to pull from it, allows all users and all databases to be in sync at any given time. Everyone should be able to pull the latest version of the schema and deploy it.
Third: Always version your database. You must be able to propagate schema changes from development to QA to production in a controlled manner. And just as importantly, you must be able to deploy any version of the database at any time. Let’s say customer “A” has found a bug in the database, a bug you may or may not have already fixed. There is no way, short of going on-site and futzing with their production environment, to test and fix such an issue. You’ve kept previous versions of the application, but how do you deploy and test them without matching revisions of the database schema?
Fourth: Have a small set of users who can merge changes to the schema back into the master for deployment. People who will ask (even if it’s just reviewing a change control ticket) why this change is being made and is this the best practice for implementing it.
Baseline
Now in order to version your schema you will need to establish a baseline. Even the best developers and DBA’s can’t do this in a single sitting. You should let the application development mature for a little while. Work with a small set of stakeholders and initial developers to establish a starting point. Build out your schema and let people begin developing against it. Many developers can mock-up or stub-out code that will simulate actually using the database and these can be used to help define the initial build. Once your team feels comfortable with the current state, usually just a day or two of work depending on the size of the team, it’s time to grab your baseline. All major databases have tools that allow you to dump or script the schema. This script should act as your baseline. I’ll get into naming conventions later. After the baseline has been created, all other changes to the schema should require a change script, not a baseline. However, you can re-baseline at anytime, but I strongly suggest that this only coincide with new major version releases.
Now for my personal “strong” opinions on what types of data goes in which files.
The baseline schema file should contain only the necessary scripts to create the database schema itself. This includes tables, indexes and relationships. And that is it. I don’t even include the ability to drop the tables and database, as that is a destructive process, and creating, as the name infers, is a constructive one. Separate scripts should be run from dropping existing databases, thus preventing the accidental destruction of a production database. The baseline script should be created for the base of every release branch. Not, I repeat, NOT for every feature branch. Any changes required for features are just that, changes. All schema modifications and referential requirements for a feature branch should be part of a change script. And a change script, as it is part of a feature branch, by definition should not make any breaking changes.
Additional required elements such as triggers, stored procedures, functions and other “programmatic” items should be stored in separate files. Preferable one file per item. I often include the IF EXIST DROP process in these as the intent is to be able to recreate (destruct then construct) these items when needed. These files are your first “change scripts” for the database.
Now there are two types of data in a database required for it to function as desired. These are “referential” data and “dynamic” data. Referential data is used by lookup processes or drop-down options to enforce consist data quality. These could be the names of states, cities, zip codes, or essentially any known content that must remain consistent (as defined by the application architecture). Dynamic data is any content that cannot be known, or may be so large and varied that pre-populating or constraining input requirements is prohibitively difficult. You will need a script to populate Referential data. I’ll address Dynamic data later.
A separate script should be created for the population of referential data. This script, while making a change to the content of the database, does not make a change to the schema or any programmatic items. Therefore it should only include INSERT, UPDATE or SELECT statements. If you noticed I did not include DELETE. That’s because I believe under most circumstances, say 99% of the time, you should never delete referential data. If it’s no longer valid, add a column, mark it invalid and handle it in the application code. The integrity of the data is the highest priority. However, you have to plan on future releases where this data never existed, so why have non-referenced data. But that should be handled when managing your source control branches. But what about that other 1%? Well, we all make mistakes. Sometimes you have to correct a bug that requires removing the invalid data. As long as the script corrects the foreign keys, and if you designed your database right it will have to, then DELETE should work.
Dynamic data is a different beast. This data is required to test the application. Under no circumstances should dynamic data be scripted into a Production deployment. At least not as part of Schema Control. Dynamic data, also referred to as seed data, should be part of the application’s Source Control, even if it still requires a DBA to execute it. But most programmers can work around that requirement.
One last thing to note on creating your baseline, and any of the schema/data scripts, you should never hard code the database name in your scripts. The name of the database should always be passed as a parameter. This will ensure that the wrong database is not affected and will allow for multiple databases to co-exist on the same server. The same rule applies for the creation of users. Each user should be restricted to its own associated database ensuring your code is hitting exactly what you think it’s hitting.
Tracking Changes
If you’re going to go through all the trouble to enforce Schema Control then you will want to reap the benefits. Nothing is more frustrating than when asked “Is Development the same version as Production?” and having to answer “I don’t know”. This knowledge is now easily tracked without the need to run database comparison tools, or the dreaded “stare-and-compare”.
To prevent that I add a table to the database called “SchemaChanges” with the columns ID, Major, Minor, Release, Build, ScriptName and DataApplied. Every script that changes the schema or is run due to requirements of changes to the schema must write to the table as its last step. Since we need to know the state of our schema at any given time, we can query the results from the last script run against the SchemaChanges table.
We are also going to track any scripts that seed data or migrate data. But we will do this in another table called “ContentChanges” with the columns ID, Major, Minor, Release, Build, ScriptName and DateApplied. Now, even though the structure of these tables is exactly the same, the content is not. You can easily run multiple migration and data modification scripts without ever needing a schema change. Pre-populating or Seeding tables so that valid test data exists is not critical to the schema. But needing to know if or when it was done could be helpful. Also, these types of processes can be done by any developer and need not be limited to the DBA function.
File Naming Conventions
Now let’s get to the fun stuff (I really have to find a new hobby). Setting a standard, and following it, for file naming is essential. Why? Because we are human. We make mistakes. And interpreting the meaning of a file name is as difficult as guessing the summary of a book from the title on its cover. So I like to set the standard for filenames early in a project and make it stick.
I use a hyphenated style of file naming pattern. I say “style” as I use the underscore “_” character since many scripting languages, which you might use to automate the process, interpret the hyphen as a minus. Other than that, the pattern I use is simple enough.
The first portion of the name is the “functional” name of the database. By functional I mean the purpose of the database and not the actual name used to reference the database. As an example your database’s function may be to hold CRM data, but the name of the database may be “MyCompanyCRM”. In this case, we want to use “CRM” and not “MyCompanyCRM”.
The second portion would be the purpose of the script. For Schema Control there are 5 types of purposes. They are: BaseLine, Trigger, StoredProcedure, Function and SchemaChange. These purposes of these should be self evident and follow the guidelines above.
The third portion is the version. The versioning system used should follow what is tracked by the SchemaChanges table. These consist of Major, Minor, Release and Build numbers. Of special note is that Major, Minor and Release should align with the Applications Source Control versioning scheme.
Below is an example of the five types of names I would use and their purposes presented in the order in which they should be applied to the schema.
| Purpose | Number of files | File Name Example |
| Baseline | 1 file per Release Branch | DBname-BaseLine-version.sql |
| Function | 1 file per Function | DBname-Funcion-version.sql |
| Stored Procedure | 1 file per Stored Procedure | DBname-StoredProcedure-version.sql |
| Trigger | 1 file per Trigger | DBname-Trigger-version.sql |
| Schema Change | 1 file per Feature Branch | DBname-SchemaChange-verison.sql |
The way in which you execute these scripts will vary based on your OS and DB Engine choice. But the naming conventions and script contents should be applicable across the board.
Here are some other suggestions on Common Sense Guidelines:
- A new Baseline should not include deprecated schema structure. If it’s not required by the application, it should not be included in the schema.
- No changes at the Release level should be a breaking change, ever. If it breaks the version then it’s time to increment the release at the Minor level.
- Provide a Migration path from version to version. No client will ever want to manually re-enter their data in order to get the new version. The process for these changes should be:
- Create New Database
- Migrate Content
- Test Application
- Cleanup