To get the most out of a change management (CM) process, preparing good change and rollback plans are required. Forunately this can be simplified through application of a good CM notice form.
Here is the form we use for all of our changes:
----
SUBJECT: Change Notice for <title>
(Start SUBJECT with "EMERGENCY" if appropriate.)
TITLE or NAME OF CHANGE:
IS THIS AN EMERGENCY CHANGE: (No/Yes)
IF YES, NAME OF APPROVING EXECUTIVE:
DESCRIPTION OF CHANGE:
EXPECTED START TIME:
EXPECTED DURATION:
RISK LEVEL: High/Medium/Low
CONTACT INFO FOR PERSON EXECUTING THE CHANGE:
Name:
Email:
Cell:
SYSTEMS IMPACTED BY CHANGE:
HOW HAS THIS CHANGE BEEN TESTED: (NOT "why this change doesn't require testing")
IF UNTESTED, NAME OF EXECUTIVE OR ENGINEER APPROVING AN UNTESTED CHANGE:
WHAT ARE THE REQUIRED CHANGES TO THE MONITORING SYSTEM:
IS A USER-FACING MAINTENANCE WINDOW (SITE OFFLINE) REQUIRED? Yes/No
WILL INDIVIDUAL SYSTEMS BE OFFLINE OR REBOOTED/POWER-CYCLED (even if they are load-balanced)? Yes/No
WHAT ARE THE STEPS TO IMPLEMENT THE CHANGE:
WHAT WILL BE THE TESTING/VALIDATION THAT THE CHANGE IS SUCCESSFUL:
WHAT ARE THE STEPS TO ROLLBACK THE CHANGE:
IF CHANGE IS ROLLED BACK, NAME OF ENGINEER WHO WILL REVIEW THE ROLLBACK:
----
In an upcoming article I will describe each element of the form above, but today I will focus on preparing the CHANGE and ROLLBACK plans.
A. PREPARING THE CHANGE PLAN:
This is often done in an iterative fashion. That is, first the step-by-step plan is drafted, then as the change is tested (always a good plan itself!) the step-by-step is refined. We like to do this on a wiki so we can easily make and undo changes.
The plan itself should cover each step that will be taken. This is important for two reasons. First, it forces the author to think through each step in advance, thus preventing many problems that might have happened if the chagen was done in an ad-hoc manner. Second, and equally important, such a process allows other people review the change to see problem and alert the author in advance. (A real-world example we had was a client doing an unplanned service restart on a NAS device that triggered an FSCK of a few TB of storage. THAT resulted in a multi-hour downtime event. Was the service restart or FSCK on the CM? Nope!)
A snippet of an actual CM plan is below:
----
1. Copy 1.15.0 clone master, but do NOT overwrite LocalSettings.php or images/.
cp -rv --no-dereference
/var/www/html/mediawiki/mwiki-1.15.0-clone_master/* .
chmod a+w images/
chmod a+w config/
2. Create the maintenance directory and run the update script
cp -r ../mediawiki-1.15.0/maintenance .
cd maintenance/
php update.php --aconf ../AdminSettings.php
3. Wait for a long time
4. Validate the new wiki, especially the DPL scripts
5. Remove the maintenance directory
cd ..
rm -rf maintenance/
----
Any linux-aware reader can see what will be done during that cutover. It is NOT necessary to include the shell commands; I just do that to prevent typos when I execute the change.
B. PREPARING THE ROLLBACK PLAN
This step is routinely overlooked by engineers who believe that most of the time it won't be needed. However, by thinking through how to recover from a failed change, you often discover the need to take snapshots and backups during the change itself. AND IT IS MUCH EASIER TO TAKE A RECOVERY SNAPSHOT *BEFORE* YOU CHANGE THE SYSTEM.
Here is an example:
----
1. Copy 1.11.0 clone master, but do NOT overwrite LocalSettings.php
or images/.
cp -rv --no-dereference
/var/www/html/mediawiki/mwiki-1.11.0-clone_master/* .
chmod a+w images/
chmod a+w config/
2. Modify the LocalSettings.php:
vi LocalSettings.php
$wgDBtype = "mysql";
$wgDBserver = "localhost";
$wgDBname = "mwiki_1_11_prod";
3. Log onto wiki to verify nothing has changed
4. Change a page normally, then verify change is visible in
mwiki_1_11_prod database
----
During testing, one of the change commands was discovered to be wrong, so the rollback plan had to be executed. Because it was thought-through in advance, the rollback was a trivial event with no head-scratching or time wasted.
One of the side benefits we find when we have good change and rollback plans, is that if a change does not go quite right in PROD, we can recover without exceeding the planned downtime window. Marketing and our customers ALWYS appreciate that.
Next up: Pulling it all together in a CM notice.