Whether you're still using Subversion (SVN), or have moved to a distributed system like Git, revision control has found its place in modern operations infrastructures. If you listen to talks at conferences and see what new companies are doing, it can be easy to assume that everyone is now using revision control, and using it effectively. Unfortunately that's not the case. I routinely interact with organizations who either don't track changes in their infrastructure at all, or are not doing so in an effective manner.
If you're looking for a way to convince your boss to spend the time to set it up, or are simply looking for some tips to improve how use it, the following are five tips for using revision control in operations.
1. Use revision control
A long time ago in a galaxy far, far away, systems administrators would log into servers and make changes to configuration files and restart services manually. Over time we've worked to automate much of this, to the point that members of your operations team may not even have login access to your servers. Everything may be centrally managed through your configuration management system, or other tooling, to automatically manage services. Whatever you're using to handle your configurations and expectations in your infrastructure, you should have a history of changes made to it.
Having a history of changes allows you to:
- Debug problems by matching up changes committed to the repository and deployed to outages, infrastructure behavior changes, and other problems. It's tempting for us to always blame developers for problems, but let's be honest, there are rare times when it is our fault.
- Understand why changes were made. A good commit message, which you should insist upon, will not only explain what the change is doing, but why the change is being made. This will help your future colleagues, and your future self, understand why certain architecture changes were made. The decisions may have been sound ones at the time, and continue to make sense, or they were based on criteria that are no longer applicable to your organization. By tracking these reasons, you can use decisions from the past to make better decisions today.
- Revert to a prior state. Whether a change to your infrastructure caused problems and you need to do a major roll-back, or a file was simply deleted before a backup was run, having production changes stored in revision control allows you to go back in time to a known state to recover.
This first move can often be the hardest thing for an organization to do. You're moving from static configurations, or configuration management files on a filesystem, into a revision control system which changes the process, and often the speed at which changes can be made. Your engineers need to know how to use revision control and get used to the idea that all changes they put into production will be tracked by other people in the organization.
2. Have a plan for what should be put in revision control
This has a few major components: make sure you have multiple repositories in your infrastructure that are targeted at specific services; don't put auto-generated content or binaries into revision control; and make sure you're working securely.
First, you typically want to split up different parts of your services into different repositories. This allows fine-tuned control of who has access to commit to specific service repositories. It also prevents a repository from getting too large, which can complicate the life of your systems adminstrators who are trying to copy it onto their systems.
You may not believe that a repository can get very big, since it's just text files, but you'll have a different perspective when you have been using a repository for five years, and every copy includes every change ever made. Let me show you the system-config repository for the OpenStack Infrastructure project. The first commit to this project was made in July 2011:
elizabeth@r2d2$:~$ time git clone https://git.openstack.org/openstack-infra/system-config Cloning into 'system-config'... remote: Counting objects: 79237, done. remote: Compressing objects: 100% (37446/37446), done. remote: Total 79237 (delta 50215), reused 64257 (delta 35955) Receiving objects: 100% (79237/79237), 10.52 MiB | 2.78 MiB/s, done. Resolving deltas: 100% (50215/50215), done. Checking connectivity... done. real 0m7.600s user 0m3.344s sys 0m0.680s
That's over 10M of data for a text-only repository over five years.
Again, yes, text-only. You typically want to avoid stuffing binaries into revision control. You often don't get diffs on these and they just bloat your repository. Find a better way to distribute your binaries. You also don't want your auto-generated files in revision control. Put the configuration file that creates those auto-generated files into revision control, and let your configuration management tooling do its work to auto-generate the files.
Finally, split out all secret data into separate repositories. Organizations can get a considerable amount of benefit from allowing all of their technical staff see their repositories, but you don't necessarily want to expose every private SSH or SSL key to everyone. You may also consider open sourcing some of your tooling some day, so making sure you have no private data in your repository from the beginning will prevent a lot of headaches later on.
3. Make it your canonical place for changes, and deploy from it
Your revision control system must be a central part of your infrastructure. You can't just update it when you remember to, or add it to a checklist of things you should be doing when you make a change in production. Any time someone makes a change, it must be tracked in revision control. If it's not, file a bug and make sure that configuration file is added to revision control in the future.
This also means you should deploy from the configuration tracked in your revision control system. No one should be able to log into a server and make changes without it being put through revision control, except in rare case of an actual emergency where you have a strict process for getting back on track as soon as the emergency has concluded.
4. Use pre-commit scripts
Humans are pretty forgetful, and we sysadmins are routinely distracted and work in a very interrupt-driven environments. Help yourself out and provide your systems administrators with some scripts to remind them what the change message should include.
These reminders may include:
- Did you explain the reason for your change in the commit message?
- Did you include a reference to the bug, ticket, issue, etc related to this change?
- Did you update the documentation to reflect this change?
- Did you write/update a test for this change? (see the bonus tip at the end of this article)
As a bonus, this also documents what reviewers of your change should look for.
Wait, reviewers? That's #5.
5: Hook it all into a code review system
Now that you have a healthy revision control system set up, you have an excellent platform for adding another great tool for systems administrators: code review. This should typically be a peer-review system where your fellow systems administrators of all levels can review changes, and your changes will meet certain agreed-upon criteria for merging. This allows a whole team to take responsibility for a change, and not just the person who came up with it. I like to joke that since changes on our team requires two people to approve, it becomes the fault of three people when something goes wrong, not just one!
Starting out, you don't need to do anything fancy, maybe just have team members submit a merge proposal or pull request and socially make sure someone other than the proposer is the one to merge it. Eventually you can look into more sophisticated code review systems. Most code review tools allow features like inline commenting and discussion-style commenting which provide a non-confrontational way to suggest changes to your colleagues. It's also great for remotely distributed teams who may not be talking face to face about changes, or even awake at the same time.
Bonus: Do tests on every commit!
You have everything in revision control, you have your fellow humans review it, why not add some robots? Start with simple things: Computers are very good at making sure files are alphabetized, humans are not. Computers are really good at making sure syntax is correct (the right number of spaces/tabs?); checking for these things is a waste of time for your brilliant systems engineers. Once you have simple tests in place, start adding unit tests, functional tests, and integration testing so you are confident that your changes won't break in production. This will also help you find where you haven't automated your production infrastructure and solve those problems in your development environment first.