workflow

Version Control Systems

Sunday, November 18th, 2007 | Personal | No Comments

Much like the topic of editors will be able to spawn a religious war among developers, so can the topic of version control software (or VCS for short). The old paragon of the development community, CVS, has felt less loved as of late and CVS does, indeed, suffer from many issues: commits are not atomic, causing the build to break if there is an inadvertent conflict somewhere through the commit process; branching, or rather merging, is a lot more difficult than it need be, and the system is generally slow. Its main advantage is that it is old and understood by many. But, seeing as this is more of a bad excuse, like continuing to use Microsoft products instead of trying other things, I would like to go on a small literary journey through a few of the interesting (and not quite so interesting) alternatives to CVS there is.

As one of the older contenders to CVS, we have Subversion or svn for short. Their cutesy motto was ‘CVS done right’. There are still issues with it, though. Renaming isn’t implemented adequately, merging can still be troublesome, and like CVS it is centralised, requiring roundtrips for many commands having to do with the history of the repository. Many projects have moved on to use svn, but in the words of Linus Torvalds, ‘There is no way to do CVS right’.

Also in the centrally controlled group, although proprietary and costing you a fortune, lies Microsoft’s newest wonder technology, Team Foundation Version Control (TFVC). Although it is nicer than the ridiculous piece of software, Visual SourceSafe, that they pushed at you before, it is still incredibly annoying to use. Granted, I prefer to run my source control at the command line rather than having all sorts of strange things happen inside my IDE, but what happens when you run ‘tf get /recursive’ to get everything recursively from the VCS? It pops up a GUI when there are conflicts! Running ‘tf help’ starts a graphical help browser… after a minute or two! Other interesting issues with it has been that it thinks some files are the most recent locally, even if there are no files locally at all. And, probably due to Visual Studio, when you get the latest version of a solution, it occasionally has a tendency to check out some project files to make changes to their configuration. All in all, creating one very annoying user experience. Now, on the bright side, merges between branches are fairly easy to do, and branching is an easy operation as well, so kudos to Microsoft for at least getting one thing right. I cannot, though, recommend anyone to use it given that so much better solutions exist, at least for source control management. (The case for Team Foundation Server really lies in the nice integration for project managers as well, but that is fairly irrelevant when it comes to the merits of the quality of the VCS).

But, enough of the centralised version control systems. They are a thing of yesteryear. Come to replace them are the distributed version control systems, or DVCS’s for short. These systems largely work in the same fashion, but with some differences in implementation. Their main benefit, though, is that they allow much more diverse workflows and can be used in just the way you like to work. So, if you prefer to work with a central repository, you can do that without any issues, just like with CVS, svn or TFVC, but if you prefer to work in a more disconnected model, or utilise some of DVCS’s staging workflows, they do not keep you from doing just that.

Some of the more popular contenders in the DVCS world are git, mercurial (or hg for short), and Bazaar (bzr). Before we look at each of these, though, perhaps it would be nice to look at some of the possible workflows that can be useful with a distributed model. Throughout this, it is perhaps important to note that in a distributed model, when you do a ‘checkout’ from a source tree, you are also a full repository. We’ll get back to this more in a bit.

Centralised with staging

In a centralised model, everything works (more or less) like you know with CVS. However, the main great difference is that since each branch is a repository in its own right, we can do some interesting release staging. If we designate the main development repository as ‘mainline’, this will be where the developers share their finished features. Once a development team signs off on a feature being completely implemented and ready for testing, they tell the test team to pull it into their ‘test’ repository. Once the test team thinks a feature is ready, they will tell the QA team to pull it into their ‘QA’ repository, and finally when QA signs off on everything being perfectly swell, you can pull the feature into the ‘production’ repository that is used to create the official build of whatever it is that you’re developing. This is all illustrated in the drawing below.

Central staging workflow

What is interesting here to remember is that history is complete at each step. You know exactly who did what, when, and where for each step in the software release flow. The test team can easily look in their own history to see what developer introduced a fault, as the tester will have the entire history right there at his own machine. There is no easy way to do this kind of staging with centralised VCS’s. Also note that this staged release can be accomplished without centralised repositories, but a central model more accurately reflects how most businesses work, and how most managers prefer things work.

Small, completely distributed project

Let us imagine that Alice and Bob are working on system for visualising branches’ changes in a version control system. Mostly this system has a unified functionality, but since each of the developers use this system in a different context as well, they each require some custom features to ease their own workflows. Since they both really like DVCS’s, they decide to just pull the changes they each find interesting from each other, thus creating a fully custom visualiser that helps them in their workflow. As it will probably be an aid to understand why people may find it interesting to have different setups for this, let us consider each of these people as their own persona:

Alice is a maintainer on a large open source project and she has the role of merging submitted patches by everyone. Most of the patches are sent to her in mail so what she needs to do is for each mail, review the patch, and if it seems sensible, merge it into the official repository. It therefore makes sense that her main view of the visualisation can take a selected mail, perform the merge listed in it and let Alice review all the code quickly and easily, speeding up the general process of improving the software.

Bob is a code reviewer at a middle-sized software company where he spends about half of his workday reviewing other people’s code against the product mainline. To cut down on the time that he has to spend figuring out what is new and old since the last time he has reviewed some code, he wants to merge the recent changes into his own branch and then quickly be able to look at the code diff for each change since his last merge. Thus it makes sense for Bob to get a good view of the different branches in the history and to quickly be able to get a diff view for each of these changes, so he can quickly address his concerns to the correct developers on the project.

With this in mind, it is now possible for each of them to develop any number of features, while they can cherry-pick (selectively choose) the changes they are interested in, from each other. So rather than over-engineer the application to do all sorts of things, they just adapt the source code to their own need, without any one of them having the ‘wrong’ version. This workflow, while simple, is illustrated below for the sake of completeness.

Alice and Bob's workflow

Tracking an upstream CVS

As part of my thesis work, I am perusing a CVS repository with a large framework for doing code transformations that my advisors and some more people have written. Since they aren’t necessarily interested in using all the code I need for my thesis, it is easier for me to keep the history of their repository while adding my changes locally. This might’ve been solved by using a CVS branch, but CVS branches are notoriously sad to work with. Instead, I have used a DVCS to import the entire CVS history to a local branch and then I branch from this (to keep a pristine copy of the CVS repository) for my different features. As an added benefit, I can just rerun the import whenever there are changes in the CVS repository, and these changes are reflected in my DVCS branch as if they were made natively in the DVCS. This would never have worked with a centralised system. In essense, my advisors can keep on working on their system without my code causing them any worries, and I can keep introducing their changes to my code as they commit it. A win-win situation, really. The, also fairly simple flow, is illustrated below.

Thesis workflow

DVCS – a redux

Looking at the workflows above (and there are countless more), we see that DVCS is able to easily support not only large corporations, but also small groups of developers who want to share work.

We still need to look a bit at the three contenders: git, hg and bzr. They pretty much all have the same features, so it is perhaps most interesting to look at the differences.

git was created by Linus Torvalds for use with the Linux kernel after a larger controversy with the original DVCS provider, BitKeeper. Being an operating systems guy, Linus has ensured that git is lightningly fast, but being an operating systems guy, Linus has also caused the general usability of the system to be abysmal, unless you are ready to devote a non-trivial amount of time to learn how it all works (it is improving with each new version, though, so not all is abysmal any longer). Another unique feature of git is that it tracks your content, not your files, but I will not go into what implications this have. Since git is used with the Linux kernel, it should be proven without doubt that it works on large projects. git owes a lot of its speed to using some very specific features of the Linux file system handling, so git works rather abysmally on Windows, making it a non-contender if you do a lot of cross-platform work.

hg was also created as a replacement for BitKeeper for the Linux kernel, but once Linus got underway with his own project he did, of course, not see much incentive for using hg. However, development on hg has continued and it has proliferated. It is implemented in Python, with a few modules in C, and is generally slower than git, but not abysmally so. One of the larger projects using mercurial is the Mozilla corporation with Firefox and friends. Thanks to Bryan O’Sullivan, there is also a very thorough book for hg on using it, and on using DVCS’s in general. It is very good reading if you are interested in DVCS’s in general.

bzr is backed by Canonical, the people behind Ubuntu, and it is written in Python after the devise ‘Correct first. Fast later’. This is one of the reasons why there has never been a working tree corruption in bzr, unlike in most other systems. There are few differences to bzr and hg, in reality, but there are in general more plugins available for bzr doing different things, and in my very subjective opinion, bzr feels more polished and has very nice usability. It is slower, due to the general development philosophy, but all the speed issues are being addressed and it doesn’t feel terribly slow to work with any longer.

I have, in case anyone was wondering, chosen to use bzr for my personal projects, but no matter your taste, I would encourage you to take a look at distributed version control systems and see how they might help with your workflow. And stay away from CVS and TFVC.

Tags: , , , , , ,