Software Engineering: Version Control Systems (VCS)

Published by Weisser Zwerg Blog on June 13, 2021

Using a VCS in software development is standard by now but may add more complexities than you asked for.

This blog post is part of the Odysseys in Software Engineering series.

Context

Only recently it happened in the project I am currently involved in that the team had a 1.5h presentation and discussion on how to “properly” use git in their project. This reminded me about a long-standing grudge I have with git. It is too complex and makes things more difficult than they need to be.

Git

While this blog post is not about git you have to understand a bit about its internals to understand the difficulties it creates. The best explanation of the git internals I’ve seen so far is InfoQ: The Git Parable by Johan Herland. I will not repeat what he explains much better than I could, but if you don’t know (enough about) the git internals be sure to watch the video.

As git is a distributed version control system (DVCS) it comes as not surprise that a lot of its internals resemble very much CRDTs (Conflict Free Replicated Data-Types). The best introductory paper to CRDTs I know is Conflict-Free Replicated Data Types by Marc Shapiro at al.

The situation before git

There was a time before git became popular and dominated the VCS space. When SourceForge was founded in 1999 it started out with using CVS and later added support for SVN. In the late 1990s many software development teams did not even know about the existence of VCS systems and used a simple directory structure to manage their code. Luckily we’ve moved far beyond this and using a VCS in a software development project is standard by now.

In the early days, quality assurance (QA) in software development projects, was severely under-developed. This changed only slowly with the advent of Test-Driven Development (TDD) and the realization that QA was primarily the responsibility of the development team and not of an independent QA team relying on manual testing.

The observation that integrating code developed by different contributors caused major headaches if these contributors did not regularly integrate their efforts lead to the habit of continuous integration (CI), a practice of merging all developers’ working copies to a shared mainline several times a day. This process was then supported by a so called continuous integration system (CI-server) that monitored commits into the version control system and after every commit built the software and ran all automated tests against it. If a test case failed an integration problem was detected and an e-mail was sent to the developer who was responsible for the commit that caused the failure.

There was a simple protocol that every developer had to follow: at least once per day do the following:

fetch all changes from the central code repository
take care of all merge conflicts detected by the VCS (this is what continuous integration means)
run all the (fast) automted tests locally on your machine and fix any problems detected
and only if all the above was successful then commit to trunk/master/mainline (or however you called it in your project).
then the CI-system verified independently, that a frech check-out plus build plus running all automated tests worked fine as a secondary safety net.
once per night all the slow tests were run like load, stress, capacity, … tests.

Everybody worked mostly on the mainline and you had a simple linear history. With the “blame” functionality of your VCS you could find out who was responsible for which line of code if you needed to.

You only created branches for releases every couple of weeks and on the release branches only bug-fixes were allowed. The bug fixes from the release branch were merged back into the mainline to make sure that all bug fixes are also present in the mainline. Some people were taking continuous integration even one step further to continuous delivery, meaning that the CI-system pushed all changes automatically into production if the tests were running fine.

If you worked in a bigger organization all of the related code was in a single repository and every commit triggered all of the CI activities for all parts, so that you always had a consistent code base across the organization. Only external libraries would be versioned and integrated into the code base via a dependency manager (often an integral part of the build system) as a kind of bill of materials. There is a good paper from Google that enumerates the benefits of this approach Why Google Stores Billions of Lines of Code in a Single Repository. Google has developed its own VCS, but for smaller organizations the same principles can easily be implemented via a system like SVN.

The situation recently (now)

I admire git as a technology. And I understand why very decentralized projects like the Linux kernel depend on the advanced features of a decentralized version control system like git. I also like very much the ease and simplicity with which you can put a local project under version control by a simple git init rather than needing to set-up somewhere a server component. I really do see all of these points. But, in my opinion, the majority of projects developed in-house in an organization by a dedicated in-house software engineering team, would be better off following the guiding principles in Why Google Stores Billions of Lines of Code in a Single Repository and rather using something like SVN rather than git.

git is complex and forces developers to think about, in principle irrelevant, technical details like should you use merge vs. rebase or similar. And once you go down the route of rebase in order to keep the version history “nicely linear and clean” you have to live with rewriting history in cases where you collaborate with other developers on public feature branches. See for example Git Rebase: Don’t be Afraid of the Force (Push). It reads:

Recap

While a git rebase might sound scary at first, it’s not so bad when you have done it a couple of times.

Speaking of which: feature branches: in the past, feature branches were frowned upon, because feature branches are intrinsically against the spirit of continuous integration. I still don’t see how we ever ended up in a world where people believe that feature branches would be a step forward. They are a step backward. I understand, why in an open-source project, where you don’t know the people who will send patches to you, you will want to have a QA step in between like a pull request. But how on earth do you want in an in-house model let feature branches diverge from mainline only to amplify the integration issues down the road. In addition, a regular commit of what you do into mainline, will make other developers aware of it in case that you touch something they’re also working on. It forces the team to discuss and make sure that different aspects of the code-base work together rather than against each-other. In two words: continuous integration!

Just as a side remark: continuous integration is tightly coupled to having a good test suite. There are lurking a lot of pitfalls on the path to a good test suite. I recommend the video TDD: Where Did It All Go Wrong? by Ian Cooper. Despite the sound of its title it is pro TDD. While I don’t care too much about test-first or test-later, I care very much about having a good test suite that helps productivity rather than hampering productivity.

In the past couple of months I see more and more (in-house) projects switching to a monorepo model, because developers start to understand that working in many small repositories will lead to dependency hell. But why did people start to work in many small repositories in the first place? Because git is not ideal for a single repository approach. One of my main complaints about git is that it lacks a “deep checkout” feature, e.g. that you can check out sub-directories of the top level directory. A feature that is useful if a developer for the moment wants to work on a more focused aspect of the code base, but the CI/CD pipeline still should check the overall consistency of everything and the version control system makes sure that you have one consistent view of everything. The word monorepo did not even exist in the past (not in my vocabulary anyway), because it was the natural thing to do, to work in a monorepo. Nobody needed a word for the concept.

Regularly, I feel, that people try to find solutions to problems that they wouldn’t have had if they wouldn’t have used git in the first place. That’s a typical sign of accidental vs. essential complexity (see Out of the Tar Pit by Ben Moseley and Peter Marks). In my observations, in in-house development projects, git causes more accidental complexity than it benefits the project. It is a net negative rather than a net positive.

The fact that junior developers are starting to ask in interview situations about which version control sytem you’re using and refusing to work for the project if you use anything other than git is taking the irrationality to the extreme. Which version control system you are using in your project should be the result of hard-headed engineering reasons rather than the need to follow fashion and trends. On the positive side, the just mentioned interviewing attitude makes it at least easy for the hiring manager to see whether a developer cares about rational reasoning or rather prefers to follow fashion or focus on perceived career-value. If you insist on gaining experience in git then just join one of the many open-source projects out there that are using git.

Summary

While I am talking above about git it is not the tool, but the version control model, that is at the heart of my issue with the current state of affairs. git just happened to conquer the whole distributed versions control market and tools that had some user base in the past like Mercurial have been marginalized. Especially, for in-house software engineering projects, I advocate to at least review the reasoning in Why Google Stores Billions of Lines of Code in a Single Repository and base your choice on hard headed engineering reasons. If you care about productivity then you will care about accidental vs. essential complexity and you will want to squeeze accidental complexity out of your project.

In many ways you can just see git as a filesystem - it’s content-addressable, and it has a notion of versioning, but I really really designed it coming at the problem from the viewpoint of a filesystem person (hey, kernels is what I do), and I actually have absolutely zero interest in creating a traditional SCM system.

– Linus Torvalds

BFG Repo-Cleaner by rtyley (an alternative to git-filter-branch): In case you need to really get rid-off some stuff in git/github like if you have by mistake committed credentials or a too big file then BFG is your friend.
SmartGit – Git Client for Windows, macOS, Linux (syntevo.com): I started to use the syntevo tooling for CVS a LONG time ago (SmartCVS) and stayed with their tooling since then.
- They also have an SVN client: SmartSVN – SVN Client
On Git and Cognitive Load
Oh Shit, Git!?!

Feedback

Have you written a response to this? Let me know the URL via telegraph.

No mentions yet.