John Firebaugh

Open Source, Ruby, Rubinius, RubySpec, Rails.

Code Archaeology With Git

Have you ever dug through the commit history of an open source project, peeling away layers, sifting for clues, trying to answer the question, “why does this code do what it does”? I call this process code archaeology.

Code archaeology is made difficult by historical debris: reformatting, refactoring, code movement, and other incidental changes. This post takes a look at techniques for separating the interesting commits from the uninteresting ones. We’ll look at existing git tools, a tool provided by another SCM system that I wish had a git equivalent, and a promising feature of git that has yet to arrive.

Blame

git blame or github’s blame view is frequently the first step—but also the first source of frustration:

git blame has a few options that can help with this problem.

  • With -w, blame ignores lines where only whitespace changed.
  • With -M, blame detects moved or copied lines within a file, and blames them on the original commit instead of the commit that moved or copied them.
  • With -C, blame extends this move or copy detection to other files that were modified in the same commit. You can specify this option two or three times to make git look even harder (but more slowly) for moves and copies. See the manpage for details.

For example, I compared the blame for Rails’s sprockets railtie without any options and with -wCCC. The latter was able to tell that this commit shouldn’t be blamed because it changed only whitespace, and it blamed the multiline comment near the end of the file on the commit where it was originally introduced, rather than a later commit which moved it.

If any githubbers are reading this: how about supporting some power-user query parameters on blame pages? I suggest w=1 for ignoring whitespace (a parameter which is already supported on diff pages); M=1, C=1, C=2, and C=3 for various levels of move and copy detection.

Pickaxe

If you read the git blame manpage, you might have noticed a somewhat cryptic reference to the “pickaxe” interface. Pickaxes are often useful for archaeological purposes, and git’s pickaxe is no exception. It refers to the -S option to git log. The -S option takes a string parameter and searches the commit history for commits that introduce or remove that string. That’s not quite the same thing as searching for commits whose diff contains the string—the change must actually add or delete that string, not simply include a line on which it appears.

For example, I was looking at the same Sprockets railtie I looked at with blame and trying to figure out why Rails.env was included in Sprocket’s environment version on line 24. Blame landed on an uninteresting commit:

1
2
3
4
5
6
7
8
9
10
11
12
$ git blame -L23,25 actionpack/lib/sprockets/railtie.rb
8248052e (Joshua Peek 2011-07-27 15:09:42 -0500 23)       app.assets = Sprockets::Environment.new(app.root.to_s) do |env|
63d3809e (Joshua Peek 2011-08-21 16:42:06 -0500 24)         env.version = ::Rails.env + "-#{config.assets.version}"
8248052e (Joshua Peek 2011-07-27 15:09:42 -0500 25)
$ git log -1 63d3809e
commit 63d3809e31cc9c0ed3b2e30617310407ae614fd4
Author: Joshua Peek <[email protected]>
Date:   Sun Aug 21 16:42:06 2011 -0500

    Fix sprockets warnings

    Fixes #2598

But the pickaxe found the answer right away:

1
2
3
4
5
6
7
8
9
10
$ git log -S'Rails.env' actionpack/lib/sprockets/railtie.rb
commit ed5c6d254c9ef5d44a11159561fddde7a3033874
Author: Ilya Grigorik <[email protected]>
Date:   Thu Aug 4 23:48:40 2011 -0400

    generate environment dependent asset digests

    If two different environments are configured to use the pipeline, but
    one has an extra step (such as compression) then without taking the
    environment into account you may end up serving wrong assets

git gui blame

git gui blame might be the most useful and least known features of the Tcl/Tk-based GUI included with git. You give it the name of a file and it opens an interactive blame viewer with built-in move and copy detection and an easy way to reblame from a parent commit. Check it out:

Screenshot of git gui blame

The first column on the left shows the blame with move and rename detection, and the second shows who moved the line to its current location. In the lines selected in green, we see evidence of the movement of the same comment that we looked at with command-line blame: in the first column, José Valim (JV) originated it in 8f75, and Josh Peek (JP) later moved it in 8428.

The killer feature of git gui blame is found in the context menu: “Blame Parent Commit”. When blame lands on an uninteresting commit, you can use this command to skip over it and reblame from the immediately prior commit. This is so useful that gui blame has become my go-to tool for code archeology.

Perforce Time-lapse View

I would never choose to use Perforce over git, but I do miss one feature that it provides: the time-lapse view.

The time-lapse view is great for quickly scrubbing through the history of a file, but it’s difficult to keep a particular line of interest in view as you scrub. And because it showed only a linear history, it suffers from Perforce’s branching model; I would frequently land on a huge “integration changelist” (Perforce’s equivalent of a merge commit) and need to go look at the time-lapse on a different branch.

Still, I was often able to unearth interesting commits more quickly than I can with git blame, and I still hope somebody creates a similar tool for git.

Git Line-level History Browser

The 2010 Google Summer of Code included a project for git called the Line-level History Browser, a set of feature additions for the git log command to make it easy to track the history of a line (or set of lines), even through file renames and code movement.

Thomas Rast, co-mentor for the project, explains the purpose of the feature:

For me it replaces a manual iterative process to find out in what ways a function was patched until it came to have its current shape:

git-blame the area, find the most recent commit C
while 1:
  git show C
  if that explains the code: break
  git-blame the area in C^
  find the most recent commit and call it C again

I do this a lot when a particular section of code puzzles me or seems buggy, to see if any commit message provides a reason for it. I think (but I never got good with it) the “blame parent” feature of git gui blame covers a similar use-case.

All of this can now be replaced by a simple git log -L <range> <filename>

And Bo Yang, the mentee, lists details:

Generally, the goal of this project is to:

  1. git log -L to trace multiple ranges from multiple files;
  2. move/copy detect when we reach the end of some lines(where lines are added from scratch).

And now, we have supports in detail:

  1. git log -L can trace multiple ranges from multiple files;
  2. we support the same syntax with git blame -L options;
  3. we integrate the git log -L with --graph options with parent-rewriting to make the history looks better and clear;
  4. move/copy detect is in its half way. We get a nearly workable version of it, and now it is in a phrase of refactor, so in the scope of GSoC, move/copy detect only partly complete.

Eventually, the feature was to support “fuzzy” matching of moves and copies, so that the history could be traced across even more “refactoring”-type commits. Sounds fantastic, right? Why am I not using this every day? Unfortunately, Bo’s work didn’t get merged to git master. It’s not completely defunct; Thomas Rast maintains a WIP version which has seen some recent activity, so I’m cautiously optimistic this feature may yet be released.

Comments