Have you ever dug through the commit history of an open source project, peeling away layers, sifting for clues, trying to answer the question, “why does this code do what it does”? I call this process code archaeology.
Code archaeology is made difficult by historical debris: reformatting, refactoring, code movement, and other incidental changes. This post takes a look at techniques for separating the interesting commits from the uninteresting ones. We’ll look at existing git tools, a tool provided by another SCM system that I wish had a git equivalent, and a promising feature of git that has yet to arrive.
Blame
git blame
or github’s blame view is frequently the first step—but also the first
source of frustration:
git blame
has a few options that can help with this problem.
- With
-w
, blame ignores lines where only whitespace changed. - With
-M
, blame detects moved or copied lines within a file, and blames them on the original commit instead of the commit that moved or copied them. - With
-C
, blame extends this move or copy detection to other files that were modified in the same commit. You can specify this option two or three times to make git look even harder (but more slowly) for moves and copies. See the manpage for details.
For example, I compared the blame for
Rails’s sprockets railtie
without any options and with -wCCC
. The latter was able to tell that
this commit
shouldn’t be blamed because it changed only whitespace, and it blamed the multiline
comment near the end of the file on the
commit where it was originally introduced,
rather than a later commit which moved it.
If any githubbers are reading this: how about supporting some power-user query
parameters on blame pages? I suggest w=1
for ignoring whitespace (a parameter
which is already supported on diff pages); M=1
, C=1
, C=2
, and C=3
for various
levels of move and copy detection.
Pickaxe
If you read the git blame
manpage, you might have noticed a somewhat cryptic
reference to the “pickaxe” interface. Pickaxes are often useful for archaeological
purposes, and git’s pickaxe is no exception. It refers to the -S
option to
git log
. The -S
option takes a string parameter and searches the commit history for
commits that introduce or remove that string. That’s not quite the same thing as
searching for commits whose diff contains the string—the change must
actually add or delete that string, not simply include a line on which it appears.
For example, I was looking at the same Sprockets railtie I looked at with blame
and trying to figure out why Rails.env
was included in Sprocket’s environment
version on line 24.
Blame landed on an uninteresting commit:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
But the pickaxe found the answer right away:
1 2 3 4 5 6 7 8 9 10 |
|
git gui blame
git gui blame
might be the most
useful and least known features of the Tcl/Tk-based GUI included with git.
You give it the name of a file and it opens an interactive blame viewer with built-in
move and copy detection and an easy way to reblame from a parent commit. Check it out:
The first column on the left shows the blame with move and rename detection, and the second
shows who moved the line to its current location. In the lines selected in
green, we see evidence of the movement of the same comment that
we looked at with command-line blame: in the first column, José Valim (JV)
originated it in 8f75
, and Josh Peek (JP) later moved it in 8428
.
The killer feature of git gui blame
is found in the context menu: “Blame Parent
Commit”. When blame lands on an uninteresting commit, you can use this command to
skip over it and reblame from the immediately prior commit. This is so useful that
gui blame has become my go-to tool for code archeology.
Perforce Time-lapse View
I would never choose to use Perforce over git, but I do miss one feature that it provides: the time-lapse view.
The time-lapse view is great for quickly scrubbing through the history of a file, but it’s difficult to keep a particular line of interest in view as you scrub. And because it showed only a linear history, it suffers from Perforce’s branching model; I would frequently land on a huge “integration changelist” (Perforce’s equivalent of a merge commit) and need to go look at the time-lapse on a different branch.
Still, I was often able to unearth interesting commits more quickly than I can with
git blame
, and I still hope somebody creates a similar tool for git.
Git Line-level History Browser
The 2010 Google Summer of Code included a project for git called the
Line-level History Browser,
a set of feature additions for the git log
command to make it easy to track the
history of a line (or set of lines), even through file renames and code movement.
Thomas Rast, co-mentor for the project, explains the purpose of the feature:
For me it replaces a manual iterative process to find out in what ways a function was patched until it came to have its current shape:
git-blame the area, find the most recent commit C while 1: git show C if that explains the code: break git-blame the area in C^ find the most recent commit and call it C again
I do this a lot when a particular section of code puzzles me or seems buggy, to see if any commit message provides a reason for it. I think (but I never got good with it) the “blame parent” feature of
git gui blame
covers a similar use-case.All of this can now be replaced by a simple
git log -L <range> <filename>
And Bo Yang, the mentee, lists details:
Generally, the goal of this project is to:
git log -L
to trace multiple ranges from multiple files;- move/copy detect when we reach the end of some lines(where lines are added from scratch).
And now, we have supports in detail:
git log -L
can trace multiple ranges from multiple files;- we support the same syntax with
git blame
-L
options;- we integrate the
git log -L
with--graph
options with parent-rewriting to make the history looks better and clear;- move/copy detect is in its half way. We get a nearly workable version of it, and now it is in a phrase of refactor, so in the scope of GSoC, move/copy detect only partly complete.
Eventually, the feature was to support “fuzzy” matching of moves and copies, so that the history could be traced across even more “refactoring”-type commits. Sounds fantastic, right? Why am I not using this every day? Unfortunately, Bo’s work didn’t get merged to git master. It’s not completely defunct; Thomas Rast maintains a WIP version which has seen some recent activity, so I’m cautiously optimistic this feature may yet be released.