How Old is Your Code? 🕰️

Finding the average age of source code with git blame and AskGit

How long has a line of code existed in a codebase?

It’s easy to get an approximate1 answer on a per-line basis using git blame (and who last modified it). Answering this question in aggregate, however, is not easy with git blame alone, and is where an AskGit query can help.

What’s the average age of all lines of code?

We can calculate the average age of all lines of code in a git repository with the following query:

-- average time (in days) since a line of code was last modified
SELECT
AVG(julianday(datetime('now')) - julianday(author_when)) AS avg_age
FROM files, blame('', '', path)
JOIN commits ON hash = commit_hash ORDER BY avg_age
view raw avg_age.sql hosted with ❤ by GitHub

Running the above will yield a single value:

+--------------------+
| AVG_AGE            |
+--------------------+
| 162.64079124513728 |
+--------------------+

Which is the average age (in days) of a line of code in a git repository. This alone is interesting, but maybe not that useful. Let’s see if we can break that number out…

Average age by author

-- average time (in days) since a line of code was last modified
-- broken out by author_name
SELECT
author_name,
AVG(julianday(datetime('now')) - julianday(author_when)) AS avg_age
FROM files, blame('', '', path)
JOIN commits ON hash = commit_hash
GROUP BY author_name
ORDER BY avg_age

This query modifies the original to group results by author (the author_name column), showing the average age of a line of code added by a particular contributor.

+------------------+--------------------+
| AUTHOR_NAME      | AVG_AGE            |
+------------------+--------------------+
| Author #1        | 39.76696759229526  |
+------------------+--------------------+
| Author #2        | 44.11296238214041  |
+------------------+--------------------+
.
.
.
| Author #20       | 291.14547453681007 |
+------------------+--------------------+
| Author #21       | 432.53069444419816 |
+------------------+--------------------+

This could be a useful way of looking at how active an author is in a codebase - by looking at how recent their “live”2 contributions are. Conversely, it can be used to identify “legacy” authors, whose contributions are now old.

Average age by file

-- average time (in days) since a line of code was last modified
-- broken out by file path
SELECT
path,
AVG(julianday(datetime('now')) - julianday(author_when)) AS avg_age
FROM files, blame('', '', path)
JOIN commits ON hash = commit_hash
GROUP BY path
ORDER BY avg_age
view raw avg_age_by_file.sql hosted with ❤ by GitHub

This query makes a similar change to break results out by file path, and yields results looking like:

+------------------------------+--------------------+
| PATH                         | AVG_AGE            |
+------------------------------+--------------------+
| path/to/file1.go             | 14.493564205910815 |
+------------------------------+--------------------+
| path/to/file2.go             | 15.012321128171276 |
+------------------------------+--------------------+
| path/to/file3.go             | 15.458618565888727 |
+------------------------------+--------------------+
.
.
.
+------------------------------+--------------------+
| path/to/file100.go           | 21.454814814962447 |
+------------------------------+--------------------+

Which is the average age of a line of code in a particular file. This could be an interesting way to find “stale” files that haven't been updated in a while (and maybe deserve a revisit?). Conversely, it could be used to find the “youngest” files in a codebase, ones that have recently been changed significantly and may be a current “hotspot.”

What comes next?

The queries here can be modified to fit different use cases and codebases. For instance, maybe individual files is too granular a view and file type (by extension) is more useful. Or maybe certain files should be excluded (generated code, vendored dependencies, etc.)

AskGit gives you a powerful way to explore these questions in single or multiple codebases. Try it out!

1

Approximate because git blame will show who lost modified a particular line, not necessarily who authored it. This means trivial changes will be included (adding/removing whitespace, simple formatting, etc) and also changes that may have involved moving that piece of code around.

2

“Live” as in code that’s currently in the codebase, last modified by this author