Over the years, Mozilla has had to deploy numerous solutions for rewriting and synchronizing the history of various version control repositories. Synchronization is an inherently difficult problem and as one can expect, we’ve learned a lot through trial and error. This document serves to capture some of that knowledge.
git filter-branch is Git’s built-in tool for complex repository history
rewriting. It accepts as arguments filters - or executables or scripts -
to perform actions at specific stages. e.g. rewrite the commit message or
modify the files in a commit.
git filter-branch generally gets the job done for simple, one-time
rewrites, we’ve found that it isn’t suitable for a) rewriting tree content
(read: files in commits) of large repositories b) use in incremental
conversion scenarios c) where robustness or complete control is needed or
d) where performance is important.
The following are some of the deficiencies we’ve encountered with
- Index update performance
Rewriting files in history using
git filter-branchrequires either a
--index-filter(there is a
--subdirectory-filterbut it is internally implemented as an index filter of sorts).
--tree-filtershould be avoided at all costs because having to perform a working copy sync on every commit adds a lot of overhead. This is especially true in scenarios where you are removing a large directory from history. e.g. if you are rewriting history to remove 10,000 files from a directory, each commit processed by
--tree-filterwill need to repopulate the files from that directory on disk. That’s extremely slow.
--index-filteris superior to
--tree-filterin that it only needs to populate the Git index on every commit (as opposed to the working copy). This is substantially faster. But, our experience is that
--index-filteris still a bit slow.
Writing the index requires I/O (it is a file). Some index update operations also require I/O (to e.g.
stat()paths). For this reason, if using an
--index-filter, it is highly recommended to perform operations on a tmpfs volume (
-dargument). Failure to do so could result in significant slowdown due to waiting on filesystem I/O.
When rewriting large parts of the index, we found the performance of
git update-indexagainst the existing index to be a bit slow. This is even when using
--assume-unchangedto prevent verifying changes with the filesystem. In some cases (including one where we deleted 90% of the files in a repository), we found that writing a new index file from scratch (by setting the
$GIT_INDEX_FILEenvironment variable combined with
git update-index --index-infoto produce a new index file) then replacing the existing index was filter than updating the index in place.
We also found the best way to load entries into an index was via
git update-index -z --index-info.
- Overhead of filter invocation
--filter-*argument passed to
git filter-branchinvokes a process for every commit. If you have 4 filters and 10,000 commits, that’s 40,000 new processes. If your process startup overhead is 10ms (typical for Python processes), that’s 400s right there - and your processes haven’t even done any real work yet! By the time you factor in the filter processes doing something, you could be spending dozens of minutes in filters for large repositories.
- Complexity around incremental rewriting
We often want to perform incremental, ongoing rewriting of a repository. For example, we want to remove a directory and publish the result to a separate repo.
git filter-branchcan be coerced to do this, but it requires a bit of work.
git filter-branchis given a rev-list of commits to operate on. When doing an incremental rewrite, you need to specify the base commits to anchor how far back processing should go. For simple histories, specifying
base..headjust works. However, things quickly fall apart in more complicated scenarios. Imagine this history:
E | \ D F | | C | | / B | A
If we initially converted
C, the next conversion could naively specify a rev-list of
C..E. This would include
Fsince it is an ancestor of
E. It would also pull in
Asince those are ancestors of
F. This would mean that
git filter-branchwould redundantly operate on
B! In the best case, this would lead to overhead and slow down incremental operations. In the worst case it would lead to divergent history. Not good.
This problem can be avoided by using the
^COMMITsyntax in the rev-list to exclude a commit and any of its ancestors. If your repository has very complicated history, you may need to specify
^COMMITmultiple times, one for each known root in the unconverted incoming set of commits.
Another problem with incremental operations is grafting incoming commits onto the appropriate commit from the last run. Unless you take action,
git filter-branchwill parent your new commit in the source DAG, which is not what you want for incremental conversions!
While you can solve this problem with a
--parent-filterto rewrite parents of processed commits, we found this approach too complicated. Instead, before incremental conversion, we walked the DAG of the to-be-processed commits. For each root node in that sub-graph, we created a Git graft (using the
info/graftsfile) mapping the old parent(s) to the already-converted parents. The
info/graftsfile was only modified for the duration of
git filter-branch. A benefit of this approach over
--parent-filterwas you only need to process the graft mapping once before conversion, as opposed for every commit. This mattered for performance.
- Ignoring commits from outside first parent ancestry
One of our common repository rewriting scenarios is stripping out merge commits from a repository (we like linear history). It is possible to do this with
git filter-branchby using a
--parent-filterthat simply only returns the first parent.
However, there is no easy way to tell
git filter-branchto only convert the first parent ancestry. While
git loghas a
--first-parentargument, there is no rev-list syntax to do this. And, listing each first parent commit explicitly will exhaust argument length for large repositories.
So, you either have to call
git filter-branchin batches with single commits or have to live with
git filter-branchconverting commits not in the first parent ancestry. The latter can have major performance implications (e.g. you process 80% more commits than you need to).
- Control over refs
git filter-branchautomatically updates the source ref it is converting. This is slightly annoying.
git filter-branch seems like an appropriate tool for systematic repository
rewriting. But our experiences tell us otherwise. If you are a developer and
need it for a quick one-off or if you are performing a one-time rewrite, it’s
probably fine. But for ongoing, robust rewriting, it’s far from our first
A case study demonstrating our lack of content for
git filter-branch is
converting the history of the Servo repository to Mercurial so we could
vendor it into Firefox. This conversion had a few requirements:
- We wanted to strip a few directories containing 100,000+ files
- We wanted to linearize the history so there were no merges
- We wanted to rewrite the commit message
- We wanted to insert hidden metadata in the commit object so
hg convertwould treat it properly
This was initially implemented with
git filter-branch using 4 filters:
parent, msg, index, and commit. The parent filter was implemented
sed. The rest were Python scripts. Rewriting ~23,000 commits
git filter-branch took almost 2 hours. That was after spending
considerable time to optimize the index filter to run as fast as possible
(including doing nothing if the current commit wasn’t in first parent
ancestry). Without these optimizations and tmpfs, run-time was 5+ hours!
After realizing that we were working around
git filter-branch more than
it was helping us, we rewrote all the functionality in Python, building on
top of the Dulwich package - a Python implementation of the Git file
formats and protocols - Dulwich:
- Gave us full control over which commits were processed. No more complexity around incremental operations!
- Allowed us to perform all operations against rich data structures (as opposed
to parsing state from filter arguments, environment variables, or by running
gitcommands). This was drastically simpler (assume you have knowledge of Git’s object types and how they work) and faster to code.
- Allowed us to use a single Python process for rewriting. This eliminated all
new process overhead from
- Allowed us to bypass the index completely. Instead, we manipulated Git tree objects in memory. While more complicated, this cut down on significant overhead.
- Drastically reduced I/O. Most of this was from avoiding the index. With Dulwich, the only I/O was object reads and writes, which are pretty fast.
- Guaranteed better consistency. When using
gitcommands, things like environment variables and
~/.gitconfigfiles matter. With Dulwich, this magic wasn’t in play and execution is much more tolerable of varying environments.
It took ~4 hours to rewrite the
git filter-branch based solution to use
Dulwich. This was made far easier by the fact that our filters were implemented
in Python before. The effort was worth it: Python + Dulwich performed an
identical conversion of the Servo repository in ~10s versus ~2 hours - a
Converting from Git to Mercurial¶
Git and Mercurial have remarkably similar concepts for structuring commit data. Essentially, both have commit objects a) with a link to a tree or manifest of all files as they exist in that commit b) links to parent commits. Not only is conversion between Git and Mercurial repositories possible, but numerous tools exist for doing it!
While there are several tools that can perform conversions, each has its intended use cases and gotchas.
In many cases
hg convert for performing an unidirectional conversion of
Git to Mercurial just works and is arguably the tool best suited for the
job (on the grounds that Mercurial itself knows the best way for data to
be imported into it). That being said, we’ve run into a few scenarios where
hg convert on its own isn’t sufficient:
- Removing merges from history
We sometimes want to remove merge commits from Git history as part of converting to Mercurial.
hg convertdoesn’t handle this case well.
In theory, you can provide
hg converta splice map that instructs the conversion to remove parents from a merge. And,
hg converthappily parses this and starts converting away. But it will eventually explode in a few places where it assumes all parents of a source commit exist in the converted history. This could likely be fixed upstream.
- Copy/rename detection performance
Mercurial stores explicit copy and rename metadata in file history. Git does not. So when converting from Git to Mercurial,
hg convertasks Git to resolve copy and rename metadata, which it then stores in Mercurial. This more or less just works.
A problem with resolving copy and rename metadata is it is very computationally expensive. When Git’s
--find-copies-harderflag is used, Git examines every file in the commit/tree to find a copy source. For repositories with say 100,000 files, you can imagine how slow this can be.
Sometimes we want to remove files as part of conversion. If doing the removal inside
hg convertwill have Git perform the copy and rename detection before those discarded files are removed. This means that Git does a lot of throwaway work for files that aren’t relevant. When removing tens of thousands of files, the overhead can be staggering.
- Copy/rename metadata and deleted files
- As stated above, Mercurial stores explicit copy and rename metadata in
file history. When files are being deleted by
hg convert, there appears to be some problems where
hg convertgets confused if the copy or rename source resides in a deleted file. This is almost certainly a correctable bug in
- Behavior for empty changesets
When removing files from history (including ignoring Git submodules), it is possible for the converted Git commit to be empty (no file changes).
hg converthas (possibly buggy) behavior where it automatically discards empty changesets, but only if a
--filemapis being used. This means that empty Git commits are preserved unless
--filemapis used. (A workaround is to specify
When these scenarios are in play, we’ve found that it is better to perform the Git to Mercurial conversion in 2 phases:
- Perform a Git->Git rewrite
- Convert the rewritten Git history to Mercurial
In cases where lots of files are being removed from Git history, this
approach is highly recommended because of the performance overhead of
processing the unwanted files during