22 May 2012, 08:03

Completely removing all traces of files and history from Git repositories

By Scott D. Barker under development git version control

At my company, we recently made the decision to migrate our source repositories from TFS to Git. While new projects have been going in to Git for quite some time, and some projects had already been migrated, the largest still remained. Let’s call this project Website.

The Website repository in TFS contained not only history for the last several years but also all of that history for our marketing content for the site, largely images. We determined that this shouldn’t be part of the repository at all (at least this one, it would continue to live in its own TFS repository until final arrangements could be made) and thus removing it would be part of our migration process. The process boiled down to this:

Branch out the Marketing content in to its own TFS repository using a build step for Website to perform the check out and get it where it needs to be.
Clone the TFS repository in to a new Git repository.
Clean out all of the old history of the Marketing content from Website.

I had no real involvement with the first step, so I spent some time with steps #2 and #3. I ran through this process on my development machine (an eight core Xeon 3.33ghz with 12gb of ram, Windows 7 Professional) a few times to make sure that I had the process down and to get an idea of timing so that we could better plan. Both steps took about 4 days each to complete. Woof. As it turns out, a co-worker spun up a large Ubuntu EC2 instance to do the heavy lifting after the process was down and brought the execution time down to about 2.5 hours total.

Here’s what we did to actually make this happen.

Clone the TFS repository in to a new Git repository.

This becomes a very straight forward process using git-tfs (http://git-tfs.com/, https://github.com/spraints/git-tfs). I won’t go in to the details of getting that pulled down and running. Once you have it though, you’re going to issue this command from the root of the directory that you want to clone the repository in to:

git tfs clone -d --username <username> http://server:8080/tfs/collection <repository> .

Note: You’ll have to do at least this initial clone on a Windows machine as the git-tfs binaries are Windows-only. Anything after that can run anywhere that Git can run.

Make sure to replace the placeholders (<username>, the URL, and <repository>) with values that are relevant for your system.

After that’s done spinning for a while, you’ll have a Git repository that mirrors your TFS repository.

Clean out all of the old history of the Marketing content from Website.

Now we want to get rid of all of the enormous Marketing content. This is where the Git voodoo comes in.

From the root directory of your Git repository, you’ll want to issue these commands:

git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch Website/Marketing" --prune-empty HEAD
rm –rf ./.git/refs/original
git reflog expire --expire=0 --all
git gc --aggressive --prune=0

The filter-branch command will go through every commit in your repository removing any files and history in the Website/Marketing path and, if any commits would be left empty after that, removing the commit in its entirety. The next command removes the original references that were in the repository; these would be pointing to the bits of history that we removed earlier. After that, expire all of the references in the repository. Finally, garbage collect your repository.

The final result is going to be a commit-by-commit clone of your TFS repository, less any of the data you told it to remove and potentially less a few commits if they only contained files that you removed.

In the end, we had a TFS repository weighing in at about 1.2gb which become a Git repository sitting around 330mb. That’s a pretty substantial win.

Potential pitfall

You might find that the size of your repository didn’t decrease at all after you do this. My previously mentioned coworker had to actually clone the repository to another git repository to see the result and end up with a smaller repository.

cd ..
mkdir WebsiteClone
cd WebsiteClone
git clone ../Website

Ramblings of a Software Engineer

Clone the TFS repository in to a new Git repository.

Clean out all of the old history of the Marketing content from Website.

Potential pitfall