23 Sep 2012, 01:07

Migrating from TFS to Git


Not so long ago, I had a post about Completely removing all traces of files and history from Git repositories. That title was somewhat misleading, since it was more about migrating from TFS to Git, and in the process pruning out files that you didn’t want to leave in your repository. So, if you’re looking to migrate from TFS to Git, and/or you’re looking to prune out history, have a look at that post.

09 Jul 2012, 14:39

Git credential caching on Windows


Update: This is no longer necessary, check out the update here.

In my last post (PRO-TIP: Recursively updating all git repositories in a directory) I made mention of using Git’s new credential caching to improve your Git experience.

A couple things of note:

  • The API for credential management in Git is fairly newish (as of 1.7.9) so you’ll a fairly newish Git to make use of it.

  • This doesn’t work for Windows systems as git-credential-cache communicates through a Unix socket. Windows users can find joy here at https://github.com/anurse/git-credential-winstore (specifically for the download the joy is located at https://github.com/anurse/git-credential-winstore/downloads). Just make sure that the binary is in your path (which is likely C:\Program Files (x86)\Git\libexec\git-core if you’ve installed msysgit and didn’t mess with the defaults). The integration is with Windows Credential Manager which you can pull up via the Windows Control Panel.

Once you’ve got it installed, from a command prompt:

git config --global credential.helper winstore

Or you can edit your .gitconfig manually:

     helper = winstore

26 Jun 2012, 13:17

PRO-TIP: Recursively updating all git repositories in a directory

I have a directory, let’s call it ./src (because it is). This directory has several other directories of which some subset are git repositories. Updating them all by hand is tedious at best and I am lazy. Here’s a one-liner that will do all of the work for you.

W=`pwd`;for i in $(find . -name .git);do D=$i;D=${D%/*};cd $W/$D;pwd;git pull;done

This has the potential to be a bit taxing as well though if you need to enter credentials for every pull. We can solve that one as well.

git config --global credential.helper 'cache --timeout=[S]'

The --timeout option has a parameter, S, which is the timeout in seconds. This can be omitted and a default timeout of 15 minutes will be used.

Being lazy + saving time + making things easy = Happy Panda!

22 May 2012, 08:03

Completely removing all traces of files and history from Git repositories

At my company, we recently made the decision to migrate our source repositories from TFS to Git. While new projects have been going in to Git for quite some time, and some projects had already been migrated, the largest still remained. Let’s call this project Website.

The Website repository in TFS contained not only history for the last several years but also all of that history for our marketing content for the site, largely images. We determined that this shouldn’t be part of the repository at all (at least this one, it would continue to live in its own TFS repository until final arrangements could be made) and thus removing it would be part of our migration process. The process boiled down to this:

  1. Branch out the Marketing content in to its own TFS repository using a build step for Website to perform the check out and get it where it needs to be.
  2. Clone the TFS repository in to a new Git repository.
  3. Clean out all of the old history of the Marketing content from Website.

I had no real involvement with the first step, so I spent some time with steps #2 and #3. I ran through this process on my development machine (an eight core Xeon 3.33ghz with 12gb of ram, Windows 7 Professional) a few times to make sure that I had the process down and to get an idea of timing so that we could better plan. Both steps took about 4 days each to complete. Woof. As it turns out, a co-worker spun up a large Ubuntu EC2 instance to do the heavy lifting after the process was down and brought the execution time down to about 2.5 hours total.

Here’s what we did to actually make this happen.

Clone the TFS repository in to a new Git repository.

This becomes a very straight forward process using git-tfs (http://git-tfs.com/, https://github.com/spraints/git-tfs). I won’t go in to the details of getting that pulled down and running. Once you have it though, you’re going to issue this command from the root of the directory that you want to clone the repository in to:

git tfs clone -d --username <username> http://server:8080/tfs/collection <repository> .

Note: You’ll have to do at least this initial clone on a Windows machine as the git-tfs binaries are Windows-only. Anything after that can run anywhere that Git can run.

Make sure to replace the placeholders (<username>, the URL, and <repository>) with values that are relevant for your system.

After that’s done spinning for a while, you’ll have a Git repository that mirrors your TFS repository.

Clean out all of the old history of the Marketing content from Website.

Now we want to get rid of all of the enormous Marketing content. This is where the Git voodoo comes in.

From the root directory of your Git repository, you’ll want to issue these commands:

git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch Website/Marketing" --prune-empty HEAD
rm –rf ./.git/refs/original
git reflog expire --expire=0 --all
git gc --aggressive --prune=0

The filter-branch command will go through every commit in your repository removing any files and history in the Website/Marketing path and, if any commits would be left empty after that, removing the commit in its entirety. The next command removes the original references that were in the repository; these would be pointing to the bits of history that we removed earlier. After that, expire all of the references in the repository. Finally, garbage collect your repository.

The final result is going to be a commit-by-commit clone of your TFS repository, less any of the data you told it to remove and potentially less a few commits if they only contained files that you removed.

In the end, we had a TFS repository weighing in at about 1.2gb which become a Git repository sitting around 330mb. That’s a pretty substantial win.

Potential pitfall

You might find that the size of your repository didn’t decrease at all after you do this. My previously mentioned coworker had to actually clone the repository to another git repository to see the result and end up with a smaller repository.

cd ..
mkdir WebsiteClone
cd WebsiteClone
git clone ../Website

23 Apr 2012, 09:00

Branch poisoning in git: Programmatically clean up old (merged) branches

After seeing that the project that I’m working on had 48 branches hanging out, 40 of which were completely merged and abandoned, I threw together a quick bit of one-liner (it’s still one line even if it’s more than 80 columns, right?) to delete all of the fully merged branches.

Full disclosure; this works on my machine, which happens to be a Win7 machine with some release of Cygwin on it (which I loathe):

for k in $(git branch -a --merged) | grep -v "\->\|master" | sed s/remotes\\/origin\\///);\
do $(git branch -d $k; git push origin :$k); done

And when you’re done, make sure you alert others to git remote prune origin.

The breakdown

To elaborate on this a little bit, for each branch $k in every completely merged branch:

for k in $(git branch -a --merged)

Make sure that it isn’t pointing somewhere (like HEAD would be) and isn’t master:

 | grep -v "\->\|master"

Remove the remotes/origin/ part of the branch name:

| sed s/remotes\\/origin\\///);\

Delete the local branch (if it’s there) and push the delete to the repository:

do $(git branch -d $k; git push origin :$k); done

Inspiration taken from Graham King at http://www.darkcoding.net/software/cleaning-up-old-git-branches/

26 Mar 2012, 16:16

Mimic Github's fork without using Github

One of the great features of Github (and there are many) is the ability to fork a project to do your own development and experimentation on. I don’t, however, use Github, as I prefer to keep everything on my own server and not pay for something that I’m capable of hosting myself. I recently started a new project that had a base of the framework from an old project, so this was an obvious place to do some pseudo-refactoring (I’m leaving the old, original project intact) and giving me a good starting bed for future new projects. I had a few goals:

  • Break the framework out of the Old Project into its own New Framework repository
  • Fork New Framework and start development of New Project
  • Easily fold changes to New Framework into New Project without tainting New Framework
  • Easily fold changes to New Project that related explicitly to New Framework back in to New Framework

I started off by cloning Old Project in to New Framework and stripping out everything that had anything to do with Old Project.

git clone OldProject NewFramework
cd NewFramework
rm -rf .git/
git init
# *hack hack hack to remove all OldProject cruft from NewFramework*
git remote add origin [new-framework repository url]
git push origin master

At this point, we have our NewFramework directory that’s full of just NewFramework, and it’s all happy and pushed off to the NewFramework repository. This is the starting point that we want for future projects that will be using this framework which is perfect because that’s exactly the situation that we’re in. So, let’s start NewProject using NewFramework as a base.

git clone NewFramework NewProject
cd NewProject
git remote -v
git remote add upstream [url of the 'origin' remote from the git remote -v command]
git remote remove origin
git remote add origin [url of the repository for NewProject]
git push origin master
git branch upstream

And now we have our NewProject repository created with our bare copy of NewFramework committed and pushed up. When we have changes in NewFramework that we want to fold back in to NewProject, we can do this:

git checkout upstream
git pull upstream master
git checkout master
git merge upstream

And if we have some changes to the framework that we want to go back to NewFramework:

git log
# *find the sha of the commit we want*
git checkout upstream
git cherrypick [sha of the commit we want]
git push upstream upstream:master

24 Feb 2012, 13:29

Backups and data redundancy for the paranoid

Data backup is one of those things that everybody talks about, few people do, fewer people do well, and fewer still have actually tested.

What makes a good backup strategy?

To me, a backup strategy needs to have a few qualities.

  • It has to be easy. If it’s not, you won’t keep up with it.
  • It has to be reliable. Backing up your data won’t do you any good if your backups aren’t good.
  • It has to be redundant. Backups can go bad too.
  • It has to be recoverable from. If you’re encrypting your backups and you forget your key, they’re useless.

Now, what brought me here, and how did I attain those goals?

First, why so seriousparanoid?

I’ve been paranoid about data loss for a long time, and I spent a good deal of time and effort trying to figure out what the best strategy would be for me that met all of the requirements that I outlined above. But why was I so paranoid to begin with?

When I was a freshman in college I experienced my first hard drive failure. My Western Digital hard drive suddenly gave up the ghost, taking with it all of my software that I had written over the years (much of it written in x86 assembly) with it. Try as I may, I couldn’t recover anything. I would have paid anything then to get that data back, but being a poor college student professional data recovery wasn’t an option. With no real backups, my entire digital life to this point was wiped clean.

It was at this point that I learned the importance of backups. I didn’t, however, learn the importance of a good backup strategy. To that end, I would burn CDs and email myself copies of things that were REALLY important. Other important things were zipped up and stored on another hard drive. Sometimes I’d just copy and paste a folder somewhere else. I’d have multiple copies of things floating around, and no real way to tell which was the most recent, or most correct.

At the time, I thought that this worked. Mostly because I just didn’t know any better. Had I experienced a drive failure during that period, I’d have been sent on a wild goose chase through my old email, unlabeled physical media, and folders upon folders of copies of various files and zip archives. I’ve since seen the error my ways. In part because I’ve gotten smarter, and in part because technology has gotten smarter.

A new strategy is born

My new backup strategy is much, much more robust, easy to manage, and easy to recover from.



How Does This Work Together?

  • Every 4 hours, CrashPlan backs up changes to the Virtual Machine on the Secondary Drive to the External Drive (encrypted with TrueCrypt), the Drobo, and to CrashPlan+.

  • Once per week, DroboCopy copies the Virtual Machine to the Drobo. This is done to give me an instantly available copy-and-paste snapshot of the server to get back up while I recover the most recent version through a CrashPlan restore.

  • In real-time, CrashPlan watches for changes to anything of high value on the Drobo and backs those changes up to the External Drive CrashPlan+.

  • Those HVFs, in addition to source code, pictures, tax returns, and the like, include scans of important physical documents (product warranties, contracts, receipts, etc.) from the Artisan 835. The original physical documents are kept in a separate fire/water-proof safe. In addition, using the Artisan, I create hard copies of digital documents (receipts and the like) for physical storage.

  • Source code is also stored in git repositories on the Virtual Machine so that I have full revision history for any project that I’m working on (my old CVS repositories have been deprecated and converted to git repositories).

What does all of this do for me? I have several points of recovery available, and aside from the OS and applications which have no irreplaceable value, I have no more than 4 hours of unrecoverable data. This all took quite a while to set up, but the peace of mind is worth it. The backups are pretty much out-of-site, out-of-mind, and I never have to worry about a manual step to protect my data.

Is it excessive? Perhaps, but I never worry about losing another piece of important data again.

There are a couple of things that I’d like to improve, but they’re not critical. One, I’d like to upgrade to a DroboFS to remove the dependency of having my Drobo physically attached to my primary machine. Second, I really wish CrashPlan would allow me to add machines to my account without buying a family subscription. I only have one more machine I’d like to add (the Virtual Machine), and the cost of a family subscription just isn’t worth it when I work around that limitation by backing up the entire machine (since it’s just a set of files). It’s just annoying.


I’ve since upgraded from the Gen2 Drobo mentioned above to a DroboFS (2x 3Tb, 2x 2Tb, 1x 1Tb with dual-drive redundancy). In addition to the speed benefits and the obvious benefits of being a NAS, my paranoia during rebuilds for the array makes dual-drive redundancy a must have. Unfortunately, the DroboFS is currently having a lot of different issues (though none that seem to be putting my data at risk). I have a support ticket in with Data Robotics and hopefully they can address the issues.