Git’s great! But it’s difficult to learn (it was for me, anyway) – especially the index, which unlike the power-user features, comes up in day-to-day operation.
Here’s my path to enlightenment, and how I ended up using the index in my particular workflow. There are other workflows, but this one is mine.
What this isn’t: a Git tutorial. It doesn’t tell you how to set up git, or use it. I don’t cover branches, or merging, or tags, or blobs. There are dozens of really great articles about Git on the web; here are some. What’s here are just some pictures that aren’t about branches or blobs, that I wished I’d been able to look at six months ago when I was trying to figure this stuff out; I still haven’t seen them elsewhere, so here they are now.
My brief history with Git
I started using Git about six months ago, in order to productively subcontract for a company that still uses Perforce. Before that I had been a happy Mercurial user; before that, a Darcs devotee; before that, a mildly satisfied Subversion supplicant; and before that, a Perforce proponent. (That last was before the other systems even existed. I introduced Perforce into a couple of companies that had previously been using SourceSafe(!) – including the one I was now contracting for.)
Each of these systems has flaws. Perforce and Subversion require an always-on connection and make branching (and merging) expensive, and Perforce uses pessimistic locking too (you have to check a file out before you can edit it). I got hit by the exponential merge bug in Darcs (since fixed?); a deeper problem was that I found I wanted to be able to go back in time more often than I needed to commute patches, whereas Darcs makes the latter easy at the expense of the former – so Darcs’ theory of patches, although insightful and beautiful, just didn’t match my workflow.
Git’s problem is its complexity. Half of that is because it’s actually more powerful than the other systems: it’s got features that make it look scary but that you can ignore. Another half is that Git uses nonstandard names for about half its most common operations. (The rest of the VCS world has more or less settled on a basic command set, with names such as “checkout” and “revert”. Not Git!) And the third half is the index. The index is a mechanism for preventing what you commit from matching what you tested in your working directory. Huh?
Git without the index
I got through my first four months of Git by pretending it was
Subversion. (A faster implementation of
Subversion, that works offline, with non-awful branches and merging, that can
run as a client to Perforce – but still basically Subversion.) The executive
summary of this mode of operation is that if you use “
git commit -a” instead
git commit”, you can ignore the index altogether. You can alias
commit -a” (and train yourself not to use the longer
commit, which I hadn’t
been doing anyway), and then you don’t have to remember the command-line
$ cat ~/.gitconfig [alias] ci = commit -a co = checkout st = status -a $ git ci -m 'some changes'
Adding Back the Index
Git keeps copies of your source tree in the locations in this diagram1. (I’ll call these locations “data stores”.)
The data store that’s new, relative to every other DVCS that I know about, is the “index”. The one that’s new relative to centralized VCS’s such as Subversion and Perforce is the “local repository”.
The illustration shows that “
git add” is the only (everyday) operation that
can cause the index to diverge from the local repository. The only reason (in
Subversion-emulation mode) to use “
git add” is so that “
git commit” will see
your changes. The
-a option to “
git commit” causes “
git commit” to run
git add -u” first – in which case you never need to
run "git add -u”
explicitly – in which case the index stays in sync with the repository head.
This is how the trick in “git without the index” works: if you always use commit
git commit -a”, you can ignore the index2.
So what’s the point of the index? Is it because Linus likes complicated things? Is to one-up all the other repositories? Is it to increase the complexity of system, so that you have a chance to shoot yourself in the foot if you’re not an alpha enough geek?
Well, probably. But it’s good for something else as well. Several things, actually; I’ll show you one (that I use), and point you to another.
But first, a piece of background that helps in understanding Git. Git isn’t at
its core a VCS. It’s really a distributed versioning file system, down to its
own fsck and
gc. It was
developed as the bottom layer of a VCS, but the VCS layer, which provides the
conventional VCS commands (
branch), is more like an
uneven veneer than like the “porcelain” it’s sometimes called: bits of file
system (git core) internals poke through.
The disadvantage of this (leaky) layering is that Git is complicated. If you look up how to diff against yesterday’s 1pm sources in git diff, it will send you to git rev-parse from the core; if you look up git checkout, you may end up at git-check-ref-format. Most of this you can ignore, but it takes some reading to figure out which.
The advantage of the layering is that you can use Git to build your own workflows. Some of these workflows involve the index. Like the other fancy Git features, building your own workflows is something that you can ignore initially, and add when you get to where you need it. This is, historically, how I’ve used the index: I ignored it until I was comfortable with more of Git, and now I use it for a more productive workflow than I had with other VCS’s. It’s not my main reason for using Git, but it’s turned to a strength from being a liability.
My Git Workflow
Added: By way of illustration, here’s how I use Git. I’m not recommending this particular workflow; instead, I’m hoping that it can further illustrate the relation between the workspace, the index, and the repository; and also the more general idea of using Git to build a workflow.
I use the index as a checkpoint. When I’m about to make a change that might go awry – when I want to explore some direction that I’m not sure if I can follow through on or even whether it’s a good idea, such as a conceptually demanding refactoring or changing a representation type – I checkpoint my work into the index. If this is the first change I’ve made since my last commit, then I can use the local repository as a checkpoint, but often I’ve got one conceptual change that I’m implementing as a set of little steps. I want to checkpoint after each step, but save the commit until I’ve gotten back to working, tested code. (More on this tomorrow.)
Added: This way I can checkpoint every few minutes. It’s a very cheap operation,
and I don’t have to spend time cleaning up the checkpoints later. “
tells me what I’ve changed since the last checkpoint; “
git diff head” shows
what’s changed since the last commit. “
git checkout .” reverts to the last
git checkout head .” reverts to the last commit. And “
stash” and “
git checkout -m -b” operate on the changes since the last commit,
which is what I want.
I’m most efficient when I can fearlessly try out risky changes. Having a test suite is one way to be fearless: the fear of having to step through a set of manual steps to test each changed code path, or worse yet missing some, inhibits creativity. Being able to roll back changes to the last checkpoint eliminates another source of fear.
I used to make copies of files before I edited them; my directory would end up
littered with files like
code.java.2, which I would
periodically sweep away. Having Git handle the checkpoint and diff with them
makes all this go faster. (Having painless branches does the same for
longer-running experiments, but I don’t want to create and then destroy a branch
for every five-minute change.)
Here’s another picture of the same Git commands, this time shown along a second
axis, time, proceeding from top to bottom. [This is the behavior diagram to the
last picture’s dataflow diagram. Kind of.] A number of local edits adds up to
something I checkpoint to the index via “
git add -u”; after a while I’ve
collected something I’m ready to commit; and every so many commits I push
everything so far to a remote repository, for backup (although I’ve got other
backup systems), and for sharing.
I’ve even added another step, releasing a distribution, that goes outside of git. This uses rsync (or scp, or some other build or deployment tool) to upload a tar file (or update a web site, or build a binary to place on a DVD).
Ryan Tomayko has written an excellent essay about a completely different way to use the repository. I recommend it wholeheartedly.
Ryan’s workflow is completely incompatible with mine. Ryan uses the repository
to tease apart the changes in his working directory into a sequence of separate
commits. I prefer to commit only code that I’ve tested in my directory, so
Ryan’s method doesn’t work for me. I set pending work aside via
git stash or
git checkout -m -b when I know I might need to interrupt it with another
change; this sounds like it might not work for Ryan. Neither one of these
workflows is wrong (and I could easily use Ryan’s, I’m just slightly more
efficient with mine); Git supports them both.
There’s another way to do this particular task – of checkpointing after every
few edits, but only persisting some of these checkpoints into the repository.
This is to commit each checkpoint to the repository (and go back to ignoring the
index – at least for checkpointing – so this might work with Ryan’s), and
rebase them later. Git lets you squash a number of commits into a single
commit before you push it to a public repository (and edit, reorder, and drop
un-pushed commits too) – that’s the
rebase -i block in the previous
illustration, and you can read about it
This is a perfectly legitimate mode of operation; it’s just one that I don’t
Both of these alternatives harken back to Git as being a tool for designing VCS workflows, as much as being a VCS system itself. The reasons I don’t use them myself bring us to Commit Policies, which I’ll write about tomorrow.
This picture shows just those commands that copy data between the local repository, the remote repository, the index, and your workspace. There’s lots more going on inside these repositories (branches, tags, and heads; or, blobs, trees, commits, and refs). In fact, during a merge, there’s more going on inside the index, too (“mine”, “ours”, and “theirs”). To a first approximation, all that’s orthogonal to how data gets between data stores; we’ll ignore it. ↩
This isn’t quite true. You still need to use “
git add” a new file to tell git about it, and at that point it’s in your index but not in your repository. You still don’t need to think about the repository in order to use it this way ↩