ezyang's blog

Scheming environments

February 12, 2010

Environments are first-class objects in MIT/GNU Scheme. This is neat, because an integral part of the lexical structure of a Scheme is also a data-structure in its own right, able to encode data and behavior. In fact, the environment data structure is precisely what Yegge calls property lists, maps that can be linked up with inheritance. So not only is it a data structure, it’s a highly general one too.

Even without the ability to pass around an environment as a first class variable, you can still leverage its syntactic ties to stash away local state in an object-oriented manner. The data is only accessible inside procedures that pointed to the appropriate environment frame, and traditionally the closure returns a lambda (with its double-bubble pointed to the newly born environment frame) that acts as the only interface into the enclosing state. This requires a fair amount of boilerplate, since this lambda has to support every possible operation you might want to do to the innards of the function.

With a first class environment object, however, you can futz around with the closure’s bindings arbitrarily. Unfortunately, there’s no way to directly get-current-environent (except for the top-level REPL environment, which doesn’t count), so we resort to the following trick:

(procedure-environment (lambda () '()))

procedure-environment can grab the environment pointer from the double-bubble for some procedure. So, we force the creation of a double-bubble pointing to the environment we care about with the empty lambda.

I most recently used this technique in a 6.945 project, in which I used a lambda to generate a bunch of procedures with various parameters swapped out (encouraging code-reuse), something akin to the time-honored trick of including C files multiple times with different macro definitions. Instead of returning these procedures as a hash-table which then people would have to explicitly call, I just returned the environment, and thus any consumer could enter “a different universe” by using an appropriate environment.

Scheme is pretty unique in its celebration of environments as first-class objects. I could try this trick in Python, but the func_closure attribute on functions is read-only, and also Python has some pretty lame scoping rules. A shame, since this technique allows for some lovely syntactic simplifications.

Comment. Oleg Kiselyov writes in to mention that “*MIT Scheme* specifically is unique in its celebration of environments as first class environments,” and notes that even some developers of MIT scheme have second thoughts about the feature. It makes code difficult to optimize, and is both theoretically and practically dangerous: theoretically dangerous since environments are really an implementation detail, and practically dangerous because it makes it very difficult to reason about code.

From in-person discussions, I’m not surprised that Sussman’s favored dialect of Scheme allows such a dangerous feature; Sussman has always been in favor of letting people have access to dangerous toys and trusting them to use them correctly.

Sources of music

February 10, 2010

I love listening to music, especially new and interesting pieces that I’ve never heard before. Unfortunately, being a somewhat frugal spender my own personal collection of music grows very slowly; perhaps a bit too slowly for my own tastes. When I need new music, where do I turn?

MixApp is a collaborative music listening application. At its worst, it can be simply used as an extension to your current music library; anyone who is your friend and who is online, you can search for music and queue it up for yourself. However, the serendipitous part of MixApp is when you’ve dropped into a room of people you don’t know and music you don’t know, but man, it sounds good and suddenly you’re being taken on a sonic adventure across artists you’ve never heard of and a genre you’ve just discovered and wham: you just got MixApp’ed. Update: MixApp is dead (the founders went on to build Meteor), though there are replacements popping up like turntable.fm
Pandora and last.fm are both pretty reliable methods to get a stream of genre appropriate singles, one after another. The serendipity level is not as nice as MixApp, though, so I don’t find myself turning to these much.
There’s not really much that can beat a good radio host. People like David Garland and John Schaefer have such a diverse and deep palette of musical knowledge, and they’ve had every evening for who knows how many years to hone the craft of sharing this with the listeners of public radio. I was very pleased when WQXR finally managed to get a high-quality internet stream back online.
I was room-skipping on MixApp one evening, and was caught by the Kleptone’s latest album Uptime/Downtime. I have nothing against mix artists: the whole tradition of music has been founded upon the borrowing, stealing, and building upon of earlier work, and in many cases, an adept mix artist can improve the “popular music” material it was founded upon. Or sometimes the source material is just really awesome, and should be listened to in its own right: one of the most interesting musical adventures I’ve had recently was taking the samples list for Uptime/Downtime and listening to each source piece in turn.
Orchestra, wind ensemble, small ensemble, or really any type of ensemble, rehearsal, affords time several months to get intimately familiar with a particular piece of music. I would have never have gotten the chance to fully appreciate contemporary works such as Bells for Stokowski or Persichetti’s Masquerade for Band without this really in-depth exploration into a piece.

I should consider myself extremely lucky to be living in an era where new music is constantly at my fingertips. How do you seek out new and interesting music?

Nested loops and exceptions (Oleg Kiselyov)

February 8, 2010

Editorial. Today we interrupt our regularly scheduled programming to bring you a guest post by Oleg Kiselyov, reinterpreting our previous post about nested loops and continuations with exceptions.

Hello!

I noticed your recent article about nested loops and continuations. I should have commented on it using the provided form, but I was not sure how formatting would come out. The comment includes a lot of code. Please feel free to post the code in whole or in part, or do anything else with it.

The thesis of my comment is that callCC is not necessary for the implementation of break and continue in single and nested loops. We observe that the continuations of each iteration and of the entire loop are invoked either 0 or 1 time (but never more than once). That is the pattern of exceptions. So, the problem posed by your article can be solved with exceptions. Here are several variations of the solution.

First, a few preliminaries: this message is the complete literate Haskell code.

> import Prelude hiding (break, catch)
> import Control.Monad
> import Control.Monad.Trans

Alas, ErrorT in Control.Monad.Error has the stupid Error constraint. So, we have to write our own Exception monad transformer. The code below is standard.

> newtype ExT e m a = ExT{unExT :: m (Either e a)}
>
> instance Monad m => Monad (ExT e m) where
>     return  = ExT . return . Right
>     m >>= f = ExT $ unExT m >>= either (return . Left) (unExT . f)
>
> instance MonadTrans (ExT e) where
>     lift m = ExT $ m >>= return . Right
>
> instance MonadIO m => MonadIO (ExT e m) where
>     liftIO m = ExT $ liftIO m >>= return . Right
>
> raise :: Monad m => e -> ExT e m a
> raise = ExT . return . Left
>
> catch :: Monad m => ExT e m a -> (e -> ExT e' m a) -> ExT e' m a
> catch m h = ExT $ unExT m >>= either (unExT . h) (return . Right)
>
> runExT :: Monad m => ExT e m a -> m a
> runExT m = unExT m >>= either (const $ fail "Unhandled exc") return

We are ready to code the first solution, for simple, non-nested loops. The idea is to treat ‘break’ and ‘continue’ as exceptions. After all, both control operators cause computations to be skipped—which is what exceptions do. We define the datatype of our ’exceptions’:

> data BC = Break | Cont
>
> break, continue :: Monad m => ExT BC m a
> break    = raise Break
> continue = raise Cont

Here is the code for the loop: it catches exceptions at some points:

> for_in :: Monad m => [a] -> (a -> ExT BC m ()) -> m ()
> for_in xs f = runExT $ mapM_ iter xs `catch` hbreak
>  where 
>  iter x = catch (f x) hcont
>  hcont  Cont  = return () -- handle Cont, re-raise Break
>  hcont  e     = raise e
>  hbreak Break = return ()
>  hbreak Cont  = return ()     -- Shouldn't happen actually

Here is your test:

> loopLookForIt1 :: IO ()
> loopLookForIt1 =
>     for_in [0..100] $ \x -> do
>         when (x `mod` 3 == 1) $ continue
>         when (x `div` 17 == 2) $ break
>         lift $ print x

Running it:

> tf1 = loopLookForIt1 :: IO ()

prints 23 numbers starting with 0, 2, 3 and ending with 30, 32, 33.

We have to generalize to nested loops. Two solutions are apparent. I would call the first one ‘dynamic’. We index the exceptions by levels, which are natural numbers. Level 0 pertains to the current loop, level 1 is for the parent loop, etc.

> data BCN = BCN BC Int         -- Add the level of breaking

Operators break and continue now take the number: how many loop scopes to break. I think Perl has a similar breaking-with-number operator.

> breakN    = raise . BCN Break
> continueN = raise . BCN Cont

The new iterator:

> for_inN :: Monad m => [a] -> (a -> ExT BCN m ()) -> ExT BCN m ()
> for_inN xs f = mapM_ iter xs `catch` hbreak
>  where 
>  iter x = catch (f x) hcont
>  hcont  (BCN Cont 0)  = return () -- handle Cont, re-raise Break
>  hcont  e             = raise e
>  -- If the exception is for a parent, re-raise it, decrementing its level
>  hbreak (BCN Break 0) = return ()
>  hbreak (BCN Cont 0)  = return ()     -- Shouldn't happen actually
>  hbreak (BCN exc n)   = raise (BCN exc (n-1))

The single-loop test now looks as follows.

> loopLookForItN :: ExT BCN IO ()
> loopLookForItN =
>     for_inN [0..100] $ \x -> do
>         when (x `mod` 3 == 1) $ continueN 0
>         when (x `div` 17 == 2) $ breakN 0
>         lift $ print x
>
> tfN = runExT loopLookForItN :: IO ()

We can now write the nested loop test. I took a liberty to enhance the example in your article, so to exercises all cases:

> loopBreakOuter1 :: ExT BCN IO ()
> loopBreakOuter1 =
>     for_inN [1,2,3] $ \x -> do
>         lift $ print x
>         for_inN [4,5,6] $ \y -> do
>             lift $ print y
>             when (y == 4) $ continueN 0
>             when (x == 1) $ breakN 0
>             when (x == 3) $ breakN 1
>             when (y == 5) $ continueN 1
>             breakN 1
>         lift $ print x
>
> tbN1 = runExT loopBreakOuter1 :: IO ()

The result is the sequence of numbers: 1 4 5 1 2 4 5 3 4 5

There exists another solution for the nested-loop problem, which I call ‘static’. What if we just iterate the single-loop solution? We can nest ExT BC monad transformers to any given depth. To refer to particular layer in the transformer stack, we use lift. We can use the for_in iterator and the operators break, continue defined earlier. We write the nested test as follows:

> loopBreakOuterS1 :: IO ()
> loopBreakOuterS1 =
>     for_in [1,2,3] $ \x -> do
>         liftIO $ print x
>         for_in [4,5,6] $ \y -> do
>             liftIO $ print y
>             when (y == 4) $ continue
>             when (x == 1) $ break
>             when (x == 3) $ lift $ break
>             when (y == 5) $ lift $ continue
>             lift $ break
>         liftIO $ print x
> tbS1 = loopBreakOuterS1 :: IO ()

I guess the lesson here might be that callCC is often not needed (I would argue that callCC is never needed, but that’s the argument for another time). Here is another example of simple exceptions sufficing where call/cc was thought to be required:

http://okmij.org/ftp/Computation/lem.html

Cheers,
Oleg

Nested loops and continuations

February 5, 2010

The bread and butter of an imperative programmer is the loop. Coming from a C/assembly perspective, a loop is simply a structured goto which jumps back to a set location if some condition is not met. Frequently, this loop ranges over the elements of some list data structure. In C, you might be doing pointer arithmetic over the elements of an array or following pointers on a linked list until you get NULL; in Python and other higher-level languages you get the for x in xs construct which neatly abstracts this functionality. Inside of a loop, you also have access to the flow control operators break and continue, which are also highly structured gotos. An even more compact form of loops and nested loops are list comprehensions, which don’t permit those flow operators.

Haskell encourages you to use the higher order forms such as map and fold, which even further restrict what may happen to the data. You’ll certainly not see a for loop anywhere in Haskell… However, as a pernicious little exercise, and also a way to get a little more insight into what callCC might be good for, I decided to implement for...in loops with both the continue and break keywords. The end hope is to be able to write code such as:

import Prelude hiding (break)

loopLookForIt :: ContT () IO ()
loopLookForIt =
    for_in [0..100] $ \loop x -> do
        when (x `mod` 3 == 1) $ continue loop
        when (x `div` 17 == 2) $ break loop
        lift $ print x

as well as:

loopBreakOuter :: ContT () IO ()
loopBreakOuter =
    for_in [1,2,3] $ \outer x -> do
        for_in [4,5,6] $ \inner y -> do
            lift $ print y
            break outer
        lift $ print x

the latter solving the classic “nested loops” problem by explicitly labeling each loop. We might run these pieces of code using:

runContT loopBreakOuter return :: IO ()

Since continuations represent, well, “continuations” to the program flow, we should have some notion of a continuation that functions as break, as well as a continuation that functions as continue. We will store the continuations that correspond to breaking and continuing inside a loop “label”, which is the first argument of our hanging lambda:

data (MonadCont m) => Label m = Label {
    continue :: m (),
    break :: m ()
}

It’s sufficient then to call continue label or break label inside the monad to extract and follow the continuation.

The next bit is to implement the actual for_in construct. If we didn’t have to supply any of the continuations, this is actually just a flipped mapM_:

for_in' :: (Monad m) => [a] -> (a -> m ()) -> m ()
for_in' xs f = mapM_ f xs

Of course, sample code, f has the type Label m -> a -> m (), so this won’t do! Consider this first transformation:

for_in'' :: (MonadCont m) => [a] -> (a -> m ()) -> m ()
for_in'' xs f = callCC $ \c -> mapM_ f xs

This function does the same thing as for_in', but we placed it inside the continuation monad and made explicit a variable c. What does the current continuation c correspond to in this context? Well, it’s in the very outer context, which means the “current continuation” is completely out of the loop. That must mean it’s the break continuation. Cool.

Consider this second alternative transformation:

for_in''' :: (MonadCont m) => [a] -> (a -> m ()) -> m ()
for_in''' xs f = mapM_ (\x -> callCC $ \c -> f x) xs

This time, we’ve replaced f with a wrapper lambda that uses callCC before actually calling f, and the current continuation results in the next step of mapM_ being called. This is the continue continuation.

All that remains is to stick them together, and package them into the Label datatype:

for_in :: (MonadCont m) => [a] -> (Label m -> a -> m ()) -> m ()
for_in xs f = callCC $ \breakCont ->
    mapM_ (\x -> callCC $ \continueCont -> f (Label (continueCont ()) (breakCont ())) x) xs

Et voila! Imperative looping constructs in Haskell. (Not that you’d ever want to use them, nudge nudge wink wink.)

Addendum. Thanks to Nelson Elhage and Anders Kaseorg for pointing out a stylistic mistake: storing the continuations as () -> m () is unnecessary because Haskell is a lazy language (in my defense, the imperative paradigm was leaking in!)

Addendum 2. Added type signatures and code for running the initial two examples.

Addendum 3. Sebastian Fischer points out a mistake introduced by addendum 1. That’s what I get for not testing my edits!

Classes begin

February 3, 2010

And so classes begin this Spring Term of 2010. The classes that I am currently signed up to take are:

6.005: Software Construction
6.02: Intro to EECS II
6.045: Automata, Computing and Complexity
6.945: Large-scale Symbolic Systems
21M.283: Musicals of Stage and Screen

6.945 is the “fun” class of the semester; I expect to have to sink a lot of time into and get a lot out of it in return. 6.005 and 6.02 are strictly being taken because my degree requires it (I’ve scheduled 6.02 as a conflict class on top of 6.945, so I’ll probably have to do a little more legwork to make sure I get all the material for that class.) 6.045 is my mathematical class for the semester; no pure course 18 class for me, unfortunately! And on the advice of Robert Jacobs, 21M.283 is my HASS-ish class (I’m quite pleased that I’ve gotten the HASS-D requirement out of the way).

Among the classes I’m not taking this term: 6.006 (goodness knows I do need the algorithms knowledge), 7.01X (punt punt punt) and 6.824 (sounds like lots of fun, but would result in a triple conflict, which I’m not willing to do.)

Cute macro tricks in the kernel

February 1, 2010

A classic stylistic tip given to C programmers is that inline functions should be preferred over macros, when possible. This advice stems from the fact that a macro and an inline function can achieve the same effect, but the inline function also gets type checking.

As it turns out, you can achieve static type checking with macros, if you’re willing to resort to the same cute trick that this following snippet from the Linux kernel uses:

#define module_param_named(name, value, type, perm) \
        param_check_##type(name, &(value)); \
        module_param_call(name, param_set_##type, param_get_##type, &value, perm); \
        __MODULE_PARM_TYPE(name, #type)

Hmm… I wonder what that param_check_##type call is all about. Digging through a few more macro definitions, we see:

#define __param_check(name, p, type) \
        static inline type *__check_##name(void) { return(p); }

So there you go. A throw-away inline function named __check_##name enforces that p is the same type as type. A comment is also given, explaining what’s going on:

/* The macros to do compile-time type checking stolen from Jakub
   Jelinek, who IIRC came up with this idea for the 2.4 module init code. */

Workflows in Git: Single-user style

January 29, 2010

Nelson Elhage wrote a post about Git and usability, in which he discussed one of the reasons why Git seems to be so confusing to users who have come in straight from a Subversion-style workflow. When discussing this issue offline, one of the things that came up was the fact that, while Subversion imposes a fairly rigid workflow upon its users, Git is flexible enough to do almost any sort of workflow. This is terrible for a user placed in a shop that uses Git: when they go Google for how to use Git, they’ll get any multitude of tutorials, each of which is for a different workflow.

In this multipart series, I’d like to discuss several different types of workflows that I’ve seen or experienced while using Git. This first post will look at a very simple example of a Git workflow, namely that of a single user, which will establish some basic idioms of Git that you might see in the other workflows.

A single-user workflow is, well, kind of simple. At it’s simplest, it’s not much more than a glorified backup system; you have lots of versions of your code. You can go back in time. Since I am assuming a general knowledge of version control systems, I don’t think I need to convince you why this is useful. This article also assumes that you’re comfortable enough to make commits in a repository (though we will not assume you know how to use the index; -a is a wondrous flag).

Backups

The very first thing you may notice when you move from a centralized VCS to a decentralized VCS is that your data never leaves your computer unless you explicitly say so. This is great if you are on an airplane and don’t have Internet access; you don’t have to pile up a stack of changes without being able to check in to the server. However, it means that you have to put in a little thought about where you are going to push your changes to. An easy way to do this is to utilize the multitude free public hosting. If you have a server that you have SSH access, private offsite backups are also easy: create a bare git repository on another server using git init --bare and then setup a remote that you can push to… but I’m getting ahead of myself!

If you created a Git repository and working copy on your own computer with git init, you’ll now have to wrangle with Git remotes. I personally find this quite annoying, and thus always arrange to have my bare Git repository (i.e. the server) setup before I git clone my working copy (i.e. the client), which sets up the configuration that makes pushing easy. My steps are then:

# On my server, make a directory (I like /srv/git/project.git) and in it run git init --bare # On my client, run git clone ssh://servername/srv/git/project.git

If you must setup the remotes on an existing repository, the following commands will do the trick:

git remote add origin $REPO_URL
git config branch.master.remote origin
git config branch.master.merge refs/heads/master

For the curious, the first line adds a remote named “origin” (which, by convention, is the remote setup from the repository you may have cloned) associated with $REPO_URL. The second and third lines setup default behavior for when you pull changes from the repository, to simulate the configuration that normally gets setup when you do a clone. (Note: this kind of sucks. Git 1.7.0 introduces the --set-upstream flag which fixes these problems.)

From there, all you need to do is make commits with git commit, and then push them to the remote repository with git push.

Topic branches

As a single user, most of your work in your repository will play nicely together; you don’t have to worry about someone else coming in and trampling on your commits. However, every once in a while you may find yourself in the midst of a large refactoring, and you find yourself having to leave things off for the day, or take an interrupt to work on a more pressing, albeit smaller, bugfix. Here, cheap commits and branching make this very simple on Git.

If you think the changes you are currently working on are big but you’ll be able to get back immediately to them, use git stash to temporarily pop your changes into a stash. You can then perform your minor changes, and once done, use git stash pop to restore your old changes. Stash works best as a temporary scratch place for you to store changes, and should be immediately emptied out when possible; you don’t want to be looking at multiple stashed changes and trying to figure out which one contains the ones you care about.

If your changes are a smidge bigger than that, or you think that you’re not going to be able to work on whatever large change you’re making for a while, you can make what’s called a topic branch. First, change your working copy over to a new branch using git checkout -b new-branch-name (pick a descriptive name). Then, make a commit to save your changes. If you pop open gitk, you’ll now notice that you have a commit hanging off of master. You can checkout master again using git checkout master and work on whatever other changes you need.

When you finally decide that your topic branch is done, you need to stick back into master. There are two ways to do this:

You can pretend that your topic branch, as a whole, is just a big patch, and as such, this patch should reasonably apply to the most recent version of master. In that case, running git rebase master while on the topic branch (you can check with git status) will take this “patch” and apply it to master. You can then checkout master and git pull topic-branch to fast-forward master to the topic branch. Since getting rid of old branches is a good thing, I recommend running git branch -d topic-branch afterwards.
You can take a stance that history is important, and perform a merge. On the master branch, run git merge topic-branch. Just as in the first case, you can then cleanup the topic branch with git branch -d topic-branch.

Cleaning up after old topic branches is a good habit to get into, because it means you can use git branch to remind yourself quickly which topic branches might need your attention.

Additionally, if you care about backing up your topic branches, you should run git push origin topic-branch. You can delete topic branches from your remote using git push origin :topic-branch (note the colon).

Clean history

Many people pay a lot of attention to documentation inside a source file in order to puzzle out what a particular piece of code does. However, another excellent source of code documentation is looking at the history of a piece of code; when did a particular snippet get introduced, and what explanation did the author give for it when making that change? git blame will give you a blow-by-blow description of when every particular line in a Git file was changed, and git log will show you the conglomeration of changes made to a particular file.

Unfortunately, the usefulness of this mechanism highly depends on the quality of the messages you’re making in your commits, and if you’re using Git properly and committing often, you might have skimped a little on some of the messages. No worries; it happens to the best of us. You just have to remember to clean things up (i.e. rewrite history) when you’re done.

In this case, git rebase -i is your friend. Specify as an argument how far back you want to rewrite history (HEAD~N where N is a number is probably a good bet), and then rewrite history to your hearts content. You have three primary tools:

edit, and when Git gets to that commit, just run git commit --amend: This is fairly simple: you have a self-contained commit that you didn’t really write a good commit message for, well amend will let you change that commit message into something that is useful.
squash: If you made a bunch of very small commits, and now you look at them and decide, no, they really logically go together, you can squash them together.
edit with git checkout HEAD~: What this will do is give you a working tree with the changes of that commit, but without any of them actually part of a commit. You can then break a “too big” commit into bite-sized pieces using git add -p (which will selectively add hunks of your changes to the index) and then using git commit without the -a flag).

This strategy interacts particularly well with topic branches, which lend themselves to the following workflow:

Create the topic branch with git checkout -b topic-name,
Hack a lot on the branch, making tiny commits with incomprehensible summaries,
Review your changes with git log -u master..HEAD,
Edit your changes with git rebase -i master,
Checkout master and git pull topic-name.

And that’s it for part one! You may have noticed that all of these strategies seem to feed into each other: this unusual integration between all aspects is one of the benefits of Git’s simple internal model. If people would like to see some examples of these techniques in action, I’d be more than happy to blog about them some more. Thanks for reading.

Arcadia Rising posters

January 27, 2010

As treasurer for the Assassins’ Guild, I often have to beg and plead GMs to get posters for their respective games, since the UA Finboard has a requirement that, to get funding for events, you need to supply posters.

So I was quite surprised, amazed and impressed by Lauren Gust’s work on posters for Arcadia. The composition is simple, but effective, and definitely adds to the atmosphere of the game. Click for larger versions, and you can get full size PDFs of the posters in Lauren Gust’s public. For a little more background into what the posters are referencing, check out the Arcadia Rising scenario.

To the right! Autocompletable names

January 25, 2010

In my younger days, the stylistic convention of MM/DD/YYYY confused me; why on earth would people opt for such an illogical system that placed months, days and years in non-hierarchical order? Surely something on order of YYYY-MM-DD would make far more sense: this format is sortable and, all-in-all, quite logical.

Eventually, though, I grudgingly accepted that MM/DD/YYYY, trades machine-friendliness for human-friendliness; after all, the year entry rarely changes, and for humans the month and date are the most important pieces of information. Context is usually more than enough to implicity specify what the year is.

But as a auto-complete user, I’ve come to appreciate that this sort of ordering can come in handy even when computers are involved. Consider the hierarchically named and non-hierarchally named list of files:

# hierarchally named
test-algorithm.sh
test-bottles.sh
test-capistrano.sh
utils.sh

# non-hierarchally named
algorithm-test.sh
bottles-test.sh
capistrano-test.sh
utils.sh

In the hierarchal case, to auto-complete test-algorithms.sh, I need to type t<tab>a<tab>; a total of four keystrokes. In the non-hierarchal case, however, I only need to type a<tab>. If I’m frequently accessing these files, the extra keystrokes add up.

So here’s my plea: the next time you’re coming up with a naming convention for files you’re sticking in a directory, consider both moving the “category” component to the end, and thinking of autocomplete friendly names. Your fingers will thank you for it.

(Hat-tip to GameTeX for showing me the light.)

Hacking git-rerere

January 22, 2010

An unusual workflow for Git, one that Wizard employs extensively, is when a single developer needs to perform merges inside lots of working copies. Normally, a maintainer would pull from the branches he cared about, and offload a large amount of the work to those who were interested in contributing patches. However, Wizard is using Git to provide a service for people who don’t know and aren’t interested in learning Git, so we need to push updates and merge their software for them.

The problem encountered when you are doing a lot of merging is “repeated resolution of the same conflicts.” The solution, at least for the classical case, is git rerere. This feature saves the resolution of conflicts and then automatically applies those resolutions if the conflict comes up again. You can find out this much if you check man git-rerere.

Unfortunately, this merge resolution data is stored per .git directory, specifically in the rr-cache subdirectory, so some modest cleverness is necessary to make this work across many repositories. Fortunately, the simple solution of symlinking all of the rr-cache directories to a common directory both works and is safe of race-conditions when initially merging (it’s not race-safe when writing out resolutions, but I am willing to consider this low contention enough to be a non-issue).

Why is this solution race safe? At first glance at the code in rerere.c, this would seem not to be the case: if two merges were to happen to generate the same merge conflict (precisely the use case of git rerere), the following code would get executed with the same value of hex:

ret = handle_file(path, sha1, NULL);
if (ret < 1)
        continue;
hex = xstrdup(sha1_to_hex(sha1));
string_list_insert(path, rr)->util = hex;
if (mkdir(git_path("rr-cache/%s", hex), 0755))
        continue;
handle_file(path, NULL, rerere_path(hex, "preimage"));

The last three lines access the (now shared) rr-cache directory, and handle_file will attempt to write out the file rr-cache/$HEX/preimage with the preimage contents; if both instances run handle_file concurrently, this file will get clobbered.

But, as it turns out, we don’t care; barring a SHA-1 collision, both instances will write out the same file. The signature of handle_file is:

static int handle_file(const char *path,
         unsigned char *sha1, const char *output)

The first argument is a path to read conflict markers from, and is mandatory. sha1 and output are optional; if output is not NULL, it contains a contents that the entire file, minus any diff3 conflict sections (the ones separated by ||||||| and =======) gets written to; if sha1 is not NULL, it gets a 20-byte binary digest written to it of the contents that output would have received. And thus, balance is restored to the world.

Addendum

Anders brings up the interesting question on whether or not two processes writing the same contents to the same file is actually race safe. Indeed, there is a very similar situation involving two processes writing the same contents to the same file which is a classic example of race conditions:

((echo "a"; sleep 1; echo "b") & (echo "a"; sleep 1; echo "b")) > test

Under normal circumstances, the contents of test is:

a
a
b
b

But every once in a while, one of the processes loses the race, and you get:

a
a
b

due to a non-atomic combination of writing and updating the file offset.

However, the distinguishing characteristic between this example and Git’s case is that, in this example, there is only one file descriptor. However, in Git’s case, there are two file descriptors, since each process called open independently. A more analogous shell script would be:

((echo "a"; sleep 1; echo "b") > test & (echo "a"; sleep 1; echo "b") > test)

the contents of which are (as far as I can tell) invariably:

a
b

Now, POSIX actually doesn’t say what happens if two writes to the same offset with the same contents occur. However, casual testing seems to indicate that Linux and ext3 are able to give a stronger guarantee that writing the same values won’t cause random corruption (note that, if the contents of the file were different, either combination could be possible, and this is what you see in practice).

« Newer Posts Older Posts »