ezyang's blog

Haskell, The Hard Sell

March 17, 2010

Last week I talked about how we replaced a small C program with an equivalent piece of Haskell code. As much as I’d like to say that we deployed the code and there was much rejoicing and client side caching, the real story is a little more complicated than that. There were some really good questions that we had to consider:

How many maintainers at any given time know the language? The Scripts project is student-run, and has an unusually high turnover rate: any given maintainer is only guaranteed to be around for four to five years (maybe a little longer if they stick around town, but besides a few notable exceptions, most people move on after their time as a student). This means at any given point we have to worry about whether or not the sum knowledge of the active contributors is enough to cover all facets of the system, and facility in a language is critical to being able to administrate the component effectively (students we are, we frequently don both the sysadmin and developer hats). In a corporate setting, this is less prominent, but it still plays a factor: employees switch from one group to another and eventually people leave or retire. We have two current maintainers who are fairly fluent in Haskell. The long-term sustainability of this approach is uncertain, and hinges on our ability to attract prospective students who know or are interested in learning Haskell; in the worst case, people may crack open the code, say “what the fuck is this” and rewrite it in another language.
How many maintainers at any given time feel comfortable hacking in the language? While superficially similar to the first point, it’s actually quite different; posed differently, it’s the difference between “can I write a full program in this language” and “can I effectively make changes to a program written in this language.” At a certain level of fluency, a programmer picks up a special feat: the ability to look at any C/Fortran derived language and lift any knowledge they need about the syntax from the surrounding code. It’s the difference between learning syntax, and learning a new programming paradigm. We may not be simultaneously Python/Perl/PHP/Java/Ruby/C experts, but the lessons in these languages carry over to one another, and many of us have working “hacker” knowledge in all of them. But Haskell is different: it’s lineage is among that of Lisp, Miranda and ML, and the imperative knowledge simply does not translate. One hopes that it’s still possible to tell what any given chunk of Haskell code does, but it’s a strictly read-only capability.
Who else uses it? For one of the team members, migrating from Subversion to Git was a pretty hard sell, but at this point, minus the missing infrastructure for doing the migration properly, he’s basically been convinced that this is the right way forward. One of the big reasons this was ok, though, was because they were able to list of projects (Linux, our kernel; AFS, our filesystem; Fedora, our distro) that they used regularly that also used Git. We can’t say the same for Haskell: the “big” open-source high-visibility applications in Haskell are Xmonad and Darcs, of which many people have never used. As a student group, we have far more latitude to experiment with new technology, but lack of ubiquity means greater risk, and corporations are allergic to that kind of risk.
Is the ecosystem mature? Internally, we’ve given the Ruby maintainers and packagers a lot of flak for a terrible record at backwards compatibility (one instance left us unable to globally update our Rails instances because the code would automatically break the site if it detected a version mismatch). You see a little bit of the same in Haskell: static-cat doesn’t actually build on a stock Fedora 11 server with the default packages installed, due to an old version of the cgi module that uses the Exception backwards compatibility wrapper and thus is incompatible with the rest of the exception handling code in the program. Further investigation reveals that the cgi module is not actually being actively maintained, and the Fedora cabal2spec script is buggy. I’ve personally had experiences of coming back to some Haskell code with up-to-date libraries from Hackage and finding that API drift has made my code not compile anymore. Cabal install refuses to upgrade all of your packages in one go.
There are many ways to work around this. A mitigating factor is that once you’ve compiled a Haskell program, you don’t have to worry about package composition anymore. Workarounds include rewriting our code to be forwards and backwards compatible, doing stupid Fedora packaging tricks to make both versions of cgi live on our servers, convincing upstream that they really want to take the new version, or maintaining a separate system wide cabal install. But it’s not ideal, and it makes people wonder.

I’m quite blessed to be working in an environment where the first point is really the point. Can we introduce Haskell into the codebase and expect to be able to maintain it in the long run? There’ll always be C hackers on the team (or at least, there better be; some of our most important security properties are wrapped up in a patch to a kernel module), but will there always be Haskell hackers on the team? There’s no way to really know the answer to the question.

I personally remain optimistic. It’s an experiment, and you’re not going to get any better chance to make this happen than in this environment. The presence of Haskell code may attract contributors to the project that may not have been originally drawn by the fact that, down beneath it all, we’re a “gratis shared web hosting provider” for our community. Haskell seems singularly aligned to be the language to break into mainstream (sorry Simon!) And when was there ever any innovation without a little risk?

Straitjacket programming

March 15, 2010

The importance of constraint is one well known to those who embark on creative endeavors. Tell someone, “you can do anything you want: anything at all,” and they will blank, paralyzed by the infinite possibility. Artists welcome constraint. Writers like the constraint of a sonnet because it imposes form and gives a place to start; roleplaying groups like the constraint of a campaign setting because it imposes rules and sets the scene for the story to be told; jazz musicians like the constraint of the chords underlying an improvisation because it keeps the soloist anchored to the source tune and suggests ideas for the melody.

However, many programmers don’t the like the constraint of a type system. “The static type system doesn’t let me do what I want to.” “I needed to write four classes for what would have been two lines of Python!” “What? I can’t do that? Why not?” For them, it’s like a straightjacket. How does anyone ever get anything done when constraint ties you up?

I beg to differ. Accept the straightjacket. The things it will let you do… are surprising.

The straitjacket was historically used as an implement to prevent dangerous individuals from harming themselves and others. Programmers are not quite mental asylum inmates, though at a glance it may seem that we’ve been trying to reduce the ways for us to hurt ourselves. But such changes have often brought with them benefits, and many have eagerly traded away pointers and manual memory management for increased expressiveness.

Static types, however, are still a pain point for many people, and Haskell is an unusually constrained language due to its type system. An overenthusiastic user of Haskell’s type system might exclaim, “after I made it typecheck, it just worked!” Of course, this statement is not actually true; there is a certain essential complexity to classes of algorithms that mean the type system won’t catch the fact that you seeded your hash function with the wrong magic number.

But not all code is like this. A lot of code is just plain boring. It’s the code that generates your website, or logs your errors; it’s the code that serves as the glue for your build infrastructure, or it shuffles data from a file into an in-memory representation into a database. It’s the code is foundational; it is the code that lets you express simple ideas simply. When you look at the development of this code, the errors being made are very simple mental typos, they’re the ones that take a total of fifteen seconds to track down and fix once they manifest, but if rolled up in the time it takes to run your test suite or, dare I say it, manually test, quickly ticks to the minutes. A fast static type checker saves you so much pain, whether or not it is a Haskell compiler or pylint -e. The difference is that pylint -e is optional; there is no guarantee that any given Python project will play nicely with it, and it is frequently wrong. The Haskell compiler is not.

This is a specific manifestation of a more general phenomenon: types reduce the number of ways things can go wrong. This applies for complicated code too; (a -> r) -> r may not illuminate the meaning of the continuation to you, but it certainly puts a lot of restrictions on how you might go about implementing them. This makes it possible to look at the types without any understanding of what they mean, and mechanically derive half of the solution you’re looking for.

This is precisely how types increase expressiveness: it’s really hard for people to understand dense, highly abstracted code. Types prevent us from wading too far off into the weeds and make handling even more powerful forms of abstractions feasible. You wouldn’t rely on this in Python (don’t write Haskell in Python!), and in the few cases I’ve written higher-order functions in this language, I’ve been sure to also supply Haskell style type signatures. As Simon Peyton Jones has said, the type offers a “crisp” succinct definition of what a function does.

Even more striking is Haskell’s solution to the null pointer problem. The exception that strikes terror in the hearts of the Java programmer is the NullPointerException: it’s a runtime exception, which means that it doesn’t need to be explicitly declared in the throws specification of a method; a testament to the fact that basically any dereference could trigger this exception. Even in Java, a language of static typing, the type system fails to encode so basic a fact as “am I guaranteed to get a value here?”

Haskell’s answer to this problem is the Maybe type, which explicitly states in the type of a function that the value could be Nothing (null) or Just a (the value). Programmers are forced to recognize that there might not be anything, and explicitly handle the failure case (with maybe) or ignore it (with fromJust, perhaps more appropriately named unsafeFromJust). There’s nothing really special about the data type itself; I could have written a Java generic that had the same form. The key is the higher order functions that come along with the Functor, Applicative, Monad, MonadPlus, Monoid and other instances of this type. I’d run straight into a wall if I wanted to write this in Java:

pureOperation <$> maybeVal

<$>, a higher order function also known as fmap, is critical to this piece of code. The equivalent Java would have to unpack the value from the generic, perform the operation on it, and the pack it up again (with conditionals for the case that it was empty). I could add a method that implements this to the Maybe interface, but then I wouldn’t have an elegant way of passing pureOperation to these method without using anonymous classes… and you’ve quickly just exploded into several (long) lines of Java. It becomes dreadfully obvious why the designers didn’t opt for this approach: an already verbose language would get even more verbose. Other languages aren’t quite as bad, but they just don’t get close to the conciseness that a language that celebrates higher order operators can give you.

In summary, while it may seem odd to say this about a language that has (perhaps undeservedly) earned a reputation for being hard to understand, but the constraint of Haskell’s type system increases the tolerance of both writer and reader for abstraction that ultimately increases expressiveness. Problems that people just shrugged and claimed, “if you want to fix that, you’ll have to add tons of boilerplate,” suddenly become tractable. That’s powerful.

One final note for the escape artists out there: if you need the dynamic typing (and I won’t claim that there aren’t times when it is necessary), you can wriggle out of the static type system completely! Just do it with caution, and not by default.

Five tips for maintainable shell scripts

March 12, 2010

When I was seventeen, I wrote my very first shell script. It was a Windows batch file, bits and pieces very carefully cargo-culted from various code samples on the web. I had already had the exquisite pleasure of futzing with pear.bat, and the thought of scripting was not something I relished; “why not write the damn thing in a real programming language!” (The extra delicious bit was “a real programming language” was PHP. Hee.)

Eventually I came around to an all-Unix environment, and with it I began to use bash extensively. And suddenly, shell scripting made a lot more sense: you’ve been writing the damn commands day in and day out, just write them to a script instead! There was, however, still the pesky little problem that shell scripts are forever; like it or not, they’ve become pieces of maintained code. Entire build infrastructures have been built on top of shell scripts. They breed like rabbits; you have to be careful about the little buggers.

Here are five tips and tricks to keep in mind when tossing commands into a shell script that will make maintenance in the long-run much more pleasant!

Learn and love to use set. There is almost always no good reason not to use the -e flag, which causes your script to error out if any command returns with a nonzero exit code, and -x can save you hours of debugging by printing precisely what command the script is executing before executing it. With the two enabled, you get very simple “assertions” in your shell script:
```
check_some_condition
! [ -s "$1" ]
```
although, if at all possible, you should write error messages to accompany them.
Just because you don’t define subprocedures when you’re at your terminal (or do you? see alias and friends) and use reverse command history search with C-r doesn’t mean it’s acceptable to repeat commands over and over again your shell script. In particular, if you have a set of commands that might go into a separate script, but you feel funny about making a separate file, stuff them in a subprocedure like this:
```
subcommand() {
  do_something_with "$1" "$2"
}
```
In particular, argument passing acts exactly the same way it does in a real shell script, and generally you can treat the subcommand as if it were it’s own script; standard input and output work the way you expect them to. The only differences is are that exit exits the whole script, so if you’d like to break out of a command use return instead.
Argument quoting in shell scripts is a strange and arcane domain of knowledge (although it doesn’t have to be; check out Waldman’s notes on shell quoting). The short version is you always want to wrap variables that will be interpolated with quotes, unless you actually want multiple arguments semantics. I have mixed feelings about whether or not literals should be quoted, and of late have fallen to the dismal habit of not quoting them.
Believe it or not, shell scripting has functional programming leanings. xargs, for example, is the quintessential “map” functionality. However, if the command you are pushing arguments to doesn’t take multiple arguments, you can use this trick:
```
pgrep bash | while read name; do
  echo "PID: $name"
done
```
Shell scripting feels incredibly natural when speaking imperatively, and mostly remains this way when you impose control flow. However, it is absolutely a terrible language for any data processing (exhibit 1: sed and perl pipelines) and you should avoid doing too much data crunching in it. Creating utility scripts in more reasonable languages can go a long way to keeping your shell scripts pretty.

Replacing small C programs with Haskell

March 10, 2010

C is the classic go-to tool for small programs that need to be really fast. When scripts.mit.edu needed a small program to be a glorified cat that also added useful HTTP headers to the beginning of its output, there was no question about it: it would be written in C, and it would be fast; the speed of our static content serving depended on it! (The grotty technical details: our webserver is based off of a networked filesystem, and we wanted to avoid giving Apache too many credentials in case it got compromised. Thus, we patched our kernel to enforce an extra stipulation that you must be running as some user id in order to read those files off the filesystem. Apache runs as it’s own user, so we need another small program to act as the go-between.)

It’s also a frankenscript, a program that grew out of the very specific needs of our project that you will not find anywhere else in the world. As such, it’s critically important that the program is concise and well-defined; both properties that are quite hard to get in C code. And it only gets worse when you want to add features. There were a number of small features (last modified by headers, byte ranges) as well as a number of large features (FastCGI support). None of the development team was relishing the thought of doubling the size of the C file to add all of these enhancements, and rewriting the program in a scripting language would cause a performance hit. Benchmarks of replacing the script with a Perl CGI made the script ten times slower (this translates into four times slower when doing an end-to-end Apache test).

But there is another way! Anders writes:

So I had this realization: replacing it with a compiled Haskell CGI script would probably let us keep the same performance. Plus it would be easy to port to FastCGI since Haskell’s FastCGI library has the same interface.

And a few weeks later, voila: static-cat in Haskell. We then saw the following benchmarks:

$ ab -n 100 http://andersk.scripts.mit.edu/static-cat.cgi/hello/hello.html
Requests per second:    15.68 [#/sec] (mean)
$ ab -n 100 http://andersk.scripts.mit.edu/static-cat.perl.cgi/hello/hello.html
Requests per second:    7.50 [#/sec] (mean)
$ ab -n 100 http://andersk.scripts.mit.edu/static-cat.c.cgi/hello/hello.html
Requests per second:    16.59 [#/sec] (mean)

Microbenchmarking reveals a 4ms difference without Apache, which Anders suspects is due to the size of the Haskell executable. There is certainly some performance snooping to be done, but the Haskell version is more than twice as fast as the Perl version on the end-to-end test.

More generally, the class of languages (Haskell is just one of a few) that compile into native code seem to be becoming more and more attractive replacements for tight C programs with high performance requirements. This is quite exciting, although it hinges on whether or not you can convince your development team that introducing Haskell to the mix of languages you use is a good idea. More on this in another blog post.

Being an expert considered harmful

March 8, 2010

It’s a sunny day in your advanced symbolic programming class. Your teacher has just started going over monads—in Scheme, though—and you sit in the back of the classroom snarking about little tidbits of knowledge you know from Haskell. Suddenly, the teacher says (quite earnestly too), “Edward here seems to know a lot about monads. Why don’t we have him come up and teach them to the class?” Suddenly, you’re up expounding types to people who have never used Haskell before and failing utterly to explain to people how the continuation monad works. Only after several iterations do you manage to partially rewrite the presentation in a form that doesn’t assume fluency in Haskell. You’ve fallen into the expert trap.

You’re an expert. You are in possession of in-depth knowledge, have accumulated wisdom and intuition, and all-in-all can work much more effectively than others within your domain. You might have an ego; you might get into hot arguments with other experts. Or you might be very unassuming and thoughtful; your expertise has little to do with your ego.

But unless you’ve been paying attention to the pre-requisite knowledge you assume, you will be terrible at teaching your area of expertise. Your expertise is getting in the way of teaching effectively, for the expert assumes too much prerequisite knowledge.

What do I mean when I speak of prerequisite knowledge? I don’t mean prerequisite “facts”—what is an iterative algorithm to solve linear equations, how does one reverse a list using a fold, how do X in my favorite framework. I do mean foundational knowledge: abstractions and higher-order primitives to think with—linear algebra, reducing higher-order operators and the architecture of said framework. One answers “how.” Another answers “Why.”

All of engineering and mathematics is perpetually in search of the right abstraction to tackle a problem. Perhaps the most striking change that occurs when you’ve put the problem in the right representation is that it becomes substantially shorter and easier to manipulate at a higher level. It’s no surprise that Newton needed to invent Calculus in order to develop his ideas about physics. The high-level programming languages and systems we build today would have been inconceivable in pure assembly language or silicon.

Finding and understanding the right abstraction is enlightenment: it makes hard things easy and impossible things possible. Calculations that used to take a page now are succinctly described in a sentence. The structure of the verbose system is encoded into the abstraction, leaving behind the salient pieces of the problem. Much of the same could be said for programs: before the advent of high level languages, assembly programs could fit on a few pages and be understood by a single programmer. They had to be. Modern software has gone far beyond.

In both cases, an expert will look at this new formulation, and immediately understand. The beginner, perhaps familiar with but not proficient in this encoding, has to now work out the underlying foundation again (or risk stumbling around with a simpler but faulty set of premises).

You might say, “Well, that’s not the problem of the expert; they just don’t have the prerequisites! I will teach them this topic once they learn that foundation.” This is not acceptable. It is true that formal education can grant them a familiarity with the basic primitives and relations of the abstraction; it is especially effective at weeding out false conceptions. But the facility that an expert has for an abstraction only comes when you spend some time “in the trenches”, using and applying the abstraction to bigger problems.

You might say, “I am not that uncharitable; I’ll teach the prerequisites too.” You might even expect to be able to impart knowledge upon the listener! Undelude yourself. In all but the simple topics (the ones where the simple statement of the solution is enough to illuminate), they won’t understand if you simply lecture to them. The teaching is just a roadmap for doing, the only way to truly get a visceral feel for any hard problem.

What you should say is, “I am just one lantern in the novice’s arsenal of understanding. I seek to illuminate precisely what the novice doesn’t think to look at.” In fact, there is an easy way to fulfill this purpose: force the novice to teach! They will start off with a very limited and ill-defined mental model of the concept: of the many roads to understanding, there is only one that they know. They will explain it in brutal detail of all their missteps and your implicit knowledge. They will be asked questions, and those questions will force them to clarify their understanding of this path. Eventually they will feel confident in their knowledge of the path, and if they continue learning, that path will expand to encompass many paths, different routes to understanding. The novice has become an expert. But, as the Buddha might say, they are the ones who must discover enlightenment. The teacher merely shows them the path.

How to use Vim's textwidth like a pro

March 5, 2010

There are lots of little blog posts containing advice about various one-line options you can do in Vim. This post falls into that category, but I’m hoping to do a more comprehensive view into one small subsystem of Vim’s configuration: automatic line wrapping.

When programming, automatic line wrapping can be a little obnoxious because even if a piece of code is hanging past the recommended 72/80 column width line, you probably don’t want to immediately break it; but if you’re writing a text document or an email message, that is specifically the behavior you want. By default, vim does no automatic line wrapping for you; turning it on is a question of being able to toggle it on and off when you want it.

Here are the configuration options you care about:

textwidth (or tw): controls the wrap width you would like to use. Use :set tw=72 to set the wrap width; by default it’s unset and thus disables line-wrapping. If this value is set, you’re entirely at the whimsy of the below formatoptions, which is often filetype sensitive.
formatoptions (or fo): controls whether or not automatic text wrapping is enabled, depending on whether or not the t flag is set. Toggle the flag on with :set fo+=t, and toggle it off with :set fo-=t. There are also a number of auxiliary format options, but they’re not as important.
wrapmargin (or wm): controls when to wrap based on terminal size; I generally find using this to be a bad idea.

Understanding the interaction between these two options is important. Here is a short table of interactions:

tw=0 fo=cq wm=0: No automatic wrapping, rewrapping will wrap to 80
tw=72 fo=cq wm=0: No automatic wrapping, rewrapping will wrap to 72
tw=0 fo=cqt wm=0: No automatic wrapping, rewrapping will wrap to 72
tw=0 fo=cqt wm=5: Automatic wrapping at a 5 col right margin
tw=72 fo=cqt wm=0: Automatic wrapping at col 72

Notice that to get automatic wrapping you need both fo+=t as well as tw or wm to be nonzero. Note also that some filetype will automatically give you fo+=t, while others won’t.

Here are the keystrokes you care about:

gq: performs a “formatting operation”, which in our universe means “rewrap the text.” This will respect leading indent and symbolic characters, which is usually nice but a little obnoxious if you’re reflowing a bullet point (since the text will suddenly acquire asterisks in front of everything).
The paragraph motions. The big one is vip (preceding v puts us in visual mode, for selection), which selects an “inner paragraph”; this means that if you’re anywhere inside of a paragraph, you can type vip and have the entire thing instantly selected for you, possibly for you to run gq subsequently. vap is also equivalent, although it selects a whole paragraph and is more appropriate if you want to, say, delete it. The curly braces move you between paragraphs.

The value of format-options will drastically change the way Vim behaves, so I highly recommend keeping it displayed some where you can reference it quickly. I use:

set statusline=...[%{&fo}]...

You probably have a statusline of your own; just add that small snippet minus the ellipses in somewhere convenient. For further good measure, I explicitly say set fo-=t in my vimrc, to prevent myself from being surprised (since I do primarily coding in vim).

One more neat trick:

augroup vimrc_autocmds
  autocmd BufEnter * highlight OverLength ctermbg=darkgrey guibg=#592929 
  autocmd BufEnter * match OverLength /\%74v.*/
augroup END

This will highlight all characters past 74 columns (tweak that number as desired) in dark grey (tweak that color as desired), and is a nice visual cue when auto linewrapping isn’t turned on when you should think about breaking things.

Third-party unattended upgrades in three steps

March 3, 2010

unattended-upgrades is a nifty little package that will go ahead and automatically install updates for you as they become enabled. No serious system administrator should use this (you are testing updates before pushing them to the servers, right?) but for many personal uses automatic updates are really what you want; if you run sudo aptitude full-upgrade and don’t read the changelog, you might as well turn on unattended upgrades. You can do this by adding the line APT::Periodic::Unattended-Upgrade "1" to /etc/apt/apt.conf.d/10periodic (thanks Ken!)

Of course, the default configuration they give you in /etc/apt/apt.conf.d/50unattended-upgrades only pulls updates from their security repository, and they only give you a commented out line for normal updates. People have asked, “well, how do I pull automatic updates from other repositories?” Maybe you have installed Chromium dailies; seeing the “you have updates” icon every day can be kind of tiresome.

Well, here’s how you do it:

Find out what URL the PPA you’re interested in points to. You can dig this up by looking at /etc/apt/sources.list or /etc/apt/sources.list.d/ (the former is if you manually added a PPA at some point; the latter is likely if you used add-apt-repository).
Navigate to that URL in your browser. Navigate to dists, and then navigate to the name of the distribution you’re running (for me, it was karmic). Finally, click Release. (For those who want to just enter the whole URL, it’s http://example.com/apt/dists/karmic/Release).
You will see a number of fields Fieldname: Value. Find the field Origin and the field Suite. The two values are the ones to put in Allowed-Origins.

For example, the Ksplice repository has the following Release file:

Origin: Ksplice
Label: Ksplice
Suite: karmic
Codename: karmic
Version: 9.10
Date: Sun, 07 Feb 2010 20:51:12 +0000
Architectures: amd64 i386
Components: ksplice
Description: Ksplice packages for Ubuntu 9.10 karmic

This translates into the following configuration:

Unattended-Upgrade::Allowed-Origins {
       "Ksplice karmic";
};

And that’s it! Go forth and make your systems more secure through more timely updates.

Bonus tip. You can turn on unattended kernel updates via Ksplice by editing /etc/uptrack/uptrack.conf and setting autoinstall = yes.

Writing generator friendly code

March 1, 2010

I’ve come a long ways from complaining to the html5lib list that the Python version gratuitously used generators, making it hard to port to PHP. Having now drunk the laziness kool-aid in Haskell, I enjoy trying to make my code fit the generator idiom. While Python generators have notable downsides compared to infinite lazy lists (for example, forking them for multiple use is nontrivial), they’re pretty nice.

Unfortunately, the majority of code I see that expects to see lists isn’t robust enough to accept generators too, and it breaks my heart when I have to say list(generator). I’ll forgive you if you’re expecting O(1) accesses of arbitrary indexes in your internal code, but all too often I see code that only needs sequential access, only to botch it all up by calling len(). Duck typing won’t save you there.

The trick for making code generator friendly is simple: use the iteration interface. Don’t mutate the list. Don’t ask for arbitrary items. Don’t ask for the length. This also is a hint that for range(0, len(l)) is absolutely the wrong way to traverse a list; if you need indices, use enumerate.

Update (September 1, 2012). Hilariously enough, PHP has finally gotten generators.

History as documentation

February 26, 2010

It’s real easy to argue about the utility, style and implementation of source code comments, those good ole’ pals of code that try to add supplementary information when the pure code isn’t enough.

However, to focus solely on the latest snapshot of any particular source file is to miss a wealth of information that is not inside the file; namely, the history of the file and the genealogy of every line that graces the file. This is not so relevant when you are rapidly prototyping functionality and versions of the file in source control history represent incomplete, half-baked figments of thought, but once a codebase transitions into a more maintenance-oriented workflow, the history takes on a keen and unusual importance. In particular:

A log of the evolution of the file over time can illustrate what the original intent of the module was, and then how it got retrofitted or extended or hacked up over time. If you have inherited code from someone else that you need to rearchitect, what better way to get in the heads of the original designers than to study the revisions they went through.
Any particular line may have simply been part of the ambient code present during the initial check-in, or it may have been touched by a highly targeted commit addressing some issue. In this case, the output of git blame is highly relevant for identifying why that particular line might be special, or why a subtly different permutation is incorrect. In the case of delocalized changes, associating a line with a commit can give you the fast pass to understanding how one operation is orchestrated with many others for some global effect.

Developers should be highly encouraged to write impeccably descriptive commit messages (with the diff in hand: never write a commit message without the diff in hand) for the sake of those who may pick through the logs in the future. It’s ok to even be a little wordier than you might be in an inline comment, since:

Log messages never grow old: they are always relevant to the revision they are attached to!
A good commit message facilitates code review, since it poses an informal specification of what the change does, which an external observer can then take and verify against the code. Otherwise, the reviewer would have to determine both the intended and actual semantics of the code, stylistic issues aside.

Finally, a few words about keeping the history clean and easy to use:

Logically organized patch sets mean that any given change is immediately relevant to the log message. If you push a big commit which contains lots of semantic changes, the reader has to disambiguate which particular semantic change is associated with which part of the diff. It is certainly worth your time to git add -p to stage hunks individually.
Make high quality diffs, which avoid touching unnecessary code. High traffic mailing lists such as LKML which receive many patches have published patch submission guidelines in order to make diffs as readable as possible to a possible reviewer. Even if you don’t need to convince a temperamental upstream to take your changes, later in time you may care about your diffs.
Stylistic changes are highly disruptive the git blame output, since they result in a line being marked as changed even though no semantic difference took place. If you must, they should be strictly alone. Infrequent is best.
Utilize history rewriting to allow for cheap commits which are polished up later for submission.

The Art of Posing a Problem

February 24, 2010

Last week, I was talking with Alexey Radul to figure out some interesting research problems that I can cut my teeth on. His PhD thesis discusses “propagation networks”, which he argues is a more general substrate for computation than traditional methods. It’s a long work, and it leaves open many questions, both theoretical and practical. I’m now tackling one very small angle with regards to the implementation of the system, but while we were still figuring a problem out, Alexy commented, “the more work I realize it takes to do a good job of giving someone a problem.”

I wholeheartedly agree, though my experiences come from a different domain: SIPB. Some of the key problems with assigning interested prospectives projects to work on include:

Many projects are extremely large and complex, and in many cases it’s simply not possible to assign someone an interesting, high-level project and expect them to make significant headway. They’re more likely to progress on a wax on wax off style training, but that’s not interesting.
No one ever tells you what they’re interested in! Even if you ask, you’ll probably get the answer, “Eh, I’d be up for anything.” As someone who has used this phrase before, I also emphatically understand that this is not true; people have different interests and will enjoy the same task dramatically differently.
It’s easy to exert too much or too little control over the direction of the project. Too much control and you’ve defined the entire technical specification for the person, taken away their creative input, made them feel bad when they’ve not managed to get work done, and are bound to be dismayed when they failed to understand your exacting standards in the first place. Too little control and the person can easily get lost or waste hours fighting incidental issues and not the core of the problem.
Being a proper mentor is a time-consuming process, even if you exert minimal control. Once the person comes back with a set of patches, you still have to read them, make sure they’re properly tested, and send back reviews on how the patches need to be reviewed (and for all but the most trivial of changes, this will be inevitable). You might wonder why you didn’t just do the damn task yourself. Reframing the problem as a purely educational exercise can also be disappointing, if not done properly.
As people refine the art of bootstrapping, the number of possible projects they can work on explode, and what makes you think that they’re going to work on your project? People decide what they want to work on, whether it’s because they made it themselves, or it’s in a field they’re interested in, or it’s a tool they use day-by-day, and if you don’t get the person to buy in, you can easily loose them.

I imagine similar tensions come up for open-source project maintainers, internship programs and Google Summer of Code organizers. And I still have no feeling for what strategies actually work in this space, even though I’ve certainly been on both sides of the fence. I’d love to hear from people who have tried interesting strategies and had them work!

« Newer Posts Older Posts »