ezyang's blog

the arc of software bends towards understanding

Workflows in Git: Single-user style

Nelson Elhage wrote a post about Git and usability, in which he discussed one of the reasons why Git seems to be so confusing to users who have come in straight from a Subversion-style workflow. When discussing this issue offline, one of the things that came up was the fact that, while Subversion imposes a fairly rigid workflow upon its users, Git is flexible enough to do almost any sort of workflow. This is terrible for a user placed in a shop that uses Git: when they go Google for how to use Git, they’ll get any multitude of tutorials, each of which is for a different workflow.

In this multipart series, I’d like to discuss several different types of workflows that I’ve seen or experienced while using Git. This first post will look at a very simple example of a Git workflow, namely that of a single user, which will establish some basic idioms of Git that you might see in the other workflows.

A single-user workflow is, well, kind of simple. At it’s simplest, it’s not much more than a glorified backup system; you have lots of versions of your code. You can go back in time. Since I am assuming a general knowledge of version control systems, I don’t think I need to convince you why this is useful. This article also assumes that you’re comfortable enough to make commits in a repository (though we will not assume you know how to use the index; -a is a wondrous flag).

Backups

The very first thing you may notice when you move from a centralized VCS to a decentralized VCS is that your data never leaves your computer unless you explicitly say so. This is great if you are on an airplane and don’t have Internet access; you don’t have to pile up a stack of changes without being able to check in to the server. However, it means that you have to put in a little thought about where you are going to push your changes to. An easy way to do this is to utilize the multitude free public hosting. If you have a server that you have SSH access, private offsite backups are also easy: create a bare git repository on another server using git init --bare and then setup a remote that you can push to… but I’m getting ahead of myself!

If you created a Git repository and working copy on your own computer with git init, you’ll now have to wrangle with Git remotes. I personally find this quite annoying, and thus always arrange to have my bare Git repository (i.e. the server) setup before I git clone my working copy (i.e. the client), which sets up the configuration that makes pushing easy. My steps are then:

# On my server, make a directory (I like /srv/git/project.git) and in it run git init --bare # On my client, run git clone ssh://servername/srv/git/project.git

If you must setup the remotes on an existing repository, the following commands will do the trick:

git remote add origin $REPO_URL
git config branch.master.remote origin
git config branch.master.merge refs/heads/master

For the curious, the first line adds a remote named “origin” (which, by convention, is the remote setup from the repository you may have cloned) associated with $REPO_URL. The second and third lines setup default behavior for when you pull changes from the repository, to simulate the configuration that normally gets setup when you do a clone. (Note: this kind of sucks. Git 1.7.0 introduces the --set-upstream flag which fixes these problems.)

From there, all you need to do is make commits with git commit, and then push them to the remote repository with git push.

Topic branches

As a single user, most of your work in your repository will play nicely together; you don’t have to worry about someone else coming in and trampling on your commits. However, every once in a while you may find yourself in the midst of a large refactoring, and you find yourself having to leave things off for the day, or take an interrupt to work on a more pressing, albeit smaller, bugfix. Here, cheap commits and branching make this very simple on Git.

If you think the changes you are currently working on are big but you’ll be able to get back immediately to them, use git stash to temporarily pop your changes into a stash. You can then perform your minor changes, and once done, use git stash pop to restore your old changes. Stash works best as a temporary scratch place for you to store changes, and should be immediately emptied out when possible; you don’t want to be looking at multiple stashed changes and trying to figure out which one contains the ones you care about.

If your changes are a smidge bigger than that, or you think that you’re not going to be able to work on whatever large change you’re making for a while, you can make what’s called a topic branch. First, change your working copy over to a new branch using git checkout -b new-branch-name (pick a descriptive name). Then, make a commit to save your changes. If you pop open gitk, you’ll now notice that you have a commit hanging off of master. You can checkout master again using git checkout master and work on whatever other changes you need.

When you finally decide that your topic branch is done, you need to stick back into master. There are two ways to do this:

  1. You can pretend that your topic branch, as a whole, is just a big patch, and as such, this patch should reasonably apply to the most recent version of master. In that case, running git rebase master while on the topic branch (you can check with git status) will take this “patch” and apply it to master. You can then checkout master and git pull topic-branch to fast-forward master to the topic branch. Since getting rid of old branches is a good thing, I recommend running git branch -d topic-branch afterwards.
  2. You can take a stance that history is important, and perform a merge. On the master branch, run git merge topic-branch. Just as in the first case, you can then cleanup the topic branch with git branch -d topic-branch.

Cleaning up after old topic branches is a good habit to get into, because it means you can use git branch to remind yourself quickly which topic branches might need your attention.

Additionally, if you care about backing up your topic branches, you should run git push origin topic-branch. You can delete topic branches from your remote using git push origin :topic-branch (note the colon).

Clean history

Many people pay a lot of attention to documentation inside a source file in order to puzzle out what a particular piece of code does. However, another excellent source of code documentation is looking at the history of a piece of code; when did a particular snippet get introduced, and what explanation did the author give for it when making that change? git blame will give you a blow-by-blow description of when every particular line in a Git file was changed, and git log will show you the conglomeration of changes made to a particular file.

Unfortunately, the usefulness of this mechanism highly depends on the quality of the messages you’re making in your commits, and if you’re using Git properly and committing often, you might have skimped a little on some of the messages. No worries; it happens to the best of us. You just have to remember to clean things up (i.e. rewrite history) when you’re done.

In this case, git rebase -i is your friend. Specify as an argument how far back you want to rewrite history (HEAD~N where N is a number is probably a good bet), and then rewrite history to your hearts content. You have three primary tools:

  • edit, and when Git gets to that commit, just run git commit --amend: This is fairly simple: you have a self-contained commit that you didn’t really write a good commit message for, well amend will let you change that commit message into something that is useful.
  • squash: If you made a bunch of very small commits, and now you look at them and decide, no, they really logically go together, you can squash them together.
  • edit with git checkout HEAD~: What this will do is give you a working tree with the changes of that commit, but without any of them actually part of a commit. You can then break a “too big” commit into bite-sized pieces using git add -p (which will selectively add hunks of your changes to the index) and then using git commit without the -a flag).

This strategy interacts particularly well with topic branches, which lend themselves to the following workflow:

  1. Create the topic branch with git checkout -b topic-name,
  2. Hack a lot on the branch, making tiny commits with incomprehensible summaries,
  3. Review your changes with git log -u master..HEAD,
  4. Edit your changes with git rebase -i master,
  5. Checkout master and git pull topic-name.

And that’s it for part one! You may have noticed that all of these strategies seem to feed into each other: this unusual integration between all aspects is one of the benefits of Git’s simple internal model. If people would like to see some examples of these techniques in action, I’d be more than happy to blog about them some more. Thanks for reading.

Arcadia Rising posters

As treasurer for the Assassins’ Guild, I often have to beg and plead GMs to get posters for their respective games, since the UA Finboard has a requirement that, to get funding for events, you need to supply posters.

So I was quite surprised, amazed and impressed by Lauren Gust’s work on posters for Arcadia. The composition is simple, but effective, and definitely adds to the atmosphere of the game. Click for larger versions, and you can get full size PDFs of the posters in Lauren Gust’s public. For a little more background into what the posters are referencing, check out the Arcadia Rising scenario.

image

image

image

image

image

image

image

image

image

To the right! Autocompletable names

In my younger days, the stylistic convention of MM/DD/YYYY confused me; why on earth would people opt for such an illogical system that placed months, days and years in non-hierarchical order? Surely something on order of YYYY-MM-DD would make far more sense: this format is sortable and, all-in-all, quite logical.

Eventually, though, I grudgingly accepted that MM/DD/YYYY, trades machine-friendliness for human-friendliness; after all, the year entry rarely changes, and for humans the month and date are the most important pieces of information. Context is usually more than enough to implicity specify what the year is.

But as a auto-complete user, I’ve come to appreciate that this sort of ordering can come in handy even when computers are involved. Consider the hierarchically named and non-hierarchally named list of files:

# hierarchally named
test-algorithm.sh
test-bottles.sh
test-capistrano.sh
utils.sh

# non-hierarchally named
algorithm-test.sh
bottles-test.sh
capistrano-test.sh
utils.sh

In the hierarchal case, to auto-complete test-algorithms.sh, I need to type t<tab>a<tab>; a total of four keystrokes. In the non-hierarchal case, however, I only need to type a<tab>. If I’m frequently accessing these files, the extra keystrokes add up.

So here’s my plea: the next time you’re coming up with a naming convention for files you’re sticking in a directory, consider both moving the “category” component to the end, and thinking of autocomplete friendly names. Your fingers will thank you for it.

(Hat-tip to GameTeX for showing me the light.)

Hacking git-rerere

An unusual workflow for Git, one that Wizard employs extensively, is when a single developer needs to perform merges inside lots of working copies. Normally, a maintainer would pull from the branches he cared about, and offload a large amount of the work to those who were interested in contributing patches. However, Wizard is using Git to provide a service for people who don’t know and aren’t interested in learning Git, so we need to push updates and merge their software for them.

The problem encountered when you are doing a lot of merging is “repeated resolution of the same conflicts.” The solution, at least for the classical case, is git rerere. This feature saves the resolution of conflicts and then automatically applies those resolutions if the conflict comes up again. You can find out this much if you check man git-rerere.

Unfortunately, this merge resolution data is stored per .git directory, specifically in the rr-cache subdirectory, so some modest cleverness is necessary to make this work across many repositories. Fortunately, the simple solution of symlinking all of the rr-cache directories to a common directory both works and is safe of race-conditions when initially merging (it’s not race-safe when writing out resolutions, but I am willing to consider this low contention enough to be a non-issue).

Why is this solution race safe? At first glance at the code in rerere.c, this would seem not to be the case: if two merges were to happen to generate the same merge conflict (precisely the use case of git rerere), the following code would get executed with the same value of hex:

ret = handle_file(path, sha1, NULL);
if (ret < 1)
        continue;
hex = xstrdup(sha1_to_hex(sha1));
string_list_insert(path, rr)->util = hex;
if (mkdir(git_path("rr-cache/%s", hex), 0755))
        continue;
handle_file(path, NULL, rerere_path(hex, "preimage"));

The last three lines access the (now shared) rr-cache directory, and handle_file will attempt to write out the file rr-cache/$HEX/preimage with the preimage contents; if both instances run handle_file concurrently, this file will get clobbered.

But, as it turns out, we don’t care; barring a SHA-1 collision, both instances will write out the same file. The signature of handle_file is:

static int handle_file(const char *path,
         unsigned char *sha1, const char *output)

The first argument is a path to read conflict markers from, and is mandatory. sha1 and output are optional; if output is not NULL, it contains a contents that the entire file, minus any diff3 conflict sections (the ones separated by ||||||| and =======) gets written to; if sha1 is not NULL, it gets a 20-byte binary digest written to it of the contents that output would have received. And thus, balance is restored to the world.

Addendum

Anders brings up the interesting question on whether or not two processes writing the same contents to the same file is actually race safe. Indeed, there is a very similar situation involving two processes writing the same contents to the same file which is a classic example of race conditions:

((echo "a"; sleep 1; echo "b") & (echo "a"; sleep 1; echo "b")) > test

Under normal circumstances, the contents of test is:

a
a
b
b

But every once in a while, one of the processes loses the race, and you get:

a
a
b

due to a non-atomic combination of writing and updating the file offset.

However, the distinguishing characteristic between this example and Git’s case is that, in this example, there is only one file descriptor. However, in Git’s case, there are two file descriptors, since each process called open independently. A more analogous shell script would be:

((echo "a"; sleep 1; echo "b") > test & (echo "a"; sleep 1; echo "b") > test)

the contents of which are (as far as I can tell) invariably:

a
b

Now, POSIX actually doesn’t say what happens if two writes to the same offset with the same contents occur. However, casual testing seems to indicate that Linux and ext3 are able to give a stronger guarantee that writing the same values won’t cause random corruption (note that, if the contents of the file were different, either combination could be possible, and this is what you see in practice).

Too many leftovers!

A bad habit in the domain of cooking that I’ve picked up from being a starving college student is the propensity to cook all of the ingredients I have on hand. Combine this with the fact that vegetables make a lot of food, the fact that you were planning on feeding 15-20 people (but realistically only fed 10), and that Costco has very large portions, and you have a recipe for post-Mystery Hunt leftover disaster.

Specifically, I now have a large pot filled to the brim with stir fried broccoli, chicken, carrots, celery, potatoes and beef chilling out in my refrigerator. It’s tasty food (if I may say so myself), but there’s only so much stirfry one person is willing to eat… and the quantities here would probably feed someone for a month. Perhaps it is time to invoke the power of living in a dorm.

Five advanced Git merge techniques

Have you ever performed a merge in Git and not have it quite turn out the way you wanted it to? For example, you accidentally converted all of your UNIX line endings to DOS line endings, and now the entire file reports a conflict? Maybe you see a conflict that you don’t really care about resolving, and want to resolve as theirs? Or perhaps the conflicted file is empty and you can’t figure out just what happened there?

Here are some advanced techniques you can apply to your conflicted merges to make things go a little easier. Many of them utilize Git plumbing; that is, the internal commands that interface directly with the bare metal Git abstractions: the index, the tree, the commit graph. Others are as simple as flipping a configuration switch.

  1. Turn diff3 conflicts using git config --global merge.conflictstyle diff3. The diff3 conflict style adds an extra section between the new ||||||| marker and ======= markers, which indicates the original contents of the section, with your changes above and their (the branch that is being merged in’s) changes below. diff3 is a powerful way of reestablishing context of a change you made several months ago (to see the changes you made, compare the middle section with the upper section; for the changes they made, compare the middle section with the lower section), and there is really no good reason not to have this on by default.

  2. If you’ve come in from Subversion, you may be familiar with the FILE.mine, FILE.r2 (the original you worked with) and FILE.r3 (the latest version checked in) files, as well as the ability to run svn resolve --accept theirs-full or mine-full, which says “I don’t care about the other changes, just use this version of the file.” Git offers similar facilities utilizing the merge parents, although they’re a little more hidden.

    You may be already familiar with the git show command, which lets you view commits as well as arbitrary blobs inside the tree of any given commit. When you are inside a merge, you can use a special :N: syntax, where N is a number, to automatically select one of the merge parents. 1 selects the common base commit (the lower revision), 2 selects your version (“mine”), and 3 selects their version (the higher revision). So git show :3:foobar.txt shows the upstream version of foobar.txt.

    To actually use one of these versions as the merge resolution, use git checkout {--ours|--theirs} filename.txt.

  3. When you’re in a conflict, git diff will give you the deep and dirty of all the conflicts that occurred, sometimes this is too much information. In that case, you can run git ls-files -u to view all of the unmerged files (this is also a lot faster than git status, and will omit all of the files that were merged properly.)

    You may notice that there as many as three copies of a file inside the list; this tells you the state of the “common”, “ours” and “theirs” copies mentioned previously. If 1 (common) is missing, that means that the file appeared at the same time in our branch and their branch. If 2 (ours) is missing, it means we deleted the file, but it got a change upstream. If 3 (theirs) is missing, it means we made some changes, but upstream deleted the file. This is especially useful if a file is conflicted, but you can’t figure out why (since there are no conflict markers.)

  4. Sometimes life gives you lemons. Many people suggest you make lemon juice. However, if Git gives you a really bad set of conflict markers, for example, you accidentally flipped the newline style for one of the files, so now the entire file conflicts, don’t settle for that: redo the merge for that file. You can do this with the handy git merge-file command. This runs a three-way file merge, and takes three arguments: the current file, the common file, and the upstream file, and writes out the merge into the current file (first argument). Use git show to dump out your file, the common file and upstream file, do whatever changes to those files you need (for example, run dos2unix), run git merge-file mine common theirs, and then copy the mine over the old conflicted file. Voila, instant new set of conflict markers.

    If you discover a global conflict relatively early in the merge process, and it was your fault, it might be easier to back out of the merge git reset --hard, fix the mistake, and try merging again. However, if you’ve already made substantial progress merging a copy, re-merging just a single file can be a lifesaver.

  5. Don’t merge, rebase! Instead of running git pull, run git pull --rebase. Instead of running git merge master, run git rebase master. Your history will be much cleaner as a result, and you want have to go on a massive rebase marathon later if you want to submit your patches upstream.

Now, go forth and merge to thy hearts content!

Typeclasses matter

Typeclasses matter. In fact, I’ll go as far to say that they have the capacity to replace what is traditional object-oriented programming. To understand why, however, we have to review the traditionally recognized benefits of object-oriented programming:

  • Organization. For C-inspired languages that don’t have a module system, this is so incredibly important; without a discipline for organizing code finding the location of any given function is difficult unless you are intimately familiar with the problem domain. With object-oriented programming, all of these aspects are obvious: classes map into obvious filenames, methods go in obvious places, and overall organization is a function of how well the object model is designed, not how well thought out the include files were.
  • Encapsulation. Objects were the first widely utilized method of hiding data and code from clients. Declare something private or protected, and you have compile-time guarantees that your clients won’t have their grubby fingers all over your innards. Used properly, modularity follows.
  • Polymorphism. The ability to change behavior based on data is a powerful idea dating back to the days of (void *), which can lead to incomprehensible code flow but more often is a more elegant and concise way of writing complicated interactions than a giant switch statement. These benefits compound in situations that multiple dispatch is appropriate, and interfaces can lead to compile-time assurances that a particular class does what it says it does.
  • Inheritance. While a problematic facet of object-oriented programming (especially when manifest as multiple inheritance), inheritance is still an extremely powerful mechanism of code reuse in object-oriented designs. Subclasses get a default implementation for methods, as well as the ability to break through a level of encapsulation and use protected methods.

Typeclasses directly fulfill some of these requirements, while others are achieved due to Haskell’s strict types and module system.

  • Organization. At first blush, this seems strictly worse: we can no longer read off class name and find the right file. However, a combination of ghci, which lets you run :info to find the location of any declaration in scope, as well as Hoogle, which lets you find the function you want from just a type signature. These capabilities make it incredibly easy not only to find functions that you know exist, but also find ones you don’t know exist. Static typing to the rescue!
  • Encapsulation. This feature is implemented by Haskell’s module export system: essentially, if you don’t export a constructor for any given data type, the end user cannot create or introspect inside that type; they have to use the functions you define to manipulate them. Additionally, if functions specify that their input types should be instances of a typeclass, there is a statically-checked type guarantee that the function will only use functions defined by the typeclass (i.e. no unsafe downcasts).
  • Polymorphism. This is the most obvious application of typeclasses; when explaining them to those coming in from imperative languages, the most common analogy made is that typeclasses are like interfaces. They are far more expressive than interfaces, however: functions can trivially specify that an incoming data type must satisfy multiple typeclasses, and parametrized types (those with extra type variables in their declaration) can have type classes constraining their type parameters. Furthermore, code can be written to be fully general over a typeclass, to be instantiated later down the line once an explicit type is inferred.
  • Inheritance. Interface inheritance is a straightforward subset of type parametrization; instead of saying class Monad m, we say class Functor m => Monad m, thus stating that any m with a Monad instance must also have a Functor instance (and thus we may freely use any Monad as if it were a Functor). The ability to specify default implementations (often self referential, as x /= y = not (x == y) and x == y = not (x /= y) of the Eq class attest) goes a long way to make writing new instances easy.

Classic object hierarchies are an excellent mechanism for modelling “is a” relationships, but very few things in this world are actually cleanly “is a”, as opposed to “acts like a”; and inheritance has been abused by many developers who have created large object hierarchies (cough GUI toolkits cough), when really, all that is really being exercised is inheritance’s code reuse mechanism. The emphasis on typeclasses/interfaces gets back to the heart of the problem:

What can I do with this type?

No more, no less.

Sup: Mail for Nerds

Update (September 1, 2012): This post is a bit out of date. I’m planning on writing an update, but the main new points are: if you have an SSD, the startup time of Sup is really fast, so you can easily run it on your laptop and you should use the maildir-sync branch, which gives you backwards synchronization of your labels (or my patchset, which is pretty sweet but needs to be polished and published.)


I use Sup and I love it; never mind the ridiculing from friends who’ve found their inbox get painfully slow as they broke the hundred thousand message mark or managed to get their index obliterated. It’s not quite been an easy road to email nirvana and a ten email inbox, so here’s a step-by-step guide for setting up Sup for your own geeky emailing needs. We’ll be using tip-top everything, which means running off a Git checkout of the next branch, using Xapian indexes and using OfflineImap.

  1. Get a server you can SSH into and run screen on. Sup has a nontrivial startup time, so the best way to work around it is to never shut down the process. It also saves you the trouble from needing to have ISP sensitive SMTP switching.
  2. Setup OfflineIMAP to slurp down your mails. IMAP is generally slow, and I find I care enough about my mail to want a local backup. The configuration of .offlineimaprc was slightly fiddly (I blew away my results twice before getting the right setup); see end of post for the template I ended up using. Since the import process will take a long time, double-check your configuration before running.
  3. Setup a Ruby environment; Sup works on Ruby 1.8 but not 1.9. If you’re on Ubuntu Jaunty, you’ll want to manually install RubyGems; on Karmic the packaged version works fine.
  4. Grab the dependency gems. This is as simple as installing the Sup gem using gem install sup, and then removing just the Sup gem with gem uninstall sup.
  5. Grab a copy of Sup from Git using git clone git://gitorious.org/sup/mainline.git sup. Inside your shell’s rc file (.bashrc for Bash users), set your PATH to include $SUPDIR/bin and your RUBYLIB to include $SUPDIR/lib. An example set of lines to add can be found at the bottom of this post.
  6. Run sup-config to setup general configuration. When it prompts you to add a new source, add a Maildir source, specifying a folder inside the directory you asked OfflineImap to sync to (for example, I asked OfflineImap to download my mail to ~/Mail/MIT, so ~/Mail/MIT/INBOX would be a valid folder for my Maildir). When I switched to Sup, I stopped using server side folders, so this is the only one I have a source for; if you still want to use them you’ll need to add them each as independent sources.
  7. Open up .sup/config.yaml in your favorite editor and on a new line add :index: xapian. An alternative method would have been to set an environment variable, but I prefer this method as more resilient.
  8. There are a few hooks that I unilaterally recommend you setup when you start Sup. Since you are using OfflineImap, the before-poll hook that executes OfflineImap prior to a poll is a must. There is also no good reason for you to not be running “Automatic backups of your labels” startup hook.
  9. Load up sup in a screen session and enjoy!

.offlineimaprc template:

[general]
accounts = MIT

[Account MIT]
localrepository = LocalMIT
remoterepository = RemoteMIT

[Repository LocalMIT]
type = Maildir
localfolders = ~/Mail/MIT

[Repository RemoteMIT]
type = IMAP
ssl = yes
remotehost = $HOST
remoteuser = $USERNAME
remotepass = $PASSWORD

.bashrc template (assuming Sup lives in $HOME/Dev/sup):

export PATH=$HOME/Dev/sup/bin:$PATH
export RUBYLIB=$HOME/Dev/sup/lib

First impressions of the VX-8R

The VX-8R is the first Ham radio I’ve ever owned; I have used the VX-7R before, but the extent of my usage of it was someone handing me the radio is “Here is the radio preconfigured to the frequencies you’ll need; here’s how squelch works; here’s how to debug common problems; don’t fuck it up.” Here are my impressions of the VX-8R

  • Despite the sturdy construction, I am sending it back for warranty replacement. The battery indicator is defective; it is stuck at 100% battery while discharging, and 0% battery when charging. According to a HRO representative, this was highly unusual. Having to send the radio back for replacement is kind of obnoxious, but eh, what can you do.
  • The Yaesu tries hard to be non-modal, but when it is it’s slightly difficult to tell what keys do what. For example, when scanning, pressing the PTT terminates the scan, but so does BAND and the arrow keys. PTT is actually a fairly reliable method for getting out of FOO mode
  • I love the scanning interface. Hold UP/DOWN to initiate scanning, use the dial to nudge it if it gets stock on the wrong thing, whack PTT when you hear something interesting.
  • The stereo headphone jack is by far one of the best things about the VX-8R, and partially offsets the suckage of needing two adapters in order to get a split speaker and PTT microphone set. I’ve, uh, been listening to a lot of FM radio with my Yaesu (perhaps not the most interesting use, but a use nonetheless!) The stereo plug is contained inside a somewhat appreciable well, so you may have some difficulty getting shorter jacks to plug in soundly.
  • On the subject of mods, it appears that despite having been released about a year ago, the VX-8R still has no software mod software available. The current hardware mod only opens up MARS/CAP transmission frequencies.

Still no word about the microphone dilemma; I might just pony up some cash for a Pryme headset (they’re a smidge more expensive than I’d like them to be).

Why Haskell? The big question

Language selection is a contentious thing, and often a compromise between “pick the right language for the job” and “use as few languages as possible to increase mindshare.” Google, for example, limits the programming languages their employees are allowed to use; and I have come to associate picking whatever language you want for your own projects as irresponsible, having once been told, “Yeah… that project was written in X and no one besides the guy who wrote it knows X… probably not a good use of your time to work on it.” Of course, I’ve been quite culpable of this myself; I wrote the member dues tracking system for the Assassins’ Guild in Haskell, and unless a miracle happens I am kind of doubtful future maintainers will be able to deal with it.

When I am not being irresponsible, Python is my favored language for most of my scripting needs, and as such I am painfully aware of quirks in the language that Haskell would smooth away.

  • Python code is dynamically typed and variables have no scoping. Brain-o typing errors, variable misnamings and plain ole broken code isn’t caught unless a code path is exercised. What makes it better: pylint -e catches large classes of errors (but you commonly have to wade through recursion limit errors to find it, and I strongly believe any error checking not built in the compiler is doomed to be ignored by the people who need it most), as is full code coverage on whatever automated tests you have. Why Haskell rocks: the static analysis is complete enough that if it compiles, it runs correctly.
  • Python is slow. If you don’t believe it: ask yourself why the runtime can’t be loaded quickly enough to make running Python as CGI tenable, or why Google has banned it from living in public facing code, or why engineers treat rewriting a Python daemon in C++ as the inevitable conclusion when you just can’t wring out anymore speed. What makes it better: Not everything has to be blazing fast. Why Haskell rocks: Insane people writing insane compilers like GHC which compile into native binaries and have absolutely epic speed.
  • Python has an limit to comprehensible code golfing. While duplication of high-level code structure is no where as bad in Python as it might be for Java or C++, attempts to purify code even further often lead to highly incomprehensible higher order functions that require copious documentation. As people say, “Don’t write Haskell in Python.” Why Haskell rocks: The type system not only becomes essential to the documentation of the code, it also serves as a framework by which a user can “snap” together combinators and data like legoblocks, leading to a much higher tolerance of complexity.
  • Python has inherited an aging object-oriented paradigm. However, I am increasingly convinced that typeclass based systems (Go is one decidedly imperative language that has picked them up) are the way to go. In combination with type signatures, they provide the two primary benefits of OOP: a logical organization of functionality and polymorphism, without all of the complicated gunk that is multiple inheritance, mix-ins, metaclasses, etc. Why Haskell rocks: First-class support for type-classes.
  • Python has abysmal threading support. It has the global interpreter lock. Why Haskell rocks: It not only has fast, green threads and the notion of purity to make splitting computations feasible, it has made it extremely simple to experiment with scheduling algorithms with the computation. I can’t say much more in this field, because I have very little experience writing parallel Haskell code.

But I would cringe to attempt to write in Haskell one of the large projects that I have done in imperative languages like PHP or Python (I mention these two particular languages, because within them I have built two systems that are actually large), for these very important reasons:

  • Haskell has not grown sufficient library support to become fully multiparadigm. I am highly skeptical that a straight-forward port of any given piece of Python code would be possible; despite great advances in shepherding the more dynamic features of Python into Haskell’s type system with packages such as Text.Printf, action at a distance intrinsic of an imperative program would require heavy IO wizardry in Haskell.
  • It’s not obvious which problems in the imperative domain truly are better kept in the imperative domain, as James Hague has mused recently. The Haskell community is fairly unified in its stance that as little code should be in the IO monad as possible, but when we bring in small bits of the imperative world to help us in cases such as the State monad, we acknowledge the imperative paradigm is useful… or at least an easy way out. Perhaps if we tried harder we could find a more elegant, maintainable, purely functional solution; and one of the favorite pastimes of academics is to figure these things out. But this is hard even for those used to thinking functionally, and the answers often need to be discovered, let alone implemented.
  • All of the engineering folklore, wisdom and best practices that have been accumulated from years of hacking on large, imperative codebases may or may not apply to functional codebases anymore. If functional libraries are encouraged to be as decoupled as possible, do we need to allow for further decoupling in the API? Does pure code need to be logged, or is its determinism make it trivial to debug? What testing do we need, and how much trust do we put in the types? How does API documentation and cross-referencing need to evolve for functional codebases? How the heck do you go ahead and debug a logic error in a section of pure Haskell code?

Yet, there are companies that are putting out production size codebases in Haskell, which makes me optimistic that answers to these questions will soon become public knowledge; if not for Haskell, for some other purely functional language. And the “classic” solutions in the imperative world have often lead to insidious errors, especially in a world of multithreaded applications, and we must not settle for “good enough.” Software sucks, but purely functional systems with strong, flexible types have the promise to eliminate large swaths of this suckage. And that is why I Haskell.