ezyang's blog

Use The Monoid: A worked example

May 28, 2010

Attention conservation notice. Equivalent Haskell and Python programs are presented for retrieving values from a data structure using state. We then refactor the Haskell program into one that has no state, just a monoid.

A pretty frequent thing a working programmer needs to do is extract some values (frequently more than one) from some data structure, possibly while keeping track of extra metadata. I found myself writing this code the other day:

getPublicNames :: Module -> State (Map Name (Set ModuleName)) ()
getPublicNames (Module _ m _ _ (Just exports) _ _) = mapM_ handleExport exports
    where handleExport x = case x of
            EVar (UnQual n) -> add n
            EAbs (UnQual n) -> add n
            EThingAll (UnQual n) -> add n
            EThingWith (UnQual n) cs -> add n >> mapM_ handleCName cs
            _ -> return ()
          handleCName x = case x of
            VarName n -> add n
            ConName n -> add n
          add n = modify (Map.insertWith Set.union n (Set.singleton m))
getPublicNames _ = return ()

Briefly, getPublicNames traverses the Module data structure looking for “public names”, and every time it finds a name, it inserts it records that the current module contained that name. This lets me efficiently ask the question, “How many (and which) modules use FOO name?”

A transcription in Python might look like:

def getPublicNames(module, ret=None):
    if not ret:
        ret = defaultdict(set)
    if module.exports is None:
        return ret
    for export in module.exports:
        if isinstance(export, EVar) or \
           isinstance(export, EAbs) or \
           isinstance(export, EThingAll):
            ret[export.name].add(module.name)
        elif isinstance(export, EThingWith):
            ret[export.name].add(module.name)
            for cname in export.cnames:
                ret[export.name].add(cname.name)
    return ret

There a number of cosmetic differences between these two versions:

The Python version takes in pre-existing state optionally; otherwise it does the initialization and is referentially transparent. The Haskell version has no such notion of default state; we trust that the user can run the state monad with a simple runState.
The Python version takes advantage of duck-typing to reduce code; I’ve also played fast and loose with the hypothetical object-oriented equivalent data structure.
The Python version doesn’t have it’s code separated into handleExport and handleCname, although we certainly could have with a few more inline functions.

But other than that, they pretty much read and operate precisely the same way, by mutating state. The Python version is also pretty much the end of the road; besides pushing the functions into their member objects, I believe there is no more “Pythonic” way to do it. The Haskell version is making me itchy though…

We’re never reading out state! This is a tell-tale sign that we should be using a Writer monad, not a State monad. There is a slight technical difficulty, though: Writer requires that the value being “logged” is a Monoid, and while, in theory, Map k (Set a) certainly has a a Monoid instance that does what we mean, the general Monoid instance for Map k v doesn’t cut it. Recall that a monoid describes data that I can “append” together to form another version of that data. For a SetMap,

We want a monoid instance that takes two SetMap structures and and unions the map, resolving duplicate by unioning those sets.
By default, we get a monoid instance that takes two Map structures and unions the map, preferring the original value when a conflict occurs and discarding the rest.

Newtype to the rescue. A newtype is in order. We’ll call it SetMap. The recipe to follow for cooking up the newtype is as follows:

First, you need a newtype declaration. Explicitly naming the field in record syntax as unDataType is idiomatic, and invokes “unwrapping” the newtype wrapper from the object:

newtype SetMap k v = SetMap { unSetMap :: Map k (Set v) }

Next, you write the special type class instances you are interested in. (And possibly use deriving ... to pull in any old, default instances that are still good).

instance (Ord k, Ord v) => Monoid (SetMap k v) where
    mempty = SetMap Map.empty
    mappend (SetMap a) (SetMap b) = SetMap $ Map.unionWith Set.union a b
    mconcat = SetMap . Map.unionsWith Set.union . map unSetMap

Perhaps some helper functions are in order:

setMapSingleton :: (Ord k, Ord v) => k -> v -> SetMap k v
setMapSingleton k v = SetMap $ Map.singleton k (Set.singleton v)

And voila!

getPublicNames :: Module -> Writer (SetMap Name ModuleName) ()
getPublicNames (Module _ m _ _ (Just exports) _ _) = mapM_ handleExport exports
    where handleExport x = case x of
            EVar (UnQual n) -> add n
            EAbs (UnQual n) -> add n
            EThingAll (UnQual n) -> add n
            EThingWith (UnQual n) cs -> add n >> mapM_ handleCName cs
            _ -> return ()
          handleCName x = case x of
            VarName n -> add n
            ConName n -> add n
          add n = tell (setMapSingleton n m) -- *
getPublicNames _ = return ()

Wait, we made our code more specific, and somehow it got longer! Perhaps, gentle reader, you might be slightly reassured by the fact that the new SetMap support code, which forms the bulk of what we wrote, is highly general and reusable, and, excluding that code, we’ve slightly reduced the code from add n = modify (Map.insertWith Set.union n (Set.singleton m)) to add n = tell (setMapSingleton n m).

Perhaps more importantly, we’ve now indicated to an enduser a new contract for this function: we will only ever write values out, and not change them.

Why were we using the monad again? Closer inspection further reveals that we’re never using bind (>>=). In fact, we’re not really using any of the power of a monad. Let’s make our code even more specific:

-- This operator is going into base soon, I swear!
(<>) = mappend

getPublicNames :: Module -> SetMap Name ModuleName
getPublicNames (Module _ m _ _ (Just exports) _ _) = foldMap handleExport exports
    where handleExport x = case x of
            EVar (UnQual n) -> make n
            EAbs (UnQual n) -> make n
            EThingAll (UnQual n) -> make n
            EThingWith (UnQual n) cs -> make n <> foldMap handleCName cs
            _ -> mempty
          handleCName x = case x of
            VarName n -> make n
            ConName n -> make n
          make n = setMapSingleton n m
getPublicNames _ = mempty

There’s not much of a space change, but users of this function now no longer need to execWriter; they can use the output right off the back (although they might need to unpack it eventually with unSetMap.

Technically, we never needed the monoid. In particular, setMapSingleton is forcing our code to cater to SetMap, and not Monoids in general (that wouldn’t really make any sense, either. Perhaps the notion of a “Pointed” Monoid would be useful). So we could have just written out all of our functions explicitly; more likely, we could have defined another set of helper functions to keep code size down. But you should still use the monoid. Monoids act certain ways (e.g. the monoid laws) and have a canonical set of functions that operate on them. By using those functions, you allow other people who have worked with monoids to quickly reason about your code, even if they’re not familiar with your specific monoid.

Postscript. I refactored real code while writing this blog post; none of the examples were contrived. I was originally planning on writing about “You ain’t gonna need it” and Haskell abstractions, but fleshing out this example ended up being a bit longer than I expected. Maybe next time…

Post-postscript. Anders Kaseorg writes in to mention that SetMap has been implemented in several places (Criterion.MultiMap, Holumbus.Data.MultiMap), but it hasn’t been put in a particularly general library.

Bananas, Lenses, Envelopes and Barbed Wire <br>A Translation Guide

May 26, 2010

One of the papers I’ve been slowly rereading since summer began is “Functional Programming with Bananas, Lenses, Envelopes and Barbed Wire”, by Erik Meijer, Maarten Fokkinga and Ross Paterson. If you want to know what {cata,ana,hylo,para}morphisms are, this is the paper to read: section 2 gives a highly readable formulation of these morphisms for the beloved linked list.

Last time, however, my eyes got a little bit glassy when they started discussing algebraic data types, despite having used and defined them in Haskell; part of me felt inundated in a sea of triangles, circles and squiggles, and by the time they reached the laws for the basic combinators, I might as well have said, “It’s all math to me!”

A closer reading revealed that, actually, all of these algebraic operators can be written out in plain Haskell, and for someone who has been working with Haskell for a little bit of time, this can provide a smoother (albeit more verbose) reading. Thus, I present this translation guide.

Type operators. By convention, types are $A, B, C\ldots$ on the left and a, b, c... on the right. We distinguish these from function operators, though the paper does not and relies on convention to distinguish between the two.

$A \dagger B \Leftrightarrow$ Bifunctor t => t a b
$A_F \Leftrightarrow$ Functor f => f a
$A* \Leftrightarrow$ [a]
$D \parallel D' \Leftrightarrow$ (d, d')
$D\ |\ D' \Leftrightarrow$ Either d d'
$_I \Leftrightarrow$ Identity
$\underline{D} \Leftrightarrow$ Const d
$A_{(FG)} \Leftrightarrow$ (Functor f, Functor g) => g (f a)
$A_{(F\dagger G)} \Leftrightarrow$ (Bifunctor t, Functor f, Functor g) => Lift t f g a
$\boldsymbol{1} \Leftrightarrow$ ()

(For the pedantic, you need to add Hask Hask Hask to the end of all the Bifunctors.)

Function operators. By convention, functions are $f, g, h\ldots$ on the left and f :: a -> b, g :: a' -> b', h... on the right (with types unified as appropriate).

$f \dagger g \Leftrightarrow$ bimap f g :: Bifunctor t => t a a' -> t b b'
$f_F \Leftrightarrow$ fmap f :: Functor f => f a -> f b
$f \parallel g \Leftrightarrow$ f *** g :: (a, a') -> (b, b')
    where f *** g = \(x, x') -> (f x, g x')
$\grave{\pi} \Leftrightarrow$ fst :: (a, b) -> a
$\acute{\pi} \Leftrightarrow$ snd :: (a, b) -> b
$f \vartriangle g \Leftrightarrow$ f &&& g :: a -> (b, b')        -- a = a'
    where f &&& g = \x -> (f x, g x)
$\Delta x \Leftrightarrow$ double :: a -> (a, a)
    where double x = (x, x)
$f\ |\ g \Leftrightarrow$ asum f g :: Either a a' -> Either b b'
    where asum f g (Left x)  = Left (f x)
          asum f g (Right y) = Right (g y)
$\grave{\i} \Leftrightarrow$ Left :: a -> Either a b
$\acute{\i} \Leftrightarrow$ Right :: b -> Either a b
$f\ \triangledown\ g \Leftrightarrow$ either f g :: Either a a' -> b        -- b = b'
$\nabla x \Leftrightarrow$ extract x :: a
    where extract (Left x) = x
          extract (Right x) = x
$f \rightarrow g \Leftrightarrow$ (f --> g) h = g . h . f
    (-->) :: (a' -> a) -> (b -> b') -> (a -> b) -> a' -> b'
$g \leftarrow f \Leftrightarrow$ (g <-- f) h = g . h . f
    (<--) :: (b -> b') -> (a' -> a) -> (a -> b) -> a' -> b'
$(f \overset{F}{\leftarrow} g) \Leftrightarrow$ (g <-*- f) h = g . fmap h . f
    (<-*-) :: Functor f => (f b -> b') -> (a' -> f a) -> (a -> b) -> a' -> b'
$f_I \Leftrightarrow$ id f :: a -> b
$f\underline{D} \Leftrightarrow$ const id f :: a -> a
$x_{(FG)} \Leftrightarrow$ (fmap . fmap) x
$VOID \Leftrightarrow$ const ()
$\mu f \Leftrightarrow$ fix f

Now, let’s look at the abides law:

$(f \vartriangle g)\ \triangledown\ (h \vartriangle j) = (f\ \triangledown\ h) \vartriangle (g\ \triangledown\ j)$

Translated into Haskell, this states:

either (f &&& g) (h &&& j) = (either f h) &&& (either g j)

Which (to me at least) makes more sense: if I want to extract a value from Either, and then run two functions on it and return the tuple of results, I can also split the value into a tuple immediately, and extract from the either “twice” with different functions. (Try running the function manually on a Left x and Right y.)

Lazy exceptions and IO

May 24, 2010

Consider the following piece of code:

import Prelude hiding (catch)
import Control.Exception

main :: IO ()
main = do
    t <- safeCall
    unsafeCall t
    putStrLn "Done."

safeCall :: IO String
safeCall = do
    return alwaysFails `catch` errorHandler

--alwaysFails = throw (ErrorCall "Oh no!")
alwaysFails = error "Oh no!"

errorHandler :: SomeException -> IO String
errorHandler e = do
    putStrLn "Caught"
    return "Ok."
errorHandler_ e = errorHandler e >> return ()

unsafeCall :: String -> IO ()
unsafeCall = putStrLn

What might you expect the output to be? A straightforward transcription to Python might look like:

def main():
    t = safeCall()
    unsafeCall(t)
    print "Done"

def safeCall():
    try:
        return alwaysFails()
    except:
        return errorHandler()

def alwaysFails():
    raise Exception("Oh no!")

def errorHandler():
    print "Caught."
    return "Ok."

def unsafeCall(output):
    print output

and anyone with a passing familiarity with the any strict language will say, “Of course, it will output:”

Caught.
Ok.
Done.

Of course, lazy exceptions (which is what error emits) aren’t called lazy for no reason; the Haskell code outputs:

*** Exception: Oh no!

What happened? Haskell was lazy, and didn’t bother evaluating the pure insides of the IO return alwaysFails until it needed it for unsafeCall, at which point there was no more catch call guarding the code. If you don’t believe me, you can add a trace around alwaysFails. You can also try installing errorHandler_ on unsafeCall.

What is the moral of the story? Well, one is that error is evil, but we already knew that…

You may install exception handlers for most IO-based errors the obvious way. (If we had replaced return alwaysFails with alwaysFails, the result would have been the strict one.) You may not install exception handlers for errors originating from pure code, since GHC reserves the right to schedule arbitrarily the time when your code is executed.
If pure code is emitting exceptions and you would like it to stop doing that, you’ll probably need to force strictness with $! deepseq or rnf, which will force GHC to perform the computation inside your guarded area. As my readers point out, a good way to think about this is that the call is not what is exceptional, the structure is.
If you are getting an imprecise exception from pure code, but can’t figure out where, good luck! I don’t have a good recipe for figuring this out yet. (Nudge to my blog readers.)

Postscript. Note that we needed to use Control.Exception.catch. Prelude.catch, as per Haskell98, only catches IO-based errors.

Upgrading to Ubuntu Lucid

May 21, 2010

Now that term is over, I finally went an upgraded my laptop to Ubuntu 10.04 LTS, Lucid Lynx. The process went substantially more smoothly than Karmic went, but there were still a few hiccups.

Etckeeper. As always, you should set AVOID_COMMIT_BEFORE_INSTALL to 0 before attempting a release upgrade, since etckeeper hooks will be invoked multiple times and there’s nothing more annoying than getting the notice “etckeeper aborted the install due to uncommited changes, please commit them yourselves” because there is no way that’s going to work.

Well, this time round there was a different, hilarious bug:

/etc/etckeeper/post-install.d/50vcs-commit: 20:
etckeeper: Argument list too long

This has been reported as Bug #574244. Despite being an ominous warning, it is actually quite harmless, and you can complete the upgrade with:

aptitude update
aptitude full-upgrade

I had to kick the network due to broken DNS; your mileage may vary.

Wireless key management. I have not resolved this issue yet, but the basic symptom is that Ubuntu network-manager fails to remember WEP keys you have provided for secured networks. (I know you MIT students still on campus aren’t too bothered by this.) This appears to be a moderately widespread problems, as you have people revivifying permutations of this bug that occurred a long time ago. (In classic terrible bug reporting style, users are attaching themselves to old bug reports when they really should be filing a new regression for Lucid.)

From what I’ve investigated, I have been able to verify that connections to the keyring daemon are not working. There is a fix circulating in which you change your startup command from “gnome-keyring-daemon –start –components=pkcs11” to just “gnome-keyring-daemon”, although I suspect this isn’t really the “right” thing, and it doesn’t work for me anyway.

PHP. Ubuntu Lucid most notably has upgraded PHP 5.3.2, but they’ve also fiddled with some of the default settings. In my case, log_errors was causing quite interesting behavior for my scripts, and I have since coded my scripts to explicitly turn this ini setting off. You should save a copy of the output of php -i prior to the upgrade and compare it with the output afterwards.

Class Reflections

May 19, 2010

Last February, I posted about classes that I was going to be taking. Here are some reflections, now that final projects and examinations are over.

6.005: Software Construction. Teaching students how to engineer large software projects is one of the oddest paradoxes that you might encounter in academic life. The institute is certainly capable of teaching concepts, tools and methodologies, but, to actually be capable of constructing a system from scratch? It’s not really something you can learn, it something that you have to do. And the amount of work you have to put in to start getting the taste of real code as opposed to school code (which gets thrown away at the end of the term) doesn’t fit into one term. We’ve joked that MIT ought to have a two part series, where the second part you are asked to go modify some code you wrote a year ago in the face of shifting requirements. (Unfortunately, I suspect a large number of people will rewrite the thing: one of the problems of not actually being able to do large systems.)

When you get past this fundamental issue, the class was relevant and even a smidge enjoyable. While I didn’t personally get much out of it, I was delighted to see the class trace course across all of the big tricky ideas that I encountered when I was cutting my teeth as a programmer: roughly, the course can be divided into state machines, functional ideas and relational modeling. Despite what others say, I find these formalisms useful, and the key ways I helped develop intuition for what a traditional imperative program should smell like. Unfortunately, each of these are really big ideas, and the course doesn’t manage to do it justice.

6.02: Intro to EECS II. MIT seems to like threes: for 6.02 the big three was signals, encodings and networks. The class was a pleasant walk through the three subjects, even though, while the class professes to be “introductory”, none of the topics are what I’m really interested these days in computer science.

One of the notable rough patches I hit taking the course was when the class hit frequency analysis. I’m a big believer in understanding the underlying principles behind complex systems: it’s one of the reasons why the calculus driven mechanical physics class worked so much better for me. Here, this predisposition was counterproductive: as Robert put it (and I paraphrase), yes, you could do it that way, but it is messy and not particularly insightful.

6.045: Automata, Computing and Complexity. So much fun! Scott Aaronson is a charming lecturer and after dealing with the bread and butter of automata and complexity (which the course teaches well; as one math major taking the class put it, “I can actually understand these lectures!”) it veers off into the magical worlds of cryptography, probably approximately correct learning and quantum computing (three out of ten questions on the comprehensive final, to be precise). By the end, you will know how to conduct a Diffie-Helman key exchange, why babies might be able to break RSA, and when to apply a Hadamard gate to a qubit! Unfortunately the graders weren’t exactly “quick” about grading problem sets, but in my opinion 6.045’s troubles were solely administrative.

6.945: Large-scale Symbolic Systems. The subject-matter of the class is well worth having in the toolkit of any enterprising programmer: combinators, pattern matching and generic dispatch all our powerful tools with wide applicability in many systems. You also learn how to use continuations (gee!)

Sussman as a lecturer is an interesting phenomenon, especially when you reach the ending lectures when they are essentially talking about ideas that they cooked up last night. It’s rare that electrical engineering and highly symbolic programming come together, but that’s precisely the problem Sussman knows best: he knows how to solve circuit engineering and he wants to figure out the implementation details of an essentially artificially intelligent system that has this knowledge too. Unfortunately, if you’re not completely versed in this analysis the analogies being made may be difficult to understand, and this was a primary blocking point later on in the term. Feedback was late to nonexistent; take this course if you don’t need too much motivation to learn about these things.

21M.283: Musicals of Stage and Screen. I watched lots of musicals. It was great.

I Hate Patches: <br>Confessions of an Open-Source Developer

May 17, 2010

It is a truth universally acknowledged that if you really want a change put into an open source project, you submit a patch along with the bug report. Sure, you might complain that the average user doesn’t have any programming experience and that it’s unreasonable to expect them to learn some complex system and then figure out how to make the change they’re looking for, but you’re just not in the secret society of hacker enthusiasts who fix their own damn software and give back to the projects they use.

I hate patches. I feel morally obligated to review and patch them in, and it usually ends up taking more than just “a few cycles.”

Not all patches are created equal. I group them into this hierarchy:

Cosmetically deficient. The developer doesn’t know anything about submitting patches; perhaps they haven’t even discovered version control yet. They’re apt to sending the entire modified file along with their changes. Those who do use diff -u don’t bother looking at the output of the patch; they submit patches that swap spaces with tabs, random whitespace changes and gratuitous cosmetic changes. Many developers simply reject these patches.
Semantically deficient. For some developers, the act of crafting a patch is taking a source file, trying a vaguely plausible change, seeing if the change had the desired effect, and if not, try something else. In the degenerate case, the patch is nonsense, and in no way correct. More frequently, the submitted patch fails to account for common edge-cases in the application, appropriate error handling or interactions with other parts of the system. Many developers will reply nicely to patches like this and ask for a hunk to be done another way.
Engineering deficient. The patch is well-written, it looks good and does the right things. But… they didn’t add tests to test the new changes, they didn’t fix old unit tests changed by the functionality difference and they didn’t add documentation in the appropriate places for the fix. Many developers will buckle down and make the engineering extras for the patch. Some developers don’t have such tests (cough Linux kernel cough). Even more rarely, some projects can afford to make the patch submitter add the tests; usually this only occurs in projects that are aimed towards a fairly literate programming end-user community.

The Git mailing list can and does expect excellent patch submissions from its community; it’s a version control system, that’s the point! A library written in PHP used primarily by developers who have never written a unit test or submitted a unified diff for upstream review has much less flexibility. Most patches I receive for HTML Purifier never make it past the cosmetic deficiencies. And worse, the developers simply don’t have the time to interactively improve the patch to the end: if I reply with a patch review, they never manage to get their patch to the point where it’s acceptable for submission without excessive tooth pulling. But I feel guilty that my software is wrong, and so when I get the patch, I go and clean it up, incorporate it in, rewrite half of it, add the tests and then ship the change.

So, in the end, the software is improved by the submission by the patch, even if it didn’t save the maintainer any time. So yeah, I hate patches. Maybe I should stop being grumpy and go back to improving my open source projects.

Spring 2010: A Random Walk

May 14, 2010

Here at the eve of Spring 2010 term, I decided to run this little experiment on my laptop: what files had I modified within the last six months?

find . \( -path '*/.*' \) -prune -o -mtime -180 -print

The result was north of one-hundred-fifty thousand modified files. Here’s the (slightly) abridged version:

LaTeX files for “Adventures in Three Monads”, an article that ran in the Monad Reader. Also, blackboard diagrams for my Advanced Typeclasses class I gave that IAP; I ended up not being able to get to the material I prepared for the Reader.
valk.txt, which contained notes for my Valkyrie Nethack character. I made my first ascension on March {24,25}th.
An Eclipse Java project that served as the jumpboard for my HAMT experiments, under the purview of my undergraduate research project.
htmlpurifier-web and htmlpurifier, courtesy of the HTML Purifier 4.1 release I pushed within the last month. (I suspect there will be another release coming soon too.) This also meant new builds of PHP 5.2.11, 5.2.12, 5.2.13, 5.3.1 and 5.3.2 for my super-amazing PHP multi-version farm. Note to self, next time, exclude the build directories from your automated backups, kthxbai.
A qemu checkout, in which I attempted to fix their broken DHCP code when the same MAC address requests two different IP addresses, gave up, and assigned static addresses to the virtual machines we were using to demo live process migration. Mmm… 6.828 final project.
A hackage-server and holumbus checkout, sprung from aborted dreams of making Holombus and Hackage cooperate to have up-to-the-minute index of all Haskell functions. I hear the Holumbus team has been making changes to make Hayoo be able to incrementally update its index.
Updates to tidy up extort, a membership dues tracking application written in Haskell, due to the recent change of leadership in the Assassins’ Guild. During the replacement election, one of the suggested candidate questions was “Do you know Haskell.” We’ll see how long the program lasts…
An abc source directory, in which I flexed my C source-diving skills and searched for information on how to use the library. I may be working closely with it in my internship at Galois. Curiously enough, this roughly coincided with the SAT solver that was to be written for 6.005, as well as the study of SAT in my computation complexity class 6.045.
A mit-scheme checkout, in order to analyze their red-black tree implementation to figure out how easily it could be persisted (the answer was no, and I had to write my own implementation off of Okasaki’s notes), and to figure out why --batch-mode didn’t do what it said on the tin.
A log4j source tree, which I used for two of my Software Engineering 6.005 projects. It was mostly painless to use, and I highly recommend it if you’re building software in Java.
Lots of test directories for wizard (note to self, backing those up is also a bad idea!) Some day, I’ll unleash this software on the world, but for now, it’s usage is growing within the MIT sphere.

The really abridged version:

Languages of the half-year: Java, Haskell and Scheme
Biggest directory: I didn’t count rigorously, but linux, wizard and hackage were pretty big.
Best filename: ./Dev/exploits/jessica_biel_naked_in_my_bed.c

It’s been a long random walk through lots of subjects, software and self-research. There is some trade-off from swapping subjects to really focus in after a month: on the one hand a month is really enough time to do anything really in the space (I feel much the same about my blog posts; they’re an excuse to do the little experiments and segues, but nothing big), on the other hand it means I keep getting views of many specific subfields of computer science. With summer coming up soon, I will probably find another ambitious project to drive through my free time (or perhaps give some of my existing projects some of the love they need).

Refactoring Haskell code?

May 12, 2010

I have to admit, refactoring Haskell code (or perhaps even just functional code) is a bit of a mystery to me. A typical refactoring session for me might look like this: sit down in front of code, reread code. Run hlint on the code, fix the problems it gives you. Look at the code some more. Make some local transformations to make a pipeline tighter or give a local subexpression a name. Decide the code is kind of pretty and functional and go do something else.

Part of the problem is that I haven’t developed the nose for common code smells for functional programs. The odors I might detect in code written in other languages, such as overly long functions and methods, duplicate code and overly coupled code, exists to a far smaller degree in my Haskell programs. Most functions I write are only a few (albeit dense) lines, light-weight and first order helper functions make ad hoc code sharing very easy, and default purity encourages loose coupling of state. That’s not to say there aren’t problems with the code: code written in do-blocks can quickly balloon to dozens of lines (this seems inevitable if you’re programming on gtk2hs), higher-level boilerplate code require more advanced tricks to scrap, and it’s very convenient and tempting to simply shove everything into the IO monad. But the level of these problems seems low enough that they can be brushed aside.

I can write code that really bothers me when I come back, either to understand it again or to extend it to do other things. On an ad hoc basis, I’ve discovered some things that can make long term maintenance a little more troublesome:

Insufficiently general types. Explicitly writing out your type signatures is a good thing to do when you’re debugging type errors, but often if you let the function be inferred you might find that your function can be far more general than the obvious signature suggests. Code that has State () as its type usually can be generalized to be MonadState m => m (), and in many cases (such as error handling) you will almost certainly want this generalization down the road.
Monolithic functions. If you’re writing a piece of functionality top-to-bottom, it’s really easy to say, “Hmm, I need a function of type FilePath -> String -> IO [FilePath]” in several places and forget that the internal code may be useful for some speculative future use of the program. Sometimes this is easy to resolve, since you had a three-liner that should have been three one-liners, or too much code in a monad that didn’t need to be, but even then you still have to choose names for all of the sub-functions, and in some cases, the division isn’t even clear.
Insufficiently general data structures or recursion duplication. When you’re reducing a complex recursive structure, it’s quite easy to pick just precisely the data structure that will contain the data you want. But if you then decide you want some other information that can’t be shoehorned into your structure, you have two choices: retrofit all of the existing code you wrote for the recursion to make it contain the extra information you were looking for, or write a whole new set of functions for recursively traversing the data structure. For complex functions, this can be a fairly large set of pattern matches that need to be handled. (Yes, I know you can Scrap Your Boilerplate, but in some cases it feels slightly too heavy a weapon to wield on code.)
Orphan instances. Sometimes the library writer just didn’t put the instance you wanted into their code, and you’re faced with a choice: the easy, sinful route of defining an orphan instance, or being a good citizen and newtype’ing, and eating the extra verbosity of wrapping and unwrapping. Then a library update comes along and breaks your code.
Ad-hoc parsing. While extremely convenient, read and show were not actually designed for production. I’ve spent time crafting Read instances long after I should have switched to using a parsing library.

But I’m really curious what you look for in code that you know is going to bite you in the future, and what steps you take to mitigate the risk.

Nested Data Parallelism versus Creative Catamorphisms

May 10, 2010

I got to watch (unfortunately not in person) Simon Peyton Jones’ excellent talk (no really, if you haven’t seen it, you should carve out the hour necessary to watch it) on Data Parallel Haskell (slides). The talk got me thinking about a previous talk about parallelism given by Guy Steele I had seen recently.

What’s the relationship between these two talks? At first I though, “Man, Guy Steele must be advocating a discipline for programmers, while Simon Peyton Jones’ is advocating a discipline for compilers.” But this didn’t really seem to fit right: maybe you have a clever catamorphism for the problem, the overhead for fully parallelizing everything is prohibitive. As Steele notes, we need “hybrid sequential/parallel strategies,” the most simple of which is “parallelize it until it’s manageable and run the fast sequential algorithm on it,” ala flat data parallelism. Nor is nested data parallelism a silver bullet; while it has wider applicability, there are still domains it fits poorly on.

I believe that Nested Data Parallelism will be a powerful and practical (well, at least once the Data Parallel Haskell team works out the kinks) tool in the quest for efficiently implementing catamorphic programs. In particular, it takes the huge win of chunking that characterized flat data parallel programs, and combines it with the powerful abstraction of nested parallel data. It promises to eliminate the drudgery of splitting a parallel data structure into even chunks to pass off to the separate processors. It does not resolve issues such as what to do when the input data doesn’t come in a parallel structure (you might notice that Data Parallel Haskell is primarily useful on numeric types: doubles, integers and words) and it still relies on the existence of a convenient reductive function for the parallel structure you’ve chosen.

Omnipresent Cabal

May 7, 2010

A short public service announcement: you might think you don’t need Cabal. Oh, you might be just whipping up a tiny throw-away script, or a small application that you never intend on distributing. Cabal? Isn’t that what you do if you’re planning on sticking your package on Hackage? But the Cabal always knows. The Cabal is always there. And you should embrace the Cabal, even if you think you’re too small to care. Here’s why:

Writing a cabal file forces you to document what modules and what versions your script worked with when you were originally writing it. If you ever decide you want to run or build your script on another environment, the cabal file will make it dramatically easier to get your dependencies and get running faster. If you ever update your modules, the cabal file will partially insulate you against API changes (assuming that the package follows Hackage’s PVP). This is far more palatable than GHC’s package-qualified imports.
You might have cringed about writing up a Makefile or ant file to build your projects in another language; as long as it is just one or two files, the pain associated with these build languages seems to outweight the cost of just running gcc foo.c -o foo. Cabal files are drop-dead easy to write. There even is a cabal init to do the scaffolding for you. Toss out the dinky shell script that you’ve kept to run ghc --make and use cabal configure && cabal build.
It gives you nice things, for free! Do you want Haddock documentation? A traditional GNU-style Makefile? Colourised code? Cabal can do all of these things for you, with minimal effort after you have your cabal file.

« Newer Posts Older Posts »