ezyang's blog

the arc of software bends towards understanding

2011

From data type definitions to code

What do these problems have in common: recursive equality/ordering checks, printing string representations, serializing/unserializing binary protocols, hashing, generating getters/setters? They are repetitive boilerplate pieces of code that have a strong dependence on the structure of the data they operate over. Since programmers love automating things away, various schools of thought have emerged on how to do this:

  1. Let your IDE generate this boilerplate for you. You right-click on the context menu, click “Generate hashCode()”, and your IDE does the necessary program analysis for you;
  2. Create a custom metadata format (usually XML), which you then run another program on which converts this description into code;
  3. Add sufficiently strong macro/higher-order capabilities to your language, so you can write programs which generate implementations in-program;
  4. Add sufficiently strong reflective capabilities to your language, so you can write a fully generic dynamic implementation for this functionality;
  5. Be a compiler and do static analysis over abstract syntax trees in order to figure out how to implement the relevant operations.

It hadn’t struck me how prevalent the fifth option was until I ran into wide scale use of one particular facet of the camlp4 system. While it describes itself as a “macro system,” in sexplib and bin-prot, the macros being used are not those of the C tradition (which would be good for implementing 3), rather, they are in the Lisp tradition, including access to the full syntax tree of OCaml and the ability to modify OCaml’s grammar. Unlike most Lisps, however, camlp4 has access to the abstract syntax tree of data type definitions (in untyped languages these are usually implicit), which it can use to transform into code.

Read more...

Variant types and GADTs

OCaml supports anonymous variant types, of the form type a = [`Foo of int | `Bar of bool], with the appropriate subtyping relations. Subtyping is, in general, kind of tricky, so I have been using these variant types fairly conservatively. (Even if a feature gives you too much rope, it can be manageable and useful if you use discipline.) Indeed, they are remarkably handy for one particular use-case for which I would have normally deployed GADTs. This is the “Combining multiple sum types into a single sum type” use-case.

Read more...

In-program GC stats for GHC

I’ll be at this year’s Hac Phi (coming up in a week and a half), and I am planning on working on in-program garbage collector statistics for GHC. There is nothing really technically difficult about this task (we just need to expose some functions in the RTS), but it’s not been done yet and I know quite a few performance-minded and long-running-server-minded folks who would love to see this functionality.

My question for you is this: what would you like such an API to look like? What things should it offer, how would you like to interact with it?

Read more...

Synthetic Git merges

In theory, Git supports custom, low-level merge drivers with the merge configuration properties. In practice, no one actually wants to write their own merge driver from scratch. Well, for many cases where a custom merge driver would come in handy, you don’t have to write your merge driver from scratch! Consider these cases:

  • You want to merge files which have differing newline styles,
  • You want to merge files where one had lots of trailing whitespace removed,
  • You want to merge files one branch has replaced certain strings with custom strings (for example, a configuration file which instantiated PASSWORD, or a file that needs to be anonymized if there is a merge conflict),
  • You want to merge a binary file that has a stable textual format, or
  • You want to merge with knowledge about specific types of conflicts and how to resolve them (a super-smart rerere).

For all of these cases, you can instead perform a synthetic Git merge by modifying the input files (constructing synthetic merge inputs), calling Git’s git merge-file to do the actual merge, and then possibly editing the result, before handing it back off to the original invoker of your merge driver. It’s really simple. Here’s an example driver that handles files with differing newline styles by canonicalizing them to UNIX:

Read more...

Parallelism to plug space leaks

It is not too difficult (scroll to “Non sequitur”) to create a combinator which combines two folds into a single fold that operates on a single input list in one pass. This is pretty important if your input list is pretty big, since doing the folds separately could result in a space leak, as might be seen in the famous “average” space leak:

import Data.List
big = [1..10000000]
sum' = foldl' (+) 0
average xs = fromIntegral (sum' xs) / fromIntegral (length xs)
main = print (average big)

(I’ve redefined sum so we don’t stack overflow.) I used to think combining functions for folds were pretty modular, since they had a fairly regular interface, could be combined together, and really represented the core notion of when it was possible to eliminate such a space leak: obviously, if you have two functions that require random access to elements of the list, they’ll retain the entirety of it all the way through.

Read more...

Facebook support for BarnOwl

This one’s for the MIT crowd. This morning, I finished my Facebook module for BarnOwl to my satisfaction (my satisfaction being asynchronous support for Facebook API calls, i.e. no more random freezing!) Getting it to run on Linerva was a bit involved, however, so here is the recipe.

  1. Setup a local CPAN installation using the instructions at sipb.mit.edu, using local::lib. Don’t forget to add the setup code to .bashrc.mine, not .bashrc, and then source them. Don’t forget to follow prerequisites: otherwise, CPAN will give a lot of prompts.
  2. Install all of the CPAN dependencies you need. For the Facebook module, this means Facebook::Graph and AnyEvent::HTTP. I suggest using notest, since Any::Moose seems to fail a harmless test on Linerva. Facebook::Graph fails several tests, but don’t worry about it since we’ll be using a pre-packaged version. If you want to use other modules, you will need to install them in CPAN as well.
  3. Clone BarnOwl to a local directory (git clone git://github.com/ezyang/barnowl.git barnowl), ./autogen.sh, configure and make.
  4. Run using ./barnowl, and then type the command :facebook-auth and follow the instructions!

Happy Facebooking!

Read more...

Grad School, Oh My

It still feels a little strange how this happened. Not a year ago, I was pretty sure I was going to do the Masters of Engineering program at MIT, to make up for a “missed year” that was spent abroad (four years at MIT plus one at Cambridge, not a bad deal.)

But a combination of conversations with various researchers whom I greatly respect, nagging from my Dad, and an inability to really figure out who I would actually do my Master’s thesis under while at MIT meant that at some point about a month and a half ago, I decided to go for the graduate school admissions cycle this fall. Oh my. It feels like undergraduate admissions all over again (which was not really a pleasant experience), though this time around, what I need to write is the “Research Statement.”

Read more...

Cambridge retrospective: History and Philosophy of Science

I recently concluded a year long study-abroad program at the University of Cambridge. You can read my original reasons and first impressions here.

image

It is the Sunday before the beginning of exams, and the weather is spectacular. Most students (except perhaps the pure historians) are dourly inside, too busy revising to take advantage of it. But I am thrust out of my den, constructed of piles of notes and past exam questions, in order to go to one final revision with our history supervisor, Mirjam. I cycle down Grange road, park my bicycle outside Ridley Hall, and am greeted to a pleasant surprise: Mirjam has elected to hold the revision outside on a cluster of park benches flanked everywhere by grass and trees, and has brought a wickerbasket containing fresh fruit, small cupcakes and other treats, and also sparkling wine and beer (perhaps not the best drink for what is ostensibly a revision, but we avail ourselves of it anyway.)

Read more...

A new vision for Mendeley

I use Mendeley because it lets me easily search for papers I care about. Unfortunately, that seems to be all Mendeley has been doing for me… and that’s a damn shame. Maybe it’s because I’m an undergraduate, still dipping my toe into an ocean of academic research. Mendeley was aimed at practicing researchers, but not people like me, who are stilll aiming for breadth not depth. I can count on two hands the number of technical papers I’ve really dug into—I’m trying to figure out what it is exactly that I want to specialize in.

Read more...

IVar leaks

image

The first thing to convince yourself of is that there actually is a problem with the code I posted last week. Since this is a memory leak, we need to keep track of creations and accesses of IVars. An IVar allocation occurs in the following cases for our example:

  1. Invocation of return, which returns a full IVar,
  2. Invocation of tick, which returns an empty IVar and schedules a thread to fill this IVar, and
  3. Invocation of >>=, which returns an empty IVar and a reference to this IVar in the callback attached to the left-hand IVar.

An IVar access occurs when we dereference an IORef, add a callback or fill the IVar. This occurs in these cases:

Read more...