Inside 206-105

Existential Pontification and Generalized Abstract Digressions

A Zerocoin puzzle

I very rarely post linkspam, but given that I’ve written on the subject of anonymizing Bitcoins in the past, this link seems relevant: Zerocoin: making Bitcoin anonymous. Their essential innovation is to have a continuously operating mixing pool built into the block chain itself; they pull this off using zero-knowledge proofs. Nifty!

Here is a puzzle for the readers of this blog. Suppose that I am a user who wants to anonymize some Bitcoins, and I am willing to wait expected time N before redeeming my Zerocoins. What is the correct probability distribution for me to pick my wait time from? Furthermore, suppose a population of Zerocoin participants, all of which are using this probability distribution. Furthermore, suppose that each participant has some utility function trading off anonymity and expected wait time (feel free to make assumptions that make the analysis easy). Is this population in Nash equilibrium?

  • April 11, 2013

A classical logic fairy tale

(Selinger) Here is a fairy tale: The evil king calls the poor shepherd and gives him these orders. “You must bring me the philosophers stone, or you have to find a way to turn the philosopher’s stone to gold. If you don’t, your head will be taken off tomorrow!” What can the poor shepherd do to save his life?

Hat tip to Chris for originally telling me a different variant of this story. Unfortunately, this quote from Lectures on the Curry-Howard Isomorphism was the only reference I could find. What should the shepherd do? Is there something a little odd about this story?

  • April 7, 2013

Resource limits for Haskell

Last week, I made my very first submission to ICFP! The topic? An old flame of mine: how to bound space usage of Haskell programs.

We describe the first iteration of a resource limits system for Haskell, taking advantage of the key observation that resource limits share semantics and implementation strategy with profiling. We pay special attention to the problem of limiting resident memory usage: we describe a simple implementation technique for carrying out incremental heap censuses and describe a novel information-flow control solution for handling forcible resource reclamation. This system is implemented as a set of patches to GHC.

You can get a copy of the submission here. I've reproduced below the background section on how profiling Haskell works; if this tickles your fancy, check out the rest of the paper!

Profiling in Haskell is performed by charging the costs of computation to the “current cost center.” A cost center is an abstract, programmer-specified entity to which costs can be charged; only one is active per thread at any given time, and the cost semantics determines how the current cost center changes as the program executes. For example, the scc cc e expression (set-cost-center) modifies the current cost center during evaluation of e to be cc. Cost centers are defined statically at compile time.

A cost semantics for Haskell was defined by Sansom et al. (1995) Previously, there had not been a formal account for how to attribute costs in the presence of lazy evaluation and higher-order functions; this paper resolved these questions. The two insights of their paper were the following: first, they articulated that cost attribution should be independent of evaluation order. For the sake of understandability, whether a thunk is evaluated immediately or later should not affect who is charged for it. Secondly, they observed that there are two ways of attributing costs for functions, in direct parallel to the difference between lexical scoping and dynamic scoping.

The principle of order-independent cost-attribution can be seen by this program:

f x = scc "f" (Just (x * x))
g x = let Just y = f x in scc "g" y

When g 4 is invoked, who is charged the cost of evaluating x * x? With strict evaluation, it is easy to see that f should be charged, since x * x is evaluated immediately inside the scc expression. Order-independence dictates, then, that even if the execution of x * x is deferred to the inside of scc "g" y, the cost should still be attributed to f. In general, scc "f" x on a variable x is a no-op. In order to implement such a scheme, the current cost-center at the time of the allocation of the thunk must be recorded with the thunk and restored when the thunk is forced.

The difference between lexical scoping and dynamic scoping for function cost attribution can be seen in this example:

f = scc "f" (\x -> x * x)
g = \x -> scc "g" (x * x)

What is the difference between these two functions? We are in a situation analogous to the choice for thunks: should the current cost-center be saved along with the closure, and restored upon invocation of the function? If the answer is yes, we are using lexical scoping and the functions are equivalent; if the answer is no, we are using dynamic scoping and the scc in f is a no-op. The choice GHC has currently adopted for scc is dynamic scoping.

  • April 2, 2013

The single export pattern

From the files of the ECMAScript TC39 proceedings

Single export refers to a design pattern where a module identifier is overloaded to also represent a function or type inside the module. As far as I can tell, the term “single export” is not particularly widely used outside the ECMAScript TC39 committee; however, the idea shows up in other contexts, so I’m hoping to popularize this particular name (since names are powerful).

The basic idea is very simple. In JavaScript, a module is frequently represented as an object:

var sayHello = require('./sayhello.js');;

The methods of sayHello are the functions exported by the module. But what about sayHello itself? Because functions are objects too, we could imagine that sayHello was a function as well, and thus:


would be a valid fragment of code, perhaps equivalent to Only one symbol can be exported this way, but in many modules, there is an obvious choice (think of jQuery’s $ object, etc).

This pattern is also commonly employed in Haskell, by taking advantage of the fact that types and modules live in different namespaces:

import qualified Data.Map as Map
import Data.Map (Map)

Map is now overloaded to be both a type and a module.

  • March 31, 2013

The duality of weak maps and private symbols

From the files of the ECMAScript TC39 proceedings

I want to talk about an interesting duality pointed out by Mark Miller between two otherwise different language features: weak maps and private symbols. Modulo implementation differences, they are the same thing!

A weak map is an ordinary associative map, with the twist that if the key for any entry becomes unreachable, then the value becomes unreachable too (though you must remember to ignore references to the key from the value itself!) Weak maps have a variety of use-cases, including memoization, where we’d like to remember results of a computation, but only if it will ever get asked for again! A weak map supports get(key) and set(key, value) operations.

A private symbol is an unforgeable identifier of a field on an object. Symbols are useful because they can be generated “fresh”; that is, they are guaranteed not to conflict with existing fields that live on an object (whereas one might get unlucky with _private_identifier_no_really); a private symbol has the extra stipulation that one cannot discover that it exists on an object without actually having the symbol—e.g. an object will refuse to divulge the existence of a private symbol while enumerating the properties of an object. A private symbol psym might be created, and then used just like an ordinary property name to get (obj[psym]) and set (obj[psym] = value) values.

To see why these are the same, lets implement weak maps in terms of private symbols, and vice versa (warning, pseudocode ahead):

function WeakMap() {
  var psym = PrivateSymbol();
  return {
    get: function(key) { return key[psym]; },
    set: function(key, value) { key[psym] = value; }

function PrivateSymbol() {
  return WeakMap();
// pretend that get/set are magical catch-all getters and setters
Object.prototype.get = function(key) {
  if (key instanceof PrivateSymbol) { return key.get(this); }
  else { return this.old_get(key); }
Object.prototype.set = function(key, value) {
  if (key instanceof PrivateSymbol) { return key.get(this, value); }
  else { return this.old_set(key, value); }

Notice, in particular, that it wouldn’t make sense to enumerate all of the entries of a weak map; such an enumeration would change arbitrarily depending on whether or not a GC had occurred.

If you look at this more closely, there is something rather interesting going on: the implementation strategies of weak maps and private symbols are the opposite of each other. With a weak map, you might imagine a an actual map-like data structure collecting the mappings from keys to values (plus some GC trickery); with a private symbol, you would expect the values to be stored on the object itself. That is to say, if we say “WeakMap = PrivateSymbol” and “key = this”, then the primary difference is whether or not the relationships are stored on the WeakMap/PrivateSymbol, or on the key/this. WeakMap suggests the former; PrivateSymbol suggests the latter.

Is one implementation or the other better? If objects in your system are immutable or not arbitrarily extensible, then the private symbol implementation may be impossible to carry out. But if both implementations are possible, then which one is better depends on the lifetimes of the objects in question. Garbage collecting weak references is expensive business (it is considerably less efficient than ordinary garbage collection), and so if you can arrange for your weak mappings to die through normal garbage collection, that is a win. Therefore, it’s better to store the mapping on the shorter-lived object. In the case of a memotable, the key is going to be a bit more ephemeral than the map, which results in a very strange consequence: the best implementation strategy for a weak map doesn’t involve creating a map at all!

Alas, as with many elegant results, there are some difficulties stemming from complexities in other parts of the ECMAScript specification. In particular, it is not at all clear what it means to have an “read-only weak map”, whereas an read-only private symbol has an obvious meaning. Furthermore, collapsing these two rather distinct concepts into a single language may only serve to confuse web developers; a case of a proposal being too clever for its own good. And finally, there is an ongoing debate about how to reconcile private state with proxies. This proposal was introduced to solve one particular aspect of this problem, but to our understanding, it only addresses a specific sub-problem, and only works when the proxy in question is a membrane.

  • March 19, 2013

What is a membrane?

If you hang out long enough with a certain crowd (in my case, it was the ECMAScript TC39 committee), you will probably hear the term membrane tossed around. And eventually, you will start to wonder, “Well, what is a membrane, anyway?”

As is the case with many clever but simple ideas, membranes were first introduced as a footnote [1] in a PhD thesis. Suppose that you are building distributed system, in which you pass references to objects between two separate nodes. If I want to pass a reference to foo in process A to process B, I can hardly just hand over an address—the memory spaces are not the same! So instead, I need to create a wrapper object wrappedFoo representing foo in B, which knows how to access the original object in A. So far so good.

Now here’s the catch: what if I pass a reference to wrappedFoo back to process A? If I were not very clever, I’d do the same thing as I did originally: create a new wrapper object wrappedWrappedFoo in A which knows how to access wrappedFoo in B. But this is silly; really, when I cross back over to A, I want to get back the original foo object.

This wrap-unwrap behavior is precisely what a membrane is. We consider the original object foo to be “inside” the membrane (a so-called wet object), and as it exits the membrane, it is wrapped with its own little membrane. However, when the object returns to its original membrane, the wrapper goes away. Just like in biology!

There is one last operation, called a “gate”: this occurs when you invoke a method on a wrapped object. Since the wrapper cannot actually perform the method, it has to forward the request to the original object. However, the arguments of the method need to get wrapped (or unwrapped) as they get forwarded; as you might expect.

While I used an RPC-like system to demonstrate the basic principle of membranes, a more conventional use is to enforce access control. Membranes are quite important; Mozilla relies on them extensively in order to enforce access restriction between objects from different websites which may interact with each other, but need security checks. (Actually, did you know that Mozilla is using a capability-based system for their security? Kind of neat!) It’s important to notice that when we unwrap, we are skipping security checks—the only reason this is acceptable is because the only objects that will be able to access the unwrapped object are precisely those objects in the same domain. For a more modern treatment of the subject, check out a more recent paper, Trustworthy Proxies: Virtualizing Objects with Invariants, which includes a lucid explanation of membranes.

[1] Well, actually it was a figure; figure 9.3 on page 71, to be precise!

  • March 15, 2013

Kindle Paperwhite notes

Along with a Nexus 7, I also acquired a Kindle Paperwhite over winter break. (Wi-Fi only) I have been quite pleased by this purchase, though in an unexpected way: while I have not increased the number of books I read, the Kindle has materially changed how I read articles on the Internet. Not via their web browser, which is essentially unusable except for the simplest tasks, but via tools which take articles on the Internet and convert them into ebook form.

For blog posts I use Calibre with the Google Reader source. This has been revolutionary: I no longer read blog posts in my browser; instead, I bundle them up on my Kindle and read them at my leisure. This change has also meant that I read blog posts much more carefully (interfluidity used to only get a skim; now I can actually make it through the posts). This setup is a little nontrivial so I’ll probably describe it in a later blog post. (I used to use, which had easy setup, but (1) it could not handle images (so equations and graphs are right out), and (2) it botched formatting of pre formatted text.)

For longform articles I use Longform, which serves up a interesting mix of in-depth reporting and nonfiction articles. They have a very convenient “Send to Kindle” button which I’m told is served by Readability; I wonder if I should add a button like that to my blog. I am also using Amazon’s Send to Kindle Firefox extension for my paywalled articles (primarily, although its results seem a bit spottier than Readability.

For academic papers, the going has been a bit rough, but I have gotten decent results on two-column papers with cut2col, which manages to make text large enough to be readable in Kindle’s PDF reader. Landscape mode also helps a bit with reading PDFs. Generally, however, managing which PDFs are on my device is a bit of an open problem. I haven’t started using Calibre yet for this purpose, although I have used it to perform some conversions between ebook formats. These conversions don’t work particularly well, although they are usable. It’s well worth getting the latest version of Calibre, since there are some bugs in the current Quantal version; I use ppa:n-muench/calibre.

For textbooks, ebook editions are still extremely expensive. I would love to carry around all of the famous computer science textbooks in my pocket, but to date I don’t have any good way of doing so without breaking the bank. Books suffer similarly: it turns out I did most of my pre-Kindle reading borrowing books from libraries; however, I hear the Palo Alto public library does Kindle lending, and I intend to check them out at some point in time. The typesetting for public domain books put out by places like Project Gutenberg are notoriously spotty; Readability and similar services seem to do a much better job! Alas, the Stanford library system does not appear to carry ebooks. The Amazon free samples of books have also been good fun, and I’ve collected quite a few of them.

Some annoyances with the Kindle itself:

  • Amazon Special Offers, which I cannot seem to get rid of (apparently you can only pay the $20 to remove them if the Kindle is associated with the original Amazon account that purchased the device; which is not me)
  • Re-sorting behavior of the Recent view: if you have gone home after reading a book, the Kindle will briefly flash the old order of items before it reshuffles and moves the item you just finished reading to the front. If you attempt to click on a different item during this period, the click will not register until the new reshuffle has happened; this has frustratingly caused me to accidently click on the wrong item multiple times!
  • With hypertext articles, they tend to contain links, which means that if you are using a "tap" to advance to the next page, and a link appears under your finger and you will accidentally bring up your browser. This is quite annoying and I would just like to turn off links entirely. (Yes, I know you can just swipe, but that is annoying and I do not want to bother retraining myself.)
  • For certain formats, especially PDFs, page refresh speed is rather slow; this makes it difficult to rapidly flip through pages like you might in a real book. This is probably the primary downside of a Kindle as opposed to a traditional book; the other being the inability to see and rapidly flip to bookmarks (it takes more than one tap to move to a bookmark on a Kindle).
  • I have also jailbroken my Kindle, but there does not seem to be any interesting software to run with it.

All-in-all, I am quite pleased with this device, and I would recommend it to anyone interested in reducing the amount of time they spend staring at a computer monitor.

  • January 30, 2013

The GHC scheduler

I’d like to talk about some nitty-gritty details of GHC’s thread scheduling, discovered over the course of working on stride scheduling for GHC. Most of these choices are merely implementation details and are not part of any specification. While these choices shouldn’t be relied upon, they are worth knowing, since many of these details were accreted over the course of many performance bugs, benchmark tests and other battles. In this post, I’ll attempt to give some historical insight into why many choices were made. These insights should generalize to any system that would like to implement green threads, lightweight threads that use less memory than traditional operating system threads. For space reasons, I’m not going to talk about STM or sparks (though they are also quite interesting).

Update: A large portion of this material has been incorporated into the scheduler page in the GHC commentary

Anatomy of a thread

I’d first like to discuss some brief background about the runtime system first and point out some perhaps nonintuitive design choices. A thread is represented by a TSO (thread-state object) by GHC, i.e. the StgTSO struct in includes/rts/storage/TSO.h. [1] In Haskell, TSOs can be passed around as ThreadId objects. The Stg in front of the struct name indicates that TSOs are garbage collected, like other closures in Haskell. The TSO, along with the stack allocated with it (STACK), constitute the primary memory overhead of a thread. Default stack size, in particular, is controlled by the GC flag -ki, and is 1k by default. [2] Threads are run by Capabilities, which can be thought of virtual cores managed by GHC. Capabilities are, in turn, mapped to true operating system threads, or Tasks, though we won’t talk about them much.

Being garbage collected has two major implications for TSOs. First, TSOs are not GC roots, so they will get GC'd if there is nothing holding on to them (e.g. in the case of deadlock), and their space is not automatically reclaimed when they finish executing [3]. Usually, a TSO will be retained by a Capability’s run queue (a GC root), or in the list of waiting threads of some concurrency variable, e.g. an MVar. Second, a TSO must be considered a mutable object, and is thus subject to the conventional GC write barriers necessary for any mutable object in a generational garbage collector. [4] The dirty bit tracks whether or not a TSO has been modified; it is always set when a thread is run and also when any of the pointer fields on a TSO are modified. Two fields, set by setTSOLink and setTSOPrev, are of particular interest to the scheduler.

Run queue

The run queue is at the heart of the scheduler, as any runnable thread will hit the run queue before the scheduler actually pops it off the queue and runs it. There’s one per capability rts/Capability.h (in the bad old days, there was a global run queue, but this performed badly for multithreaded processes), and it is implemented as a doubly-linked list run_queue_hd and run_queue_tl. [6] The head and tail pointers mean that the queue is actually a deque: this is important because the scheduler will often have to handle threads that were interrupted in some way, and should let the threads get back on. The links themselves are on the TSOs and modified with setTSOLink and setTSOPrev, so modifying the queue dirties the TSOs involved. [7] Otherwise, the run queue is exclusively owned by the scheduler. If there are idle capabilities and if we have more than one thread left in our run queue, threads will be pushed to other queues with schedulePushWork.

Threads are put in front (pushOnRunQueue) if:

  • A stack overflow occurs;
  • A heap overflow occurs; [8]
  • A task attempts to run a thread, but it is bound and the current task is the wrong one;
  • A thread is associated with a black hole (a thunk that is being evaluated), and another thread, possibly on another capability, has blocked on its evaluation (see ticket #3838);
  • In the threaded runtime, if a thread was interrupted because another Capability needed to do a stop-the-world GC (see commit 6d18141d8);
  • In the non-threaded runtime, when a thread waiting on IO unblocks.

Threads are put in back (appendToRunQueue) in the case of pre-emption, or if it’s new; particularly, if

  • A thread was pre-empted via the context switch flag (e.g. incoming message from another thread, the timer fired, the thread cooperatively yielded, etc; see also [8] on how this interacts with heap overflows);
  • It is a new thread (so large amounts of thread creation do not starve old threads, see conc004 and commit 05881ecab);
  • A thread becomes unblocked;
  • A thread is migrated to another capability (though, in this case, the queue was empty anyway);
  • A thread finishes, but for some reason we need to keep it around (this is related to in-calls, though I’m not a 100% sure what is going on here; if you know, please tell me!)


Benchmarks like nofib are very important, even if they are synthetic, as they will often be construed as primary evidence whether or not a change to the scheduler speeds or slows things down. One reason is that it is much easier to tell why a short program that torture tests threads has slowed down than it is to tell why a large, complicated multithreaded program no longer seems very snappy. But really, the main motivation is convenience: nofib programs are easy to measure and easy to compare. Fortunately, the tests often measure something quite specific, so I’d like to describe the tests that compose the smp nofib suite here:

  • callback001 (also known as ffi014) performs a large number of incalls to Haskell from C from a large number of threads. This is a rather specific test related to how we place threads in the run queue even if they’ve finished, if they finished in an in-call.
  • callback002 measures how quickly we can perform incalls to Haskell from C.
  • chan measures how scheduling order effects memory usage: if threads are allowed to run for a bit without getting context switched, they build up data in channels. This is related to when we reset the context switch flag (see [8]).
  • sieve implements the Sieve of Eratosthenes, spawning many threads to evaluate thunks of a lazy list in parallel. It performs a bit of allocation, and sensitive to what happens to threads after a HeapOverflow.
  • threads001 tests how quickly we can create a thread and then context switch to it.
  • threads003 tests how quickly many threads can communicate by reading and writing MVars. It is a bit sensitive to what happens to threads after they wake up from sleeping.
  • threads006 tests how quickly threads can be created and destroyed, as well as throwTo blocking performance. It is very sensitive to the number of major GCs that occur (which can be influenced if TSO size changes).
  • threads007 generates a lot of threads waiting on MVars, and then sees how shutdown behavior is affected. It was due to bad behavior in the MVar queue and fixed in f4692220c7.


The GHC scheduler is pretty complicated! Much of the current behavior was created in response to specific problems: the right choices are not obvious a priori! I hope this post will serve as a valuable reference for any future GHC hackers interested in playing around with the scheduler, as well as for anyone else who needs to implement a scheduler for their runtime system. Much of the historical data was gleaned from comments (though I found some out-of-date ones), liberal use of git blame, and cross-referencing with the bug tracker—these are all useful places to figure out, “Well, why does that code do that?” In this post, I hope I’ve answered that question, to some degree.

[1] Initialization of StgTSO is handled in createThread in rts/Threads.c; this function is in turn invoked by createGenThread, createIOThread and createStrictIOThread in rts/RtsAPI.c. These functions setup the initial stack state, which controls what the thread executes when it actually gets run. These functions are the ones invoked by the fork# and other primops (entry-points for primops are located in rts/PrimOps.cmm).

[2] Actually, your usable stack will be a little smaller than that because this size also includes the size of the StgTSO struct. (This is only really for allocating lots of threads into one block, however, as once a GC occurs the TSOs and stacks will no longer be adjacent.)

[3] Here is a sample program which demonstrates how holding onto ThreadId using stable pointers (which force the object their pointing to to never be GC'd) can leak memory:

import Control.Concurrent
import Control.Monad
import Foreign.StablePtr

n = 400000
main = do
    ms <- replicateM n (newEmptyMVar >>= \m -> (forkIO (putMVar m ()) >>= newStablePtr) >> return m)
    mapM_ takeMVar ms

The heap profile of the run shows none of the TSO/STACK objects being deallocated, even when the MVars drain out as threads finish executing.

[4] The write barrier for generational GCs refers not to memory barrier of multithreaded execution, but rather, notification for the garbage collector when a mutable reference in the old generation changes, and may now possibly point to an object in the young generation. Write barriers are necessary because the old generation will not be traversed during a minor collection, and thus if old generations may point to an object in a young generation, we may miss the fact that a young object is still alive even though it has no references from other young objects. In GHC, a write barrier is implemented by adding an object to the mutable list (mut_list) of a Capability if it is not in the youngest generation. (Some objects, like MutArr#, are permanently on the mutable list; in such a case, a write barrier may not be necessary. But see [5] for more details.) Objects will usually track their dirty status, so that they don’t add themselves to the mutable list multiple times. (Accidentally adding an object multiple times is harmless, but means the GC has to do extra work traversing the mutable list.) Additionally, if we can guarantee that the new reference does not point to the young generation (for instance, it is a static closure like END_TSO_QUEUE), then dirtying the object is not necessary. Getting this stuff right is tricky, to say the least!

[5] There is a bit of a sordid story here. Keeping an object permanently on the mutable list is done by scavenge_mutable_list in rts/sm/Scav.c, which will unconditionally re-add such an object to the mutable list if it sees it there. How does the object get on the mutable list in the first place? It’s not placed on the list upon creation; rather, upon the first minor GC on the youngest generation, the scavenging GC notices the object and places it on the mutable list by gct->failed_to_evac = rtsTrue. How do we end up freeing the object? The mutable list is considered a set of root pointers, but it is only scavenged, not evacuated. If an item on the mutable list ends up not being evacuated, it will be blown away regardless. (This does mean, however, that its elements will not be freed until the next GC.) Isn’t it really inefficient to always be scanning these arrays? Yes, and this used to be a problem (ticket #650), nowadays mitigated by card marking. The same story applied to TSOs (ticket #1589), but the fix here was to properly apply a write barrier and not keep the objects permanently on the mutable list; this improved performance quite a bit when there were a lot of threads (even if you don’t scavenge their pointers, traversing a huge mutable list is still a pain.) Creating a lot of small mutable arrays is apt to be painful.

[6] It used to be singly linked, but fixing ticket #3838 required the ability to remove TSOs from the run queue.

[7] Since these fields are always traversed by the GC, it’s important that they do not contain NULL pointers or garbage. Instead, we set them to the static closure END_TSO_QUEUE. Because this is guaranteed not to be in the young generation, this is why you do not need to dirty the TSO after setting this field.

[8] Sometimes, a heap overflow and a context switch occur simultaneously. If the thread requested a large block, we still always push it in front (because we don’t want another thread to steal our large block); however, otherwise, the context switch takes precedence and the thread is booted to the end of the queue—the context switch is checked as late as possible. (See commit 05881ecab)

  • January 28, 2013

NLP: the missing framework

So you want to make a web app. In today’s world, there is a panoply of software to assist you: you can use an all-in-one framework, or you can grab libraries to deal with the common needs of templating, database access, interactivity, etc. These libraries unify common functionality and take care of edge-cases you might otherwise not have the resources to deal with.

But there is one tool which is conspicuously absent: the natural language processing library.

“Now wait!” you may be saying, “of course there are NLP libraries, nltk and lingpipe come to mind.” Sure, but are you actually using these libraries? “Maybe not, but my application doesn’t need NLP, you see.”

The thing is, you are doing language processing in your application, even if you don’t realize it: “string concatenation” is really just a simple form of natural language generation, a subfield of NLP in its own right. [1] If you need to perform a more complicated task, such as pluralize nouns, capitalize sentences or change the grammatical form of verbs, you’ll need linguistic data. [2] This data is an essential part of many traditional NLP tasks. However, if you need to pluralize something today, you’re more likely to copy-paste a list of regexes off the Internet rather than think, “Hm, I should install an NLP library.” Part of this is because, while NLP libraries do contain this data, it is not publicized well.

It’s also worth considering if your application could benefit from any traditional NLP, including keyword generation, canonicalization (When are two things written slightly differently the same?), language identification, full-text search, autocompletion, topic detection and clustering, content summarization, parsing human-written dates and locations, etc. While it’s rare for an application to need all of these features, most would benefit from a few of them. For example, a blog application might want keyword generation to generate tags, full-text search to search posts, content summarization for non-fullpage views of posts, and date parsing for scheduling posts. These features tend to be absent, however, because they are often difficult to implement properly. Modern approaches often require models to be trained on large corpora of data—so-called data-driven models. Most of the time, this setup cost doesn’t seem worth it; if the feature is to be implemented (e.g. as an extension), a bag of heuristics is quicker.

Both of these problems hint at the trouble with current NLP frameworks: they assume that users are interested in building NLP systems, as opposed to using NLP systems. I shouldn’t need a PhD in computational linguistics to get my nouns to pluralize correctly or parse dates robustly. I shouldn’t need a PhD to get passable results on conventional, well-studied NLP applications. The default expectation should not be that users need to train a model: pre-existing models can easily be reused. Although there is an upper limit to how good an NLP algorithm can do without any tuning, the principled approach can still offer improvements over heuristics. But even more importantly, once a model is being used, developers who want to improve their results can train their own model on text from their own application, which is likely to carry domain-specific terminology and patterns. The library should be initially easy to use, and principled enough to be a gateway drug into the wonderful world of computational linguistics. Who knows what other applications could arise when developers recognize NLP as an accessible tool for their toolbox? [3]

Here is my call to arms: I want to see all of the current “baby-NLP” functionality collected together into a single place, where they get benefit from shared linguistic data and serve as easy-to-use features that initially attract developers. I would like to see more complicated but useful NLP technology become more accessible to a non-linguistic audience. And I would like all of this to be based on principled NLP foundations, so that it is possible to improve on the out-of-the-box models and algorithms. NLP practitioners are often very careful not to overstate what their systems are capable of (in contrast to the irrational exuberance of the 1980s). That’s OK: sometimes, the bar really is that low.

Thanks to Gregory Price, Eric Kow and Richard Tibbetts for helping review earlier drafts of this article.

[1] As a field, natural language generation doesn’t really consider string concatenation to be a true method; instead, it is interested in how to generate text from a functional description of intent. One neat example is referring expression generation.

[2] For example, the functionality (e.g. pluralization rules collected in the language/ folder in MediaWiki. MediaWiki is one of the most international open source projects, and I find it a fascinating source of information about linguistic oddities in foreign languages.

[3] As an example, I'd like to sketch how natural language generation can assist internationalization of applications. Suppose that you would like to let a user know that “you have three new messages.” The most obvious way to implement this would be with: printf("You have %d new message(s).", numMessages). Now, there are a number of shortcuts that have been taken here: we always print out a numeric digit, rather than AP style which uses English for numbers between zero and nine, and we’ve sidestepped whether or not “message” should be pluralized by tacking on an (s) on the end.

If we’d like to handle those cases, the next obvious thing to do is to add a few new functions: we’ll need a function apnumber to convert 3 to three, and we’ll need a function pluralize to convert message into messages when numMessages is greater than one. So you would end up with something like printf("You have %s new %s", apnumber(numMessages), pluralize("message", numMessages)). This is the ad hoc approach which will work reasonably well on English but will get you into trouble when you realize other languages have things like noun-adjective agreement (“nouveau message” versus “nouveaux messages”). Internationalization frameworks have long recognized and offered mechanisms for dealing with these cases; however, the average English-based project is unlikely to know about these problems until they internationalize.

However, there exists a representation which is agnostic to these issues. Consider the dependency grammar of this sentence, which we have extracted with a little NLP:

nsubj(have-2, You-1)
root(ROOT-0, have-2)
num(messages-5, three-3)
amod(messages-5, new-4)
dobj(have-2, messages-5)

We might ask, “Given data of this form, can we automatically generate an appropriate sentence in some language, which conveys the information and is grammatically correct?” That is a pretty hard task: it is the fundamental question of NLG. (It's not quite equivalent to machine translation, since we might require a user to add extra information about the functional intent that would otherwise be very hard to extract from text.) While it would be cool if we had a magic black box which could crank out the resulting sentences, even today, the tools developed by NLG may help reduce translator burden and increase flexibility. I think that’s well worth investigating.

  • January 2, 2013