ezyang’s blog

What’s a module system good for anyway?

This summer, I've been working at Microsoft Research implementing Backpack, a module system for Haskell. Interestingly, Backpack is not really a single monolothic feature, but, rather, an agglomeration of small, infrastructural changes which combine together in an interesting way. In this series of blog posts, I want to talk about what these individual features are, as well as how the whole is greater than the sum of the parts.

But first, there's an important question that I need to answer: What's a module system good for anyway? Why should you, an average Haskell programmer, care about such nebulous things as module systems and modularity. At the end of the day, you want your tools to solve specific problems you have, and it is sometimes difficult to understand what problem a module system like Backpack solves. As tomejaguar puts it: "Can someone explain clearly the precise problem that Backpack addresses? I've read the paper and I know the problem is 'modularity' but I fear I am lacking the imagination to really grasp what the issue is."

Look no further. In this blog post, I want to talk concretely about problems Haskellers have today, explain what the underlying causes of these problems are, and say why a module system could help you out.

The String, Text, ByteString problem

As experienced Haskellers are well aware, there are multitude of string types in Haskell: String, ByteString (both lazy and strict), Text (also both lazy and strict). To make matters worse, there is no one "correct" choice of a string type: different types are appropriate in different cases. String is convenient and native to Haskell'98, but very slow; ByteString is fast but are simply arrays of bytes; Text is slower but Unicode aware.

In an ideal world, a programmer might choose the string representation most appropriate for their application, and write all their code accordingly. However, this is little solace for library writers, who don't know what string type their users are using! What's a library writer to do? There are only a few choices:

They "commit" to one particular string representation, leaving users to manually convert from one representation to another when there is a mismatch. Or, more likely, the library writer used the default because it was easy. Examples: base (uses Strings because it completely predates the other representations), diagrams (uses Strings because it doesn't really do heavy string manipulation).
They can provide separate functions for each variant, perhaps identically named but placed in separate modules. This pattern is frequently employed to support both strict/lazy variants Text and ByteStringExamples: aeson (providing decode/decodeStrict for lazy/strict ByteString), attoparsec (providing Data.Attoparsec.ByteString/Data.Attoparsec.ByteString.Lazy), lens (providing Data.ByteString.Lazy.Lens/Data.ByteString.Strict.Lens).
They can use type-classes to overload functions to work with multiple representations. The particular type class used hugely varies: there is ListLike, which is used by a handful of packages, but a large portion of packages simply roll their own. Examples: SqlValue in HDBC, an internal StringLike in tagsoup, and yet another internal StringLike in web-encodings.

The last two methods have different trade offs. Defining separate functions as in (2) is a straightforward and easy to understand approach, but you are still saying no to modularity: the ability to support multiple string representations. Despite providing implementations for each representation, the user still has to commit to particular representation when they do an import. If they want to change their string representation, they have to go through all of their modules and rename their imports; and if they want to support multiple representations, they'll still have to write separate modules for each of them.

Using type classes (3) to regain modularity may seem like an attractive approach. But this approach has both practical and theoretical problems. First and foremost, how do you choose which methods go into the type class? Ideally, you'd pick a minimal set, from which all other operations could be derived. However, many operations are most efficient when directly implemented, which leads to a bloated type class, and a rough time for other people who have their own string types and need to write their own instances. Second, type classes make your type signatures more ugly String -> String to StringLike s => s -> s and can make type inference more difficult (for example, by introducing ambiguity). Finally, the type class StringLike has a very different character from the type class Monad, which has a minimal set of operations and laws governing their operation. It is difficult (or impossible) to characterize what the laws of an interface like this should be. All-in-all, it's much less pleasant to program against type classes than concrete implementations.

Wouldn't it be nice if I could import String, giving me the type String and operations on it, but then later decide which concrete implementation I want to instantiate it with? This is something a module system can do for you! This Reddit thread describes a number of other situations where an ML-style module would come in handy.

(PS: Why can't you just write a pile of preprocessor macros to swap in the implementation you want? The answer is, "Yes, you can; but how are you going to type check the thing, without trying it against every single implementation?")

Destructive package reinstalls

Have you ever gotten this error message when attempting to install a new package?

$ cabal install hakyll
cabal: The following packages are likely to be broken by the reinstalls:
pandoc-1.9.4.5
Graphalyze-0.14.0.0
Use --force-reinstalls if you want to install anyway.

Somehow, Cabal has concluded that the only way to install hakyll is to reinstall some dependency. Here's one situation where a situation like this could come about:

pandoc and Graphalyze are compiled against the latest unordered-containers-0.2.5.0, which itself was compiled against the latest hashable-1.2.2.0.
hakyll also has a dependency on unordered-containers and hashable, but it has an upper bound restriction on hashable which excludes the latest hashable version. Cabal decides we need to install an old version of hashable, say hashable-0.1.4.5.
If hashable-0.1.4.5 is installed, we also need to build unordered-containers against this older version for Hakyll to see consistent types. However, the resulting version is the same as the preexisting version: thus, reinstall!

The root cause of this error an invariant Cabal currently enforces on a package database: there can only be one instance of a package for any given package name and version. In particular, this means that it is not possible to install a package multiple times, compiled against different dependencies. This is a bit troublesome, because sometimes you really do want the same package installed multiple times with different dependencies: as seen above, it may be the only way to fulfill the version bounds of all packages involved. Currently, the only way to work around this problem is to use a Cabal sandbox (or blow away your package database and reinstall everything, which is basically the same thing).

You might be wondering, however, how could a module system possibly help with this? It doesn't... at least, not directly. Rather, nondestructive reinstalls of a package are a critical feature for implementing a module system like Backpack (a package may be installed multiple times with different concrete implementations of modules). Implementing Backpack necessitates fixing this problem, moving Haskell's package management a lot closer to that of Nix's or NPM.

Version bounds and the neglected PVP

While we're on the subject of cabal-install giving errors, have you ever gotten this error attempting to install a new package?

$ cabal install hledger-0.18
Resolving dependencies...
cabal: Could not resolve dependencies:
# pile of output

There are a number of possible reasons why this could occur, but usually it's because some of the packages involved have over-constrained version bounds (especially upper bounds), resulting in an unsatisfiable set of constraints. To add insult to injury, often these bounds have no grounding in reality (the package author simply guessed the range) and removing it would result in a working compilation. This situation is so common that Cabal has a flag --allow-newer which lets you override the upper bounds of packages. The annoyance of managing bounds has lead to the development of tools like cabal-bounds, which try to make it less tedious to keep upper bounds up-to-date.

But as much as we like to rag on them, version bounds have a very important function: they prevent you from attempting to compile packages against dependencies which don't work at all! An under-constrained set of version bounds can easily have compiling against a version of the dependency which doesn't type check.

How can a module system help? At the end of the day, version numbers are trying to capture something about the API exported by a package, described by the package versioning policy. But the current state-of-the-art requires a user to manually translate changes to the API into version numbers: an error prone process, even when assisted by various tools. A module system, on the other hand, turns the API into a first-class entity understood by the compiler itself: a module signature. Wouldn't it be great if packages depended upon signatures rather than versions: then you would never have to worry about version numbers being inaccurate with respect to type checking. (Of course, versions would still be useful for recording changes to semantics not seen in the types, but their role here would be secondary in importance.) Some full disclosure is warranted here: I am not going to have this implemented by the end of my internship, but I'm hoping to make some good infrastructural contributions toward it.

Conclusion

If you skimmed the introduction to the Backpack paper, you might have come away with the impression that Backpack is something about random number generators, recursive linking and applicative semantics. While these are all true "facts" about Backpack, they understate the impact a good module system can have on the day-to-day problems of a working programmer. In this post, I hope I've elucidated some of these problems, even if I haven't convinced you that a module system like Backpack actually goes about solving these problems: that's for the next series of posts. Stay tuned!

New theme!

Hello loyal readers: Inside 206-105 has a new theme! I’m retiring Manifest, which was a pretty nice theme but (1) the text size was too small and (2) I decided I didn’t really like the fonts, I’ve reskinned my blog with a theme based on Brent Jackson’s Ashley, but ported to work on WordPress. I hope you like it, and please report any rendering snafus you might notice on older pages. Thanks!

Type classes: confluence, coherence and global uniqueness

Today, I'd like to talk about some of the core design principles behind type classes, a wildly successful feature in Haskell. The discussion here is closely motivated by the work we are doing at MSRC to support type classes in Backpack. While I was doing background reading, I was flummoxed to discover widespread misuse of the terms "confluence" and "coherence" with respect to type classes. So in this blog post, I want to settle the distinction, and propose a new term, "global uniqueness of instances" for the property which people have been colloquially referred to as confluence and coherence.

Let's start with the definitions of the two terms. Confluence is a property that comes from term-rewriting: a set of instances is confluent if, no matter what order constraint solving is performed, GHC will terminate with a canonical set of constraints that must be satisfied for any given use of a type class. In other words, confluence says that we won't conclude that a program doesn't type check just because we swapped in a different constraint solving algorithm.

Confluence's closely related twin is coherence (defined in the paper "Type classes: exploring the design space"). This property states that every different valid typing derivation of a program leads to a resulting program that has the same dynamic semantics. Why could differing typing derivations result in different dynamic semantics? The answer is that context reduction, which picks out type class instances, elaborates into concrete choices of dictionaries in the generated code. Confluence is a prerequisite for coherence, since one can hardly talk about the dynamic semantics of a program that doesn't type check.

So, what is it that people often refer to when they compare Scala type classes to Haskell type classes? I am going to refer to this as global uniqueness of instances, defining to say: in a fully compiled program, for any type, there is at most one instance resolution for a given type class. Languages with local type class instances such as Scala generally do not have this property, but in Haskell, we find this property is a very convenient one when building abstractions like sets.

So, what properties does GHC enforce, in practice? In the absence of any type system extensions, GHC's employs a set of rules to ensure that type class resolution is confluent and coherent. Intuitively, it achieves this by having a very simple constraint solving algorithm (generate wanted constraints and solve wanted constraints) and then requiring the set of instances to be nonoverlapping, ensuring there is only ever one way to solve a wanted constraint. Overlap is a more stringent restriction than either confluence or coherence, and via the OverlappingInstances and IncoherentInstances, GHC allows a user to relax this restriction "if they know what they're doing."

Surprisingly, however, GHC does not enforce global uniqueness of instances. Imported instances are not checked for overlap until we attempt to use them for instance resolution. Consider the following program:

-- T.hs
data T = T
-- A.hs
import T
instance Eq T where
-- B.hs
import T
instance Eq T where
-- C.hs
import A
import B

When compiled with one-shot compilation, C will not report overlapping instances unless we actually attempt to use the Eq instance in C. This is by design: ensuring that there are no overlapping instances eagerly requires eagerly reading all the interface files a module may depend on.

We might summarize these three properties in the following manner. Culturally, the Haskell community expects global uniqueness of instances to hold: the implicit global database of instances should be confluent and coherent. GHC, however, does not enforce uniqueness of instances: instead, it merely guarantees that the subset of the instance database it uses when it compiles any given module is confluent and coherent. GHC does do some tests when an instance is declared to see if it would result in overlap with visible instances, but the check is by no means perfect; truly, type-class constraint resolution has the final word. One mitigating factor is that in the absence of orphan instances, GHC is guaranteed to eagerly notice when the instance database has overlap (assuming that the instance declaration checks actually worked...)

Clearly, the fact that GHC's lazy behavior is surprising to most Haskellers means that the lazy check is mostly good enough: a user is likely to discover overlapping instances one way or another. However, it is relatively simple to construct example programs which violate global uniqueness of instances in an observable way:

-- A.hs
module A where
data U = X | Y deriving (Eq, Show)

-- B.hs
module B where
import Data.Set
import A

instance Ord U where
compare X X = EQ
compare X Y = LT
compare Y X = GT
compare Y Y = EQ

ins :: U -> Set U -> Set U
ins = insert

-- C.hs
module C where
import Data.Set
import A

instance Ord U where
compare X X = EQ
compare X Y = GT
compare Y X = LT
compare Y Y = EQ

ins' :: U -> Set U -> Set U
ins' = insert

-- D.hs
module Main where
import Data.Set
import A
import B
import C

test :: Set U
test = ins' X $ ins X $ ins Y $ empty

main :: IO ()
main = print test

-- OUTPUT
$ ghc -Wall -XSafe -fforce-recomp --make D.hs
[1 of 4] Compiling A ( A.hs, A.o )
[2 of 4] Compiling B ( B.hs, B.o )

B.hs:5:10: Warning: Orphan instance: instance [safe] Ord U
[3 of 4] Compiling C ( C.hs, C.o )

C.hs:5:10: Warning: Orphan instance: instance [safe] Ord U
[4 of 4] Compiling Main ( D.hs, D.o )
Linking D ...
$ ./D
fromList [X,Y,X]

Locally, all type class resolution was coherent: in the subset of instances each module had visible, type class resolution could be done unambiguously. Furthermore, the types of ins and ins' discharge type class resolution, so that in D when the database is now overlapping, no resolution occurs, so the error is never found.

It is easy to dismiss this example as an implementation wart in GHC, and continue pretending that global uniqueness of instances holds. However, the problem with global uniqueness of instances is that they are inherently nonmodular: you might find yourself unable to compose two components because they accidentally defined the same type class instance, even though these instances are plumbed deep in the implementation details of the components. This is a big problem for Backpack, or really any module system, whose mantra of separate modular development seeks to guarantee that linking will succeed if the library writer and the application writer develop to a common signature.

Parsec: “try a <|> b” considered harmful

tl;dr The scope of backtracking try should be minimized, usually by placing it inside the definition of a parser.

Have you ever written a Parsec parser and gotten a really uninformative error message?

"test.txt" (line 15, column 7):
unexpected 'A'
expecting end of input

The line and the column are randomly somewhere in your document, and you're pretty sure you should be in the middle of some stack of parser combinators. But wait! Parsec has somehow concluded that the document should be ending immediately. You noodle around and furthermore discover that the true error is some ways after the actually reported line and column.

You think, “No wonder Parsec gets such a bad rep about its error handling.”

Assuming that your grammar in question is not too weird, there is usually a simple explanation for an error message like this: the programmer sprinkled their code with too many backtracking try statements, and the backtracking has destroyed useful error state. In effect, at some point the parser failed for the reason we wanted to report to the user, but an enclosing try statement forced the parser to backtrack and try another (futile possibility).

This can be illustrated by way of an example. A Haskeller is playing around with parse combinators and decides to test out their parsing skills by writing a parser for Haskell module imports:

stmt ::= import qualified A as B
       | import A

Piggy-backing off of Parsec’s built in token combinators (and the sample code), their first version might look something like this:

import Text.Parsec
import qualified Text.Parsec.Token as P
import Text.Parsec.Language (haskellDef)

data Stmt = QualifiedImport String String | Import String
    deriving (Show)

pStmt = pQualifiedImport <|> pImport

pQualifiedImport = do
    reserved "import"
    reserved "qualified"
    i <- identifier
    reserved "as"
    i' <- identifier
    return (QualifiedImport i i')

pImport = do
    reserved "import"
    i <- identifier
    return (Import i)

lexer = P.makeTokenParser (haskellDef
    { P.reservedNames = P.reservedNames haskellDef ++ ["qualified", "as"] })
identifier = P.identifier lexer
reserved = P.reserved lexer

parseStmt input = parse (pStmt >> eof) "(unknown)" input

Unfortunately, the parser doesn't work for regular imports—they get this error message:

*Main> parseStmt "import Foo"
Left "(unknown)" (line 1, column 8):
unexpected "F"
expecting "qualified"

After a little Googling, they discover that Parsec doesn’t backtrack by default. Well, that’s fine; why not just insert a try into the parser.

pStmt = try pQualifiedImport <|> pImport

This fixes both parses and suggests the following rule for writing future parsers:

If I need choice over multiple parsers, but some of these parsers might consume input, I better tack a try onto each of the parsers, so that I can backtrack.

Unbeknownst to the user, they have introduced bad error reporting behavior:

*Main> parseStmt "import qualified Foo s B"
Left "(unknown)" (line 1, column 17):
unexpected reserved word "qualified"
expecting letter or digit or "#"

Wait a second! The error we wanted was that there was an unexpected identifier s, when we were expecting as. But instead of reporting an error when this occurred, Parsec instead backtracked, and attempted to match the pImport rule, only failing once that rule failed. By then, the knowledge that one of our choice branches failed had been forever lost.

How can we fix it? The problem is that our code backtracks when we, the developer, know it will be futile. In particular, once we have parsed import qualified, we know that the statement is, in fact, a qualified import, and we shouldn’t backtrack anymore. How can we get Parsec to understand this? Simple: reduce the scope of the try backtracking operator:

pStmt = pQualifiedImport <|> pImport

pQualifiedImport = do
    try $ do
        reserved "import"
        reserved "qualified"
    i <- identifier
    reserved "as"
    i' <- identifier
    return (QualifiedImport i i')

Here, we have moved the try from pStmt into pQualifiedImport, and we only backtrack if import qualified fails to parse. Once it parses, we consume those tokens and we are now committed to the choice of a qualified import. The error messages get correspondingly better:

*Main> parseStmt "import qualified Foo s F"
Left "(unknown)" (line 1, column 22):
unexpected "s"
expecting "as"

The moral of the story: The scope of backtracking try should be minimized, usually by placing it inside the definition of a parser. Some amount of cleverness is required: you have to be able to identify how much lookahead is necessary to commit to a branch, which generally depends on how the parser is used. Fortunately, many languages are constructed specifically so that the necessary lookahead is not too large, and for the types of projects I might use Parsec for, I’d be happy to sacrifice this modularity.

Another way of looking at this fiasco is that Parsec is at fault: it shouldn’t offer an API that makes it so easy to mess up error messages—why can’t it automatically figure out what the necessary lookahead is? While a traditional parser generator can achieve this (and improve efficiency by avoiding backtracking altogether in our earlier example), there are some fundamental reasons why Parsec (and monadic parser combinator libraries like it) cannot automatically determine what the lookahead needs to be. This is one of the reasons (among many) why many Haskellers prefer faster parsers which simply don’t try to do any error handling at all.

Why, then, did I write this post in the first place? There is still a substantial amount of documentation recommending the use of Parsec, and a beginning Haskeller is more likely than not going to implement their first parser in Parsec. And if someone is going to write a Parsec parser, you might as well spend a little time to limit your backtracking: it can make working with Parsec parsers a lot more pleasant.

GHC and mutable arrays: a DIRTY little secret

Brandon Simmon recently made a post to the glasgow-haskell-users mailing list asking the following question:

I've been looking into an issue in a library in which as more mutable arrays are allocated, GC dominates (I think I verified this?) and all code gets slower in proportion to the number of mutable arrays that are hanging around.

...to which I replied:

In the current GC design, mutable arrays of pointers are always placed on the mutable list. The mutable list of generations which are not being collected are always traversed; thus, the number of pointer arrays corresponds to a linear overhead for minor GCs.

If you’re coming from a traditional, imperative language, you might find this very surprising: if you paid linear overhead per GC in Java for all the mutable arrays in your system... well, you probably wouldn't use Java ever, for anything. But most Haskell users seem to get by fine; mostly because Haskell encourages immutability, making it rare for one to need lots of mutable pointer arrays.

Of course, when you do need it, it can be a bit painful. We have a GHC bug tracking the issue, and there is some low hanging fruit (a variant of mutable pointer arrays which has more expensive write operation, but which only gets put on the mutable list when you write to it) as well as some promising directions for how to implement card-marking for the heap, which is the strategy that GCs like the JVM's use.

On a more meta-level, implementing a perfomant generational garbage collector for an immutable language is far, far easier than implementing one for a mutable language. This is my personal hypothesis why Go doesn’t have a generational collector yet, and why GHC has such terrible behavior on certain classes of mutation.

Postscript. The title is a pun on the fact that “DIRTY” is used to describe mutable objects which have been written to since the last GC. These objects are part of the remembered set and must be traversed during garbage collection even if they are in an older generation.

Elimination with a Motive (in Coq)

Elimination rules play an important role in computations over datatypes in proof assistants like Coq. In his paper "Elimination with a Motive", Conor McBride argued that "we should exploit a hypothesis not in terms of its immediate consequences, but in terms of the leverage it exerts on an arbitrary goal: we should give elimination a motive." In other words, proofs in a refinement setting (backwards reasoning) should use their goals to guide elimination.

I recently had the opportunity to reread this historical paper, and in the process, I thought it would be nice to port the examples to Coq. Here is the result:

http://web.mit.edu/~ezyang/Public/motive/motive.html

It's basically a short tutorial motivating John Major equality (also known as heterogenous equality.) The linked text is essentially an annotated version of the first part of the paper—I reused most of the text, adding comments here and there as necessary. The source is also available at:

http://web.mit.edu/~ezyang/Public/motive/motive.v

The cost of weak pointers and finalizers in GHC

Weak pointers and finalizers are a very convenient feature for many types of programs. Weak pointers are useful for implementing memotables and solving certain classes of memory leaks, while finalizers are useful for fitting "allocate/deallocate" memory models into a garbage-collected language. Of course, these features don’t come for free, and so one might wonder what the cost of utilizing these two (closely related) features are in GHC. In this blog post, I want to explain how weak pointers and finalizers are implemented in the GHC runtime system and characterize what extra overheads you incur by using them. These post assumes some basic knowledge about how the runtime system and copying garbage collection work.

The userland API

The API for weak pointers is in System.Mem.Weak; in its full generality, a weak pointer consists of a key and a value, with the property that if the key is alive, then the value is considered alive. (A "simple" weak reference is simply one where the key and value are the same.) A weak pointer can also optionally be associated with a finalizer, which is run when the object is garbage collected. Haskell finalizers are not guaranteed to run.

Foreign pointers in Foreign.ForeignPtr also have a the capability to attach a C finalizer; i.e. a function pointer that might get run during garbage collection. As it turns out, these finalizers are also implemented using weak pointers, but C finalizers are treated differently from Haskell finalizers.

Representation of weak pointers

A weak pointer is a special type of object with the following layout:

typedef struct _StgWeak {   /* Weak v */
  StgHeader header;
  StgClosure *cfinalizers;
  StgClosure *key;
  StgClosure *value;                /* v */
  StgClosure *finalizer;
  struct _StgWeak *link;
} StgWeak;

As we can see, we have pointers to the key and value, as well as separate pointers for a single Haskell finalizer (just a normal closure) and C finalizers (which have the type StgCFinalizerList). There is also a link field for linking weak pointers together. In fact, when the weak pointer is created, it is added to the nursery's list of weak pointers (aptly named weak_ptr_list). As of GHC 7.8, this list is global, so we do have to take out a global lock when a new weak pointer is allocated; however, the lock has been removed in HEAD.

Garbage collecting weak pointers

Pop quiz! When we do a (minor) garbage collection on weak pointers, which of the fields in StgWeak are considered pointers, and which fields are considered non-pointers? The correct answer is: only the first field is considered a “pointer”; the rest are treated as non-pointers by normal GC. This is actually what you would expect: if we handled the key and value fields as normal pointer fields during GC, then they wouldn’t be weak at all.

Once garbage collection has been completed (modulo all of the weak references), we then go through the weak pointer list and check if the keys are alive. If they are, then the values and finalizers should be considered alive, so we mark them as live, and head back and do more garbage collection. This process will continue as long as we keep discovering new weak pointers to process; however, this will only occur when the key and the value are different (if they are the same, then the key must have already been processed by the GC). Live weak pointers are removed from the "old" list and placed into the new list of live weak pointers, for the next time.

Once there are no more newly discovered live pointers, the list of dead pointers is collected together, and the finalizers are scheduled (scheduleFinalizers). C finalizers are run on the spot during GC, while Haskell finalizers are batched together into a list and then shunted off to a freshly created thread to be run.

That's it! There are some details for how to handle liveness of finalizers (which are heap objects too, so even if an object is dead we have to keep the finalizer alive for one more GC) and threads (a finalizer for a weak pointer can keep a thread alive).

Tallying up the costs

To summarize, here are the extra costs of a weak pointer:

Allocating a weak pointer requires taking a global lock (will be fixed in GHC 7.10) and costs six words (fairly hefty as far as Haskell heap objects tend to go.)
During each minor GC, processing weak pointers takes time linear to the size of the weak pointer lists for all of the generations being collected. Furthermore, this process involves traversing a linked list, so data locality will not be very good. This process may happen more than once, although once it is determined that a weak pointer is live, it is not processed again. The cost of redoing GC when a weak pointer is found to be live is simply the cost of synchronizing all parallel GC threads together.
The number of times you have to switch between GC'ing and processing weak pointers depends on the structure of the heap. Take a heap and add a special "weak link" from a key to its dependent weak value. Then we can classify objects by the minimum number of weak links we must traverse from a root to reach the object: call this the "weak distance". Supposing that a given weak pointer's weak distance is n, then we spend O(n) time processing that weak pointer during minor GC. The maximum weak distance constitutes how many times we need to redo the GC.

In short, weak pointers are reasonably cheap when they are not deeply nested: you only pay the cost of traversing a linked list of all of the pointers you have allocated once per garbage collection. In the pessimal case (a chain of weak links, where the value of each weak pointer was not considered reachable until we discovered its key is live in the previous iteration), we can spend quadratic time processing weak pointers.

Calculating Shanten in Mahjong

Move aside, poker! While the probabilities of various poker hands are well understood and tabulated, the Chinese game of chance Mahjong [1] enjoys a far more intricate structure of expected values and probabilities. [2] This is largely due in part to the much larger variety of tiles available (136 tiles, as opposed to the standard playing card deck size of 52), as well as the turn-by-turn game play, which means there is quite a lot of strategy involved with what is ostensibly a game of chance. In fact, the subject is so intricate, I’ve decided to write my PhD thesis on it. This blog post is a condensed version of one chapter of my thesis, considering the calculation of shanten, which we will define below. I’ll be using Japanese terms, since my favorite variant of mahjong is Riichi Mahjong; you can consult the Wikipedia article on the subject if you need to translate.

Calculating Shanten

The basic gameplay of Mahjong involves drawing a tile into a hand of thirteen tiles, and then discarding another tile. The goal is to form a hand of fourteen tiles (that is, after drawing, but before discarding a tile) which is a winning configuration. There are a number of different winning configurations, but most winning configurations share a similar pattern: the fourteen tiles must be grouped into four triples and a single pair. Triples are either three of the same tile, or three tiles in a sequence (there are three “suits” which can be used to form sequences); the pair is two of the same tiles. Here is an example:

http://upload.wikimedia.org/wikipedia/commons/1/1c/MJw1.png

http://upload.wikimedia.org/wikipedia/commons/c/c3/MJw2.png

http://upload.wikimedia.org/wikipedia/commons/9/9e/MJw3.png

http://upload.wikimedia.org/wikipedia/commons/e/ed/MJw5.png

http://upload.wikimedia.org/wikipedia/commons/6/61/MJt2.png

http://upload.wikimedia.org/wikipedia/commons/a/a4/MJt3.png

http://upload.wikimedia.org/wikipedia/commons/7/72/MJt4.png

http://upload.wikimedia.org/wikipedia/commons/6/6f/MJt7.png

http://upload.wikimedia.org/wikipedia/commons/1/18/MJt8.png

http://upload.wikimedia.org/wikipedia/commons/a/a1/MJt9.png

http://upload.wikimedia.org/wikipedia/commons/d/df/MJs4.png

http://upload.wikimedia.org/wikipedia/commons/b/bf/MJs5.png

http://upload.wikimedia.org/wikipedia/commons/c/cb/MJs6.png

Represented numerically, this hand consists of the triples and pairs 123 55 234 789 456.

One interesting quantity that is useful to calculate given a mahjong hand is the shanten number, that is, the number of tiles away from winning you are. This can be used to give you the most crude heuristic of how to play: discard tiles that get you closer to tenpai. The most widely known shanten calculator is this one on Tenhou’s website [3]; unfortunately, the source code for this calculator is not available. There is another StackOverflow question on the subject, but the “best” answer offers only a heuristic approach with no proof of correctness! Can we do better?

Naïvely, the shanten number is a breadth first search on the permutations of a hand. When a winning hand is found, the algorithm terminates and indicates the depth the search had gotten to. Such an algorithm is obviously correct; unfortunately, with 136 tiles, one would have to traverse $((136-13)\times 14)^n$ hands (choices of new tiles times choices of discard) while searching for a winning hand that is n-shanten away. If you are four tiles away, you will have to traverse over six trillion hands. We can reduce this number by avoiding redundant work if we memoize the shanten associated with hands: however, the total number of possible hands is roughly $136 \choose 13$ , or 59 bits. Though we can fit (via a combinatorial number system) a hand into a 64-bit integer, the resulting table is still far too large to hope to fit in memory.

The trick is to observe that shanten calculation for each of the suits is symmetric; thus, we can dynamic program over a much smaller space of the tiles 1 through 9 for some generic suit, and then reuse these results when assembling the final calculation. $9 \times 4 \choose 13$ is still rather large, so we can take advantage of the fact that because there are four copies of each tile, an equivalent representation is a 9-vector of the numbers zero to four, with the constraint that the sum of these numbers is 13. Even without the constraint, the count $5^9$ is only two million, which is quite tractable. At a byte per entry, that’s 2MB of memory; less than your browser is using to view this webpage. (In fact, we want the constraint to actually be that the sum is less than or equal to 13, since not all hands are single-suited, so the number of tiles in a hand is less.

The breadth-first search for solving a single suit proceeds as follows:

Initialize a table A indexed by tile configuration (a 9-vector of 0..4).
Initialize a todo queue Q of tile configurations.
Initialize all winning configurations in table A with shanten zero (this can be done by enumeration), recording these configurations in Q.
While the todo queue Q is not empty, pop the front element, mark the shanten of all adjacent uninitialized nodes as one greater than that node, and push those nodes onto the todo queue.

With this information in hand, we can assemble the overall shanten of a hand. It suffices to try every distribution of triples and the pairs over the four types of tiles (also including null tiles), consulting the shanten of the requested shape, and return the minimum of all these configurations. There are $4 \times {4 + 4 - 1 \choose 4}$ (by stars and bars) combinations, for a total of 140 configurations. Computing the shanten of each configuration is a constant time operation into the lookup table generated by the per-suit calculation. A true shanten calculator must also accomodate the rare other hands which do not follow this configuration, but these winning configurations are usually highly constrained, and quite easily to (separately) compute the shanten of.

With a shanten calculator, there are a number of other quantities which can be calculated. Uke-ire refers to the number of possible draws which can reduce the shanten of your hand: one strives for high uke-ire because it means that probability that you will draw a tile which moves your hand closer to winning. Given a hand, it's very easy to calculate its uke-ire: just look at all adjacent hands and count the number of hands which have lower shanten.

Further extensions

Suppose that you are trying to design an AI which can play Mahjong. Would the above shanten calculator provide a good evaluation metric for your hand? Not really: it has a major drawback, in that it does not consider the fact that some tiles are simply unavailable (they were discarded). For example, if all four “nine stick” tiles are visible on the table, then no hand configuration containing a nine stick is actually reachable. Adjusting for this situation is actually quite difficult, for two reasons: first, we can no longer precompute a shanten table, since we need to adjust at runtime what the reachability metric is; second, the various suits are no longer symmetric, so we have to do three times as much work. (We can avoid an exponential blowup, however, since there is no inter-suit interaction.)

Another downside of the shanten and uke-ire metrics is that they are not direct measures of “tile efficiency”: that is, they do not directly dictate a strategy for discards which minimizes the expected time before you get a winning hand. Consider, for example, a situation where you have the tiles 233, and only need to make another triple in order to win. You have two possible discards: you can discard a 2 or a 3. In both cases, your shanten is zero, but discarding a 2, you can only win by drawing a 3, whereas discarding a 3, you can win by drawing a 1 or a 4. Maximizing efficiency requires considering the lifetime ure-kire of your hands.

Even then, perfect tile efficiency is not enough to see victory: every winning hand is associated with a point-score, and so in many cases it may make sense to go for a lower-probability hand that has higher expected value. Our decomposition method completely falls apart here, as while the space of winning configurations can be partitioned, scoring has nonlocal effects, so the entire hand has to be considered as a whole. In such cases, one might try for a Monte Carlo approach, since the probability space is too difficult to directly characterize. However, in the Japanese Mahjong scoring system, there is yet another difficulty with this approach: the scoring system is exponential. Thus, we are in a situation where the majority of samples will be low scoring, but an exponentially few number of samples have exponential payoff. In such cases, it’s difficult to say if random sampling will actually give a good result, since it is likely to miscalculate the payoff, unless exponentially many samples are taken. (On the other hand, because these hands are so rare, an AI might do considerably well simply ignoring them.)

To summarize, Mahjong is a fascinating game, whose large state space makes it difficult to accurately characterize the probabilities involved. In my thesis, I attempt to tackle some of these questions; please check it out if you are interested in more.

[1] No, I am not talking about the travesty that is mahjong solitaire.

[2] To be clear, I am not saying that poker strategy is simple—betting strategy is probably one of the most interesting parts of the game—I am simply saying that the basic game is rather simple, from a probability perspective.

[3] Tenhou is a popular Japanese online mahjong client. The input format for the Tenhou calculator is 123m123p123s123z, where numbers before m indicate man tiles, p pin tiles, s sou tiles, and z honors (in order, they are: east, south, west, north, white, green, red). Each entry indicates which tile you can discard to move closer to tenpai; the next list is of ure-kire (and the number of tiles which move the hand further).

Haskell for Coq programmers

So you may have heard about this popular new programming language called Haskell. What's Haskell? Haskell is a non-dependently typed programming language, sporting general recursion, type inference and built-in side-effects. It is true that dependent types are considered an essential component of modern, expressive type systems. However, giving up dependence can result in certain benefits for other aspects of software engineering, and in this article, we'd like to talk about the omissions that Haskell makes to support these changes.

Syntax

There are a number of syntactic differences between Coq and Haskell, which we will point out as we proceed in this article. To start with, we note that in Coq, typing is denoted using a single colon (false : Bool); in Haskell, a double colon is used (False :: Bool). Additionally, Haskell has a syntactic restriction, where constructors must be capitalized, while variables must be lower-case.

Similar to my OCaml for Haskellers post, code snippets will have the form:

(* Coq *)

{- Haskell -}

Universes/kinding

A universe is a type whose elements are types. They were originally introduced to constructive type theory by Per Martin-Löf. Coq sports an infinite hierarchy of universes (e.g. Type (* 0 *) : Type (* 1 *), Type (* 1 *) : Type (* 2 *), and so forth).

Given this, it is tempting to draw an analogy between universes and Haskell’s kind of types * (pronounced “star”), which classifies types in the same way Type (* 0 *) classifies primitive types in Coq. Furthermore, the sort box classifies kinds (* : BOX, although this sort is strictly internal and cannot be written in the source language). However, the resemblance here is only superficial: it is misleading to think of Haskell as a language with only two universes. The differences can be summarized as follows:

In Coq, universes are used purely as a sizing mechanism, to prevent the creation of types which are too big. In Haskell, types and kinds do double duty to enforce the phase distinction: if a has kind *, then x :: a is guaranteed to be a runtime value; likewise, if k has sort box, then a :: k is guaranteed to be a compile-time value. This structuring is a common pattern in traditional programming languages, although knowledgeable folks like Conor McBride think that ultimately this is a design error, since one doesn’t really need a kinding system to have type erasure.
In Coq, universes are cumulative: a term which has type Type (* 0 *) also has type Type (* 1 *). In Haskell, there is no cumulativity between between types and kinds: if Nat is a type (i.e. has the type *), it is not automatically a kind. However, in some cases, partial cumulativity can be achieved using datatype promotion, which constructs a separate kind-level replica of a type, where the data constructors are now type-level constructors. Promotion is also capable of promoting type constructors to kind constructors.
In Coq, a common term language is used at all levels of universes. In Haskell, there are three distinct languages: a language for handling base terms (the runtime values), a language for handling type-level terms (e.g. types and type constructors) and a language for handling kind-level terms. In some cases this syntax is overloaded, but in later sections, we will often need to say how a construct is formulated separately at each level of the kinding system.

One further remark: Type in Coq is predicative; in Haskell, * is impredicative, following the tradition of System F and other languages in the lambda cube, where kinding systems of this style are easy to model.

Function types

In Coq, given two types A and B, we can construct the type A -> B denoting functions from A to B (for A and B of any universe). Like Coq, functions with multiple arguments are natively supported using currying. Haskell supports function types for both types (Int -> Int) and kinds (* -> *, often called type constructors) and application by juxtaposition (e.g. f x). (Function types are subsumed by pi types, however, we defer this discussion for later.) However, Haskell has some restrictions on how one may construct functions, and utilizes different syntax when handling types and kinds:

For expressions (with type a -> b where a, b :: *), both direct definitions and lambdas are supported. A direct definition is written in an equational style:

Definition f x := x + x.

f x = x + x

while a lambda is represented using a backslash:

fun x => x + x

\x -> x + x

For type families (with type k1 -> k2 where k1 and k2 are kinds), the lambda syntax is not supported. In fact, no higher-order behavior is permitted at the type-level; while we can directly define appropriately kinded type functions, at the end of the day, these functions must be fully applied or they will be rejected by the type-checker. From an implementation perspective, the omission of type lambdas makes type inference and checking much easier.

Type synonyms:
```
Definition Endo A := A -> A.
```
```
type Endo a = a -> a
```
Type synonyms are judgmentally equal to their expansions. As mentioned in the introduction, they cannot be partially applied. They were originally intended as a limited syntactic mechanism for making type signatures more readable.
Closed type (synonym) families:
```
Inductive fcode :=
  | intcode : fcode
  | anycode : fcode.
Definition interp (c : fcode) : Type := match c with
  | intcode -> bool
  | anycode -> char
end.
```
```
type family F a where
  F Int = Bool
  F a   = Char
```
While closed type families look like the addition of typecase (and would violate parametricity in that case), this is not the case, as closed type families can only return types. In fact, closed type families correspond to a well-known design pattern in Coq, where one writes inductive data type representing codes of types, and then having an interpretation function which interprets the codes as actual types. As we have stated earlier, Haskell has no direct mechanism for defining functions on types, so this useful pattern had to be supported directly in the type families functionality. Once again, closed type families cannot be partially applied.

In fact, the closed type family functionality is a bit more expressive than an inductive code. In particular, closed type families support non-linear pattern matches (F a a = Int) and can sometimes reduce a term when no iota reductions are available, because some of the inputs are not known. The reason for this is because closed type families are “evaluated” using unification and constraint-solving, rather than ordinary term reduction as would be the case with codes in Coq. Indeed, nearly all of the “type level computation” one may perform in Haskell, is really just constraint solving. Closed type families are not available in a released version of GHC (yet), but there is a Haskell wiki page describing closed type families in more detail.
Open type (synonym) families:
```
(* Not directly supported in Coq *)
```
```
type family F a
type instance F Int = Char
type instance F Char = Int
```
Unlike closed type families, open type families operate under an open universe, and have no analogue in Coq. Open type families do not support nonlinear matching, and must completely unify to reduce. Additionally, there are number of restrictions on the left-hand side and right-hand side of such families in order maintain decidable type inference. The section of the GHC manual Type instance declarations expands on these limitations.

Both closed and type-level families can be used to implement computation at the type-level of data constructors which were lifted to the type-level via promotion. Unfortunately, any such algorithm must be implemented twice: once at the expression level, and once at the type level. Use of metaprogramming can alleviate some of the boilerplate necessary; see, for example, the singletons library.

Dependent function types (Π-types)

A Π-type is a function type whose codomain type can vary depending on the element of the domain to which the function is applied. Haskell does not have Π-types in any meaningful sense. However, if you only want to use a Π-type solely for polymorphism, Haskell does have support. For polymorphism over types (e.g. with type forall a : k, a -> a, where k is a kind), Haskell has a twist:

Definition id : forall (A : Type), A -> A := fun A => fun x => x.

id :: a -> a
id = \x -> x

In particular, the standard notation in Haskell is to omit both the type-lambda (at the expression level) and the quantification (at the type level). The quantification at the type level can be recovered using the explicit universal quantification extension:

id :: forall a. a -> a

However, there is no way to directly explicitly state the type-lambda. When the quantification is not at the top-level, Haskell requires an explicit type signature with the quantification put in the right place. This requires the rank-2 (or rank-n, depending on the nesting) polymorphism extension:

Definition f : (forall A, A -> A) -> bool := fun g => g bool true.

f :: (forall a. a -> a) -> Bool
f g = g True

Polymorphism is also supported at the kind-level using the kind polymorphism extension. However, there is no explicit forall for kind variables; you must simply mention a kind variable in a kind signature.

Proper dependent types cannot be supported directly, but they can be simulated by first promoting data types from the expression level to the type-level. A runtime data-structure called a singleton is then used to refine the result of a runtime pattern-match into type information. This pattern of programming in Haskell is not standard, though there are recent academic papers describing how to employ it. One particularly good one is Hasochism: The Pleasure and Pain of Dependently Typed Haskell Program, by Sam Lindley and Conor McBride.

Product types

Coq supports cartesian product over types, as well as a nullary product type called unit. Very similar constructs are also implemented in the Haskell standard library:

(true, false) : bool * bool
(True, False) :: (Bool, Bool)

tt : unit
() :: ()

Pairs can be destructed using pattern-matching:

match p with
  | (x, y) => ...
end

case p of
  (x, y) -> ...

Red-blooded type theorists may take issue with this identification: in particular, Haskell’s default pair type is what is considered a negative type, as it is lazy in its values. (See more on polarity.) As Coq’s pair is defined inductively, i.e. positively, a more accurate identification would be with a strict pair, defined as data SPair a b = SPair !a !b; i.e. upon construction, both arguments are evaluated. This distinction is difficult to see in Coq, since positive and negative pairs are logically equivalent, and Coq does not distinguish between them. (As a total language, it is indifferent to choice of evaluation strategy.) Furthermore, it's relatively common practice to extract pairs into their lazy variants when doing code extraction.

Dependent pair types (Σ-types)

Dependent pair types are the generalization of product types to be dependent. As before, Σ-types cannot be directly expressed, except in the case where the first component is a type. In this case, there is an encoding trick utilizing data types which can be used to express so-called existential types:

Definition p := exist bool not : { A : Type & A -> bool }

data Ex = forall a. Ex (a -> Bool)
p = Ex not

As was the case with polymorphism, the type argument to the dependent pair is implicit. It can be specified explicitly by way of an appropriately placed type annotation.

Recursion

In Coq, all recursive functions must have a structurally decreasing argument, in order to ensure that all functions terminate. In Haskell, this restriction is lifted for the expression level; as a result, expression level functions may not terminate. At the type-level, by default, Haskell enforces that type level computation is decidable. However, this restriction can be lifted using the UndecidableInstances flag. It is generally believed that undecidable instances cannot be used to cause a violation of type safety, as nonterminating instances would simply cause the compiler to loop infinitely, and due to the fact that in Haskell, types cannot (directly) cause a change in runtime behavior.

Inductive types/Recursive types

In Coq, one has the capacity to define inductive data types. Haskell has a similar-looking mechanism for defining data types, but there are a number of important differences which lead many to avoid using the moniker inductive data types for Haskell data types (although it’s fairly common for Haskellers to use the term anyway.)

Basic types like boolean can be defined with ease in both languages (in all cases, we will use the GADT syntax for Haskell data-types, as it is closer in form to Coq’s syntax and strictly more powerful):

Inductive bool : Type :=
  | true : bool
  | false : bool.

data Bool :: * where
  True :: Bool
  False :: Bool

Both also support recursive occurrences of the type being defined:

Inductive nat : Type :=
  | z : nat
  | s : nat -> nat.

data Nat :: * where
  Z :: Nat
  S :: Nat -> Nat

One has to be careful though: our definition of Nat in Haskell admits one more term: infinity (an infinite chain of successors). This is similar to the situation with products, and stems from the fact that Haskell is lazy.

Haskell’s data types support parameters, but these parameters may only be types, and not values. (Though, recall that data types can be promoted to the type level). Thus, the standard type family of vectors may be defined, assuming an appropriate type-level nat (as usual, explicit forall has been omitted):

Inductive vec (A : Type) : nat -> Type :=
  | vnil  : vec A 0
  | vcons : forall n, A -> vec A n -> vec A (S n)

data Vec :: Nat -> * -> * where
  VNil  :: Vec Z a
  VCons :: a -> Vec n a -> Vec (S n) a

As type-level lambda is not supported but partial application of data types is (in contrast to type families), the order of arguments in the type must be chosen with care. (One could define a type-level flip, but they would not be able to partially apply it.)

Haskell data type definitions do not have the strict positivity requirement, since we are not requiring termination; thus, peculiar data types that would not be allowed in Coq can be written:

data Free f a where
   Free :: f (Free f a) -> Free f a
   Pure :: a -> Free f a

data Mu f where
   Roll :: f (Mu f) -> Mu f

Inference

Coq has support for requesting that a term be inferred by the unification engine, either by placing an underscore in a context or by designating an argument as implicit (how one might implement in Coq the omission of type arguments of polymorphic functions as seen in Haskell). Generally, one cannot expect all inference problems in a dependently typed language to be solvable, and the inner-workings of Coq’s unification engines (plural!) are considered a black art (no worry, as the trusted kernel will verify that the inferred arguments are well-typed).

Haskell as specified in Haskell'98 enjoys principal types and full type inference under Hindley-Milner. However, to recover many of the advanced features enjoyed by Coq, Haskell has added numerous extensions which cannot be easily accomodated by Hindley-Milner, including type-class constraints, multiparameter type classes, GADTs and type families. The current state-of-the-art is an algorithm called OutsideIn(X). With these features, there are no completeness guarantee. However, if the inference algorithm accepts a definition, then that definition has a principal type and that type is the type the algorithm found.

Conclusion

This article started as a joke over in OPLSS'13, where I found myself explaining some of the hairier aspects of Haskell’s type system to Jason Gross, who had internalized Coq before he had learned much Haskell. Its construction was iced for a while, but later I realized that I could pattern the post off of the first chapter of the homotopy type theory book. While I am not sure how useful this document will be for learning Haskell, I think it suggests a very interesting way of mentally organizing many of Haskell’s more intricate type-system features. Are proper dependent types simpler? Hell yes. But it’s also worth thinking about where Haskell goes further than most existing dependently typed languages...

Postscript

Bob Harper complained over Twitter that this post suggested misleading analogies in some situations. I've tried to correct some of his comments, but in some cases I wasn't able to divine the full content of his comments. I invite readers to see if they can answer these questions:

Because of the phase distinction, Haskell’s type families are not actually type families, in the style of Coq, Nuprl or Agda. Why?
This post is confused about the distinction between elaboration (type inference) and semantics (type structure). Where is this confusion?
Quantification over kinds is not the same as quantification over types. Why?

Equality, roughly speaking

In Software Foundations, equality is defined in this way:

Even Coq's equality relation is not built in. It has (roughly) the following inductive definition.
Inductive eq0 {X:Type} : X -> X -> Prop :=
  refl_equal0 : forall x, eq0 x x.

Why the roughly? Well, as it turns out, Coq defines equality a little differently (reformatted to match the Software Foundations presentation):

Inductive eq1 {X:Type} (x:X) : X -> Prop :=
  refl_equal1 : eq1 x x.

What’s the difference? The trick is to look at the induction principles that Coq generates for each of these:

eq0_ind
   : forall (X : Type) (P : X -> X -> Prop),
     (forall x : X, P x x) -> forall y y0 : X, eq0 y y0 -> P y y0

eq1_ind
   : forall (X : Type) (x : X) (P : X -> Prop),
     P x -> forall y : X, eq1 x y -> P y

During our Homotopy Type Theory reading group, Jeremy pointed out that the difference between these two principles is exactly the difference between path induction (eq0) and based path induction (eq1). (This is covered in the Homotopy Type Theory book in section 1.12) So, Coq uses the slightly weirder definition because it happens to be a bit more convenient. (I’m sure this is folklore, but I sure didn’t notice this until now! For more reading, check out this excellent blog post by Dan Licata.)

« Older Newer »