Existential Pontification and Generalized Abstract Digressions

## Tomatoes are a subtype of vegetables

Subtyping is one of those concepts that seems to makes sense when you first learn it (“Sure, convertibles are a subtype of vehicles, because all convertibles are vehicles but not all vehicles are convertibles”) but can quickly become confusing when function types are thrown into the mix. For example, if a is a subtype of b, is (a -> r) -> r a subtype of (b -> r) -> r? (If you know the answer to this question, this blog post is not for you!) When we asked our students this question, invariably some were lead astray. True, you can mechanically work it out using the rules, but what’s the intuition?

Maybe this example will help. Let a be tomatoes, and b be vegetables. a is a subtype of b if we can use an a in any context where we were expecting a b: since tomatoes are (culinary) vegetables, tomatoes are a subtype of vegetables.

What about a -> r? Let r be soup: then we can think of Tomato -> Soup as recipes for tomato soup (taking tomatoes and turning them into soup) and Vegetable -> Soup as recipes for vegetable soup (taking vegetables—any kind of vegetable—and turning them into soup). As a simplifying assumption, let's assume all we care about the result is that it’s soup, and not what type of soup it is.

What is the subtype relationship between these two types of recipes? A vegetable soup recipe is more flexible: you can use it as a recipe to make soup from tomatoes, since tomatoes are just vegetables. But you can’t use a tomato soup recipe on an eggplant. Thus, vegetable soup recipes are a subtype of tomato soup recipes.

This brings us to the final type: (a -> r) -> r. What is (Vegetable -> Soup) -> Soup? Well, imagine the following situation...

One night, Bob calls you up on the phone. He says, “Hey, I’ve got some vegetables left in the fridge, and I know your Dad was a genius when it came to inventing recipes. Do you know if he had a good soup recipe?”

“I don’t know...” you say slowly, “What kind of vegetables?”

“Oh, it’s just vegetables. Look, I’ll pay you back with some soup, just come over with the recipe!” You hear a click on the receiver.

You pore over your Dad’s cookbook and find a tomato soup recipe. Argh! You can’t bring this recipe, because Bob might not actually have tomatoes. As if on cue, the phone rings again. Alice is on the line: “The beef casserole recipe was lovely; I’ve got some tomatoes and was thinking of making some soup with them, do you have a recipe for that too?” Apparently, this happens to you a lot.

“In fact I do!” you turn back to your cookbook, but to your astonishment, you can’t find your tomato soup recipe any more. But you do find a vegetable soup recipe. “Will a vegetable soup recipe work?”

“Sure—I’m not a botanist: to me, tomatoes are vegetables too. Thanks a lot!”

You feel relieved too, because you now have a recipe for Bob as well.

Bob is a person who takes vegetable soup recipes and turns them into soup: he’s (Vegetable -> Soup) -> Soup. Alice, on the other hand, is a person who takes tomato soup recipes and turns them into soup: she’s (Tomato -> Soup) -> Soup. You could give Alice either a tomato soup recipe or a vegetable soup recipe, since you knew she had tomatoes, but Bob’s vague description of the ingredients he had on hand meant you could only bring a recipe that worked on all vegetables. Callers like Alice are easier to accommodate: (Tomato -> Soup) -> Soup is a subtype of (Vegetable -> Soup) -> Soup.

In practice, it is probably faster to formally reason out the subtyping relationship than it is to intuit it out; however, hopefully this scenario has painted a picture of why the rules look the way they do.

• November 14, 2014

This year at ICFP, we had some blockbuster attendance to the Haskell Implementor's Workshop (at times, it was standing room only). I had the pleasure of presenting the work I had done over the summer on Backpack.

You can grab the slides or view the presentation itself (thank you ICFP organizers for being incredibly on-the-ball with videos this year!) The talk intersects a little bit with my blog post A taste of Cabalized Backpack, but there are more pictures, and I also emphasize (perhaps a little too much) the long term direction we are headed in.

There were a lot of really nice talks at HiW. Here are some of my personal highlights:

• September 7, 2014

## Open type families are not modular

One of the major open problems for building a module system in Haskell is the treatment of type classes, which I have discussed previously on this blog. I've noted how the current mode of use in type classes in Haskell assume “global uniqueness”, which is inherently anti-modular; breaking this assumption risks violating the encapsulation of many existing data types.

As if we have a choice.

In fact, our hand is forced by the presence of open type families in Haskell, which are feature many similar properties to type classes, but with the added property that global uniqueness is required for type safety. We don't have a choice (unless we want type classes with associated types to behave differently from type classes): we have to figure out how to reconcile the inherent non-modularity of type families with the Backpack module system.

In this blog post, I want to carefully lay out why open type families are inherently unmodular and propose some solutions for managing this unmodularity. If you know what the problem is, you can skip the first two sections and go straight to the proposed solutions section.

Before we talk about open type family instances, it's first worth emphasizing the (intuitive) fact that a signature of a module is supposed to be able to hide information about its implementation. Here's a simple example:

```module A where
x :: Int

module B where
import A
y = 0
z = x + y
```

Here, A is a signature, while B is a module which imports the signature. One of the points of a module system is that we should be able to type check B with respect to A, without knowing anything about what module we actually use as the implementation. Furthermore, if this type checking succeeds, then for any implementation which provides the interface of A, the combined program should also type check. This should hold even if the implementation of A defines other identifiers not mentioned in the signature:

```module A where
x = 1
y = 2
```

If B had directly imported this implementation, the identifier y would be ambiguous; but the signature filtered out the declarations so that B only sees the identifiers in the signature.

With this in mind, let's now consider the analogous situation with open type families. Assuming that we have some type family F defined in the prelude, we have the same example:

```module A where
type instance F Int
f :: F Bool

module B where
import A
type instance F Bool = Int -> Bool
x = f 2
```

Now, should the following module A be a permissible implementation of the signature?

```module A where
type instance F Int = Int
type instance F Bool = Int
f = 42
```

If we view this example with the glasses off, we might conclude that it is a permissible implementation. After all, the implementation of A provides an extra type instance, yes, but when this happened previously with a (value-level) declaration, it was hidden by the signature.

But if put our glasses on and look at the example as a whole, something bad has happened: we're attempting to use the integer 42 as a function from integers to booleans. The trouble is that F Bool has been given different types in the module A and module B, and this is unsound... like, segfault unsound. And if we think about it some more, this should not be surprising: we already knew it was unsound to have overlapping type families (and eagerly check for this), and signature-style hiding is an easy way to allow overlap to sneak in.

The distressing conclusion: open type families are not modular.

So, what does this mean? Should we throw our hands up and give up giving Haskell a new module system? Obviously, we’re not going to go without a fight. Here are some ways to counter the problem.

### The basic proposal: require all instances in the signature

The simplest and most straightforward way to solve the unsoundness is to require that a signature mention all of the family instances that are transitively exported by the module. So, in our previous example, the implementation of A does not satisfy the signature because it has an instance which is not mentioned in the signature, but would satisfy this signature:

```module A where
type instance F Int
type instance F Bool
```

While at first glance this might not seem too onerous, it's important to note that this requirement is transitive. If A happens to import another module Internal, which itself has its own type family instances, those must be represented in the signature as well. (It's easy to imagine this spinning out of control for type classes, where any of the forty imports at the top of your file may be bringing in any manner of type classes into scope.) There are two major user-visible consequences:

1. Module imports are not an implementation detail—you need to replicate this structure in the signature file, and
2. Adding instances is always a backwards-incompatible change (there is no weakening).

Of course, as Richard pointed out to me, this is already the case for Haskell programs (and you just hoped that adding that one extra instance was "OK").

Despite its unfriendliness, this proposal serves as the basis for the rest of the proposals, which you can conceptualize as trying to characterize, “When can I avoid having to write all of the instances in my signature?”

### Extension 1: The orphan restriction

Suppose that I write the following two modules:

```module A where
data T = T
type instance F T = Bool

module B where
import A
type instance F T = Int -> Int
```

While it is true that these two type instances are overlapping and rightly rejected, they are not equally at fault: in particular, the instance in module B is an orphan. An orphan instance is an instance for type class/family F and data type T (it just needs to occur anywhere on the left-hand side) which lives in a module that defines neither. (A is not an orphan since the instance lives in the same module as the definition of data type T).

What we might wonder is, “If we disallowed all orphan instances, could this rule out the possibility of overlap?” The answer is, “Yes! (...with some technicalities).” Here are the rules:

1. The signature must mention all what we will call ragamuffin instances transitively exported by implementations being considered. An instance of a family F is a ragamuffin if it is not defined with the family definition, or with the type constructor at the head in the first parameter. (Or some specific parameter, decided on a per-family basis.) All orphan instances are ragamuffins, but not all ragamuffins are orphans.
2. A signature exporting a type family must mention all instances which are defined in the same module as the definition of the type family.
3. It is strictly optional to mention non-ragamuffin instances in a signature.

(Aside: I don't think this is the most flexible version of the rule that is safe, but I do believe it is the most straightforward.) The whole point of these rules is to make it impossible to write an overlapping instance, while only requiring local checking when an instance is being written. Why did we need to strengthen the orphan condition into a ragamuffin condition to get this non-overlap? The answer is that absence of orphans does not imply absence of overlap, as this simple example shows:

```module A where
data A = A
type instance F A y = Int

module B where
data B = B
type instance F x B = Bool -> Bool
```

Here, the two instances of F are overlapping, but neither are orphans (since their left-hand sides mention a data type which was defined in the module.) However, the B instance is a ragamuffin instance, because B is not mentioned in the first argument of F. (Of course, it doesn't really matter if you check the first argument or the second argument, as long as you're consistent.)

Another way to think about this rule is that open type family instances are not standalone instances but rather metadata that is associated with a type constructor when it is constructed. In this way, non-ragamuffin type family instances are modular!

A major downside of this technique, however, is that it doesn't really do anything for the legitimate uses of orphan instances in the Haskell ecosystem: when third-parties defined both the type family (or type class) and the data type, and you need the instance for your own purposes.

### Extension 2: Orphan resolution

This proposal is based off of one that Edward Kmett has been floating around, but which I've refined. The motivation is to give a better story for offering the functionality of orphan instances without gunking up the module system. The gist of the proposal is to allow the package manager to selectively enable/disable orphan definitions; however, to properly explain it, I'd like to do first is describe a few situations involving orphan type class instances. (The examples use type classes rather than type families because the use-cases are more clear. If you imagine that the type classes in question have associated types, then the situation is the same as that for open type families.)

The story begins with a third-party library which defined a data type T but did not provide an instance that you needed:

```module Data.Foo where
data Foo = Foo

module MyApp where
import Data.Foo
fooString = show Foo -- XXX no instance for Show
```

If you really need the instance, you might be tempted to just go ahead and define it:

```module MyApp where
import Data.Foo
instance Show Foo where -- orphan
show Foo = "Foo"
fooString = show Foo
```

Later, you upgrade Data.Foo to version 1.0.0, which does define a Show instance, and now your overlapping instance error! Uh oh.

How do we get ourselves out of the mess? A clue is how many package authors currently “get out of jail” by using preprocessor macros:

```{-# LANGUAGE CPP #-}
module MyApp where
import Data.Foo
#if MIN_VERSION_foo(1,0,0)
instance Show Foo where -- orphan
show Foo = "Foo"
#endif
fooString = show Foo
```

Morally, we'd like to hide the orphan instance when the real instance is available: there are two variations of MyApp which we want to transparently switch between: one which defines the orphan instance, and one which does not and uses the non-orphan instance defined in the Data.Foo. The choice depends on which foo was chosen, a decision made by the package manager.

Let's mix things up a little. There is no reason the instance has to be a non-orphan coming from Data.Foo. Another library might have defined its own orphan instance:

```module MyOtherApp where
import Data.Foo
instance Show Foo where ... -- orphan
otherFooString = show Foo

module MyApp where
import Data.Foo
instance Show Foo where ... -- orphan
fooString = show Foo

module Main where
import MyOtherApp
import MyApp
main = print (fooString ++ otherFooString ++ show Foo)
```

It's a bit awful to get this to work with preprocessor macros, but there are two ways we can manually resolve the overlap: we can erase the orphan instance from MyOtherApp, or we can erase the orphan instance from MyApp. A priori, there is no reason to prefer one or the other. However, depending on which one is erased, Main may have to be compiled differently (if the code in the instances is different). Furthermore, we need to setup a new (instance-only) import between the module who defines the instance to the module whose instance was erased.

There are a few takeaways from these examples. First, the most natural way of resolving overlapping orphan instances is to simply “delete” the overlapping instances; however, which instance to delete is a global decision. Second, which overlapping orphan instances are enabled affects compilation: you may need to add module dependencies to be able to compile your modules. Thus, we might imagine that a solution allows us to do both of these, without modifying source code.

Here is the game plan: as before, packages can define orphan instances. However, the list of orphan instances a package defines is part of the metadata of the package, and the instance itself may or may not be used when we actually compile the package (or its dependencies). When we do dependency resolution on a set of packages, we have to consider the set of orphan instances being provided and only enable a set which is non-overlapping, the so called orphan resolution. Furthermore, we need to add an extra dependency from packages whose instances were disabled to the package who is the sole definer of an instance (this might constrain which orphan instance we can actually pick as the canonical instance).

The nice thing about this proposal is that it solves an already existing pain point for type class users, namely defining an orphan type class instance without breaking when upstream adds a proper instance. But you might also think of it as a big hack, and it requires cooperation from the package manager (or some other tool which manages the orphan resolution).

The extensions to the basic proposal are not mutually exclusive, but it's an open question whether or not the complexity they incur are worth the benefits they bring to existing uses of orphan instances. And of course, there may other ways of solving the problem which I have not described here, but this smorgasbord seems to be the most plausible at the moment.

At ICFP, I had an interesting conversation with Derek Dreyer, where he mentioned that when open type families were originally going into GHC, he had warned Simon that they were not going to be modular. With the recent addition of closed type families, many of the major use-cases for open type families stated in the original paper have been superseded. However, even if open type families had never been added to Haskell, we still might have needed to adopt these solutions: the global uniqueness of instances is deeply ingrained in the Haskell community, and even if in some cases we are lax about enforcing this constraint, it doesn't mean we should actively encourage people to break it.

I have a parting remark for the ML community, as type classes make their way in from Haskell: when you do get type classes in your language, don’t make the same mistake as the Haskell community and start using them to enforce invariants in APIs. This way leads to the global uniqueness of instances, and the loss of modularity may be too steep a price to pay.

Postscript. One natural thing to wonder, is if overlapping type family instances are OK if one of the instances “is not externally visible.” Of course, the devil is in the details; what do we mean by external visibility of type family instances of F?

For some definitions of visibility, we can find an equivalent, local transformation which has the same effect. For example, if we never use the instance at all, it certainly OK to have overlap. In that case, it would also have been fine to delete the instance altogether. As another example, we could require that there are no (transitive) mentions of the type family F in the signature of the module. However, eliminating the mention of the type family requires knowing enough parameters and equations to reduce: in which case the type family could have been replaced with a local, closed type family.

One definition that definitely does not work is if F can be mentioned with some unspecified type variables. Here is a function which coerces an Int into a function:

```module A where
type instance F Int = Int
f :: Typeable a => a -> F a
f x = case eqT of
Just Refl -> x :: Int
Nothing -> undefined

module ASig where
f :: Typeable a => a -> F a

module B where
import ASig
type instance F Int = Bool -> Bool
g :: Bool
g = f 0 True -- oops
```

...the point being that, even if a signature doesn't directly mention the overlapping instance F Int, type refinement (usually by some GADT-like structure) can mean that an offending instance can be used internally.

• September 4, 2014

## A taste of Cabalized Backpack

So perhaps you've bought into modules and modularity and want to get to using Backpack straightaway. How can you do it? In this blog post, I want to give a tutorial-style taste of how to program Cabal in the Backpack style. These examples are executable, but you'll have to build custom versions of GHC and Cabal to build them. Comments and suggestions would be much appreciated; while the design here is theoretically well-founded, for obvious reasons, we don't have much on-the-ground programmer feedback yet.

### A simple package in today's Cabal

To start, let's briefly review how Haskell modules and Cabal packages work today. Our running example will be the bytestring package, although I'll inline, simplify and omit definitions to enhance clarity.

Let's suppose that you are writing a library, and you want to use efficient, packed strings for some binary processing you are doing. Fortunately for you, the venerable Don Stewart has already written a bytestring package which implements this functionality for you. This package consists of a few modules: an implementation of strict ByteStrings...

```module Data.ByteString(ByteString, empty, singleton, ...) where
data ByteString = PS !(ForeignPtr Word8) !Int !Int
empty :: ByteString
empty = PS nullForeignPtr 0 0
-- ...
```

...and an implementation of lazy ByteStrings:

```module Data.ByteString.Lazy(ByteString, empty, singleton, ...) where
data ByteString = Empty | Chunk !S.ByteString ByteString
empty :: ByteString
empty = Empty
-- ...
```

These modules are packaged up into a package which is specified using a Cabal file:

```name: bytestring
version: 0.10.4.0
library
build-depends: base >= 4.2 && < 5, ghc-prim, deepseq
exposed-modules: Data.ByteString, Data.ByteString.Lazy, ...
other-modules: ...
```

We can then make a simple module and package which depends on the bytestring package:

```module Utils where
import Data.ByteString.Lazy as B
blank :: IO ()
blank = B.putStr B.empty
```
```name: utilities
version: 0.1
library
build-depends: base, bytestring >= 0.10
exposed-modules: Utils
```

1. It's not possible to switch Utils from using lazy ByteStrings to strict ByteStrings without literally editing the Utils module. And even if you do that, you can't have Utils depending on strict ByteString, and Utils depending on lazy ByteString, in the same program, without copying the entire module text. (This is not too surprising, since the code really is different.)
2. Nevertheless, there is some amount of indirection here: while Utils includes a specific ByteString module, it is unspecified which version of ByteString it will be. If (hypothetically) the bytestring library released a new version where lazy byte-strings were actually strict, the functionality of Utils would change accordingly when the user re-ran dependency resolution.
3. I used a qualified import to refer to identifiers in Data.ByteString.Lazy. This is a pretty common pattern when developing Haskell code: we think of B as an alias to the actual model. Textually, this is also helpful, because it means I only have to edit the import statement to change which ByteString I refer to.

### Generalizing Utils with a signature

To generalize Utils with some Backpack magic, we need to create a signature for ByteString, which specifies what the interface of the module providing ByteStrings is. Here one such signature, which is placed in the file Data/ByteString.hsig inside the utilities package:

```signature Data.ByteString where
import Data.Word
data ByteString
instance Eq ByteString
empty :: ByteString
singleton :: Word8 -> ByteString
putStr :: ByteString -> IO ()
```

The format of a signature is essentially the same of that of an hs-boot file: we have normal Haskell declarations, but omitting the actual implementations of values.

The utilities package now needs a new field to record signatures:

```name: utilities
library
build-depends: base
exposed-modules: Utils
signatures: Data.ByteString
```

Notice that there have been three changes: (1) We've removed the direct dependency on the bytestring package, and (2) we have a new field signatures which simply lists the names of the signature files (also known as holes) that we need filled in.

How do we actually use the utilities package, then? Let's suppose our goal is to produce a new module, Utils.Strict, which is Utils but using strict ByteStrings (which is exported by the bytestring package under the module name Data.ByteString). To do this, we'll need to create a new package:

```name: strict-utilities
library
build-depends: utilities, bytestring
reexported-modules: Utils as Utils.Strict
```

That's it! strict-utilities exports a single module Utils.Strict which is utilities using Data.ByteString from bytestring (which is the strict implementation). This is called a mix-in: in the same dependency list, we simply mix together:

• utilities, which requires a module named Data.ByteString, and
• bytestring, which supplies a module named Data.ByteString.

Cabal automatically figures out that how to instantiate the utilities package by matching together module names. Specifically, the two packages above are connected through the module name Data.ByteString. This makes for a very convenient (and as it turns out, expressive) mode of package instantiation. By the way, reexported-modules is a new (orthogonal) feature which lets us reexport a module from the current package or a dependency to the outside world under a different name. The modules that are exported by the package are the exposed-modules and the reexported-modules. The reason we distinguish them is to make clear which modules have source code in the package (exposed-modules).

Unusually, strict-utilities is a package that contains no code! Its sole purpose is to mix existing packages.

Now, you might be wondering: how do we instantiate utilities with the lazy ByteString implementation? That implementation was put in Data.ByteString.Lazy, so the names don't match up. In this case, we can use another new feature, module thinning and renaming:

```name: lazy-utilities
library
build-depends: utilities, bytestring
backpack-includes:
bytestring (Data.ByteString.Lazy as Data.ByteString)
reexported-modules: Utils as Utils.Lazy
```

The new backpack-includes field says that only the Data.ByteString.Lazy module should brought into scope, under the name Data.ByteString. This is sufficient to mix in link utilities with the lazy implementation of ByteString.

An interesting duality is that you can do the renaming the other way:

```name: lazy-utilities
library
build-depends:
utilities (Utils, Data.ByteString as Data.ByteString.Lazy),
bytestring
```

Instead of renaming the implementation, I renamed the hole! It's equivalent: the thing that matters it that the signature and implementation need to be mixed under the same name in order for linking (the instantiation of the signature with the implementation) to occur.

There are a few things to note about signature usage:

1. If you are using a signature, there's not much point in also specifying an explicit import list when you import it: you are guaranteed to only see types and definitions that are in the signature (modulo type classes... a topic for another day). Signature files act like a type-safe import list which you can share across modules.

2. A signature can, and indeed often must, import other modules. In the type signature for singleton in Data/ByteString.hsig, we needed to refer to a type Word8, so we must bring it into scope by importing Data.Word.

Now, when we compile the signature in the utilities package, we need to know where Data.Word came from. It could have come from another signature, but in this case, it's provided by the definite package base: it's a proper concrete module with an implementation! Signatures can depend on implementations: since we can only refer to types from those modules, we are saying, in effect: any implementation of the singleton function and any representation of the ByteString type is acceptable, but regarding Word8 you must use the specific type from Data.Word in prelude.

3. What happens if, independently of my packages strict-utilities, someone else also instantiatiates utilities with Data.ByteString? Backpack is clever enough to reuse the instantiation of utilities: this property is called applicativity of the module system. The specific rule that we use to decide if the instantiation is the same is to look at how all of the holes needed by a package are instantiated, and if they are instantiated with precisely the same modules, the instantiated packages are considered type equal. So there is no need to actually create strict-utilities or lazy-utilities: you can just instantiate utilities on the fly.

Mini-quiz: What does this package do?

```name: quiz-utilities
library
build-depends:
utilities (Utils, Data.ByteString as B),
bytestring (Data.ByteString.Lazy as B)
```

### Sharing signatures

It's all very nice to be able to explicitly write a signature for Data.ByteString in my package, but this could get old if I have to do this for every single package I depend on. It would be much nicer if I could just put all my signatures in a package and include that when I want to share it. I want all of the Hackage mechanisms to apply to my signatures as well as my normal packages (e.g. versioning). Well, you can!

The author of bytestring can write a bytestring-sig package which contains only signatures:

```name: bytestring-sig
version: 1.0
library
build-depends: base
signatures: Data.ByteString
```

Now, utilities can include this package to indicate its dependence on the signature:

```name: utilities
library
build-depends: base, bytestring-sig-1.0
exposed-modules: Utils
```

Unlike normal dependencies, signature dependencies should be exact: after all, while you might want an upgraded implementation, you don't want the signature to change on you!

We can summarize all of the fields as follows:

1. exposed-modules says that there is a public module defined in this package

System Message: WARNING/2 (<stdin>, line 189)

Enumerated list ends without a blank line; unexpected unindent.

2. other-modules says that there is a private module defined in this package 4. signatures says that there is a public signature defined in this package (there are no private signatures; they are always public, because a signature always must be implemented) 5. reexported-modules says that there is a public module or signature defined in a dependency.

In this list, public means that it is available to clients. Notice the first four fields list all of the source code in this package. Here is a simple example of a client:

```name: utilities-extras
library
build-depends: utilities
exposed-modules: Utils.Extra
```

### Summary

We've covered a lot of ground, but when it comes down to it, Backpack really comes together because of set of orthogonal features which interact in a good way:

1. Module signatures: the heart of a module system, giving us the ability to write indefinite packages and mix together implementations,
2. Module reexports: the ability to take locally available modules and reexport them under a different name, and
3. Module thinning and renaming : the ability to selectively make available modules from a dependency.

To compile a Backpack package, we first run the traditional version dependency solving, getting exact versions for all packages involved, and then we calculate how to link the packages together. That's it! In a future blog post, I plan to more comprehensively describe the semantics of these new features, especially module signatures, which can be subtle at times.

• August 26, 2014

## The fundamental problem of programming language package management

Why are there so many goddamn package managers? They sprawl across both operating systems (apt, yum, pacman, Homebrew) as well as for programming languages (Bundler, Cabal, Composer, CPAN, CRAN, CTAN, EasyInstall, Go Get, Maven, npm, NuGet, OPAM, PEAR, pip, RubyGems, etc etc etc). "It is a truth universally acknowledged that a programming language must be in want of a package manager." What is the fatal attraction of package management that makes programming language after programming language jump off this cliff? Why can't we just, you know, reuse an existing package manager?

You can probably think of a few reasons why trying to use apt to manage your Ruby gems would end in tears. "System and language package managers are completely different! Distributions are vetted, but that's completely unreasonable for most libraries tossed up on GitHub. Distributions move too slowly. Every programming language is different. The different communities don't talk to each other. Distributions install packages globally. I want control over what libraries are used." These reasons are all right, but they are missing the essence of the problem.

The fundamental problem is that programming languages package management is decentralized.

This decentralization starts with the central premise of a package manager: that is, to install software and libraries that would otherwise not be locally available. Even with an idealized, centralized distribution curating the packages, there are still two parties involved: the distribution and the programmer who is building applications locally on top of these libraries. In real life, however, the library ecosystem is further fragmented, composed of packages provided by a huge variety of developers. Sure, the packages may all be uploaded and indexed in one place, but that doesn't mean that any given author knows about any other given package. And then there's what the Perl world calls DarkPAN: the uncountable lines of code which probably exist, but which we have no insight into because they are locked away on proprietary servers and source code repositories. Decentralization can only be avoided when you control absolutely all of the lines of code in your application.. but in that case, you hardly need a package manager, do you? (By the way, my industry friends tell me this is basically mandatory for software projects beyond a certain size, like the Windows operating system or the Google Chrome browser.)

Decentralized systems are hard. Really, really hard. Unless you design your package manager accordingly, your developers will fall into dependency hell. Nor is there a one "right" way to solve this problem: I can identify at least three distinct approaches to the problem among the emerging generation of package managers, each of which has their benefits and downsides.

Pinned versions. Perhaps the most popular school of thought is that developers should aggressively pin package versions; this approach advocated by Ruby's Bundler, PHP's Composer, Python's virtualenv and pip, and generally any package manager which describes itself as inspired by the Ruby/node.js communities (e.g. Java's Gradle, Rust's Cargo). Reproduceability of builds is king: these package managers solve the decentralization problem by simply pretending the ecosystem doesn't exist once you have pinned the versions. The primary benefit of this approach is that you are always in control of the code you are running. Of course, the downside of this approach is that you are always in control of the code you are running. An all-to-common occurrence is for dependencies to be pinned, and then forgotten about, even if there are important security updates to the libraries involved. Keeping bundled dependencies up-to-date requires developer cycles--cycles that more often than not are spent on other things (like new features).

A stable distribution. If bundling requires every individual application developer to spend effort keeping dependencies up-to-date and testing if they keep working with their application, we might wonder if there is a way to centralize this effort. This leads to the second school of thought: to centralize the package repository, creating a blessed distribution of packages which are known to play well together, and which will receive bug fixes and security fixes while maintaining backwards compatibility. In programming languages, this is much less common: the two I am aware of are Anaconda for Python and Stackage for Haskell. But if we look closely, this model is exactly the same as the model of most operating system distributions. As a system administrator, I often recommend my users use libraries that are provided by the operating system as much as possible. They won't take backwards incompatible changes until we do a release upgrade, and at the same time you'll still get bugfixes and security updates for your code. (You won't get the new hotness, but that's essentially contradictory with stability!)

Embracing decentralization. Up until now, both of these approaches have thrown out decentralization, requiring a central authority, either the application developer or the distribution manager, for updates. Is this throwing out the baby with the bathwater? The primary downside of centralization is the huge amount of work it takes to maintain a stable distribution or keep an individual application up-to-date. Furthermore, one might not expect the entirety of the universe to be compatible with one another, but this doesn't stop subsets of packages from being useful together. An ideal decentralized ecosystem distributes the problem of identifying what subsets of packages work across everyone participating in the system. Which brings us to the fundamental, unanswered question of programming languages package management:

How can we create a decentralized package ecosystem that works?

Here are a few things that can help:

1. Stronger encapsulation for dependencies. One of the reasons why dependency hell is so insidious is the dependency of a package is often an inextricable part of its outwards facing API: thus, the choice of a dependency is not a local choice, but rather a global choice which affects the entire application. Of course, if a library uses some library internally, but this choice is entirely an implementation detail, this shouldn't result in any sort of global constraint. Node.js's NPM takes this choice to its logical extreme: by default, it doesn't deduplicate dependencies at all, giving each library its own copy of each of its dependencies. While I'm a little dubious about duplicating everything (it certainly occurs in the Java/Maven ecosystem), I certainly agree that keeping dependency constraints local improves composability.
2. Advancing semantic versioning. In a decentralized system, it's especially important that library writers give accurate information, so that tools and users can make informed decisions. Wishful, invented version ranges and artistic version number bumps simply exacerbate an already hard problem (as I mentioned in my previous post). If you can enforce semantic versioning, or better yet, ditch semantic versions and record the true, type-level dependency on interfaces, our tools can make better choices. The gold standard of information in a decentralized system is, "Is package A compatible with package B", and this information is often difficult (or impossible, for dynamically typed systems) to calculate.
3. Centralization as a special-case. The point of a decentralized system is that every participant can make policy choices which are appropriate for them. This includes maintaining their own central authority, or deferring to someone else's central authority: centralization is a special-case. If we suspect users are going to attempt to create their own, operating system style stable distributions, we need to give them the tools to do so... and make them easy to use!

For a long time, the source control management ecosystem was completely focused on centralized systems. Distributed version control systems such as Git fundamentally changed the landscape: although Git may be more difficult to use than Subversion for a non-technical user, the benefits of decentralization are diverse. The Git of package management doesn't exist yet: if someone tells you that package management is solved, just reimplement Bundler, I entreat you: think about decentralization as well!

• August 21, 2014

## What’s a module system good for anyway?

This summer, I've been working at Microsoft Research implementing Backpack, a module system for Haskell. Interestingly, Backpack is not really a single monolothic feature, but, rather, an agglomeration of small, infrastructural changes which combine together in an interesting way. In this series of blog posts, I want to talk about what these individual features are, as well as how the whole is greater than the sum of the parts.

But first, there's an important question that I need to answer: What's a module system good for anyway? Why should you, an average Haskell programmer, care about such nebulous things as module systems and modularity. At the end of the day, you want your tools to solve specific problems you have, and it is sometimes difficult to understand what problem a module system like Backpack solves. As tomejaguar puts it: "Can someone explain clearly the precise problem that Backpack addresses? I've read the paper and I know the problem is 'modularity' but I fear I am lacking the imagination to really grasp what the issue is."

Look no further. In this blog post, I want to talk concretely about problems Haskellers have today, explain what the underlying causes of these problems are, and say why a module system could help you out.

### The String, Text, ByteString problem

As experienced Haskellers are well aware, there are multitude of string types in Haskell: String, ByteString (both lazy and strict), Text (also both lazy and strict). To make matters worse, there is no one "correct" choice of a string type: different types are appropriate in different cases. String is convenient and native to Haskell'98, but very slow; ByteString is fast but are simply arrays of bytes; Text is slower but Unicode aware.

In an ideal world, a programmer might choose the string representation most appropriate for their application, and write all their code accordingly. However, this is little solace for library writers, who don't know what string type their users are using! What's a library writer to do? There are only a few choices:

1. They "commit" to one particular string representation, leaving users to manually convert from one representation to another when there is a mismatch. Or, more likely, the library writer used the default because it was easy. Examples: base (uses Strings because it completely predates the other representations), diagrams (uses Strings because it doesn't really do heavy string manipulation).
2. They can provide separate functions for each variant, perhaps identically named but placed in separate modules. This pattern is frequently employed to support both strict/lazy variants Text and ByteStringExamples: aeson (providing decode/decodeStrict for lazy/strict ByteString), attoparsec (providing Data.Attoparsec.ByteString/Data.Attoparsec.ByteString.Lazy), lens (providing Data.ByteString.Lazy.Lens/Data.ByteString.Strict.Lens).
3. They can use type-classes to overload functions to work with multiple representations. The particular type class used hugely varies: there is ListLike, which is used by a handful of packages, but a large portion of packages simply roll their own. Examples: SqlValue in HDBC, an internal StringLike in tagsoup, and yet another internal StringLike in web-encodings.

The last two methods have different trade offs. Defining separate functions as in (2) is a straightforward and easy to understand approach, but you are still saying no to modularity: the ability to support multiple string representations. Despite providing implementations for each representation, the user still has to commit to particular representation when they do an import. If they want to change their string representation, they have to go through all of their modules and rename their imports; and if they want to support multiple representations, they'll still have to write separate modules for each of them.

Using type classes (3) to regain modularity may seem like an attractive approach. But this approach has both practical and theoretical problems. First and foremost, how do you choose which methods go into the type class? Ideally, you'd pick a minimal set, from which all other operations could be derived. However, many operations are most efficient when directly implemented, which leads to a bloated type class, and a rough time for other people who have their own string types and need to write their own instances. Second, type classes make your type signatures more ugly String -> String to StringLike s => s -> s and can make type inference more difficult (for example, by introducing ambiguity). Finally, the type class StringLike has a very different character from the type class Monad, which has a minimal set of operations and laws governing their operation. It is difficult (or impossible) to characterize what the laws of an interface like this should be. All-in-all, it's much less pleasant to program against type classes than concrete implementations.

Wouldn't it be nice if I could import String, giving me the type String and operations on it, but then later decide which concrete implementation I want to instantiate it with? This is something a module system can do for you! This Reddit thread describes a number of other situations where an ML-style module would come in handy.

(PS: Why can't you just write a pile of preprocessor macros to swap in the implementation you want? The answer is, "Yes, you can; but how are you going to type check the thing, without trying it against every single implementation?")

### Destructive package reinstalls

Have you ever gotten this error message when attempting to install a new package?

```\$ cabal install hakyll
cabal: The following packages are likely to be broken by the reinstalls:
pandoc-1.9.4.5
Graphalyze-0.14.0.0
Use --force-reinstalls if you want to install anyway.
```

Somehow, Cabal has concluded that the only way to install hakyll is to reinstall some dependency. Here's one situation where a situation like this could come about:

1. pandoc and Graphalyze are compiled against the latest unordered-containers-0.2.5.0, which itself was compiled against the latest hashable-1.2.2.0.
2. hakyll also has a dependency on unordered-containers and hashable, but it has an upper bound restriction on hashable which excludes the latest hashable version. Cabal decides we need to install an old version of hashable, say hashable-0.1.4.5.
3. If hashable-0.1.4.5 is installed, we also need to build unordered-containers against this older version for Hakyll to see consistent types. However, the resulting version is the same as the preexisting version: thus, reinstall!

The root cause of this error an invariant Cabal currently enforces on a package database: there can only be one instance of a package for any given package name and version. In particular, this means that it is not possible to install a package multiple times, compiled against different dependencies. This is a bit troublesome, because sometimes you really do want the same package installed multiple times with different dependencies: as seen above, it may be the only way to fulfill the version bounds of all packages involved. Currently, the only way to work around this problem is to use a Cabal sandbox (or blow away your package database and reinstall everything, which is basically the same thing).

You might be wondering, however, how could a module system possibly help with this? It doesn't... at least, not directly. Rather, nondestructive reinstalls of a package are a critical feature for implementing a module system like Backpack (a package may be installed multiple times with different concrete implementations of modules). Implementing Backpack necessitates fixing this problem, moving Haskell's package management a lot closer to that of Nix's or NPM.

### Version bounds and the neglected PVP

While we're on the subject of cabal-install giving errors, have you ever gotten this error attempting to install a new package?

```\$ cabal install hledger-0.18
Resolving dependencies...
cabal: Could not resolve dependencies:
# pile of output
```

There are a number of possible reasons why this could occur, but usually it's because some of the packages involved have over-constrained version bounds (especially upper bounds), resulting in an unsatisfiable set of constraints. To add insult to injury, often these bounds have no grounding in reality (the package author simply guessed the range) and removing it would result in a working compilation. This situation is so common that Cabal has a flag --allow-newer which lets you override the upper bounds of packages. The annoyance of managing bounds has lead to the development of tools like cabal-bounds, which try to make it less tedious to keep upper bounds up-to-date.

But as much as we like to rag on them, version bounds have a very important function: they prevent you from attempting to compile packages against dependencies which don't work at all! An under-constrained set of version bounds can easily have compiling against a version of the dependency which doesn't type check.

How can a module system help? At the end of the day, version numbers are trying to capture something about the API exported by a package, described by the package versioning policy. But the current state-of-the-art requires a user to manually translate changes to the API into version numbers: an error prone process, even when assisted by various tools. A module system, on the other hand, turns the API into a first-class entity understood by the compiler itself: a module signature. Wouldn't it be great if packages depended upon signatures rather than versions: then you would never have to worry about version numbers being inaccurate with respect to type checking. (Of course, versions would still be useful for recording changes to semantics not seen in the types, but their role here would be secondary in importance.) Some full disclosure is warranted here: I am not going to have this implemented by the end of my internship, but I'm hoping to make some good infrastructural contributions toward it.

### Conclusion

If you skimmed the introduction to the Backpack paper, you might have come away with the impression that Backpack is something about random number generators, recursive linking and applicative semantics. While these are all true "facts" about Backpack, they understate the impact a good module system can have on the day-to-day problems of a working programmer. In this post, I hope I've elucidated some of these problems, even if I haven't convinced you that a module system like Backpack actually goes about solving these problems: that's for the next series of posts. Stay tuned!

• August 9, 2014

## New theme!

Hello loyal readers: Inside 206-105 has a new theme! I’m retiring Manifest, which was a pretty nice theme but (1) the text size was too small and (2) I decided I didn’t really like the fonts, I’ve reskinned my blog with a theme based on Brent Jackson’s Ashley, but ported to work on WordPress. I hope you like it, and please report any rendering snafus you might notice on older pages. Thanks!

• July 26, 2014

## Type classes: confluence, coherence and global uniqueness

Today, I'd like to talk about some of the core design principles behind type classes, a wildly successful feature in Haskell. The discussion here is closely motivated by the work we are doing at MSRC to support type classes in Backpack. While I was doing background reading, I was flummoxed to discover widespread misuse of the terms "confluence" and "coherence" with respect to type classes. So in this blog post, I want to settle the distinction, and propose a new term, "global uniqueness of instances" for the property which people have been colloquially referred to as confluence and coherence.

Let's start with the definitions of the two terms. Confluence is a property that comes from term-rewriting: a set of instances is confluent if, no matter what order constraint solving is performed, GHC will terminate with a canonical set of constraints that must be satisfied for any given use of a type class. In other words, confluence says that we won't conclude that a program doesn't type check just because we swapped in a different constraint solving algorithm.

Confluence's closely related twin is coherence (defined in the paper "Type classes: exploring the design space"). This property states that every different valid typing derivation of a program leads to a resulting program that has the same dynamic semantics. Why could differing typing derivations result in different dynamic semantics? The answer is that context reduction, which picks out type class instances, elaborates into concrete choices of dictionaries in the generated code. Confluence is a prerequisite for coherence, since one can hardly talk about the dynamic semantics of a program that doesn't type check.

So, what is it that people often refer to when they compare Scala type classes to Haskell type classes? I am going to refer to this as global uniqueness of instances, defining to say: in a fully compiled program, for any type, there is at most one instance resolution for a given type class. Languages with local type class instances such as Scala generally do not have this property, but in Haskell, we find this property is a very convenient one when building abstractions like sets.

So, what properties does GHC enforce, in practice? In the absence of any type system extensions, GHC's employs a set of rules to ensure that type class resolution is confluent and coherent. Intuitively, it achieves this by having a very simple constraint solving algorithm (generate wanted constraints and solve wanted constraints) and then requiring the set of instances to be nonoverlapping, ensuring there is only ever one way to solve a wanted constraint. Overlap is a more stringent restriction than either confluence or coherence, and via the OverlappingInstances and IncoherentInstances, GHC allows a user to relax this restriction "if they know what they're doing."

Surprisingly, however, GHC does not enforce global uniqueness of instances. Imported instances are not checked for overlap until we attempt to use them for instance resolution. Consider the following program:

```-- T.hs
data T = T
-- A.hs
import T
instance Eq T where
-- B.hs
import T
instance Eq T where
-- C.hs
import A
import B
```

When compiled with one-shot compilation, C will not report overlapping instances unless we actually attempt to use the Eq instance in C. This is by design: ensuring that there are no overlapping instances eagerly requires eagerly reading all the interface files a module may depend on.

We might summarize these three properties in the following manner. Culturally, the Haskell community expects global uniqueness of instances to hold: the implicit global database of instances should be confluent and coherent. GHC, however, does not enforce uniqueness of instances: instead, it merely guarantees that the subset of the instance database it uses when it compiles any given module is confluent and coherent. GHC does do some tests when an instance is declared to see if it would result in overlap with visible instances, but the check is by no means perfect; truly, type-class constraint resolution has the final word. One mitigating factor is that in the absence of orphan instances, GHC is guaranteed to eagerly notice when the instance database has overlap (assuming that the instance declaration checks actually worked...)

Clearly, the fact that GHC's lazy behavior is surprising to most Haskellers means that the lazy check is mostly good enough: a user is likely to discover overlapping instances one way or another. However, it is relatively simple to construct example programs which violate global uniqueness of instances in an observable way:

```-- A.hs
module A where
data U = X | Y deriving (Eq, Show)

-- B.hs
module B where
import Data.Set
import A

instance Ord U where
compare X X = EQ
compare X Y = LT
compare Y X = GT
compare Y Y = EQ

ins :: U -> Set U -> Set U
ins = insert

-- C.hs
module C where
import Data.Set
import A

instance Ord U where
compare X X = EQ
compare X Y = GT
compare Y X = LT
compare Y Y = EQ

ins' :: U -> Set U -> Set U
ins' = insert

-- D.hs
module Main where
import Data.Set
import A
import B
import C

test :: Set U
test = ins' X \$ ins X \$ ins Y \$ empty

main :: IO ()
main = print test
```
```-- OUTPUT
\$ ghc -Wall -XSafe -fforce-recomp --make D.hs
[1 of 4] Compiling A ( A.hs, A.o )
[2 of 4] Compiling B ( B.hs, B.o )

B.hs:5:10: Warning: Orphan instance: instance [safe] Ord U
[3 of 4] Compiling C ( C.hs, C.o )

C.hs:5:10: Warning: Orphan instance: instance [safe] Ord U
[4 of 4] Compiling Main ( D.hs, D.o )
\$ ./D
fromList [X,Y,X]
```

Locally, all type class resolution was coherent: in the subset of instances each module had visible, type class resolution could be done unambiguously. Furthermore, the types of ins and ins' discharge type class resolution, so that in D when the database is now overlapping, no resolution occurs, so the error is never found.

It is easy to dismiss this example as an implementation wart in GHC, and continue pretending that global uniqueness of instances holds. However, the problem with global uniqueness of instances is that they are inherently nonmodular: you might find yourself unable to compose two components because they accidentally defined the same type class instance, even though these instances are plumbed deep in the implementation details of the components. This is a big problem for Backpack, or really any module system, whose mantra of separate modular development seeks to guarantee that linking will succeed if the library writer and the application writer develop to a common signature.

• July 11, 2014

## Parsec: “try a <|> b” considered harmful

tl;dr The scope of backtracking try should be minimized, usually by placing it inside the definition of a parser.

Have you ever written a Parsec parser and gotten a really uninformative error message?

```"test.txt" (line 15, column 7):
unexpected 'A'
expecting end of input
```

The line and the column are randomly somewhere in your document, and you're pretty sure you should be in the middle of some stack of parser combinators. But wait! Parsec has somehow concluded that the document should be ending immediately. You noodle around and furthermore discover that the true error is some ways after the actually reported line and column.

You think, “No wonder Parsec gets such a bad rep about its error handling.”

Assuming that your grammar in question is not too weird, there is usually a simple explanation for an error message like this: the programmer sprinkled their code with too many backtracking try statements, and the backtracking has destroyed useful error state. In effect, at some point the parser failed for the reason we wanted to report to the user, but an enclosing try statement forced the parser to backtrack and try another (futile possibility).

This can be illustrated by way of an example. A Haskeller is playing around with parse combinators and decides to test out their parsing skills by writing a parser for Haskell module imports:

```stmt ::= import qualified A as B
| import A
```

Piggy-backing off of Parsec’s built in token combinators (and the sample code), their first version might look something like this:

```import Text.Parsec
import qualified Text.Parsec.Token as P

data Stmt = QualifiedImport String String | Import String
deriving (Show)

pStmt = pQualifiedImport <|> pImport

pQualifiedImport = do
reserved "import"
reserved "qualified"
i <- identifier
reserved "as"
i' <- identifier
return (QualifiedImport i i')

pImport = do
reserved "import"
i <- identifier
return (Import i)

{ P.reservedNames = P.reservedNames haskellDef ++ ["qualified", "as"] })
identifier = P.identifier lexer
reserved = P.reserved lexer

parseStmt input = parse (pStmt >> eof) "(unknown)" input
```

Unfortunately, the parser doesn't work for regular imports—they get this error message:

```*Main> parseStmt "import Foo"
Left "(unknown)" (line 1, column 8):
unexpected "F"
expecting "qualified"
```

After a little Googling, they discover that Parsec doesn’t backtrack by default. Well, that’s fine; why not just insert a try into the parser.

```pStmt = try pQualifiedImport <|> pImport
```

This fixes both parses and suggests the following rule for writing future parsers:

If I need choice over multiple parsers, but some of these parsers might consume input, I better tack a try onto each of the parsers, so that I can backtrack.

Unbeknownst to the user, they have introduced bad error reporting behavior:

```*Main> parseStmt "import qualified Foo s B"
Left "(unknown)" (line 1, column 17):
unexpected reserved word "qualified"
expecting letter or digit or "#"
```

Wait a second! The error we wanted was that there was an unexpected identifier s, when we were expecting as. But instead of reporting an error when this occurred, Parsec instead backtracked, and attempted to match the pImport rule, only failing once that rule failed. By then, the knowledge that one of our choice branches failed had been forever lost.

How can we fix it? The problem is that our code backtracks when we, the developer, know it will be futile. In particular, once we have parsed import qualified, we know that the statement is, in fact, a qualified import, and we shouldn’t backtrack anymore. How can we get Parsec to understand this? Simple: reduce the scope of the try backtracking operator:

```pStmt = pQualifiedImport <|> pImport

pQualifiedImport = do
try \$ do
reserved "import"
reserved "qualified"
i <- identifier
reserved "as"
i' <- identifier
return (QualifiedImport i i')
```

Here, we have moved the try from pStmt into pQualifiedImport, and we only backtrack if import qualified fails to parse. Once it parses, we consume those tokens and we are now committed to the choice of a qualified import. The error messages get correspondingly better:

```*Main> parseStmt "import qualified Foo s F"
Left "(unknown)" (line 1, column 22):
unexpected "s"
expecting "as"
```

The moral of the story: The scope of backtracking try should be minimized, usually by placing it inside the definition of a parser. Some amount of cleverness is required: you have to be able to identify how much lookahead is necessary to commit to a branch, which generally depends on how the parser is used. Fortunately, many languages are constructed specifically so that the necessary lookahead is not too large, and for the types of projects I might use Parsec for, I’d be happy to sacrifice this modularity.

Another way of looking at this fiasco is that Parsec is at fault: it shouldn’t offer an API that makes it so easy to mess up error messages—why can’t it automatically figure out what the necessary lookahead is? While a traditional parser generator can achieve this (and improve efficiency by avoiding backtracking altogether in our earlier example), there are some fundamental reasons why Parsec (and monadic parser combinator libraries like it) cannot automatically determine what the lookahead needs to be. This is one of the reasons (among many) why many Haskellers prefer faster parsers which simply don’t try to do any error handling at all.

Why, then, did I write this post in the first place? There is still a substantial amount of documentation recommending the use of Parsec, and a beginning Haskeller is more likely than not going to implement their first parser in Parsec. And if someone is going to write a Parsec parser, you might as well spend a little time to limit your backtracking: it can make working with Parsec parsers a lot more pleasant.

• May 17, 2014

## GHC and mutable arrays: a DIRTY little secret

I've been looking into an issue in a library in which as more mutable arrays are allocated, GC dominates (I think I verified this?) and all code gets slower in proportion to the number of mutable arrays that are hanging around.

...to which I replied:

In the current GC design, mutable arrays of pointers are always placed on the mutable list. The mutable list of generations which are not being collected are always traversed; thus, the number of pointer arrays corresponds to a linear overhead for minor GCs.

If you’re coming from a traditional, imperative language, you might find this very surprising: if you paid linear overhead per GC in Java for all the mutable arrays in your system... well, you probably wouldn't use Java ever, for anything. But most Haskell users seem to get by fine; mostly because Haskell encourages immutability, making it rare for one to need lots of mutable pointer arrays.

Of course, when you do need it, it can be a bit painful. We have a GHC bug tracking the issue, and there is some low hanging fruit (a variant of mutable pointer arrays which has more expensive write operation, but which only gets put on the mutable list when you write to it) as well as some promising directions for how to implement card-marking for the heap, which is the strategy that GCs like the JVM's use.

On a more meta-level, implementing a perfomant generational garbage collector for an immutable language is far, far easier than implementing one for a mutable language. This is my personal hypothesis why Go doesn’t have a generational collector yet, and why GHC has such terrible behavior on certain classes of mutation.

Postscript. The title is a pun on the fact that “DIRTY” is used to describe mutable objects which have been written to since the last GC. These objects are part of the remembered set and must be traversed during garbage collection even if they are in an older generation.

• May 9, 2014