Inside 245-5D

Existential Pontification and Generalized Abstract Digressions

Announcing cabal new-build: Nix-style local builds

cabal new-build, also known as “Nix-style local builds”, is a new command inspired by Nix that comes with cabal-install 1.24. Nix-style local builds combine the best of non-sandboxed and sandboxed Cabal:

  1. Like sandboxed Cabal today, we build sets of independent local packages deterministically and independent of any global state. new-build will never tell you that it can't build your package because it would result in a “dangerous reinstall.” Given a particular state of the Hackage index, your build is completely reproducible. For example, you no longer need to compile packages with profiling ahead of time; just request profiling and new-build will rebuild all its dependencies with profiling automatically.
  2. Like non-sandboxed Cabal today, builds of external packages are cached globally, so that a package can be built once, and then reused anywhere else it is also used. No need to continually rebuild dependencies whenever you make a new sandbox: dependencies which can be shared, are shared.

Nix-style local builds work with all versions of GHC supported by cabal-install 1.24, which currently is GHC 7.0 and later. Additionally, cabal-install is on a different release cycle than GHC, so we plan to be pushing bugfixes and updates on a faster basis than GHC's yearly release cycle.

Although this feature is in only beta (there are bugs, see “Known Issues”, and the documentation is a bit sparse), I’ve been successfully using Nix-style local builds exclusively to do my Haskell development. It's hard to overstate my enthusiasm for this new feature: it “just works”, and you don't need to assume that there is a distribution of blessed, version-pegged packages to build against (e.g., Stackage). Eventually, new-build will simply replace the existing build command.

Quick start

Nix-style local builds “just work”: there is very little configuration that needs to be done to start working with it.

  1. Download and install cabal-install 1.24:

    cabal update
    cabal install cabal-install
    

    Make sure the newly installed cabal is in your path.

  2. To build a single Cabal package, instead of running cabal configure; cabal build, you can use Nix-style builds by prefixing these commands with new-; e.g., cabal new-configure; cabal new-build. cabal new-repl is also supported. (Unfortunately, other commands are not yet supported, e.g. new-clean (#2957) or new-freeze (#2996).)

  3. To build multiple Cabal packages, you need to first create cabal.project file in some root directory. For example, in the Cabal repository, there is a root directory with a folder per package, e.g., the folders Cabal and cabal-install. Then in cabal.project, specify each folder:

    packages: Cabal/
              cabal-install/
    

    Then, in the directory for a package, you can say cabal new-build to build all of the components in that package; alternately, you can specify a list of targets to build, e.g., package-tests cabal asks to build the package-tests test suite and the cabal executable. A component can be built from any directory; you don't have to be cd'ed into the directory containing the package you want to build. Additionally, you can qualify targets by the package they came from, e.g., Cabal:package-tests asks specifically for the package-tests component from Cabal. There is no need to manually configure a sandbox: add a cabal.project file, and it just works!

Unlike sandboxes, there is no need to add-source; just add the package directories to your cabal.project. And unlike traditional cabal install, there is no need to explicitly ask for packages to be installed; new-build will automatically fetch and build dependencies.

There is also a convenient script you can use for hooking up new-build to your Travis builds.

How it works

Nix-style local builds are implemented with these two big ideas:

  1. For external packages (from Hackage), prior to compilation, we take all of the inputs which would influence the compilation of a package (flags, dependency selection, etc.) and hash it into an identifier. Just as in Nix, these hashes uniquely identify the result of a build; if we compute this identifier and we find that we already have this ID built, we can just use the already built version. These packages are stored globally in ~/.cabal/store; you can list all of the Nix packages that are globally available using ghc-pkg list --package-db=$HOME/.cabal/store/ghc-VERSION/package.db.
  2. For local packages, we instead assign an inplace identifier, e.g., foo-0.1-inplace, which is local to a given cabal.project. These packages are stored locally in dist-newstyle/build; you can list all of the per-project packages using ghc-pkg list --package-db=dist-newstyle/packagedb. This treatment applies to any remote packages which depend on local packages (e.g., if you vendored some dependency which your other dependencies depend on.)

Furthermore, Nix local builds use a deterministic dependency solving strategy, by doing dependency solving independently of the locally installed packages. Once we've solved for the versions we want to use and have determined all of the flags that will be used during compilation, we generate identifiers and then check if we can improve packages we would have needed to build into ones that are already in the database.

Commands

new-configure FLAGS

Overwrites cabal.project.local based on FLAGS.

new-build [FLAGS] [COMPONENTS]

Builds one or more components, automatically building any local and non-local dependencies (where a local dependency is one where we have an inplace source code directory that we may modify during development). Non-local dependencies which do not have a transitive dependency on a local package are installed to ~/.cabal/store, while all other dependencies are installed to dist-newstyle.

The set of local packages is read from cabal.project; if none is present, it assumes a default project consisting of all the Cabal files in the local directory (i.e., packages: *.cabal), and optional packages in every subdirectory (i.e., optional-packages: */*.cabal).

The configuration of the build of local packages is computed by reading flags from the following sources (with later sources taking priority):

  1. ~/.cabal/config
  2. cabal.project
  3. cabal.project.local (usually generated by new-configure)
  4. FLAGS from the command line

The configuration of non-local packages is only affect by package-specific flags in these sources; global options are not applied to the build. (For example, if you --disable-optimization, this will only apply to your local inplace packages, and not their remote dependencies.)

new-build does not read configuration from cabal.config.

Phrasebook

Here is a handy phrasebook for how to do existing Cabal commands using Nix local build:

old-style new-style
cabal configure cabal new-configure
cabal build cabal new-build
cabal clean rm -rf dist-newstyle cabal.project.local
cabal run EXECUTABLE cabal new-build; ./dist-newstyle/build/PACKAGE-VERSION/build/EXECUTABLE/EXECUTABLE
cabal repl cabal new-repl
cabal test TEST cabal new-build; ./dist-newstyle/build/PACKAGE-VERSION/build/TEST/TEST
cabal benchmark BENCH cabal new-build; ./dist-newstyle/build/PACKAGE-VERSION/build/BENCH/BENCH
cabal haddock does not exist yet
cabal freeze does not exist yet
cabal install --only-dependencies unnecessary (handled by new-build)
cabal install does not exist yet (for libraries new-build should be sufficient; for executables, they can be found in ~/.cabal/store/ghc-GHCVER/PACKAGE-VERSION-HASH/bin)

cabal.project files

cabal.project files actually support a variety of options beyond packages for configuring the details of your build. Here is a simple example file which displays some of the possibilities:

-- For every subdirectory, build all Cabal files
-- (project files support multiple Cabal files in a directory)
packages: */*.cabal
-- Use this compiler
with-compiler: /opt/ghc/8.0.1/bin/ghc
-- Constrain versions of dependencies in the following way
constraints: cryptohash < 0.11.8
-- Do not build benchmarks for any local packages
benchmarks: False
-- Build with profiling
profiling: true
-- Suppose that you are developing Cabal and cabal-install,
-- and your local copy of Cabal is newer than the
-- distributed hackage-security allows in its bounds: you
-- can selective relax hackage-security's version bound.
allow-newer: hackage-security:Cabal

-- Settings can be applied per-package
package cryptohash
  -- For the build of cryptohash, instrument all functions
  -- with a cost center (normally, you want this to be
  -- applied on a per-package basis, as otherwise you would
  -- get too much information.)
  profiling-detail: all-functions
  -- Disable optimization for this package
  optimization: False
  -- Pass these flags to GHC when building
  ghc-options: -fno-state-hack

package bytestring
  -- And bytestring will be built with the integer-simple
  -- flag turned off.
  flags: -integer-simple

When you run cabal new-configure, it writes out a cabal.project.local file which saves any extra configuration options from the command line; if you want to know how a command line arguments get translated into a cabal.project file, just run new-configure and inspect the output.

Known issues

As a tech preview, the code is still a little rough around the edges. Here are some more major issues you might run into:

  • Although dependency resolution is deterministic, if you update your Hackage index with cabal update, dependency resolution will change too. There is no cabal new-freeze, so you'll have to manually construct the set of desired constraints.
  • A new feature of new-build is that it avoids rebuilding packages when there have been no changes to them, by tracking the hashes of their contents. However, this dependency tracking is not 100% accurate (specifically, it relies on your Cabal file accurately reporting all file dependencies ala sdist, and it doesn't know about search paths). There's currently no UI for forcing a package to be recompiled; however you can induce a recompilation fairly easily by removing an appropriate cache file: specifically, for the package named p-1.0, delete the file dist-newstyle/build/p-1.0/cache/build.
  • On Mac OS X, Haskell Platform, you may get the message “Warning: The package list for 'hackage.haskell.org' does not exist. Run 'cabal update' to download it.” That is issue #3392; see the linked ticket for workarounds.

If you encounter other bugs, please let us know on Cabal's issue tracker.

  • May 2, 2016

Hindley-Milner with top-level existentials

Content advisory: This is a half-baked research post.

Abstract. Top-level unpacking of existentials are easy to integrate into Hindley-Milner type inference. Haskell should support them. It's possible this idea can work for internal bindings of existentials as well (ala F-ing modules) but I have not worked out how to do it.

Update. And UHC did it first!

Update 2. And rank-2 type inference is decidable (and rank-1 existentials are an even weaker system), although the algorithm for rank-2 inference requires semiunification.

Background

The difference between Hindley-Milner and System F. Although in informal discussion, Hindley-Milner is commonly described as a “type inference algorithm”, it should properly be described as a type system which is more restrictive than System F. Both type systems allow polymorphism via universal quantification of variables, but in System F this polymorphism is explicit and can occur anywhere, whereas in Hindley-Milner the polymorphism is implicit, and can only occur at the “top level” (in a so-called “polytype” or “type scheme.”) This restriction of polymorphism is the key which makes inference (via Algorithm W) for Hindley-Milner decidable (and practical), whereas inference for System F undecidable.

-- Hindley Milner
id :: a -> a
id = λx. x

-- System F
id :: ∀a. a -> a
id = Λa. λ(x : a). x

Existential types in System F. A common generalization of System F is to equip it with existential types:

Types  τ ::= ... | ∃a. τ
Terms  e ::= ... | pack <τ, e>_τ | unpack <a, x> = e in e

In System F, it is technically not necessary to add existentials as a primitive concept, as they can be encoded using universal quantifiers by saying ∃a. τ = ∀r. (∀a. τ → r) → r.

Existential types in Hindley-Milner? This strategy will not work for Hindley-Milner: the encoding requires a higher-rank type, which is precisely what Hindley-Milner rules out for the sake of inference.

In any case, it is a fool's game to try to infer existential types: there's no best type! HM always infers the most general type for an expression: e.g., we will infer f :: a -> a for the function f = \x -> x, and not Int -> Int. But the whole point of data abstraction is to pick a more abstract type, which is not going to be the most general type and, consequently, is not going to be unique. What should be abstract, what should be concrete? Only the user knows.

Existential types in Haskell. Suppose that we are willing to write down explicit types when existentials are packed, can Hindley-Milner do the rest of the work: that is to say, do we have complete and decidable inference for the rest of the types in our program?

Haskell is an existence (cough cough) proof that this can be made to work. In fact, there are two ways to go about doing it. The first is what you will see if you Google for “Haskell existential type”:

{-# LANGUAGE ExistentialQuantification #-}
data Ex f = forall a. Ex (f a)
pack :: f a -> Ex f
pack = Ex
unpack :: Ex f -> (forall a. f a -> r) -> r
unpack m k = case m of Ex x -> f x

Ex f is isomorphic to ∃a. f a, and similar to the System F syntax, they can be packed with the Ex constructor and unpacked by pattern-matching on them.

The second way is to directly use the System F encoding using Haskell's support for rank-n types:

{-# LANGUAGE RankNTypes #-}
type Ex f = forall r. (forall a. f a -> r) -> r
pack :: f a -> Ex f
pack x = \k -> k x
unpack :: Ex f -> (forall a. f a -> r) -> r
unpack m k = m k

The boxy types paper demonstrated that you can do inference, so long as all of your higher rank types are annotated. Although, perhaps it was not as simple as hoped, since impredicative types are a source of constant bugs in GHC's type checker.

The problem

Explicit unpacks suck. As anyone who has tried programming with existentials in Haskell can attest, the use of existentials can still be quite clumsy due to the necessity of unpacking an existential (casing on it) before it can be used. That is to say, the syntax let Ex x = ... in ... is not allowed, and it is an easy way to get GHC to tell you its brain exploded.

Leijen investigated the problem of handling existentials without explicit unpacks.

Loss of principal types without explicit unpacks, and Leijen's solution. Unfortunately, the naive type system does not have principal types. Leijen gives an example where there is no principal type:

wrap :: forall a. a -> [a]
key  :: exists b. Key b
-- What is the type of 'wrap key'?
-- [exists b. Key b]?
-- exists b. [key b]?

Neither type is a subtype of the other. In his paper, Leijen suggests that the existential should be unwrapped as late as possible (since you can go from the first type to the second, but not vice versa), and thus, the first type should be preferred.

The solution

A different approach. What if we always lift the existential to the top level? This is really easy to do if you limit unpacks to the top-level of a program, and it turns out this works really well. (The downside is that dynamic use of existentials is not supported.)

There's an existential in every top-level Haskell algebraic data type. First, I want to convince you that this is not all that strange of an idea. To do this, we look at Haskell's support for algebraic data types. Algebraic data types in Haskell are generative: each data type must be given a top-level declaration and is considered a distinct type from any other data type. Indeed, Haskell users use this generativity in conjunction with the ability to hide constructors to achieve data abstraction in Haskell. Although there is not actually an existential lurking about—generativity is not data abstraction—generativity is an essential part of data abstraction, and HM has no problem with this.

Top-level generativity corresponds to existentials that are unpacked at the top-level of a program (ala F-ing modules). We don't need existentials embedded inside our Haskell expressions to support the generativity of algebraic data types: all we need is the ability to pack an existential type at the top level, and then immediately unpack it into the top-level context. In fact, F-ing modules goes even further: existentials can always be lifted until they reach the top level of the program. Modular programming with applicative functors (the ML kind) can be encoded using top-level existentials which are immediately unpacked as they are defined.

The proposal. So let us suggest the following type system, Hindley-Milner with top-level existentials (where a* denotes zero to many type variables):

Term variables ∈ f, x, y, z
Type variables ∈ a, b, c

Programs
prog ::= let f = e in prog
       | seal (b*, f :: σ) = (τ*, e) in prog
       | {- -}

Type schemes (polytypes)
σ ::= ∀a*. τ

Expressions
e ::= x
    | \x -> e
    | e e

Monotypes
τ ::= a
    | τ -> τ

There is one new top-level binding form, seal. We can give it the following typing rule:

Γ ⊢ e :: τ₀[b* → τ*]
a* = free-vars(τ₀[b* → τ*])
Γ, b*, (f :: ∀a*. τ₀) ⊢ prog
---------------------------------------------
Γ ⊢ seal (b*, f :: ∀a*. τ₀) = (τ*, e) in prog

It also elaborates directly to System F with existentials:

seal (b*, f :: σ) = (τ*, e) in prog
  ===>
unpack <b*, f> = pack <τ*, e>_{∃b*. σ} in prog

A few observations:

  1. In conventional presentations of HM, let-bindings are allowed to be nested inside expressions (and are generalized to polytypes before being added to the context). Can we do something similar with seal? This should be possible, but the bound existential type variables must be propagated up.
  2. This leads to a second problem: naively, the order of quantifiers must be ∃b. ∀a. τ and not ∀a. ∃b. τ, because otherwise we cannot add the existential to the top-level context. However, there is a "skolemization" trick (c.f. Shao and F-ing modules) by which you can just make b a higher-kinded type variable which takes a as an argument, e.g., ∀a. ∃b. b is equivalent to ∃b'. ∀a. b' a. This trick could serve as the way to support inner seal bindings, but the encoding tends to be quite involved (as you must close over the entire environment.)
  3. This rule is not very useful for directly modeling ML modules, as a “module” is usually thought of as a record of polymorphic functions. Maybe you could generalize this rule to bind multiple polymorphic functions?

Conclusion. And that's as far as I've worked it out. I am hoping someone can tell me (1) who came up with this idea already, and (2) why it doesn't work.

  • April 24, 2016

ghc-shake: Reimplementing ghc -​-make

ghc --make is a useful mode in GHC which automatically determines what modules need to be compiled and compiles them for you. Not only is it a convenient way of building Haskell projects, its single-threaded performance is good too, by reusing the work of reading and deserializing external interface files. However, the are a number of downsides to ghc --make:

  1. Projects with large module graphs have a hefty latency before recompilation begins. This is because ghc --make (re)computes the full module graph, parsing each source file's header, before actually doing any work. If you have a preprocessor, it's even worse.
  2. It's a monolithic build system, which makes it hard to integrate with other build systems if you need something more fancy than what GHC knows how to do. (For example, GHC's painstakingly crafted build system knows how to build in parallel across package boundaries, which Cabal has no idea how to do.)
  3. It doesn't give you any insight into the performance of your build, e.g. what modules take a long time to build or what the big "blocker" modules are.

ghc-shake is a reimplementation of ghc --make using the Shake build system. It is a drop-in replacement for ghc. ghc-shake sports the following features:

  1. Greatly reduced latency to recompile. This is because Shake does not recompute the module graph by parsing the header of every file; it reuses cached information and only re-parses source files which have changed.
  2. If a file is rebuilt (and its timestamp updated) but the build output has not changed, we don't bother recompiling anything that depended on it. This is in contrast to ghc --make, which has to run the recompilation check on every downstream module before concluding there is nothing to do. In fact, ghc-shake never runs the recompilation test, because we reimplemented this dependency structure natively in Shake.
  3. Using -ffrontend-opt=--profile, you can get nice profiling information about your build, including how long it took to build each module, and how expensive it is to change one of the modules.
  4. It's as fast as ghc --make on single-threaded builds. Compare this to ghc-make, another build system which uses Shake to build Haskell. ghc-make does not use the GHC API and must use the (slow) ghc -M to get initial dependency information about your project.
  5. It's accurate. It handles many edge-cases (like -dynamic-too) correctly, and because it is written using the GHC API, it can in principle be feature-for-feature compatible with ghc --make. (It's not currently, but only because I haven't implemented them yet.)

There are some downsides:

  1. Shake build systems require a .shake directory to actual store metadata about the build. This is in contrast to ghc --make, which operates entirely off of the timestamps of build products in your directory.
  2. Because it is directly implemented with the GHC API, it only works with a specific version of GHC (the upcoming GHC 8.0 release).
  3. It needs a patched version of the Shake library, because we have custom rule for building modules based off of Shake's (not exported) file representation. I've reported it here.
  4. There are still some missing features and bugs. The ones I've run into are that (1) we forget to relink in some cases, and (2) it doesn't work for building profiled code.

If you want to use ghc-shake today (not for the faint of heart), try git clone https://github.com/ezyang/ghc-shake and follow the instructions in the README. But even if you're not interested in using it, I think the code of ghc-shake has some good lessons for anyone who wants to write a build system involving Haskell code. One of the most important architectural decisions was to make the rules in ghc-shake not be organized around output files (e.g. dist/build/Data/Foo.hi, as in make) but around Haskell modules (e.g. Data.Foo). Semantic build systems work a lot better than forcing everything into a "file abstraction" (although Shake doesn't quite support this mode of use as well as I would like.) There were some other interesting lessons... but that should be the subject for another blog post!

Where is this project headed? There are a few things I'm considering doing in the not-so-immediate future:

  1. To support multiple GHC versions, we should factor out the GHC specific code into a separate executable and communicate over IPC (hat tip Duncan Coutts). This would also allow us to support separate-process parallel GHC builds which still get to reuse read interface files. In any case, ghc-shake could serve as the blueprint for what information GHC needs to make more easily accessible to build systems.
  2. We could consider moving this code back to GHC. Unfortunately, Shake is a bit too big of a dependency to actually have GHC depend on, but it may be possible to design some abstract interface (hello Backpack!) which represents a Shake-style build system, and then have GHC ship with a simple implementation for --make (but let users swap it out for Shake if they like.)
  3. We can extend this code beyond ghc --make to understand how to build entire Cabal projects (or bigger), ala ToolCabal, a reimplementation of Cabal using Shake. This would let us capture patterns like GHC's build system, which can build modules from all the boot packages in parallel (without waiting for the package to completely finish building first.

P.S. ghc-shake is not to be confused with shaking-up-ghc, which is a project to replace GHC's Makefile-based build system with a Shake based build system.

  • January 7, 2016

The convergence of compilers, build systems and package managers

Abstract. The traditional abstraction barriers between compiler, build system and package manager are increasingly ill-suited for IDEs, parallel build systems, and modern source code organization. Recent compilers like go and rustc are equipped with a fully-fledged build systems; semantic build systems like Bazel and Gradle also expect to manage the packaging of software. Does this mean we should jettison these abstraction barriers? It seems worthwhile to look for new interfaces which can accommodate these use-cases.

Traditionally, one can understand the tooling of a programming language in three parts:

  • The compiler takes a single source file and transforms it into an object file. (Examples: ghc -c, go tool 6g, javac and gcc -c.)
  • The build system takes a collection of source files (and metadata) and transforms them into the final build product. It does this by invoking the compiler multiple times. (Examples: go build, Setup build, make, ant compile.) Often, the build system also knows how to install the build product in question.
  • The package manager takes a package name, and retrieves and builds the package and its dependencies, and installs them into some store. It does this by invoking the build systems of each package. (Examples: cabal install, cargo install, maven package.)

This separation constitutes an abstraction barrier which allows these components to be separately provided. For example, a single build system can work with multiple different compilers (gcc versus clang); conversely, a compiler may be invoked from a user's custom build system. A library may be packaged for both its native language package manager as well as a Linux distribution's packaging system; conversely, a package manager may be indifferent to how a library actually gets built. In today's software ecosystem, these abstraction barriers are used heavily, with good effect!

However, there are an increasing number of use-cases which cannot be adequately handled using these abstraction barriers:

  • A build system needs to know what order to build source files in; however, the canonical source for this information is inside the import/include declarations of the source file. This information must either be duplicated inside the build system, or the build system must call the compiler in order to compute the dependency graph to be used. In any case, a compiler cannot just be a dumb source-to-object-file converter: it must know how to emit dependencies of files (e.g., gcc -M). There is no standardized format for this information, except perhaps a Makefile stub.
  • The dependency problem is further exacerbated when module dependencies can be cyclic. A build system must know how to resolve cycles, either by compiling strongly connected components of modules at a time, or compiling against "interface" files, which permit separate compilation. This was one of the problems which motivated the Rust developers to not expose a one-source-at-a-time compiler.
  • The best parallelization can be achieved with a fine-grained dependency graph over source files. However, the most desirable place to implement parallelization is the package manager, as an invocation of the package manager will cause the most code to be compiled. Thus, a system like Bazel unifies both the build system and the package manager, so that parallelism can be achieved over the entire build. (Another example is GHC's build system, which parallelizes compilation of all wired-in packages on a per-module basis.)
  • IDEs want in-depth information from the compiler beyond a -c style interface. But they cannot invoke the compiler directly, because the only way to invoke the compiler with the right flags and the right environment is via the build system / the package manager. Go's built in build-system means that it can more easily provide a tool like go oracle; otherwise, go oracle would need to be able to accommodate external build systems.
  • Certain language features are actively hostile to build systems; only the compiler has enough smarts to figure out how to manage the build. Good examples include macros (especially macros that can access the filesystem), other forms of compile-time metaprogramming, and compiler plugins.

Thus, the temptation is to roll up these components into a single monolithic tool that does everything. There are many benefits: a single tool is easier to develop, gives a more uniform user experience, and doesn't require the developers to specify a well-defined API between the different components. The downside? You can't swap out pieces of a monolithic system.

I think it is well worth considering how we can preserve this separation of concerns, even in the face of these features. Unfortunately, I don't know what the correct API is, but here is a strawman proposal: every compiler and build system writer should have an alternative mode which lets a user ask the query, "How do I make $output file?" This mode returns (1) the dependencies of that file, and (2) a recipe for how to make it. The idea is to place the dependency-finding logic in the compiler (the canonical place to put it), while letting an external tool actually handle building the dependencies. But there are a lot of details this proposal doesn't cover.

What do you think about the convergence of compiler, build system and package manager? Do you think they should be monolithic? If not, what do you think the right API to support these new use cases should be? I'd love to know what you think.

  • December 7, 2015

What is Stateless User Interface?

The essence of stateless user interface is that actions you take with a program should not depend on implicit state. Stateless interfaces are easier to understand, because an invocation of a command with some arguments will always do the same thing, whereas in a stateful interface, the command may do some different than it did yesterday, because that implicit state has changed and is influencing the meaning of your program.

This philosophy is something any Haskeller should intuitively grasp... but Cabal and cabal-install today fail this ideal. Here are some examples of statefulness in Cabal today:

  1. Running cabal install, the built packages are installed into a "package database", which makes them available for use by GHC.
  2. Running cabal install, the choice of what packages and versions to install depends on the state of the local package database (the current solver attempts to reuse already installed software) and the state of the remote package repository (which says what packages and versions are available.)
  3. Running ./Setup configure saves a LocalBuildInfo to dist/setup-config, which influences further Setup commands (build, register, etc.)

Each of these instances of state imposes complexity on users: how many times do you think you have (1) blown away your local package database because it was irreversibly wedged, (2) had your project stop building because the dependency solver started picking too new version of packages, or (3) had Cabal ask you to reconfigure because some feature wasn't enabled?

State has cost, but it is not here for no reason:

  1. The package database exists because we don't want to have to rebuild all of our packages from scratch every time we want to build something (indeed, this is the whole point of a package manager);
  2. The solver depends on the local package database because users are impatient and want to avoid building new versions packages before they can build their software;
  3. The solver depends on the remote package repository because developers and users are impatient and want to get new releases to users as quickly possible;
  4. The configure caches its information because a user doesn't want to wait to reconfigure the package every time they try to build a package they're working on.

In the face of what is seemingly an inherently stateful problem domain, can stateless user interface prevail? Of course it can.

Sometimes the state is merely being used as a cache. If a cache is blown away, everything should still work, just more slowly. The package database (reason 1) and configuration cache (reason 4) both fall under this banner, but the critical mistake today's Cabal makes is that if you delete this information, things do not "just work". There must be sufficient information to rebuild the cache; e.g., the configuration cache should be supplemented with the actual input to the configure step. (Sometimes, separation of concerns means you simply cannot achieve this. What is ghc to do if you ask it to use the not-in-cache lens package?) Furthermore, the behavior of a system should not vary depending on whether or not the cached data is present or not; e.g., the solver (reason 2) should not make different (semantically meaningful) decisions based on what is cached or not.

Otherwise, it must be possible to explicitly manage the state in question: if the state is a remote package repository (reason 3), there must be a way to pin against some state. (There's a tool that does this and it's called Stack.) While sometimes necessary, explicit state complicates interface and makes it harder to describe what the system can do. Preferably, this state should be kept as small and as centralized as possible.

I don't think anything I've said here is particularly subtle. But it is something that you need to specifically think about; otherwise, you will be seduced by the snare of stateful interface. But if you refuse the siren call and put on the hair shirt, your users will thank you much more for it.

Acknowledgments. These thoughts are not my own: I have to give thanks to Johan Tibell, Duncan Coutts, and Simon Marlow, for discussions which communicated this understanding to me. Any mistakes in this article are my own. This is not a call to action: the Cabal developers recognize and are trying to fix this, see this hackathon wiki page for some mutterings on the subject. But I've not seen this philosophy written out explicitly anywhere on the Internet, and so I write it here for you.

  • November 27, 2015

Is no-reinstall Cabal coming to GHC 8.0?

You might be wondering: with the beta release of no-reinstall Cabal, is this functionality be coming to GHC 8.0? (Or even a new release of Cabal, since the no-reinstall feature works with GHC 7.10). Unfortunately, there is a split among the Cabal developers over whether or not the actual no-reinstall behavior should go into Cabal by default as is. Duncan Coutts, in particular, has argued that it's a bad idea to enable no-reinstall without other (unimplemented) changes to Cabal. Since the extra needed changes are not fully implemented yet, it's unclear if Duncan will manage them for GHC 8.0.

I've heard a smattering of feedback that no-reinstall Cabal actually is working just fine for people, so I suspect many people would be in favor of just biting the bullet and putting in the "good" (but not "best") solution into Cabal. But I want to foster an informed discussion, so I'd like to explain what the (known) problems with no-reinstall are.

What is no reinstall?

Currently, GHC and Cabal maintain an invariant in the installed package database that for any package name and version (i.e. (source) package ID), there is at most one matching package in the database:

/img/cabal-status/reinstall-old.png

The arrows indicate a "depends on" relationship: so if you have a database that has bar-0.1, bar-0.2 and an instance of foo-0.1 built against bar-0.1, you aren't allowed to install another instance of foo-0.1 built against bar-0.2 (though you are allowed to install foo-0.2 built against bar-0.2).

If cabal-install wants to install a package with the same package ID as a package already in the database, but with different dependencies, it must destructively overwrite the previous entry to maintain this invariant, pictured below:

/img/cabal-status/destructive-reinstall.png

No reinstall relaxes this invariant, so that "reinstalling" a package with different dependencies just works:

/img/cabal-status/no-reinstall.png

The recently beta released no-reinstall Cabal achieves this with two small features. First, in GHC 7.10, we added the flag --enable-multi-instance to ghc-pkg which makes ghc-pkg no longer error if you attempt to add multiple copies of the same package in the database. Second, in Vishal Agrawal's patchset for Cabal, cabal-install is modified to use this flag, so that the dependency solver no longer has to avoid reinstalls.

However, breaking this invariant has consequences. Let's look at some of them.

Problem 1: It doesn't work on old versions of GHC

Summary: In GHC 7.8 and earlier, it's not possible to directly implement no reinstall (because ghc-pkg will reject it.) And even if it were possible, installing a new instance of a package (which has the same source package ID of an existing package) either (1) causes the old package and all of its dependents to become hidden from the default view of GHC, even though they are still usable, or (2) fails to be exposed in the default view of GHC.

Suppose that a package foo-0.1, which defines a type Foo, and has been compiled twice with different versions of its dependencies:

/img/cabal-status/ghc-7.8-id.png

GHC 7.8 could not distinguish between two compilations of the package: symbols from both packages would live in the foo-0.1 namespace, and colliding symbols would simply be considered the same. Disaster! To avoid this situation, GHC has a shadowing algorithm which remove incompatible packages from its visible set. Here is an example:

/img/cabal-status/ghci-shadowing.png

We have two package databases, the user database and the global database, laid side-to-side (the user database is "on top"). When there is a conflicting package ID in the combined database, GHC prefers the package from the topmost database: thus, in our example the global foo-0.1 is shadowed (any packages which transitively have it as a dependency are also shadowed). When a package is shadowed, it doesn't exist at all to GHC: GHC will not suggest it or make any mention it exists.

No reinstall requires us to allow these duplicate packages the same database! In this case, GHC will apply shadowing; however, it is not well-defined which package should be shadowed. If GHC chooses to shadow the old package, they "vanish" from GHC's default view (it is as if they do not exist at all); if GHC chooses to shadow the new package, a package that a user just cabal-install'd will be mysteriously absent! Troublesome.

Problem 2: Using multiple instances of the same package is confusing

Summary: In GHC 7.10 or later, multiple instances of the same package may be used together in the same GHC/GHCi session, which can result in confusing type inequalities.

In GHC 7.10, we now use "package keys" to test for type identity. A package key is a source package ID augmented with a hash of the package keys of all the dependencies. This means that GHC no longer needs to apply shadowing for soundness, and you can register duplicates of a package using the --enable-mult-instances flag on ghc-pkg.

/img/cabal-status/ghc-7.10-id.png

However, this can still result in confusing behavior. Consider the previous example in GHC 7.10:

/img/cabal-status/ghci-multi-inst.png

Both versions of foo are visible, and so if we try to import Foo, GHC will complain that it doesn't know which Foo we want. This can be fixed by hiding one package or the other. However, suppose that both baz and qux are exposed, and furthermore, they both export a value foo which has type Foo. These types are "distinct", despite the fact that they are: (1) both named Foo, and (2) come from a package named foo-0.1: they are two different instances of foo-0.1. Confusing!

Problem 3: Nix hashing non-sdist'ed packages is difficult

It is easy to "trick" Cabal into hashing a set of source files which is not representative of the true input of the build system: for example, you can omit files in the other-modules field, or you can modify files in between the time Cabal has computed the source hash and the time it builds the files. And if you can't trust the Nix hash, you now have to worry about what happens when you really need to clobber an old entry in the Nix database (which incorrectly has the "same" hash as what you are attempting to install).

This problem doesn't exist for tarballs downloaded from Hackage, because you can simply hash the tarball and that is guaranteed to be the full set of source for building the file.

Duncan's patchset

To deal with these problems, Duncan has been working on a bigger patchset, with the following properties:

  1. To support old versions of GHC, he is maintaining a separate "default view" package database (which is used by bare invocations of GHC and GHCi) from the actual "Nix store" package database. cabal-install is responsible for maintaining a consistent default view, but also installs everything into the Nix store database.
  2. Nix-style hashing is only done on Hackage packages; local source tree are to be built and installed only into a sandbox database, but never the global database. Thus, an actual Nix hash is only ever computed by cabal-install.
  3. He also wants to make it so that cabal-install's install plan doesn't depend on the local state of the Nix database: it should give the same plan no matter what you have installed previously. This is done by dependency resolving without any reference to the Nix database, and then once IPIDs are calculated for each package, checking to see if they are already built. This plan would also make it possible to support cabal install --enable-profiling without having to blow away and rebuild your entire package database.

Vishal's patchset

Vishal was also cognizant of the problems with the default view of the package database, and he worked on some patches to GHC for support for modifying package environments, which would serve a similar role to Duncan's extra package databases. Unfortunately, these patches have been stuck in code review for a bit now, and they wouldn't help users of old versions of GHC. While the code review process for these patches may get unstuck in the near future, I'm hesitant to place any bets on these changes landing.

Conclusion

My view is that, historically, problems one and two have been the big stated reasons why "no reinstall", while being a simple change, hasn't been added to Cabal as the default mode of operation. However, there's been rising sentiment (I think I can safely cite Simon Marlow in this respect) among some that these problems are overstated, and that we should bite the bullet.

If we want to turn on "no reinstall" before Duncan finishes his patchset (whenever that will be—or maybe someone else will finish it), I think there will need to be some concerted effort to show that these problems are a small price to pay for no reinstall Cabal, and that the Haskell community is willing to pay... at least, until a better implementation comes around.

  • September 18, 2015

Help us beta test “no-reinstall Cabal”

Over this summer, Vishal Agrawal has been working on a GSoC project to move Cabal to more Nix-like package management system. More simply, he is working to make it so that you'll never get one of these errors from cabal-install again:

Resolving dependencies...
In order, the following would be installed:
directory-1.2.1.0 (reinstall) changes: time-1.4.2 -> 1.5
process-1.2.1.0 (reinstall)
extra-1.0 (new package)
cabal: The following packages are likely to be broken by the reinstalls:
process-1.2.0.0
hoogle-4.2.35
haskell98-2.0.0.3
ghc-7.8.3
Cabal-1.22.0.0
...

However, these patches change a nontrivial number of moving parts in Cabal and cabal-install, so it would be very helpful to have willing guinea pigs to help us iron out some bugs before we merge it into Cabal HEAD. As your prize, you'll get to run "no-reinstall Cabal": Cabal should never tell you it can't install a package because some reinstalls would be necessary.

Here's how you can help:

  1. Make sure you're running GHC 7.10. Earlier versions of GHC have a hard limitation that doesn't allow you to reinstall a package multiple times against different dependencies. (Actually, it would be useful if you test with older versions of GHC 7.8, but only mostly to make sure we haven't introduced any regressions here.)
  2. git clone https://github.com/ezyang/cabal.git (I've added some extra corrective patches on top of Vishal's version in the course of my testing) and git checkout cabal-no-pks.
  3. In the Cabal and cabal-install directories, run cabal install.
  4. Try building things without a sandbox and see what happens! (When I test, I've tried installing multiple version of Yesod at the same time.)

It is NOT necessary to clear your package database before testing. If you completely break your Haskell installation (unlikely, but could happen), you can do the old trick of clearing out your .ghc and .cabal directories (don't forget to save your .cabal/config file) and rebootstrapping with an old cabal-install.

Please report problems here, or to this PR in the Cabal tracker. Or chat with me in person next week at ICFP. :)

  • August 29, 2015

Ubuntu Vivid upgrade (Xmonad)

Another half year, another Ubuntu upgrade. This upgrade went essentially smoothly: the only things that stopped working were my xbindkeys bindings for volume and suspend, which were easy to fix.

Volume up and down

If you previously had:

#Volume Up
"pactl set-sink-volume 0 -- +5%"
    m:0x10 + c:123
    Mod2 + XF86AudioRaiseVolume

this syntax no longer works: you must place the double dash earlier in the command, as so:

#Volume Up
"pactl -- set-sink-volume 0 +5%"
    m:0x10 + c:123
    Mod2 + XF86AudioRaiseVolume

Do the same for volume down.

Suspend

If you previously had:

#Sleep
"dbus-send --system --print-reply --dest="org.freedesktop.UPower" /org/freedesktop/UPower org.freedesktop.UPower.Suspend"
     m:0x10 + c:150
     Mod2 + XF86Sleep

UPower no longer handles suspend; you have to send the command to login:

#Sleep
"dbus-send --system --print-reply --dest=org.freedesktop.login1 /org/freedesktop/login1 org.freedesktop.login1.Manager.Suspend boolean:true"
    m:0x10 + c:150
    Mod2 + XF86Sleep
  • May 29, 2015

Width-adaptive XMonad layout

My usual laptop setup is I have a wide monitor, and then I use my laptop screen as a secondary monitor. For a long time, I had two XMonad layouts: one full screen layout for my laptop monitor (I use big fonts to go easy on the eyes) and a two-column layout when I'm on the big screen.

But I had an irritating problem: if I switched a workspace from the small screen to the big screen, XMonad would still be using the full screen layout, and I would have to Alt-Tab my way into the two column layout. To add insult to injury, if I moved it back, I'd have to Alt-Tab once again.

After badgering the fine folks on #xmonad, I finally wrote an extension to automatically switch layout based on screen size! Here it is:

{-# LANGUAGE FlexibleInstances, MultiParamTypeClasses #-}

-----------------------------------------------------------------------------
-- |
-- Module      :  XMonad.Layout.PerScreen
-- Copyright   :  (c) Edward Z. Yang
-- License     :  BSD-style (see LICENSE)
--
-- Maintainer  :  <ezyang@cs.stanford.edu>
-- Stability   :  unstable
-- Portability :  unportable
--
-- Configure layouts based on the width of your screen; use your
-- favorite multi-column layout for wide screens and a full-screen
-- layout for small ones.
-----------------------------------------------------------------------------

module XMonad.Layout.PerScreen
    ( -- * Usage
      -- $usage
      PerScreen,
      ifWider
    ) where

import XMonad
import qualified XMonad.StackSet as W

import Data.Maybe (fromMaybe)

-- $usage
-- You can use this module by importing it into your ~\/.xmonad\/xmonad.hs file:
--
-- > import XMonad.Layout.PerScreen
--
-- and modifying your layoutHook as follows (for example):
--
-- > layoutHook = ifWider 1280 (Tall 1 (3/100) (1/2) ||| Full) Full
--
-- Replace any of the layouts with any arbitrarily complicated layout.
-- ifWider can also be used inside other layout combinators.

ifWider :: (LayoutClass l1 a, LayoutClass l2 a)
               => Dimension   -- ^ target screen width
               -> (l1 a)      -- ^ layout to use when the screen is wide enough
               -> (l2 a)      -- ^ layout to use otherwise
               -> PerScreen l1 l2 a
ifWider w = PerScreen w False

data PerScreen l1 l2 a = PerScreen Dimension Bool (l1 a) (l2 a) deriving (Read, Show)

-- | Construct new PerScreen values with possibly modified layouts.
mkNewPerScreenT :: PerScreen l1 l2 a -> Maybe (l1 a) ->
                      PerScreen l1 l2 a
mkNewPerScreenT (PerScreen w _ lt lf) mlt' =
    (\lt' -> PerScreen w True lt' lf) $ fromMaybe lt mlt'

mkNewPerScreenF :: PerScreen l1 l2 a -> Maybe (l2 a) ->
                      PerScreen l1 l2 a
mkNewPerScreenF (PerScreen w _ lt lf) mlf' =
    (\lf' -> PerScreen w False lt lf') $ fromMaybe lf mlf'

instance (LayoutClass l1 a, LayoutClass l2 a, Show a) => LayoutClass (PerScreen l1 l2) a where
    runLayout (W.Workspace i p@(PerScreen w _ lt lf) ms) r
        | rect_width r > w    = do (wrs, mlt') <- runLayout (W.Workspace i lt ms) r
                                   return (wrs, Just $ mkNewPerScreenT p mlt')
        | otherwise           = do (wrs, mlt') <- runLayout (W.Workspace i lf ms) r
                                   return (wrs, Just $ mkNewPerScreenF p mlt')

    handleMessage (PerScreen w bool lt lf) m
        | bool      = handleMessage lt m >>= maybe (return Nothing) (\nt -> return . Just $ PerScreen w bool nt lf)
        | otherwise = handleMessage lf m >>= maybe (return Nothing) (\nf -> return . Just $ PerScreen w bool lt nf)

    description (PerScreen _ True  l1 _) = description l1
    description (PerScreen _ _     _ l2) = description l2

I'm going to submit it to xmonad-contrib, if I can figure out their darn patch submission process...

  • May 2, 2015

An Eq instance for non de Bruijn terms

tl;dr A non-nameless term equipped with a map specifying a de Bruijn numbering can support an efficient equality without needing a helper function. More abstractly, quotients are not just for proofs: they can help efficiency of programs too.

The cut. You're writing a small compiler, which defines expressions as follows:

type Var = Int
data Expr = Var Var
          | App Expr Expr
          | Lam Var Expr

Where Var is provided from some globally unique supply. But while working on a common sub-expression eliminator, you find yourself needing to define equality over expressions.

You know the default instance won’t work, since it will not say that Lam 0 (Var 0) is equal to Lam 1 (Var 1). Your colleague Nicolaas teases you that the default instance would have worked if you used a nameless representation, but de Bruijn levels make your head hurt, so you decide to try to write an instance that does the right thing by yourself. However, you run into a quandary:

instance Eq Expr where
  Var v == Var v'          = n == n'
  App e1 e2 == App e1' e2' = e1 == e1' && e2 == e2'
  Lam v e == Lam v' e'     = _what_goes_here

If v == v', things are simple enough: just check if e == e'. But if they're not... something needs to be done. One possibility is to rename e' before proceeding, but this results in an equality which takes quadratic time. You crack open the source of one famous compiler, and you find that in fact: (1) there is no Eq instance for terms, and (2) an equality function has been defined with this type signature:

eqTypeX :: RnEnv2 -> Type -> Type -> Bool

Where RnEnv2 is a data structure containing renaming information: the compiler has avoided the quadratic blow-up by deferring any renaming until we need to test variables for equality.

“Well that’s great,” you think, “But I want my Eq instance, and I don’t want to convert to de Bruijn levels.” Is there anything to do?

Perhaps a change of perspective in order:

The turn. Nicolaas has the right idea: a nameless term representation has a very natural equality, but the type you've defined is too big: it contains many expressions which should be equal but structurally are not. But in another sense, it is also too small.

Here is an example. Consider the term x, which is a subterm of λx. λy. x. The x in this term is free; it is only through the context λx. λy. x that we know it is bound. However, in the analogous situation with de Bruijn levels (not indexes—as it turns out, levels are more convenient in this case) we have 0, which is a subterm of λ λ 0. Not only do we know that 0 is a free variable, but we also know that it binds to the outermost enclosing lambda, no matter the context. With just x, we don’t have enough information!

If you know you don’t know something, you should learn it. If your terms don’t know enough about their free variables, you should equip them with the necessary knowledge:

import qualified Data.Map as Map
import Data.Map (Map)

data DeBruijnExpr = D Expr NEnv

type Level = Int
data NEnv = N Level (Map Var Level)

lookupN :: Var -> NEnv -> Maybe Level
lookupN v (N _ m) = Map.lookup v m

extendN :: Var -> NEnv -> NEnv
extendN v (N i m) = N (i+1) (Map.insert v i m)

and when you do that, things just might work out the way you want them to:

instance Eq DeBruijnExpr where
  D (Var v) n == D (Var v') n' =
    case (lookupN v n, lookupN v' n') of
      (Just l, Just l')  -> l == l'
      (Nothing, Nothing) -> v == v'
      _ -> False
  D (App e1 e2) n == D (App e1' e2') n' =
    D e1 n == D e1' n' && D e2 n == D e2' n'
  D (Lam v e) n == D (Lam v' e') n' =
    D e (extendN v n) == D e' (extendN v' n')

(Though perhaps Coq might not be able to tell, unassisted, that this function is structurally recursive.)

Exercise. Define a function with type DeBruijnExpr -> DeBruijnExpr' and its inverse, where:

data DeBruijnExpr' = Var' Var
                   | Bound' Level
                   | Lam' DeBruijnExpr'
                   | App' DeBruijnExpr' DeBruijnExpr'

The conclusion. What have we done here? We have quotiented a type—made it smaller—by adding more information. In doing so, we recovered a simple way of defining equality over the type, without needing to define a helper function, do extra conversions, or suffer quadratically worse performance.

Sometimes, adding information is the only way to get the minimal definition. This situation occurs in homotopy type theory, where equivalences must be equipped with an extra piece of information, or else it is not a mere proposition (has the wrong homotopy type). If you, gentle reader, have more examples, I would love to hear about them in the comments. We are frequently told that “less is more”, that the route to minimalism lies in removing things: but sometimes, the true path lies in adding constraints.

Postscript. In Haskell, we haven’t truly made the type smaller: I can distinguish two expressions which should be equivalent by, for example, projecting out the underlying Expr. A proper type system which supports quotients would oblige me to demonstrate that if two elements are equivalent under the quotienting equivalence relation, my elimination function can't observe it.

Postscript 2. This technique has its limitations. Here is one situation where I have not been able to figure out the right quotient: suppose that the type of my expressions are such that all free variables are implicitly universally quantified. That is to say, there exists some ordering of quantifiers on a and b such that a b is equivalent to b a. Is there a way to get the quantifiers in order on the fly, without requiring a pre-pass on the expressions using this quotienting technique? I don’t know!

  • January 30, 2015