ezyang’s blog

the arc of software bends towards understanding

Backpack and the PVP

In the PVP, you increment the minor version number if you add functions to a module, and the major version number if you remove function to a module. Intuitively, this is because adding functions is a backwards compatible change, while removing functions is a breaking change; to put it more formally, if the new interface is a subtype of the older interface, then only a minor version number bump is necessary.

Backpack adds a new complication to the mix: signatures. What should the PVP policy for adding/removing functions from signatures should be? If we interpret a package with required signatures as a function, theory tells us the answer: signatures are contravariant, so adding required functions is breaking (bump the major version), whereas it is removing required functions that is backwards-compatible (bump the minor version).

However, that's not the end of the story. Signatures can be reused, in the sense that a package can define a signature, and then another package reuse that signature:

unit sigs where
  signature A where
    x :: Bool
unit p where
  dependency sigs[A=<A>]
  module B where
    import A
    z = x

In the example above, we've placed a signature in the sigs unit, which p uses by declaring a dependency on sigs. B has access to all the declarations defined by the A in sigs.

But there is something very odd here: if sigs were to ever remove its declaration for x, p would break (x would no longer be in scope). In this case, the PVP rule from above is incorrect: p must always declare an exact version bound on sigs, as any addition or deletion would be a breaking change.

So we are in this odd situation:

  1. If we include a dependency with a signature, and we never use any of the declarations from that signature, we can specify a loose version bound on the dependency, allowing for it to remove declarations from the signature (making the signature easier to fulfill).
  2. However, if we ever import the signature and use anything from it, we must specify an exact bound, since removals are now breaking changes.

I don't think end users of Backpack should be expected to get this right on their own, so GHC (in this proposed patchset) tries to help users out by attaching warnings like this to declarations that come solely from packages that may have been specified with loose bounds:

foo.bkp:9:11: warning: [-Wdeprecations]
    In the use of ‘x’ (imported from A):
    "Inherited requirements from non-signature libraries
    (libraries with modules) should not be used, as this
    mode of use is not compatible with PVP-style version
    bounds.  Instead, copy the declaration to the local
    hsig file or move the signature to a library of its
    own and add that library as a dependency."

UPDATE. After the publishing of this post, we ended up removing this error, because it triggered in situations which were PVP-compatible. (The gory details: if a module reexported an entity from a signature, then a use of the entity from that module would have triggered the error, due to how DEPRECATED notices work.)

Of course, GHC knows nothing about bounds, so the heuristic we use is that a package is a signature package with exact bounds if it does not expose any modules. A package like this is only ever useful by importing its signatures, so we never warn about this case. We conservatively assume that packages that do expose modules might be subject to PVP-style bounds, so we warn in that case, e.g., as in:

unit q where
  signature A where
    x :: Bool
  module M where -- Module!
unit p where
  dependency q[A=<A>]
  module B where
    import A
    z = x

As the warning suggests, this error can be fixed by explicitly specifying x :: Bool inside p, so that, even if q removes its requirement, no code will break:

unit q where
  signature A where
    x :: Bool
  module M where -- Module!
unit p where
  dependency q[A=<A>]
  signature A where
    x :: Bool
  module B where
    import A
    z = x

Or by putting the signature in a new library of its own (as was the case in the original example.)

This solution isn't perfect, as there are still ways you can end up depending on inherited signatures in PVP-incompatible ways. The most obvious is with regards to types. In the code below, we rely on the fact that the signature from q forces T to be type equal to Bool:

unit q where
  signature A where
    type T = Bool
    x :: T
  module Q where
unit p where
  dependency q[A=<A>]
  signature A where
    data T
    x :: T
  module P where
    import A
    y = x :: Bool

In principle, it should be permissible for q to relax its requirement on T, allowing it to be implemented as anything (and not just a synonym of Bool), but that change will break the usage of x in P. Unfortunately, there isn't any easy way to warn in this case.

A perhaps more principled approach would be to ban use of signature imports that come from non-signature packages. However, in my opinion, this complicates the Backpack model for not a very good reason (after all, some day we'll augment version numbers with signatures and it will be glorious, right?)

To summarize. If you want to reuse signatures from signature package, specify an exact version bound on that package. If you use a component that is parametrized over signatures, do not import and use declarations from those signatures; GHC will warn you if you do so.

  • December 30, 2016

Left-recursive parsing of Haskell imports and declarations

Suppose that you want to parse a list separated by newlines, but you want to automatically ignore extra newlines (just in the same way that import declarations in a Haskell file can be separated by one or more newlines.) Historically, GHC has used a curious grammar to perform this parse (here, semicolons represent newlines):

decls : decls ';' decl
      | decls ';'
      | decl
      | {- empty -}

It takes a bit of squinting, but what this grammar does is accept a list of decls, interspersed with one or more semicolons, with zero or more leading/trailing semicolons. For example, ;decl;;decl; parses as:

{- empty -}                             (rule 4)
{- empty -} ';' decl                    (rule 1)
{- empty -} ';' decl ';'                (rule 2)
{- empty -} ';' decl ';' ';' decl       (rule 1)
{- empty -} ';' decl ';' ';' decl ';'   (rule 2)

(Rule 3 gets exercised if there is no leading semicolon.)

This grammar has two virtues: first, it only requires a single state, which reduces the size of the parser; second, it is left-recursive, which means that an LALR parser (like Happy) can parse it in constant stack space.

This code worked quite well for a long time, but it finally fell over in complexity when we added annotations to GHC. Annotations are a feature which track the locations of all keywords/punctuation/whitespace in source code, so that we byte-for-byte can reconstruct the source code from the abstract syntax tree (normally, this formatting information is lost at abstract syntax). With annotations, we needed to save information about each semicolon; for reasons that I don't quite understand, we were expending considerable effort to associate each semicolon with preceding declaration (leading semicolons were propagated up to the enclosing element.)

This lead to some very disgusting parser code:

importdecls :: { ([AddAnn],[LImportDecl RdrName]) }
        : importdecls ';' importdecl
                                {% if null (snd $1)
                                     then return (mj AnnSemi $2:fst $1,$3 : snd $1)
                                     else do
                                      { addAnnotation (gl $ head $ snd $1)
                                                      AnnSemi (gl $2)
                                      ; return (fst $1,$3 : snd $1) } }
        | importdecls ';'       {% if null (snd $1)
                                     then return ((mj AnnSemi $2:fst $1),snd $1)
                                     else do
                                       { addAnnotation (gl $ head $ snd $1)
                                                       AnnSemi (gl $2)
                                       ; return $1} }
        | importdecl             { ([],[$1]) }
        | {- empty -}            { ([],[]) }

Can you tell what this does?! It took me a while to understand what the code is doing: the null test is to check if there is a preceding element we can attach the semicolon annotation to: if there is none, we propagate the semicolons up to the top level.

The crux of the issue was that, once annotations were added, the grammar did not match the logical structure of the syntax tree. That's bad. Let's make them match up. Here are a few constraints:

  1. The leading semicolons are associated with the enclosing AST element. So we want to parse them once at the very beginning, and then not bother with them in the recursive rule. Call the rule to parse zero or more semicolons semis:

    semis : semis ';'
          | {- empty -}
    
  2. If there are duplicate semicolons, we want to parse them all at once, and then associate them with the preceding declarations. So we also need a rule to parse one or more semicolons, which we will call semis1; then when we parse a single declaration, we want to parse it as decl semis1:

    semis1 : semis1 ';'
           | ';'
    

Then, we can build up our parser in the following way:

-- Possibly empty decls with mandatory trailing semicolons
decls_semi : decls_semi decl semis1
           | {- empty -}

-- Non-empty decls with no trailing semicolons
decls : decls_semi decl

-- Possibly empty decls with optional trailing semicolons
top1 : decls_semi
     | decls

-- Possibly empty decls with optional leading/trailing semicolons
top : semi top1

We've taken care not to introduce any shift-reduce conflicts. It was actually a bit non-obvious how to make this happen, because in Haskell source files, we need to parse a list of import declarations (importdecl), followed by a list of top-level declarations (topdecl). It's a bit difficult to define the grammar for these two lists without introducing a shift-reduce conflict, but this seems to work:

top : importdecls_semi topdecls_semi
    | importdecls_semi topdecls
    | importdecls

It looks so simple, but there are a lot of plausible looking alternatives which introduce shift/reduce conflicts. There's an important meta-lesson here, which is that when trying to work out how to do something like this, it is best to experiment with on a smaller grammar, where re-checking is instantaneous (happy takes quite a bit of time to process all of GHC, which made the edit-recompile cycle a bit miserable.)

I'd love to know if there's an even simpler way to do this, or if I've made a mistake and changed the set of languages I accept. Let me know in the comments. I've attached below a simple Happy grammar that you can play around with (build with happy filename.y; ghc --make filename.hs).

{
module Main where

import Data.Char
}

%name parse
%expect 0
%tokentype { Token }
%error { parseError }

%token
      import          { TokenImport }
      decl            { TokenDecl }
      ';'             { TokenSemi }

%%

top     : semis top1                        { $2 }
top1    : importdecls_semi topdecls_semi    { (reverse $1, reverse $2) }
        | importdecls_semi topdecls         { (reverse $1, reverse $2) }
        | importdecls                       { (reverse $1, []) }

id_semi : importdecl semis1                 { $1 }
importdecls
        : importdecls_semi importdecl       { $2:$1 }
importdecls_semi
        : importdecls_semi id_semi          { $2:$1 }
        | {- empty -}                       { [] }

topdecls
        : topdecls_semi topdecl             { $2:$1 }
topdecls_semi
        : topdecls_semi topdecl semis1      { $2:$1 }
        | {- empty -}                       { [] }

semis   : semis ';'                         { () }
        | {- empty -}                       { () }

semis1  : semis1 ';'                        { () }
        | ';'                               { () }

importdecl
        : import                            { "import" }
topdecl : decl                              { "decl" }

{
parseError :: [Token] -> a
parseError p = error ("Parse error: " ++ show p)

data Token
      = TokenImport
      | TokenDecl
      | TokenSemi
 deriving Show

lexer :: String -> [Token]
lexer [] = []
lexer (c:cs)
      | isSpace c = lexer cs
      | isAlpha c = lexVar (c:cs)
lexer (';':cs) = TokenSemi : lexer cs

lexVar cs =
   case span isAlpha cs of
      ("import",rest) -> TokenImport : lexer rest
      ("decl",rest) -> TokenDecl : lexer rest

main = print . parse . lexer $ "import;;import;;decl"
}
  • December 21, 2016

The problem of reusable and composable specifications

It's not too hard to convince people that version bounds are poor approximation for a particular API that we depend on. What do we mean when we say >= 1.0 && < 1.1? A version bound is a proxy some set of modules and functions with some particular semantics that a library needs to be built. Version bounds are imprecise; what does a change from 1.0 to 1.1 mean? Clearly, we should instead write down the actual specification (either types or contracts) of what we need.

This all sounds like a good idea until you actually try to put it into practice, at which point you realize that version numbers had one great virtue: they're very short. Specifications, on the other hand, can get quite large: even just writing down the types of all the functions you depend on can take pages, let alone executable contracts describing more complex behavior. To make matters worse, the same function will be depended upon repeatedly; the specification must be provided in each case!

So we put on our PL hats and say, "Aha! What we need is a mechanism for reuse and composition of specifications. Something like... a language of specification!" But at this point, there is disagreement about how this language should work.

Specifications are code. If you talk to a Racketeer, they'll say, "Well, contracts are just code, and we know how to reuse and compose code!" You have primitive contracts to describe values, compose them together into contracts that describe functions, and then further compose these together to form contracts about modules. You can collect these contracts into modules and share them across your code.

There is one interesting bootstrapping problem: you're using your contracts to represent versions, but your contracts themselves live in a library, so should you version your contracts? Current thinking is that you shouldn't.

But maybe you shouldn't compose them the usual way. One of the things that stuck out to me when I was reading the frontmatter of Clojure's spec documentation is that map specs should be of keysets only, and how they deal with it.

The core principle of spec's design is that specifications for records should NOT take the form { name: string, age: int }. Instead, the specification is split into two pieces: a set of keys { name, age }, and a mapping from keys to specifications which, once registered, apply to all occurrences of a key in all map specifications. (Note that keys are all namespaced, so it is not some insane free-for-all in a global namespace.) The justification for this:

In Clojure we gain power by dynamically composing, merging and building up maps. We routinely deal with optional and partial data, data produced by unreliable external sources, dynamic queries etc. These maps represent various sets, subsets, intersections and unions of the same keys, and in general ought to have the same semantic for the same key wherever it is used. Defining specifications of every subset/union/intersection, and then redundantly stating the semantic of each key is both an antipattern and unworkable in the most dynamic cases.

Back to the land of types. Contracts can do all this because they are code, and we know how to reuse code. But in (non-dependently) typed languages, the language of types tends to be far more impoverished than than the language of values. To take Backpack as an (unusually expressive) example, the only operations we can perform on signatures is to define them (with full definitions for types) and to merge them together. So Backpack signatures run head long into the redundancy problem identified by spec: because the signature of a module includes the signatures of its functions, you end up having to repeat these function signatures whenever you write slightly different iterations of a module.

To adopt the Clojure model, you would have to write a separate signature per module (each in their own package), and then have users combine them together by adding a build-depends on every signature they wanted to use:

-- In Queue-push package
signature Queue where
  data Queue a
  push :: a -> Queue a -> Queue a

-- In Queue-pop package
signature Queue where
  data Queue a
  pop :: Queue a -> Maybe (Queue a, a)

-- In Queue-length package
signature Queue where
  data Queue a
  length :: Queue a -> Int

-- Putting them together (note that Queue is defined
-- in each signature; mix-in linking merges these
-- abstract data types together)
build-depends: Queue-push, Queue-pop, Queue-length

In our current implementation of Backpack, this is kind of insane: to write the specification for a module with a hundred methods, you'd need a hundred packages. The ability to concisely define multiple public libraries in a single package might help but this involves design that doesn't exist yet. (Perhaps the cure is worse than the disease. The package manager-compiler stratification rears its ugly head again!) (Note to self: signature packages ought to be treated specially; they really shouldn't be built when you instantiate them.)

Conclusions. A lot of my thinking here did not crystallize until I started reading about how dynamic languages like Clojure were grappling with the specification problem: I think this just goes to show how much we can learn by paying attention to other systems, even if their context is quite different. (If Clojure believed in data abstraction, I think they could learn a thing or two from how Backpack mix-in links abstract data declarations.)

In Clojure, the inability to reuse specs is a deal breaker which lead them to spec's current design. In Haskell, the inability to reuse type signatures flirts on the edge of unusability: types are just short enough and copy-pasteable enough to be tolerable. Documentation for these types, less so; this is what lead me down my search for better mechanisms for signature reuse.

Although Backpack's current design is "good enough" to get things done, I still wonder if we can't do something better. One tempting option is to allow for downstream signatures to selectively pick out certain functions from a larger signature file to add to their requirements. But if you require Queue.push, you had better also require Queue.Queue (without which, the type of push cannot even be stated: the avoidance problem); this could lead to a great deal of mystery as to what exactly is required in the end. Food for thought.

  • December 17, 2016

Thoughts about Spec-ulation (Rich Hickey)

Rich Hickey recently gave a keynote at Clojure/conj 2016, meditating on the problems of versioning, specification and backwards compatibility in language ecosystems. In it, Rich considers the "extremist" view, what if we built a language ecosystem, where you never, ever broke backwards compatibility.

A large portion of the talk is spent grappling with the ramifications of this perspective. For example:

  1. Suppose you want to make a backwards-compatibility breaking change to a function. Don't mutate the function, Richard says, give the function another name.
  2. OK, but how about if there is some systematic change you need to apply to many functions? That's still not an excuse: create a new namespace, and put all the functions there.
  3. What if there's a function you really don't like, and you really want to get rid of it? No, don't remove it, create a new namespace with that function absent.
  4. Does this sound like a lot of work to remove things? Yeah. So don't remove things!

In general, Rich wants us to avoid breakage by turning all changes into accretion, where the old and new can coexist. "We need to bring functional programming [immutability] to the library ecosystem," he says, "dependency hell is just mutability hell." And to do this, there need to be tools for you to make a commitment to what it is that a library provides and requires, and not accidentally breaking this commitment when you release new versions of your software.

He says a lot more in the talk, so I encourage you to give it a watch if you want to hear the whole picture.


In general, I'm in favor of this line of thinking, because my feeling is that a large amount of breakage associated with software change that is just a product of negligence; breakage not for any good reason, breakage that could have been avoided if there was a little more help from tooling.

That being said, I do have some thoughts about topics that are not so prominently featured in his talk.

Accretion is not a silver bullet... if you believe in data hiding. In his talk, Rich implies that backwards compatibility can be maintained simply by committing not to "remove things". As a Haskeller, this sounds obviously false to me: if I change the internal representation of some abstract type (or even the internal invariants), I cannot just load up both old and new copies of the library and expect to pass values of this type between the two. Indeed, the typechecker won't even let you do this even if the representation hasn't changed.

But, at least for Clojure, I think Rich is right. The reason is this: Clojure doesn't believe data hiding! The prevailing style of Clojure code is that data types consist of immutable records with public fields that are passed around. And so a change to the representation of the data is a possibly a breaking change; non-breaking representation changes are simply not done. (I suspect a similar ethos explains why duplicated dependencies in node.js work as well as they do.)

I am not sure how I feel about this. I am personally a big believer in data abstraction, but I often admire the pragmatics of "everything is a map". (I tweeted about this earlier today, which provoked some thoughtful discussion.)

Harmful APIs. At several points in the talk, Rich makes fun of developers who are obsessed with taking away features from their users. ("I hate this function. I hate it, I hate it, I hate that people call it, I just want it out of my life.") This downplays the very real, very important reasons why infinite backwards compatibility has been harmful to the software we write today.

One need look no further than the systems with decades of backwards compatibility that Rich cites: the Unix APIs, Java and HTML. In all these cases, backwards compatibility has lead to harmful APIs sticking around far longer than they should: strncpy, gets, legacy parsers of HTML (XSS), Java antipatterns, etc. And there are examples galore in Android, C libraries, everywhere.

In my opinion, library authors should design APIs in such a way that it is easy to do the right thing, and hard to do the wrong thing. And yes, that means sometimes that means you that you need to stop people from using insecure or easy-to-get wrong library calls.

Semantic versioning doesn't cause cascading version bumps, lack of version ranges is the cause. In the slide "Do deps force Versioning?", Rich describe a problem in the Clojure ecosystem which is that, when following semantic versioning, a new release of a package often causes cascading version bumps in the system.

While the problem of cascading version bumps is a real question that applies to semantic versioning in general, the "cascading version bumps" Rich is referring to in the Clojure ecosystem stem from a much more mundane source: best practices is to specify a specific version of a dependency in your package metadata. When a new version of a dependency comes out, you need to bump the version of a package so that you can update the recorded version of the dependency... and so forth.

I'm not saying that Clojure is wrong for doing things this way (version ranges have their own challenges), but in his talk Rich implies that this is a failure of semantic versioning... which it's not. If you use version ranges and aren't in the habit of reexporting APIs from your dependencies, updating the version range of a dependency is not a breaking change. If you have a solver that picks a single copy of a library for the entire application, you can even expose types from your dependency in your API.


Overall, I am glad that Clojure is thinking about how to put backwards compatibility first and foremost: often, it is in the most extreme applications of a principle that we learn the most. Is it the end of the story? No; but I hope that all languages continue slowly moving towards explicit specifications and tooling to help you live up to your promises.

  • December 16, 2016

Try Backpack: ghc --backpack

Backpack, a new system for mix-in packages in Haskell, has been released with GHC 8.2. Although Backpack is closely integrated with the Cabal package system, it's still possible to play around with toy examples using a new command ghc --backpack. Before you get started, make sure you have a recent enough version of GHC:

ezyang@sabre:~$ ghc-8.2 --version
The Glorious Glasgow Haskell Compilation System, version 8.2.1

By the way, if you want to jump straight into Backpack for real (with Cabal packages and everything), skip this tutorial and jump to Try Backpack: Cabal packages.

Hello World

GHC supports a new file format, bkp files, which let you easily define multiple modules and packages in a single source file, making it easy to experiment with Backpack. This format is not suitable for large scale programming (there isn't any integration of bkp files with Cabal, nor do we plan to add an such integration), but we will use it for our tutorial because it makes it very easy to play around with Backpack without mucking about with lots of Cabal packages.

Here is a simple "Hello World" program:

unit main where
  module Main where
    main = putStrLn "Hello world!"

We define a unit (think package) with the special name main, and in it define a Main module (also specially named) which contains our main function. Place this in a file named hello.bkp, and then run ghc --backpack hello.bkp (using your GHC nightly). This will produce an executable at main/Main which you can run; you can also explicitly specify the desired output filename using -o filename. Note that by default, ghc --backpack creates a directory with the same name as every unit, so -o main won't work (it'll give you a linker error; use a different name!)

A Play on Regular Expressions

Let's write some nontrivial code that actually uses Backpack. For this tutorial, we will write a simple matcher for regular expressions as described in A Play on Regular Expressions (Sebastian Fischer, Frank Huch, Thomas Wilke). The matcher itself is inefficient (it checks for a match by testing all exponentially many decompositions of a string), but it will be sufficient to illustrate many key concepts of Backpack.

To start things off, let's go ahead and write a traditional implementation of the matcher by copy-pasting the code from this Functional Pearl into a Regex module in the Backpack file and writing a little test program to run it:

unit regex where
    module Regex where
        -- | A type of regular expressions.
        data Reg = Eps
                 | Sym Char
                 | Alt Reg Reg
                 | Seq Reg Reg
                 | Rep Reg

        -- | Check if a regular expression 'Reg' matches a 'String'
        accept :: Reg -> String -> Bool
        accept Eps       u = null u
        accept (Sym c)   u = u == [c]
        accept (Alt p q) u = accept p u || accept q u
        accept (Seq p q) u =
            or [accept p u1 && accept q u2 | (u1, u2) <- splits u]
        accept (Rep r) u =
            or [and [accept r ui | ui <- ps] | ps <- parts u]

        -- | Given a string, compute all splits of the string.
        -- E.g., splits "ab" == [("","ab"), ("a","b"), ("ab","")]
        splits :: String -> [(String, String)]
        splits [] = [([], [])]
        splits (c:cs) = ([], c:cs):[(c:s1,s2) | (s1,s2) <- splits cs]

        -- | Given a string, compute all possible partitions of
        -- the string (where all partitions are non-empty).
        -- E.g., partitions "ab" == [["ab"],["a","b"]]
        parts :: String -> [[String]]
        parts [] = [[]]
        parts [c] = [[[c]]]
        parts (c:cs) = concat [[(c:p):ps, [c]:p:ps] | p:ps <- parts cs]

unit main where
    dependency regex
    module Main where
        import Regex
        nocs = Rep (Alt (Sym 'a') (Sym 'b'))
        onec = Seq nocs (Sym 'c')
        -- | The regular expression which tests for an even number of cs
        evencs = Seq (Rep (Seq onec onec)) nocs
        main = print (accept evencs "acc")

If you put this in regex.bkp, you can once again compile it using ghc --backpack regex.bkp and invoke the resulting executable at main/Main. It should print True.

Functorizing the matcher

The previously shown code isn't great because it hardcodes String as the type to do regular expression matching over. A reasonable generalization (which you can see in the original paper) is to match over arbitrary lists of symbols; however, we might also reasonably want to match over non-list types like ByteString. To support all of these cases, we will instead use Backpack to "functorize" (in ML parlance) our matcher.

We'll do this by creating a new unit, regex-indef, and writing a signature which provides a string type (we've decided to call it Str, to avoid confusion with String) and all of the operations which need to be supported on it. Here are the steps I took:

  1. First, I copy-pasted the old Regex implementation into the new unit. I replaced all occurrences of String with Str, and deleted splits and parts: we will require these to be implemented in our signature.

  2. Next, we create a new Str signature, which is imported by Regex, and defines our type and operations (splits and parts) which it needs to support:

    signature Str where
      data Str
      splits :: Str -> [(Str, Str)]
      parts :: Str -> [[Str]]
    
  3. At this point, I ran ghc --backpack to typecheck the new unit. But I got two errors!

    regex.bkp:90:35: error:
        • Couldn't match expected type ‘t0 a0’ with actual type ‘Str’
        • In the first argument of ‘null’, namely ‘u’
          In the expression: null u
          In an equation for ‘accept’: accept Eps u = null u
    
    regex.bkp:91:35: error:
        • Couldn't match expected type ‘Str’ with actual type ‘[Char]’
        • In the second argument of ‘(==)’, namely ‘[c]’
          In the expression: u == [c]
          In an equation for ‘accept’: accept (Sym c) u = u == [c]
    

    Traversable null nonsense aside, the errors are quite clear: Str is a completely abstract data type: we cannot assume that it is a list, nor do we know what instances it has. To solve these type errors, I introduced the combinators null and singleton, an instance Eq Str, and rewrote Regex to use these combinators (a very modest change.) (Notice we can't write instance Traversable Str; it's a kind mismatch.)

Here is our final indefinite version of the regex unit:

unit regex-indef where
    signature Str where
        data Str
        instance Eq Str
        null :: Str -> Bool
        singleton :: Char -> Str
        splits :: Str -> [(Str, Str)]
        parts :: Str -> [[Str]]
    module Regex where
        import Prelude hiding (null)
        import Str

        data Reg = Eps
                 | Sym Char
                 | Alt Reg Reg
                 | Seq Reg Reg
                 | Rep Reg

        accept :: Reg -> Str -> Bool
        accept Eps       u = null u
        accept (Sym c)   u = u == singleton c
        accept (Alt p q) u = accept p u || accept q u
        accept (Seq p q) u =
            or [accept p u1 && accept q u2 | (u1, u2) <- splits u]
        accept (Rep r) u =
            or [and [accept r ui | ui <- ps] | ps <- parts u]

(To keep things simple for now, I haven't parametrized Char.)

Instantiating the functor (String)

This is all very nice but we can't actually run this code, since there is no implementation of Str. Let's write a new unit which provides a module which implements all of these types and functions with String, copy pasting in the old implementations of splits and parts:

unit str-string where
    module Str where
        import Prelude hiding (null)
        import qualified Prelude as P

        type Str = String

        null :: Str -> Bool
        null = P.null

        singleton :: Char -> Str
        singleton c = [c]

        splits :: Str -> [(Str, Str)]
        splits [] = [([], [])]
        splits (c:cs) = ([], c:cs):[(c:s1,s2) | (s1,s2) <- splits cs]

        parts :: Str -> [[Str]]
        parts [] = [[]]
        parts [c] = [[[c]]]
        parts (c:cs) = concat [[(c:p):ps, [c]:p:ps] | p:ps <- parts cs]

One quirk when writing Backpack implementations for functions is that Backpack does no subtype matching on polymorphic functions, so you can't implement Str -> Bool with a polymorphic function Traversable t => t a -> Bool (adding this would be an interesting extension, and not altogether trivial). So we have to write a little impedance matching binding which monomorphizes null to the expected type.

To instantiate regex-indef with str-string:Str, we modify the dependency in main:

-- dependency regex -- old
dependency regex-indef[Str=str-string:Str]

Backpack files require instantiations to be explicitly specified (this is as opposed to Cabal files, which do mix-in linking to determine instantiations). In this case, the instantiation specifies that regex-indef's signature named Str should be filled with the Str module from str-string.

After making these changes, give ghc --backpack a run; you should get out an identical looking result.

Instantiating the functor (ByteString)

The whole point of parametrizing regex was to enable us to have a second implementation of Str. So let's go ahead and write a bytestring implementation. After a little bit of work, you might end up with this:

unit str-bytestring where
    module Str(module Data.ByteString.Char8, module Str) where
        import Prelude hiding (length, null, splitAt)
        import Data.ByteString.Char8
        import Data.ByteString

        type Str = ByteString

        splits :: Str -> [(Str, Str)]
        splits s = fmap (\n -> splitAt n s) [0..length s]

        parts :: Str -> [[Str]]
        parts s | null s    = [[]]
                | otherwise = do
                    n <- [1..length s]
                    let (l, r) = splitAt n s
                    fmap (l:) (parts r)

There are two things to note about this implementation:

  1. Unlike str-string, which explicitly defined every needed method in its module body, str-bytestring provides null and singleton simply by reexporting all of the entities from Data.ByteString.Char8 (which are appropriately monomorphic). We've cleverly picked our names to abide by the existing naming conventions of existing string packages!
  2. Our implementations of splits and parts are substantially more optimized than if we had done a straight up transcription of the consing and unconsing from the original String implementation. I often hear people say that String and ByteString have very different performance characteristics, and thus you shouldn't mix them up in the same implementation. I think this example shows that as long as you have sufficiently high-level operations on your strings, these performance changes smooth out in the end; and there is still a decent chunk of code that can be reused across implementations.

To instantiate regex-indef with bytestring-string:Str, we once again modify the dependency in main:

-- dependency regex -- oldest
-- dependency regex-indef[Str=str-string:Str] -- old
dependency regex-indef[Str=str-bytestring:Str]

We also need to stick an {-# LANGUAGE OverloadedStrings #-} pragma so that "acc" gets interpreted as a ByteString (unfortunately, the bkp file format only supports language pragmas that get applied to all modules defined; so put this pragma at the top of the file). But otherwise, everything works as it should!

Using both instantiations at once

There is nothing stopping us from using both instantiations of regex-indef at the same time, simply by uncommenting both dependency declarations, except that the module names provided by each dependency conflict with each other and are thus ambiguous. Backpack files thus provide a renaming syntax for modules which let you give each exported module a different name:

dependency regex-indef[Str=str-string:Str]     (Regex as Regex.String)
dependency regex-indef[Str=str-bytestring:Str] (Regex as Regex.ByteString)

How should we modify Main to run our regex on both a String and a ByteString? But is Regex.String.Reg the same as Regex.ByteString.Reg? A quick query to the compiler will reveal that they are not the same. The reason for this is Backpack's type identity rule: the identity of all types defined in a unit depends on how all signatures are instantiated, even if the type doesn't actually depend on any types from the signature. If we want there to be only one Reg type, we will have to extract it from reg-indef and give it its own unit, with no signatures.

After the refactoring, here is the full final program:

{-# LANGUAGE OverloadedStrings #-}

unit str-bytestring where
    module Str(module Data.ByteString.Char8, module Str) where
        import Prelude hiding (length, null, splitAt)
        import Data.ByteString.Char8
        import Data.ByteString

        type Str = ByteString

        splits :: Str -> [(Str, Str)]
        splits s = fmap (\n -> splitAt n s) [0..length s]

        parts :: Str -> [[Str]]
        parts s | null s    = [[]]
                | otherwise = do
                    n <- [1..length s]
                    let (l, r) = splitAt n s
                    fmap (l:) (parts r)

unit str-string where
    module Str where
        import Prelude hiding (null)
        import qualified Prelude as P

        type Str = String

        null :: Str -> Bool
        null = P.null

        singleton :: Char -> Str
        singleton c = [c]

        splits :: Str -> [(Str, Str)]
        splits [] = [([], [])]
        splits (c:cs) = ([], c:cs):[(c:s1,s2) | (s1,s2) <- splits cs]

        parts :: Str -> [[Str]]
        parts [] = [[]]
        parts [c] = [[[c]]]
        parts (c:cs) = concat [[(c:p):ps, [c]:p:ps] | p:ps <- parts cs]

unit regex-types where
    module Regex.Types where
        data Reg = Eps
                 | Sym Char
                 | Alt Reg Reg
                 | Seq Reg Reg
                 | Rep Reg

unit regex-indef where
    dependency regex-types
    signature Str where
        data Str
        instance Eq Str
        null :: Str -> Bool
        singleton :: Char -> Str
        splits :: Str -> [(Str, Str)]
        parts :: Str -> [[Str]]
    module Regex where
        import Prelude hiding (null)
        import Str
        import Regex.Types

        accept :: Reg -> Str -> Bool
        accept Eps       u = null u
        accept (Sym c)   u = u == singleton c
        accept (Alt p q) u = accept p u || accept q u
        accept (Seq p q) u =
            or [accept p u1 && accept q u2 | (u1, u2) <- splits u]
        accept (Rep r) u =
            or [and [accept r ui | ui <- ps] | ps <- parts u]

unit main where
    dependency regex-types
    dependency regex-indef[Str=str-string:Str]     (Regex as Regex.String)
    dependency regex-indef[Str=str-bytestring:Str] (Regex as Regex.ByteString)
    module Main where
        import Regex.Types
        import qualified Regex.String
        import qualified Regex.ByteString
        nocs = Rep (Alt (Sym 'a') (Sym 'b'))
        onec = Seq nocs (Sym 'c')
        evencs = Seq (Rep (Seq onec onec)) nocs
        main = print (Regex.String.accept evencs "acc") >>
               print (Regex.ByteString.accept evencs "acc")

And beyond!

Read on to the next blog post, Try Backpack: Cabal packages, where I tell you how to take this prototype in a bkp file, and scale it up into a set of Cabal packages.

Postscript. If you are feeling adventurous, try further parametrizing regex-types so that it no longer hard-codes Char as the element type, but some arbitrary element type Elem. It may be useful to know that you can instantiate multiple signatures using the syntax dependency regex-indef[Str=str-string:Str,Elem=str-string:Elem] and that if you depend on a package with a signature, you must thread the signature through using the syntax dependency regex-types[Elem=<Elem>]. If this sounds user-unfriendly, it is! That is why in the Cabal package universe, instantiation is done implicitly, using mix-in linking.

  • October 10, 2016

Seize the Means of Production (of APIs)

There's a shitty API and it's ruining your day. What do you do?

Without making a moral judgment, I want to remark that there is something very different about these two approaches. In Dropbox's case, Dropbox has no (direct) influence on what APIs Apple provides for its operating system. So it has no choice but to work within the boundaries of the existing API. (When Apple says jump, you say, "How high?") But in Adam's case, POSIX is implemented by an open source project Linux, and with some good ideas, Adam could implement his new interface on top of Linux (avoiding the necessity of writing an operating system from scratch.)

APIs cross social boundaries: there is the proletariat that produces software using an API and the bourgeoisie that controls the API. When the Man(TM) is a big corporation, our only choices are to work around their shitty API or pay them sufficiently large amounts of money to fix their shitty APIs. But when the Man(TM) is an open source project, your choices change. True, you could work around their shitty API. Or you could seize the means of production, thereby enabling you to fix the shitty API.

What do I mean by seize the means of production? Indeed, what are the means of production? An open source API does not live in a vacuum; it is made useful by the software that provides the API, the developers who contribute their time and expertise to maintain this expertise, even the publicity platform which convinces people to use an API. To seize the means of production is to gain control of all of these aspects. If you can convince the establishment that you are a core contributor of a piece of open source software, you in turn gain the ability to fix the shitty APIs. If you are unwilling or unable to do so, you might still fork, vendor or rewrite the project, but this is not so much seizing the means of production as it is recreating it from scratch. Another possibility is to build the abstraction you need on top of the existing APIs (as Adam did), though you are always at risk of the original project not being sensitive to your needs.

Time and time again, I see people working with open source projects who refuse to seize the means of production. Instead, they are willing to write increasingly convoluted workarounds to solve problems, all to stay within the boundaries. You might say, "This is just Getting Things Done(TM)", but at some point you will have done more work working around the problem, than you would have spent just fixing the damn thing.

So stop it. Stop putting up with shitty APIs. Stop working within the boundaries. Seize the means of production.

Counterpoint.

  1. What is advocated for in this post is nothing less than infinite yak shaving; if you take the advice seriously you will proceed to never get anything done ever again.
  2. It may be true that in aggregate, the cost of working around a bad API exceeds the cost to fix it, but for any individual the cost is generally less. Thus, even if you could perfectly predict the cost of a workaround versus a proper fix, individual incentives would prevent the proper fix.
  3. Users (including developers) don't know anything about the software they use and are ill-equipped to design better APIs, even if they know where it hurts.
  4. Rarely can you unilaterally seize the means of production. In an ideal world, to become a core contributor, it would merely be sufficient to demonstrate sustained, useful contributions to a project. We all know the real world is more messy.
  • September 13, 2016

The Base of a String Theory for Haskell

One of the early posts from this blog, from 2010, was on the subject of how to pick your string library in Haskell. Half a decade later, the Haskell ecosystem is still largely in the same situation as it was half a decade ago, where most of the boot libraries shipped with GHC (e.g., base) still use the String type, despite the existence of superior string types. The problem is twofold:

  1. No one wants to break all of the existing code, which means libraries like base have to keep String versions of all their code. You can't just search-replace every occurrence of String with Text.
  2. No one wants to be in the business of maintaining two copies of any piece of code, which are copy-pastes of each other but subtly different. In practice, we must: e.g., unix has ByteString variants of all of its functions (done by copy-paste); text provides some core IO functionality (also done by copy-paste). But it is terrible and scales poorly: every downstream library that wants to support two string types (or more) now has to publish two copies of themselves, and any new string implementation has the unenviable task of reimplementing the world to make themselves useful.

Backpack solves these problems, by allowing you to parametrize over a signature rather than a concrete implementation of a string type, and instantiate such an indefinite library whenever you want. This solves both problems:

  1. Because you are allowed to instantiate an indefinite library whenever you want, we can eagerly instantiate a posix-indef using String and ship it as posix, keeping backwards compatibility with all packages which are Backpack ignorant.
  2. At the same time, if packages depend directly on posix-indef, they themselves are parametrizable over a string type. Entire library ecosystems can defer the choice of string type to the end user, which on a sufficiently new version of GHC offers an backwards compatible way of adding support for new string types to a library. (I don't want to say, support multiple string types, because this is not necessarily a virtue in-and-of-itself.)

To this end, I would like to propose a string theory, for the base of GHC Haskell: namely the core boot libraries that are distributed with GHC today. These packages will set the tone for the eventual Backpackification of the rest of the ecosystem.

But first, what is it that we are parametrizing over? A string is not so simple...

A digression on file paths (and OS strings)

File paths (FilePath) are an important form of String which aren't really Unicode strings at all. POSIX specifies that file paths can be arbitrary C strings, thus, code which decodes a file path as Unicode must be cognizant of the fact that the underlying ByteString could contain arbitrary, undecodable nonsense. To make matters worse, even the encoding can vary: on Windows file paths are encoded in UTF-16 (with unpaired surrogates, eek!), while in modern Linux environments the encoding is dictated by the locale (base uses locale_charset to determine how to interpret file paths; the locale is often UTF-8, but not always).

Thus, the definition type FilePath = String is very questionable indeed. There is an existing proposal, the Abstract FilePath Proposal to turn FilePath into an abstract type, and not just a type synonym for String. Unfortunately, a change like this is a BC-breaking one, so it will take some time to implement, since GHC must first be taught to warn when FilePath is used as if it were a String, to help people find out that they are using it incorrectly.

Backpack offers a more decentralized way to move into the future: just define an abstract signature for FilePath to depend upon. The low level signature might look like this:

signature FilePath where

-- | File and directory names, whose precise
-- meaning is operating system dependent. Files can be opened, yielding a
-- handle which can then be used to operate on the contents of that file.
data FilePath

-- | A C string (pointer to an array of C characters terminated by NUL)
-- representing a file path, suitable for use with the operating system
-- C interface for file manipulation.  This exact type is architecture
-- dependent.
type CFilePath =
#ifdef mingw32_HOST_OS
        CWString
#else
        CString
#endif

withFilePath :: FilePath -> (CFilePath -> IO a) -> IO a
newFilePath  :: FilePath -> IO CFilePath
peekFilePath :: CFilePath -> IO FilePath
-- peekFilePath >=> newFilePath should be identity
-- (this is tricky to achieve if FilePath is a
-- Unicode-based type, like String)

And of course, you would want all of the FilePath manipulation functions that people use.

To maintain compatibility with the existing ecosystem, you would likely instantiate your library with type FilePath = String. But there is nothing stopping you from picking your own abstract FilePath type and using it instead.

File paths are not unique in this sense; there are other strings (such as the values of environment variables) which have similar properties: I've taken to calling these OSStrings (as they are called in Rust.)

Axes of parametrization

With this in mind, there are three "string variants" any given library can be parametrized:

  1. They can be parametrized over FilePath, for modules which deal with the file system (e.g., System.Posix.Directory)
  2. They can be parametrized over an OSString, because they deal with various operating system specific APIs (e.g., System.Posix.Env)
  3. They can be parametrized over a String, because, well, sometimes a string is just a string. (e.g., Text.ParserCombinators.ReadP)

Some libraries may be parametrized in multiple ways: for example, readFile needs to be parametrized over both FilePath and String.

Split base (and friends) for Backpack

For technical reasons, Backpack cannot be used to parametrize specific modules; you have to parametrize over an entire library. So a side-effect of Backpack-ing the core libraries is that they will be split into a number of smaller libraries. Using module reexports, you can still keep the old libraries around as shims.

There are four GHC boot libraries which would most benefit from modularization on strings:

  • base
    • base-io (System.IO and submodules; parametrized over FilePath and String)
    • There are a few other modules which could be stringified, but the marginal benefit may not justify making a new package for each (Data.String, System.Console.GetOpt, Text.ParserCombinators.ReadP, Text.Printf). Each of these only needs to be parametrized over String.
    • Control.Exception, Text.Read and Text.Show are explicit non-goals, they are too deeply wired into GHC at present to muck about with.
  • unix
    • unix-env (System.Posix.Env, parametrized over OSString)
    • unix-fs (System.Posix.Directory, System.Posix.Files, System.Posix.Temp parametrized over FilePath)
    • unix-process (System.Posix.Process, parametrized over FilePath and OSString)
  • pretty (parametrized over String; then GHC could use it rather than roll its own copy!)
  • process (parametrized over String, OSString and FilePath)

The naming scheme I propose is that, e.g., the package unix continues to be the package instantiated with old-fashioned Strings. Then unix-indef is a package which is uninstantiated (the user can instantiate it to what they want, or pass on the decision to their users). Some packages may choose to also provide shims of their package instantiated with specific types, e.g., base-io-bytestring, which would be base-io instantiated with ByteString rather than String, though these names could get to be quite long, so it's uncertain how useful this would be.

Closing remarks

Of all the packages mentioned here, only base could make the bold step of using Backpack next GHC release (although it won't; at least, not for GHC 8.2); the rest need to maintain backwards compatibility with old versions of GHC and so would have to be forked to use Backpack.

The real test for Backpack will be whether or not string-using packages in the ecosystem decide to sign on, and parametrize themselves over signatures. I hope that eventually, you can use any library with ByteString or Text with the same ease that you can use libraries with String (and maybe even use your own, home-grown type.) The problem with module systems is that you rarely see the benefits until you use them for big systems, but that makes it difficult to evaluate them before hand. But the benefits seem tantalizing: let's boldly backpack forth to a brighter future!

  • September 7, 2016

The Edit-Recompile Manager

A common claim I keep seeing repeated is that there are too many language-specific package managers, and that we should use a distribution's package manager instead. As an example, I opened the most recent HN discussion related to package managers, and sure enough the third comment was on this (very) dead horse. (But wait! There's more.) But it rarely feels like there is any forward progress on these threads. Why?

Here is my hypothesis: these two camps of people are talking past each other, because the term "package manager" has been overloaded to mean two things:

  1. For end-users, it denotes an install manager, primarily responsible for installing some useful software so that they can use it. Software here usually gets installed once, and then used for a long time.
  2. For developers, it denotes an edit-recompile manager: a piece of software for letting you take a software project under development and (re)build it, as quickly as possible. The installation of packages is a means, but it is not the end.

It should be clear that while these two use-cases have some shared mechanism, the priorities are overwhelmingly different:

  • End-users don't care about how a package is built, just that the things they want to install have been built. For developers, speed on rebuild is an overriding concern. To achieve this performance, a deep understanding of the structure of the programming language is needed.
  • End-users usually just want one version of any piece software. Developers use multiple versions, because that is the cost of doing business with a diverse, rapidly updated, decentralized package ecosystem.
  • End-users care about it "just working": thus, a distribution package manager emphasizes control over the full stack (usually requiring root.) Developers care about flexibility for the software they are rebuilding and don't mind if a little setup is needed.

So the next time someone says that there are too many language-specific package managers, mentally replace "package manager" with "edit-recompile manager". Does the complaint still make sense? Maybe it does, but not in the usual sense: what they may actually be advocating for is an interface between these two worlds. And that seems like a project that is both tractable and worth doing.

  • September 2, 2016

Backpack and separate compilation

When building a module system which supports parametrizing code over multiple implementations (i.e., functors), you run into an important implementation question: how do you compile said parametric code? In existing language implementations are three major schools of thought:

  1. The separate compilation school says that you should compile your functors independently of their implementations. This school values compilation time over performance: once a functor is built, you can freely swap out implementations of its parameters without needing to rebuild the functor, leading to fast compile times. Pre-Flambda OCaml works this way. The downside is that it's not possible to optimize the functor body based on implementation knowledge (unless, perhaps, you have a just-in-time compiler handy).
  2. The specialize at use school says, well, you can get performance by inlining functors at their use-sites, where the implementations are known. If the functor body is not too large, you can transparently get good performance benefits without needing to change the architecture of your compiler in major ways. Post-FLambda OCaml and C++ templates in the Borland model both work this way. The downside is that the code must be re-optimized at each use site, and there may end up being substantial duplication of code (this can be reduced at link time)
  3. The repository of specializations school says that it's dumb to keep recompiling the instantiations: instead, the compiled code for each instantiation should be cached somewhere globally; the next time the same instance is needed, it should be reused. C++ templates in the Cfront model and Backpack work this way.

The repository perspective sounds nice, until you realize that it requires major architectural changes to the way your compiler works: most compilers don't try to write intermediate results into some shared cache, and adding support for this can be quite complex and error-prone.

Backpack sidesteps the issue by offloading the work of caching instantiations to the package manager, which does know how to cache intermediate products. The trade off is that Backpack is not as integrated into Haskell itself as some might like (it's extremely not first-class.)

  • September 1, 2016

cabal new-build is a package manager

An old article I occasionally see cited today is Repeat after me: "Cabal is not a Package Manager". Many of the complaints don't apply to cabal-install 1.24's new Nix-style local builds. Let's set the record straight.

Fact: cabal new-build doesn't handle non-Haskell dependencies

OK, so this is one thing that hasn't changed since Ivan's article. Unlike Stack, cabal new-build will not handle downloading and installing GHC for you, and like Stack, it won't download and install system libraries or compiler toolchains: you have to do that yourself. This is definitely a case where you should lean on your system package manager to bootstrap a working installation of Cabal and GHC.

Fact: The Cabal file format can record non-Haskell pkg-config dependencies

Since 2007, the Cabal file format has a pkgconfig-depends field which can be used to specify dependencies on libraries understood by the pkg-config tool. It won't install the non-Haskell dependency for you, but it can let you know early on if a library is not available.

In fact, cabal-install's dependency solver knows about the pkgconfig-depends field, and will pick versions and set flags so that we don't end up with a package with an unsatisfiable pkg-config dependency.

Fact: cabal new-build 2.0 handles build-tools dependencies

As of writing, this feature is unreleased (if you are impatient, get a copy of HEAD from the GitHub repository or install cabal-install-head from hvr's PPA). However, in cabal-install 2.0, build-tools dependencies will be transparently built and added to your PATH. Thus, if you want to install a package which has build-tools: happy, cabal new-build will automatically install happy and add it to the PATH when building this package. These executables are tracked by new-build and we will avoid rebuilding the executable if it is already present.

Since build-tools identify executable names, not packages, there is a set of hardcoded build-tools which are treated in this way, coinciding with the set of build-tools that simple Setup scripts know how to use natively. They are hscolour, haddock, happy, alex, hsc2hs, c2hs, cpphs and greencard.

Fact: cabal new-build can upgrade packages without breaking your database

Suppose you are working on some project which depends on a few dependencies. You decide to upgrade one of your dependencies by relaxing a version constraint in your project configuration. After making this change, all it takes is a cabal new-build to rebuild the relevant dependency and start using it. That's it! Even better, if you had an old project using the old dependency, well, it still is working, just as you would hope.

What is actually going on is that cabal new-build doesn't do anything like a traditional upgrade. Packages installed to cabal new-build's global store are uniquely identified by a Nix-style identifier which captures all of the information that may have affected the build, including the specific versions that were built against. Thus, a package "upgrade" actually is just the installation of a package under a different unique identifier which can coexist with the old one. You will never end up with a broken package database because you typed new-build.

There is not presently a mechanism for removing packages besides deleting your store (.cabal/store), but it is worth noting that deleting your store is a completely safe operation: cabal new-build won't decide that it wants to build your package differently if the store doesn't exist; the store is purely a cache and does not influence the dependency solving process.

Fact: Hackage trustees, in addition to package authors, can edit Cabal files for published packages to fix bugs

If a package is uploaded with bad version bounds and a subsequent new release breaks them, a Hackage Trustee can intervene, making a modification to the Cabal file to update the version bounds in light of the new information. This is a more limited form of intervention than the patches of Linux distributions, but it is similar in nature.

Fact: If you can, use your system package manager

cabal new-build is great, but it's not for everyone. If you just need a working pandoc binary on your system and you don't care about having the latest and greatest, you should download and install it via your operating system's package manager. Distro packages are great for binaries; they're less good for libraries, which are often too old for developers (though it is often the easiest way to get a working install of OpenGL). cabal new-build is oriented at developers of Haskell packages, who need to build and depend on packages which are not distributed by the operating system.

I hope this post clears up some misconceptions!

  • August 29, 2016