ezyang’s blog

the arc of software bends towards understanding

The Base of a String Theory for Haskell

One of the early posts from this blog, from 2010, was on the subject of how to pick your string library in Haskell. Half a decade later, the Haskell ecosystem is still largely in the same situation as it was half a decade ago, where most of the boot libraries shipped with GHC (e.g., base) still use the String type, despite the existence of superior string types. The problem is twofold:

  1. No one wants to break all of the existing code, which means libraries like base have to keep String versions of all their code. You can't just search-replace every occurrence of String with Text.
  2. No one wants to be in the business of maintaining two copies of any piece of code, which are copy-pastes of each other but subtly different. In practice, we must: e.g., unix has ByteString variants of all of its functions (done by copy-paste); text provides some core IO functionality (also done by copy-paste). But it is terrible and scales poorly: every downstream library that wants to support two string types (or more) now has to publish two copies of themselves, and any new string implementation has the unenviable task of reimplementing the world to make themselves useful.

Backpack solves these problems, by allowing you to parametrize over a signature rather than a concrete implementation of a string type, and instantiate such an indefinite library whenever you want. This solves both problems:

  1. Because you are allowed to instantiate an indefinite library whenever you want, we can eagerly instantiate a posix-indef using String and ship it as posix, keeping backwards compatibility with all packages which are Backpack ignorant.
  2. At the same time, if packages depend directly on posix-indef, they themselves are parametrizable over a string type. Entire library ecosystems can defer the choice of string type to the end user, which on a sufficiently new version of GHC offers an backwards compatible way of adding support for new string types to a library. (I don't want to say, support multiple string types, because this is not necessarily a virtue in-and-of-itself.)

To this end, I would like to propose a string theory, for the base of GHC Haskell: namely the core boot libraries that are distributed with GHC today. These packages will set the tone for the eventual Backpackification of the rest of the ecosystem.

But first, what is it that we are parametrizing over? A string is not so simple...

A digression on file paths (and OS strings)

File paths (FilePath) are an important form of String which aren't really Unicode strings at all. POSIX specifies that file paths can be arbitrary C strings, thus, code which decodes a file path as Unicode must be cognizant of the fact that the underlying ByteString could contain arbitrary, undecodable nonsense. To make matters worse, even the encoding can vary: on Windows file paths are encoded in UTF-16 (with unpaired surrogates, eek!), while in modern Linux environments the encoding is dictated by the locale (base uses locale_charset to determine how to interpret file paths; the locale is often UTF-8, but not always).

Thus, the definition type FilePath = String is very questionable indeed. There is an existing proposal, the Abstract FilePath Proposal to turn FilePath into an abstract type, and not just a type synonym for String. Unfortunately, a change like this is a BC-breaking one, so it will take some time to implement, since GHC must first be taught to warn when FilePath is used as if it were a String, to help people find out that they are using it incorrectly.

Backpack offers a more decentralized way to move into the future: just define an abstract signature for FilePath to depend upon. The low level signature might look like this:

signature FilePath where

-- | File and directory names, whose precise
-- meaning is operating system dependent. Files can be opened, yielding a
-- handle which can then be used to operate on the contents of that file.
data FilePath

-- | A C string (pointer to an array of C characters terminated by NUL)
-- representing a file path, suitable for use with the operating system
-- C interface for file manipulation.  This exact type is architecture
-- dependent.
type CFilePath =
#ifdef mingw32_HOST_OS
        CWString
#else
        CString
#endif

withFilePath :: FilePath -> (CFilePath -> IO a) -> IO a
newFilePath  :: FilePath -> IO CFilePath
peekFilePath :: CFilePath -> IO FilePath
-- peekFilePath >=> newFilePath should be identity
-- (this is tricky to achieve if FilePath is a
-- Unicode-based type, like String)

And of course, you would want all of the FilePath manipulation functions that people use.

To maintain compatibility with the existing ecosystem, you would likely instantiate your library with type FilePath = String. But there is nothing stopping you from picking your own abstract FilePath type and using it instead.

File paths are not unique in this sense; there are other strings (such as the values of environment variables) which have similar properties: I've taken to calling these OSStrings (as they are called in Rust.)

Axes of parametrization

With this in mind, there are three "string variants" any given library can be parametrized:

  1. They can be parametrized over FilePath, for modules which deal with the file system (e.g., System.Posix.Directory)
  2. They can be parametrized over an OSString, because they deal with various operating system specific APIs (e.g., System.Posix.Env)
  3. They can be parametrized over a String, because, well, sometimes a string is just a string. (e.g., Text.ParserCombinators.ReadP)

Some libraries may be parametrized in multiple ways: for example, readFile needs to be parametrized over both FilePath and String.

Split base (and friends) for Backpack

For technical reasons, Backpack cannot be used to parametrize specific modules; you have to parametrize over an entire library. So a side-effect of Backpack-ing the core libraries is that they will be split into a number of smaller libraries. Using module reexports, you can still keep the old libraries around as shims.

There are four GHC boot libraries which would most benefit from modularization on strings:

  • base
    • base-io (System.IO and submodules; parametrized over FilePath and String)
    • There are a few other modules which could be stringified, but the marginal benefit may not justify making a new package for each (Data.String, System.Console.GetOpt, Text.ParserCombinators.ReadP, Text.Printf). Each of these only needs to be parametrized over String.
    • Control.Exception, Text.Read and Text.Show are explicit non-goals, they are too deeply wired into GHC at present to muck about with.
  • unix
    • unix-env (System.Posix.Env, parametrized over OSString)
    • unix-fs (System.Posix.Directory, System.Posix.Files, System.Posix.Temp parametrized over FilePath)
    • unix-process (System.Posix.Process, parametrized over FilePath and OSString)
  • pretty (parametrized over String; then GHC could use it rather than roll its own copy!)
  • process (parametrized over String, OSString and FilePath)

The naming scheme I propose is that, e.g., the package unix continues to be the package instantiated with old-fashioned Strings. Then unix-indef is a package which is uninstantiated (the user can instantiate it to what they want, or pass on the decision to their users). Some packages may choose to also provide shims of their package instantiated with specific types, e.g., base-io-bytestring, which would be base-io instantiated with ByteString rather than String, though these names could get to be quite long, so it's uncertain how useful this would be.

Closing remarks

Of all the packages mentioned here, only base could make the bold step of using Backpack next GHC release (although it won't; at least, not for GHC 8.2); the rest need to maintain backwards compatibility with old versions of GHC and so would have to be forked to use Backpack.

The real test for Backpack will be whether or not string-using packages in the ecosystem decide to sign on, and parametrize themselves over signatures. I hope that eventually, you can use any library with ByteString or Text with the same ease that you can use libraries with String (and maybe even use your own, home-grown type.) The problem with module systems is that you rarely see the benefits until you use them for big systems, but that makes it difficult to evaluate them before hand. But the benefits seem tantalizing: let's boldly backpack forth to a brighter future!

4 Responses to “The Base of a String Theory for Haskell”

  1. David Turner says:

    Hi Edward,

    Windows file paths are a strict superset of UTF-16 because it is legal to have unpaired surrogates in them. Strange, but legal.

    You mention having to handle round-tripping illegal UTF-8 sequences on UNIX (while otherwise treating paths as text). Just thought I’d point out that you would also have to do the same on Windows.

    IMO the only good way to represent paths is as ByteStrings; when you need to convert them to something human-readable then the conversion should be platform dependent.

    Cheers,

  2. David: Thanks, I was actually aware of this fact, but didn’t state it clearly (reworded the post!). While ByteStrings are a compelling file path representation for file paths, I think there are also good reasons to want a Unicode-aware representation in transit; for one, you might need a type that can be interpreted as a Unicode string in a platform independent (with a ByteString, you are relying on the implicit knowledge of the “current OS” to give them meaning.) So there’s not obviously a right choice, and languages have gone either way.

  3. cblp says:

    Hi,

    Why do we need Backpack? Can’t we achieve the same results with type classes?

  4. The biggest problem with using typeclasses for large scale modularity
    of this form is that typeclasses are nominal rather than structural.
    This can most be clearly seen in if you look at an earlier proposed
    StringLike type class. Look how many methods it has! I guarantee
    you that any given user of the type class isn’t going to use all all
    of these methods, but if you want to use this type class, if you
    have a string type, you better have implemented all of the methods.
    And god forbid you should want a new method, which is not in the
    type class?

    So, if it is not a priori obvious what the correct set of methods
    to include in a type class are, things tend to work out poorly.
    Structural signatures do a bit better: you can add things when you
    need them. I’m not saying signatures are perfectly egalitarian
    (if there’s a name clash, you’ll have to keep two signatures
    separate), but they scale quite a bit better, because anyone can
    write a signature saying “here is what I need” and anyone else
    can fill that need, and there doesn’t have to be a centralized
    source of truth for the interface.

Leave a Comment