ezyang’s blog

the arc of software bends towards understanding

The AST Typing Problem

This Lambda the Ultimate post (dated 2010) describes a rather universal problem faced by compiler writers: how does one go about adding “extra information” (e.g. types) to an AST? (The post itself divides the problem into three components: adding the information to the data types, using the information to inform the construction of the node, and using the information to inform the destruction of a node—but I’m really only interested in the question of how you define your data type, not do things to it.) In this post, I want to sum up ways of solving the problem which were described in this post, and also take a look at what some real world compilers do. The running example lambda calculus looks like the following:

data Exp = Num Int
         | Bool Bool
         | Var Var
         | If Exp Exp Exp
         | Lambda Var Exp
         | App Exp Exp
data Type = TyInt | TyBool | TyArrow Type Type

Separate IR where nodes are decorated with types

The low-tech solution: if you need a new version of the IR with more information, just define a new IR type where each node can also carry the information. A trick to make these definitions more concise is to make a mutually recursive data structure. [1]

type TExp = (TExp', Type)
data TExp' = TNum Int
           | TBool Bool
           | TVar Var
           | TIf TExp TExp TExp
           | TLambda Var TExp
           | TApp TExp TExp

Despite (or perhaps because of) it’s simplicity, this approach is extremely popular among many compilers, especially in the ML community. A few examples include OCaml (parsetree/typedtree), MLton (AST/CoreML) and Ikarus Scheme. Part of the reason for this is that the transition from frontend language to typed language also comes with some other changes, and when a new AST is defined those changes can be combined in too.

Nullable field

The unprincipled solution: use one AST, but have an optional field in which you can slot in the information. [2]

type TExp = (TExp', Maybe Type)
data TExp' = TNum Int
           | TBool Bool
           | TVar Var
           | TIf TExp TExp TExp
           | TLambda Var TExp
           | TApp TExp TExp

Presented without further comment.

Explicit typing

While closely related to the separate IR solution, an explicitly typed IR takes the approach of not decorating each node with a type, but arranging that the type of any given node can be quickly computed using only local information. [3]

data TExp = TNum Int
          | TBool Bool
          | TVar Var
          | TIf TExp TExp TExp
          | TLambda Var Type TExp
          | TApp TExp TExp

Here, the difference between TExp and Exp is very slight; the TLambda is annotated with an explicit type for the binder. As far as type-checking is concerned, this makes a world of difference: we no longer need to look outside a lambda to figure out what the binder could be.

Forcing your IR to be explicitly typed is often a good idea for metatheoretic reasons, as complicated type systems often don’t have decidable inference algorithms. Both GHC’s core IR, Ur/Web's core and Coq are explicitly typed in this way.

Two-level types

By deferring when you tie the knot of a recursive data-structure, you can arrange for the base functor to do double-duty for the untyped and typed representations. [4]

data ExpF a = Num Int
            | Bool Bool
            | Var Var
            | If a a a
            | Lambda Var a
            | App a a
newtype Exp = Exp (ExpF Exp)
newtype TExp = TExp (ExpF TExp, Type)

The Coq kernel uses this to define its expression type, although it doesn’t use it to define an untyped variant.

(Lazy) Attribute grammars

I don’t claim to understand this approach too well, but essentially it is a programming model distinct from usual algebraic data types which associates attributes over nodes of a tree. In some sense, it can be thought as a memoized function from AST nodes to the attributes. Many compilers do utlize maps, but only for top-level declarations. [5]

Closing remarks

There were a few things that I did not mention here which came up in the discussion. One participant suggested using polymorphic variants to define the data type; this doesn’t help much with adding extra information but allows for different ways of writing traversal functions. Indeed, traversal is one of the big concerns, and the mention of generic programming also is targeted at this problem.

As for my preference? It’s hard to say. I’ve worked with compilers mostly written in the “define a new IR style”, and while the initial outlay of defining two data structures is quite annoying, it is mostly a fixed cost. What’s yours?

Also, a question. Is there a presentation of the conventional set of annotations needed to get explicitly typed System F?

18 Responses to “The AST Typing Problem”

  1. Adam Chlipala says:

    FWIW, Ur/Web actually uses “explicit typing,” and I think Coq does too, though I’m less sure on the latter.

  2. Good catch, I think you’re right on both.

  3. Gabriel says:

    You seem to have forgotten to delete (or rewrite) the paragraph starting with “GHC – lots of IRs”. It looks like a bunch of scribbled notes.

  4. Haha, thanks. “That’s what I get for writing blog posts at 4am in the morning.”

  5. Pseudonym says:

    The two-level type approach is under-used, I think. I’ve used it for making mutable structures (by recursing through an MVar or IOVar) and to implement hash consing. You could also do value numbering using that approach.

  6. Pseudonym says:

    Oh, having said that, a common scenario is that rich expressive IRs (where a close match between the source program and the IR must be maintained, say for reporting errors to the user) are typically compiled down to desugared lower-level IRs (where fewer operations is better). If you’re using multiple IRs anyway, there’s no point in making things too generic.

  7. José Pedro Magalhães says:

    Generic programming can make the “two-level types” approach easier to use, in particular allowing you to define only the original Exp and Type datatypes, and getting ExpF for free. For example:
    http://hackage.haskell.org/package/Annotations (adding position information)
    http://hackage.haskell.org/package/rewriting (adding meta-variables)
    http://www.andres-loeh.de/GenericStorage/ (adding pointers)

  8. exDM69 says:

    I’ve encountered this problem too. And I’ve used yet another kind of solution for this.

    I used a type declaration expression (x :: Int), so I added another type of expression to the AST, data Exp = TypeDecl Type Exp | Num Int | If Exp Exp Exp | .... After the type checking, I rebuild the AST and add TypeDecl nodes to every branch.

    I also tried not putting the type information in the AST in the first place, and passing the type and type environment along with the expression to the code generator tree walk and doing partial type checking as I go. This is crazy inefficient but it worked.

  9. Pseudonym: Hmm, I wonder if this approach would work for types, so that types can be temporarily be replaced with IORefs for unification.

  10. heatsink says:

    The two-level type approach makes it awkward when your AST is several mutually recursive types, because each type in the recursive cycle becomes a type parameter. Suppose that you wanted a separate AST type for functions. Now everything is parameterized over two types.

    data ExpF e f = … | Lambda f | App e e
    data FunF e f = Fun Var e
    newtype Exp = Exp (ExpF Exp Fun)
    newtype Fun = Fun (FunF Exp Fun)
    newtype ExpT = Exp (ExpF ExpT FunT, Type)
    newtype FunT = Fun (FunF ExpT FunT, Type)

    The two type parameters cannot vary independently, which makes me think there may be a single-parameter solution using type families.

  11. Pseudonym says:

    Edward, you’ve inspired me to try it out. I’ve been looking for something small but nontrivial to try with automatically extracting Haskell from Coq. This looks perfect.

  12. Pseudonym says:

    heatsink: That’s the sort of situation where you really want O’Caml modules. :-)

  13. Jimmy Koppel says:

    heatsink:

    There is indeed a single-parameter solution. You’ve shown how to construct a two-sorted AST using a fixpoint for pairs. In this encoding, separate constructions would be needed for more sorts. However, pairs (*,*) are isomorphic to functions (2 -> *), and an n-tuple is isomorphic to (n -> *). If we approximate n with *, we can get away with the single, mono-kinded higher order fixpoint of kind ((* -> *) -> (* -> *)) -> (* -> *). We can thus rewrite this:

    newtype Fix f i = f (Fix f l) i
    data (:+:) f g e l = L (f e l) | R (g e l)

    data ExpL
    data FunL

    data ExpF e i where

    Lambda :: e FunL -> ExpF e ExpL
    App :: e ExpL -> e ExpL -> ExpF e ExpL

    data FunF e i where
    Fun :: Var -> e ExpL -> FunF e FunL

    type Sig = FunF :+: ExpF

    type Term = Fix Sig

    type Exp = Term ExpL
    type Fun = Term FunL

    type (:&:) f a e i = (f e i) :&: a
    type TypeTerm = Fix (Sig :&: Type)

    Here, Exp has the rather pleasing interpretation of labeled trees whose topmost label is “ExpL.”

    This idea is laid out in more detail in “Generic programming with fixed points for mutually recursive datatypes” (ICFP ’09), and is used in the multirec and compdata libraries.

  14. mpickering says:

    Note that now with Pattern Synonyms (in haskell since 7.8) have made it much easier to maintain an interface whilst performing these changes than before.

  15. Rich says:

    I’ll give you another solution I tried, with the caveat that it’s *not* a good solution:

    Keep a weak hashtable around which maps your original (untyped) tree nodes, by address, to your node types.

    As I said, I tried this approach for an HTML/CSS parser I was writing, but I couldn’t get the implementation of the weak hashtable to work well. In particular for whatever reason my hashtable never freed any entries when the tree was freed.

  16. […] Can you define your AST in a way that all semantic errors become type errors? This blog post The AST Typing Problem by Edward Z. Yang is a good introduction. I might write another post sometime later explaining a […]

  17. Edward Kmett says:

    Tying this together with some of your more recent work:

    This encodes rather nicely if you adapt the “Trees that Grow” approach to Backpack!

  18. Benedict Gaster says:

    Edward Kmett said:
    > This encodes rather nicely if you adapt the “Trees that Grow” approach to Backpack!

    I’d been thinking about a solution to this problem for the last couple of days, found this blog via GHC comments and then found this paper via your comment, which was just the solution I was looking for. A couple of hours this morning refactoring… Thanks!

Leave a Comment