The AST Typing Problem : ezyang’s blog

The AST Typing Problem

This Lambda the Ultimate post (dated 2010) describes a rather universal problem faced by compiler writers: how does one go about adding “extra information” (e.g. types) to an AST? (The post itself divides the problem into three components: adding the information to the data types, using the information to inform the construction of the node, and using the information to inform the destruction of a node—but I’m really only interested in the question of how you define your data type, not do things to it.) In this post, I want to sum up ways of solving the problem which were described in this post, and also take a look at what some real world compilers do. The running example lambda calculus looks like the following:

data Exp = Num Int
         | Bool Bool
         | Var Var
         | If Exp Exp Exp
         | Lambda Var Exp
         | App Exp Exp
data Type = TyInt | TyBool | TyArrow Type Type

Separate IR where nodes are decorated with types

The low-tech solution: if you need a new version of the IR with more information, just define a new IR type where each node can also carry the information. A trick to make these definitions more concise is to make a mutually recursive data structure. [1]

type TExp = (TExp', Type)
data TExp' = TNum Int
           | TBool Bool
           | TVar Var
           | TIf TExp TExp TExp
           | TLambda Var TExp
           | TApp TExp TExp

Despite (or perhaps because of) it’s simplicity, this approach is extremely popular among many compilers, especially in the ML community. A few examples include OCaml (parsetree/typedtree), MLton (AST/CoreML) and Ikarus Scheme. Part of the reason for this is that the transition from frontend language to typed language also comes with some other changes, and when a new AST is defined those changes can be combined in too.

Nullable field

The unprincipled solution: use one AST, but have an optional field in which you can slot in the information. [2]

type TExp = (TExp', Maybe Type)
data TExp' = TNum Int
           | TBool Bool
           | TVar Var
           | TIf TExp TExp TExp
           | TLambda Var TExp
           | TApp TExp TExp

Presented without further comment.

Explicit typing

While closely related to the separate IR solution, an explicitly typed IR takes the approach of not decorating each node with a type, but arranging that the type of any given node can be quickly computed using only local information. [3]

data TExp = TNum Int
          | TBool Bool
          | TVar Var
          | TIf TExp TExp TExp
          | TLambda Var Type TExp
          | TApp TExp TExp

Here, the difference between TExp and Exp is very slight; the TLambda is annotated with an explicit type for the binder. As far as type-checking is concerned, this makes a world of difference: we no longer need to look outside a lambda to figure out what the binder could be.

Forcing your IR to be explicitly typed is often a good idea for metatheoretic reasons, as complicated type systems often don’t have decidable inference algorithms. Both GHC’s core IR, Ur/Web's core and Coq are explicitly typed in this way.

Two-level types

By deferring when you tie the knot of a recursive data-structure, you can arrange for the base functor to do double-duty for the untyped and typed representations. [4]

data ExpF a = Num Int
            | Bool Bool
            | Var Var
            | If a a a
            | Lambda Var a
            | App a a
newtype Exp = Exp (ExpF Exp)
newtype TExp = TExp (ExpF TExp, Type)

The Coq kernel uses this to define its expression type, although it doesn’t use it to define an untyped variant.

(Lazy) Attribute grammars

I don’t claim to understand this approach too well, but essentially it is a programming model distinct from usual algebraic data types which associates attributes over nodes of a tree. In some sense, it can be thought as a memoized function from AST nodes to the attributes. Many compilers do utlize maps, but only for top-level declarations. [5]

Closing remarks

There were a few things that I did not mention here which came up in the discussion. One participant suggested using polymorphic variants to define the data type; this doesn’t help much with adding extra information but allows for different ways of writing traversal functions. Indeed, traversal is one of the big concerns, and the mention of generic programming also is targeted at this problem.

As for my preference? It’s hard to say. I’ve worked with compilers mostly written in the “define a new IR style”, and while the initial outlay of defining two data structures is quite annoying, it is mostly a fixed cost. What’s yours?

Also, a question. Is there a presentation of the conventional set of annotations needed to get explicitly typed System F?

« Previous Post Next Post »