### The AST Typing Problem

#### by Edward Z. Yang

This Lambda the Ultimate post (dated 2010) describes a rather universal problem faced by compiler writers: how does one go about adding “extra information” (e.g. types) to an AST? (The post itself divides the problem into three components: adding the information to the data types, using the information to inform the construction of the node, and using the information to inform the destruction of a node—but I’m really only interested in the question of how you define your data type, not do things to it.) In this post, I want to sum up ways of solving the problem which were described in this post, and also take a look at what some real world compilers do. The running example lambda calculus looks like the following:

data Exp = Num Int | Bool Bool | Var Var | If Exp Exp Exp | Lambda Var Exp | App Exp Exp data Type = TyInt | TyBool | TyArrow Type Type

### Separate IR where nodes are decorated with types

The low-tech solution: if you need a new version of the IR with more information, just define a new IR type where each node can also carry the information. A trick to make these definitions more concise is to make a mutually recursive data structure. [1]

type TExp = (TExp', Type) data TExp' = TNum Int | TBool Bool | TVar Var | TIf TExp TExp TExp | TLambda Var TExp | TApp TExp TExp

Despite (or perhaps because of) it’s simplicity, this approach is extremely popular among many compilers, especially in the ML community. A few examples include OCaml (parsetree/typedtree), MLton (AST/CoreML) and Ikarus Scheme. Part of the reason for this is that the transition from frontend language to typed language also comes with some other changes, and when a new AST is defined those changes can be combined in too.

### Nullable field

The unprincipled solution: use one AST, but have an optional field in which you can slot in the information. [2]

type TExp = (TExp', Maybe Type) data TExp' = TNum Int | TBool Bool | TVar Var | TIf TExp TExp TExp | TLambda Var TExp | TApp TExp TExp

Presented without further comment.

### Explicit typing

While closely related to the separate IR solution, an explicitly typed IR takes the approach of not decorating each node with a type, but arranging that the type of any given node can be quickly computed using only local information. [3]

data TExp = TNum Int | TBool Bool | TVar Var | TIf TExp TExp TExp | TLambda Var Type TExp | TApp TExp TExp

Here, the difference between `TExp` and `Exp` is very slight; the `TLambda` is annotated with an explicit type for the binder. As far as type-checking is concerned, this makes a world of difference: we no longer need to look outside a lambda to figure out what the binder could be.

Forcing your IR to be explicitly typed is often a good idea for metatheoretic reasons, as complicated type systems often don’t have decidable inference algorithms. Both GHC’s core IR, Ur/Web's core and Coq are explicitly typed in this way.

### Two-level types

By deferring when you tie the knot of a recursive data-structure, you can arrange for the base functor to do double-duty for the untyped and typed representations. [4]

data ExpF a = Num Int | Bool Bool | Var Var | If a a a | Lambda Var a | App a a newtype Exp = Exp (ExpF Exp) newtype TExp = TExp (ExpF TExp, Type)

The Coq kernel uses this to define its expression type, although it doesn’t use it to define an untyped variant.

### (Lazy) Attribute grammars

I don’t claim to understand this approach too well, but essentially it is a programming model distinct from usual algebraic data types which associates attributes over nodes of a tree. In some sense, it can be thought as a memoized function from AST nodes to the attributes. Many compilers do utlize maps, but only for top-level declarations. [5]

### Closing remarks

There were a few things that I did not mention here which came up in the discussion. One participant suggested using polymorphic variants to define the data type; this doesn’t help much with adding extra information but allows for different ways of writing traversal functions. Indeed, traversal is one of the big concerns, and the mention of generic programming also is targeted at this problem.

As for my preference? It’s hard to say. I’ve worked with compilers mostly written in the “define a new IR style”, and while the initial outlay of defining two data structures is quite annoying, it is mostly a fixed cost. What’s yours?

**Also, a question.** Is there a presentation of the conventional set of annotations needed to get explicitly typed System F?

Did you enjoy this post? Please subscribe to my feed!

FWIW, Ur/Web actually uses “explicit typing,” and I think Coq does too, though I’m less sure on the latter.

Good catch, I think you’re right on both.

You seem to have forgotten to delete (or rewrite) the paragraph starting with “GHC – lots of IRs”. It looks like a bunch of scribbled notes.

Haha, thanks. “That’s what I get for writing blog posts at 4am in the morning.”

The two-level type approach is under-used, I think. I’ve used it for making mutable structures (by recursing through an MVar or IOVar) and to implement hash consing. You could also do value numbering using that approach.

Oh, having said that, a common scenario is that rich expressive IRs (where a close match between the source program and the IR must be maintained, say for reporting errors to the user) are typically compiled down to desugared lower-level IRs (where fewer operations is better). If you’re using multiple IRs

anyway, there’s no point in making things too generic.Generic programming can make the “two-level types” approach easier to use, in particular allowing you to define only the original Exp and Type datatypes, and getting ExpF for free. For example:

http://hackage.haskell.org/package/Annotations (adding position information)

http://hackage.haskell.org/package/rewriting (adding meta-variables)

http://www.andres-loeh.de/GenericStorage/ (adding pointers)

I’ve encountered this problem too. And I’ve used yet another kind of solution for this.

I used a type declaration expression

`(x :: Int)`

, so I added another type of expression to the AST,`data Exp = TypeDecl Type Exp | Num Int | If Exp Exp Exp | ....`

After the type checking, I rebuild the AST and add TypeDecl nodes to every branch.I also tried not putting the type information in the AST in the first place, and passing the type and type environment along with the expression to the code generator tree walk and doing partial type checking as I go. This is crazy inefficient but it worked.

Pseudonym: Hmm, I wonder if this approach would work for types, so that types can be temporarily be replaced with IORefs for unification.

The two-level type approach makes it awkward when your AST is several mutually recursive types, because each type in the recursive cycle becomes a type parameter. Suppose that you wanted a separate AST type for functions. Now everything is parameterized over two types.

data ExpF e f = … | Lambda f | App e e

data FunF e f = Fun Var e

newtype Exp = Exp (ExpF Exp Fun)

newtype Fun = Fun (FunF Exp Fun)

newtype ExpT = Exp (ExpF ExpT FunT, Type)

newtype FunT = Fun (FunF ExpT FunT, Type)

The two type parameters cannot vary independently, which makes me think there may be a single-parameter solution using type families.

Edward, you’ve inspired me to try it out. I’ve been looking for something small but nontrivial to try with automatically extracting Haskell from Coq. This looks perfect.

heatsink: That’s the sort of situation where you really want O’Caml modules. :-)

heatsink:

There is indeed a single-parameter solution. You’ve shown how to construct a two-sorted AST using a fixpoint for pairs. In this encoding, separate constructions would be needed for more sorts. However, pairs (*,*) are isomorphic to functions (2 -> *), and an n-tuple is isomorphic to (n -> *). If we approximate n with *, we can get away with the single, mono-kinded higher order fixpoint of kind ((* -> *) -> (* -> *)) -> (* -> *). We can thus rewrite this:

newtype Fix f i = f (Fix f l) i

data (:+:) f g e l = L (f e l) | R (g e l)

data ExpL

data FunL

data ExpF e i where

…

Lambda :: e FunL -> ExpF e ExpL

App :: e ExpL -> e ExpL -> ExpF e ExpL

data FunF e i where

Fun :: Var -> e ExpL -> FunF e FunL

type Sig = FunF :+: ExpF

type Term = Fix Sig

type Exp = Term ExpL

type Fun = Term FunL

type (:&:) f a e i = (f e i) :&: a

type TypeTerm = Fix (Sig :&: Type)

Here, Exp has the rather pleasing interpretation of labeled trees whose topmost label is “ExpL.”

This idea is laid out in more detail in “Generic programming with fixed points for mutually recursive datatypes” (ICFP ’09), and is used in the multirec and compdata libraries.