Type Kata: Local data type : ezyang’s blog

Type Kata: Local data type

The imperative. When should you create a custom data type, as opposed to reusing pre-existing data types such as Either, Maybe or tuples? Here are some reasons you should reuse a generic type:

It saves typing (both in declaration and in pattern matching), making it good for one-off affairs,
It gives you a library of predefined functions that work with that type,
Other developers have expectations about what the type does that make understanding quicker.

On the flip side of the coin:

You may lose semantic distinction between types that are the same but have different meanings (the newtype argument),
The existing type may allow more values than you intended to allow,
Other developers have expectations about what the type does that can cause problems if you mean something different.

In this post, I’d like to talk about those last two problems with reusing custom types, using two case studies from the GHC codebase, and how the problems where alleviated by defining a data type that was only used by a small section of the codebase.

Great expectations. The Maybe type, by itself, has a very straight-forward interpretation: you either have the value, or you don’t. Even when Nothing means something like Wildcard or Null or UseTheDefault, the meaning is usually fairly clear.

What is more interesting, however, is when Maybe is placed in another data type that has its own conception of nothing-ness. A very simple example is Maybe (Maybe a), which admits the values Nothing, Just Nothing or Just (Just a). What is Just Nothing supposed to mean? In this case, what we really have masquerading here is a case of a data type with three constructors: data Maybe2 a = Nothing1 | Nothing2 | Just a. If we intend to distinguish between Nothing and Just Nothing, we need to assign some different semantic meaning to them, meaning that is not obvious from the cobbled together data type.

Another example, which comes from Hoopl and GHC, is the curious case of Map Var (Maybe Lit). A map already has its own conception of nothingness: that is, when the key-value pair is not in the map at all! So the first thing a developer who encounters this code may ask is, “Why isn’t it just Map Var Lit?” The answer to this question, for those of you have read the dataflow lattice post, is that Nothing in this circumstance, represents top (there is variable is definitely not constant), which is different from absence in the map, which represents bottom (the variable is constant or not constant). I managed to confuse both the Simons with this strange piece of code, and after taking some time to explain the situation, and they immediately recommended that I make a custom data type for the purpose. Even better, I found that Hoopl already provided this very type: HasTop, which also had a number of utility functions that reflected this set of semantics. Fortuitous indeed!

Uninvited guests. Our example of a data type that allows too many values comes from the linear register allocator in GHC (compiler/nativeGen/RegAlloc/Linear/Main.hs). Don’t worry, you don’t need to know how to implement a linear register allocator to follow along.

The linear register allocator is a rather large and unwieldy beast. Here is the function that actually allocates and spills the registers:

allocateRegsAndSpill reading keep spills alloc (r:rs)
 = do   assig <- getAssigR
        case lookupUFM assig r of
                -- case (1a): already in a register
                Just (InReg my_reg) ->
                        allocateRegsAndSpill reading keep spills (my_reg:alloc) rs

                -- case (1b): already in a register (and memory)
                -- NB1. if we're writing this register, update its assignment to be
                -- InReg, because the memory value is no longer valid.
                -- NB2. This is why we must process written registers here, even if they
                -- are also read by the same instruction.
                Just (InBoth my_reg _)
                 -> do  when (not reading) (setAssigR (addToUFM assig r (InReg my_reg)))
                        allocateRegsAndSpill reading keep spills (my_reg:alloc) rs

                -- Not already in a register, so we need to find a free one...
                loc -> allocRegsAndSpill_spill reading keep spills alloc r rs loc assig

There’s a bit of noise here, but the important things to notice is that it’s a mostly recursive function. The first two cases of lookupUFM directly call allocateRegsAndSpill, but the last case needs to do something complicated, and calls the helper function allocRegsAndSpill_spill. It turns out that this function will always eventually call allocateRegsAndSpill, so we have two mutually recursive functions. The original was probably just recursive, but the long bit handling “Not already in a register, so we need to find a free one” got factored out at some point.

This code is reusing a data type! Can you see it? It’s very subtle, because the original use of the type is legitimate, but it is then reused in an inappropriate manner. The answer is loc, in the last case statement. In particular, because we’ve already case-matched on loc, we know that it can’t possibly be Just (InReg{}) or Just (InBoth{}). If we look at the declaration of Loc, we see that there are only two cases left:

data Loc
        -- | vreg is in a register
        = InReg   !RealReg

        -- | vreg is held in a stack slot
        | InMem   {-# UNPACK #-}  !StackSlot


        -- | vreg is held in both a register and a stack slot
        | InBoth   !RealReg
                   {-# UNPACK #-} !StackSlot
        deriving (Eq, Show, Ord)

That is, the only remaining cases are Just (InMem{}) and Nothing. This is fairly important, because we lean on this invariant later in allocRegsAndSpill_spill:

let new_loc
        -- if the tmp was in a slot, then now its in a reg as well
        | Just (InMem slot) <- loc
        , reading
        = InBoth my_reg slot

        -- tmp has been loaded into a reg
        | otherwise
        = InReg my_reg

If you hadn’t seen the original case split in allocateRegsAndSpill, this particular code may have got you wondering if the last guard also applied when the result was Just (InReg{}), in which case the behavior would be very wrong. In fact, if we are reading, loc must be Nothing in that last branch. But there’s no way of saying that as the code stands: you’d have to add some panics and it gets quite messy:

let new_loc
        -- if the tmp was in a slot, then now its in a reg as well
        | Just (InMem slot) <- loc
        = if reading then InBoth my_reg slot else InReg my_reg

        -- tmp has been loaded into a reg
        | Nothing <- loc
        = InReg my_reg

        | otherwise = panic "Impossible situation!"

Furthermore, we notice a really interesting extra invariant: what happens if we’re reading from a location that has never been assigned to before (that is, reading is True and loc is Nothing)? That is obviously bogus, so in fact we should check if that’s the case. Notice that the original code did not enforce this invariant, which was masked out by the use of otherwise.

Rather than panic on an impossible situation, we should statically rule out this possibility. We can do this simply by introducing a new data type for loc, and pattern-matching appropriately:

-- Why are we performing a spill?
data SpillLoc = ReadMem StackSlot  -- reading from register only in memory
              | WriteNew           -- writing to a new variable
              | WriteMem           -- writing to register only in memory
-- Note that ReadNew is not valid, since you don't want to be reading
-- from an uninitialized register.  We also don't need the location of
-- the register in memory, since that will be invalidated by the write.
-- Technically, we could coalesce WriteNew and WriteMem into a single
-- entry as well. -- EZY

allocateRegsAndSpill reading keep spills alloc (r:rs)
 = do       assig <- getAssigR
        let doSpill = allocRegsAndSpill_spill reading keep spills alloc r rs assig
        case lookupUFM assig r of
                -- case (1a): already in a register
                Just (InReg my_reg) ->
                        allocateRegsAndSpill reading keep spills (my_reg:alloc) rs

                -- case (1b): already in a register (and memory)
                -- NB1. if we're writing this register, update its assignment to be
                -- InReg, because the memory value is no longer valid.
                -- NB2. This is why we must process written registers here, even if they
                -- are also read by the same instruction.
                Just (InBoth my_reg _)
                 -> do      when (not reading) (setAssigR (addToUFM assig r (InReg my_reg)))
                        allocateRegsAndSpill reading keep spills (my_reg:alloc) rs

                -- Not already in a register, so we need to find a free one...
                Just (InMem slot) | reading   -> doSpill (ReadMem slot)
                                  | otherwise -> doSpill WriteMem
                Nothing | reading   -> pprPanic "allocateRegsAndSpill: Cannot read from uninitialized register" (ppr r)
                        | otherwise -> doSpill WriteNew

And now the pattern match inside allocateRegsAndSpill_spill is nice and tidy:

-- | Calculate a new location after a register has been loaded.
newLocation :: SpillLoc -> RealReg -> Loc
-- if the tmp was read from a slot, then now its in a reg as well
newLocation (ReadMem slot) my_reg = InBoth my_reg slot
-- writes will always result in only the register being available
newLocation _ my_reg = InReg my_reg

gasche says:

April 7, 2011 at 5:43 am

Re. the difference between (2 + a) and (1 + (1 + a)) : it is not true that those are exactly the same. (1 + (1 + a)) has a *hierarchized* notion of “failure” (or more generally pointedness). If you have a primary failure you get None, and if you have a secondary failure you get Just None. (2 + a) remove this distinction. It is the right thing to do where there is no hierarchy, or it is not apparent enough, but not in all situations.

For example, consider I may a query to a remote database. I know that the query may return at most one result, but I also know that the remote database may be unavailable. It is natural that “no connection” is None, and “Just result” is a (possibly empty) result. Furthermore it will mesh better with the rest of my code that uses (Maybe value) as a possibly-empty value.

Edward Z. Yang says:

April 7, 2011 at 5:47 am

I agree, and the difference becomes more clear when you add laziness to the mix (so Nothing and Just Nothing are actually very different.) However, my opinion is that the potential for semantic confusion is such that at least one of the layers should use its own data type.

DSC says:

April 8, 2011 at 3:52 pm

s/When should create a custom data type/When should you create a custom data type/

Thanks for the post!

April 8, 2011 at 4:36 pm

Thanks, fixed!