ezyang’s blog

the arc of software bends towards understanding

Idiomatic algebraic data types in Python with dataclasses and Union

One of the features I miss most in non-Haskell programming languages is algebraic data types (ADT). ADTs fulfill a similar role to objects in other languages, but with more restrictions: objects are an open universe, where clients can implement new subclasses that were not known at definition time; ADTs are a closed universe, where the definition of an ADT specifies precisely all the cases that are possible. We often think of restrictions of a bad thing, but in the case of ADTs, the restriction of being a closed universe makes programs easier to understand (a fixed set of cases to understand, as opposed to a potentially infinite set of cases) and allows for new modes of expression (pattern matching). ADTs make it really easy to accurately model your data structures; they encourage you to go for precise types that make illegal states unrepresentable. Still, it is generally not a good idea to try to manually reimplement your favorite Haskell language feature in every other programming language you use, and so for years I've suffered in Python under the impression that ADTs were a no go.

Recently, however, I have noticed that a number of new features in Python 3 have made it possible to use objects in the same style of ADTs, in idiomatic Python with virtually no boilerplate. The key features:

  • A structural static type checking system with mypy; in particular, the ability to declare Union types, which let you represent values that could be one of a fixed set of other types, and the ability to refine the type of a variable by performing an isinstance check on it.
  • The dataclasses library, which allows you to conveniently define (possibly immutable) structures of data without having to write boilerplate for the constructor.

The key idea: define each constructor as a dataclass, put the constructors together into an ADT using a Union type, and use isinstance tests to do pattern matching on the result. The result is just as good as an ADT (or better, perhaps; their structural nature bears more similarity to OCaml's polymorphic variants).

Here's how it works. Let's suppose that you want to define an algebraic data type with two results:

data Result
   = OK Int
   | Failure String

showResult :: Result -> String
showResult (OK result) = show result
showResult (Failure msg) = "Failure: " ++ msg

First, we define each constructor as a dataclass:

from dataclasses import dataclass

class OK:
    result: int

class Failure:
    msg: str

Using the automatically generated constructors from dataclasses, we can construct values of these dataclasses using OK(2) or Failure("something wrong"). Next, we define a type synonym for the union of these two classes:

Result = Union[OK, Failure]

Finally, we can do pattern matching on Result by doing isinstance tests:

def assert_never(x: NoReturn) -> NoReturn:
    raise AssertionError("Unhandled type: {}".format(type(x).__name__))

def showResult(r: Result) -> str:
    if isinstance(r, OK):
        return str(r.result)
    elif isinstance(r, Failure):
        return "Failure: " + r.msg

assert_never is a well known trick for doing exhaustiveness checking in mypy. If we haven't covered all cases with enough isinstance checks, mypy will complain that assert_never was given a type like UnhandledCtor when it expected NoReturn (which is the uninhabited type in Python).

That's all there is to it. As an extra bonus, this style of writing unions is compatible with the structured pattern matching PEP, if it actually gets accepted. I've been using this pattern to good effect in our recent rewrite of PyTorch's code generator. If you have the opportunity to work in a statically typed Python codebase, give this style of code a try!

10 Responses to “Idiomatic algebraic data types in Python with dataclasses and Union”

  1. dlax says:

    An alternative to explicit pattern matching is to use functools.singledispatch, e.g.:

    def showResult(r: NoReturn) -> NoReturn:
    raise AssertionError(“Unhandled type: {}”.format(type(x).__name__))

    def show_ok(r: OK) -> str:
    return str(r.result)

    def show_failure(r: Failure) -> str:
    return “Failure: ” + r.msg

  2. Dimi says:

    Except that you want Result to be polymorphic. If you define Ok for int, what will you do for all other types? Maybe some kind of magic with a TypeVar is possible? Moreover you cannot do isinstance(OK(5), Result), try it.

    Now about this sentence: “The result is just as good as an ADT (or better, perhaps; their structural nature bears more similarity to OCaml’s polymorphic variants).” You need to study.

  3. Dimi says:

    PS. Good tips though. I’ll use your way :)

  4. > Moreover you cannot do isinstance(OK(5), Result), try it.

    Yes, but in a fully statically typed program, you shouldn’t need to do so, since you should statically know from context that something is a Result, without having to refine it manually with an isinstance test. (So for example, don’t do something like `Union[Result, Result2]`, you want a tagged union for this case.)

    One thing that mypy-style type checking can’t do for you is type-driven metaprogramming, ala type classes, which is honestly pretty useful. But there are other ways (e.g., object oriented programming) to get what you want in a language like Python.

  5. Franklin Chen says:

    Is there a reason you choose to use dataclass rather than NamedTuple?

  6. Yes: I typically don’t want positional access to work :)

  7. Andreas Abel says:

    A bit like you do in Java: use an abstract class for the data type and one subclass for each of its constructors.
    Alternative to case with `instanceof` is the so-called visitor pattern.

  8. Felipe Gusmao says:

    This would now work even better with pattern matching that is comming to python 3.10

  9. Jeroen says:

    > Moreover you cannot do isinstance(OK(5), Result), try it.

    In Python 3.10, you will be able to write the Union using a new syntax with ‘|’ which will also be supported by isinstance:

    Result = OK | Failure

    isinstance(OK(3), Result)
    => True

    See PEP 604 for more on this.

  10. David Froger says:

    Very interesting, assert_never is what I was looking for!

    What about serialization, let’s say in JSON? There is not “tag”
    to include, and including the class name seems not clean (as
    the class name may be an implementation detail).


Leave a Comment