ezyang's blog

the arc of software bends towards understanding

PyTorch

State of torch.compile for training (August 2025)

The purpose of this post is to sum up, in one place, the state of torch.compile for training as of August 2025. Nothing in here isn’t something you might not already know about from elsewhere on the Internet, but we rarely put everything together in one place. The target audience for this document are teams who are evaluating the use of torch.compile for large scale training runs.

First, the basics. torch.compile (also known as PT2) is a compiler for PyTorch eager programs for both inference and training workloads. Speedups from 1.5-2x compared to eager code are typical, and torch.compile also makes it possible to do global optimizations for memory (e.g., automatic activation checkpointing) and distributed communications (e.g., async tensor parallelism).

Read more...

New Years resolutions for PyTorch in 2025

In my previous two posts “`Ways to use torch.compile <http://blog.ezyang.com/2024/11/ways-to-use-torch-compile/>`_” and “`Ways to use torch.export <http://blog.ezyang.com/2024/12/ways-to-use-torch-export/>`_”, I often said that PyTorch would be good for a use case, but there might be some downsides. Some of the downsides are foundational and difficult to remove. But some… just seem like a little something is missing from PyTorch. In this post, here are some things I hope we will end up shipping in 2025!

Read more...

Ways to use torch.export

Previously, I discussed the value proposition of torch.compile. While doing so, I observed a number of downsides (long compile time, complicated operational model, lack of packaging) that were intrinsic to torch.compile’s API contract, which emphasized being able to work on Python code as is, with minimal intervention from users. torch.export occupies a different spot in the tradeoff space: in exchange for more upfront work making a model exportable, it allows for use of PyTorch models in environments where using torch.compile as is would be impossible.

Read more...

Ways to use torch.compile

On the surface, the value proposition of torch.compile is simple: compile your PyTorch model and it runs X% faster. But after having spent a lot of time helping users from all walks of life use torch.compile, I have found that actually understanding how this value proposition applies to your situation can be quite subtle! In this post, I want to walk through the ways to use torch.compile, and within these use cases, what works and what doesn’t. By the way, some of these gaps are either served by export, or by missing features we are actively working on, those will be some other posts!

Read more...

Tensor programming for databases, with first class dimensions

Tensor libraries like PyTorch and JAX have developed compact and accelerated APIs for manipulating n-dimensional arrays. N-dimensional arrays are kind of similar to tables in database, and this results in the logical question which is could you setup a Tensor-like API to do queries on databases that would be normally done with SQL? We have two challenges:

  • Tensor computation is typically uniform and data-independent. But SQL relational queries are almost entirely about filtering and joining data in a data-dependent way.
  • JOINs in SQL can be thought of as performing outer joins, which is not a very common operation in tensor computation.

However, we have a secret weapon: first class dimensions were primarily designed to as a new frontend syntax that made it easy to express einsum, batching and tensor indexing expressions. They might be good for SQL too.

Read more...

PyTorch Developer Podcast

I’m launching a new podcast, the PyTorch Developer Podcast. The idea is to be a place for the PyTorch dev team to do bite sized (10-20 min) topics about all sorts of internal development topics in PyTorch. For now, it’s just me monologuing for fifteen minutes about whatever topic I decide. The plan is to release an episode daily, five days a week, until I run out of things to say (probably not for a while, I have SO MANY THINGS TO SAY). I don’t edit the podcasts and do minimal planning, so they’re a bit easier to do than blog posts. Check it out! There’s two episodes out already, one about how we do Python bindings for our C++ objects and another about history and constraints of the dispatcher. If there are any topics you’d like me to cover, give a shout.

Read more...

The PyTorch open source process

PyTorch is a fairly large and active open source project, and sometimes we have people come to us and ask if there are any lessons from how we run PyTorch that they could apply to their own projects. This post is an attempt to describe some of the processes as of 2021 that help PyTorch operate effectively as an open source project. I won’t claim that everything we do necessarily the best way to go about doing things, but at the very least, everything I describe here is working in practice.

Read more...

Let's talk about the PyTorch dispatcher

image

If this is your first time reading about PyTorch internals, you might want to check out my PyTorch internals post first. In this post, I want to talk about one particular part of PyTorch’s internals: the dispatcher. At a first glance, the dispatcher is just a glorified if statement: based on some information about the tensor inputs, decide what piece of code should be called. So why should we care about the dispatcher?

Read more...

A brief taxonomy of PyTorch operators by shape behavior

I’ve recently been working on a revamp of how we specify tensor shape formulas in PyTorch. As part of this process, I classified every single operator in PyTorch by its shaping behavior; yes, that’s all 1364 of them (this includes each variant of an operator; e.g., inplace and out= keyword variants). During the process, I tried to come up with categories to help classify what operators did. One of the surprises from the process was discovering that shaping behaviors that I previously thought were uncommon, actually showed up a bit more often than one might have expected.

Read more...

vmap in Haskell

vmap is an interface popularized by JAX which offers you a vectorizing map. Semantically, a vmap is exactly equivalent to a map in Haskell; the key difference is that operations run under a vmap are vectorized. If you map a convolution and a matrix multiply, you will have one big loop which repeatedly calls convolution and matrix multiply for each entry in your batch. If you vmap a convolution and matrix multiply, you’ll call the batched versions of convolution and matrix multiply once. Unless you have a fuser, on most modern deep learning frameworks, calling the batched implementations of these operations will be much faster.

Read more...