2026 : ezyang's blog

2026

Silent BC Breaking Changes March 2, 2026

A silent BC breaking change is a change of semantics in an API that is not immediately obvious; e.g., the old usages of the broken behavior won’t raise a compile or runtime error on upgrade. If you are a project that cares about backwards compatibility, it’s generally a bad idea to ship silent BC breaking changes, because they manifest as users getting silently incorrect results and having to painfully root cause the difference to a particular version upgrade. However, many bug fixes technically are silent BC breaking changes, especially when the old buggy behavior could conceivably be useful in certain circumstances. Load bearing bugs are often the reason why sometimes bug-for-bug compatibility is required.

Vibe Coding Design Study: tlparse February 3, 2026

I recently received the following question about vibe-coding for tlparse, a structured log parser for torch.compile (slightly edited for ease of reading):

Hi Ed, I have some thoughts on vibe-coding tlparse and I’m curious what your opinions are. I think it’s fairly easy to vibe code and improve tlparse, but it’s hard to find people who know Rust or HTML or JavaScript well to properly review the vibe-coded stuff. The Rust PRs are not impossible to review, but the JavaScript ones are certainly hard… YOLO-landing PRs that we can’t review is certainly bad, I guess one option is just not landing any changes, and tell people to vibe-code themselves…?
Read more...

Replicate Forwards, Partial Backwards February 3, 2026

A central thesis of sharding in types is that the backward sharding can be directly computed from the forward sharding. This is not true for DTensor today, e.g., as seen in sum to Partial, and it leads to confusion where users cannot easily predict what the sharding of tensors are in their program. The question now arises: given a forward sharding, what should its backward sharding be? There are some easy cases to fill in:

DTensor erasure February 1, 2026

DTensor has famously terrible eager mode performance; for example, this paper measured a 35-60% slowdown in end-to-end training performance with and without DTensor (with DTensor operations taking at least 7x longer than actually running the computation for real). While it is possible to alleviate some of this slowdown via optimizations (in the paper, veScale shows fast bypass of sharding propagation, improved cache lookups and C++ code can take dispatch overhead to 30us), this is still too high for some settings.

The JAX sharding type system January 28, 2026

Conventionally, a type system is something that classifies values into data types like float32 or int64. However, fancy type systems go beyond data types, allowing us to talk about potentially arbitrary invariants on data; for example, if we were to talk about the “type” of a array, it would cover not only its data type, but also its shape, e.g., f32[40, 20]. JAX’s type system of abstract values (avals) goes further than just data types and shapes and is equipped to reason about sharding related invariants. However, this type system is poorly documented, especially recent additions like reduced/unreduced axes (circa June 2025). In this blog post, I want to give a consolidated description of the sharding related aspects of JAX’s typing in explicit sharding mode, as of 2026. Disclaimer: I am not a JAX developer, and there may potentially be mistakes in this presentation; please let me know about errors in Twitter. I will assume that you have some knowledge about how to work with JAX sharding in the frontend; please refer to Distributed arrays and automatic parallelization, Explicit sharding and Manual parallelism with shard_map for a refresher on these topics.

Global vs Local SPMD January 27, 2026

Global SPMD (also known as the “global view”, exposed by code using DTensor or jax.Array) refers to writing multi-device code as if it was on a single device, with an orthogonal mechanism for expressing how these full tensors are distributed over multiple devices (this mechanism can be implicit or explicit, e.g., as seen in this table).

Local SPMD (also known as “per-device view”, and exposed by local_map and shard_map, and also traditional PyTorch distributed code operating on plain Tensors, e.g., Megatron-style) refers to writing code from the “local” view on a single device, with explicit collectives when communicating across devices.

Megatron via shard_map January 26, 2026

In Computing sharding with einsum, we worked an example of Megatron style tensor parallelism where we discover that the ordinary backwards formula for linear results in a pending reduction on grad_input, even though the input was replicated and no communications happened in forwards. In Megatron, which is implemented with plain Tensors and manual collectives, you just have to know that this reduction is necessary and manually insert it with a custom autograd function.

Computing sharding with einsum January 25, 2026

Mental arithmetic in grade school (e.g., memorizing your times tables) is typically justified on the grounds that facility in basic calculations makes it easier to focus on higher-level problems that require being able to do these manipulations. When working on DTensor, I have also found it important to be able to quickly calculate what shardings you get when you do matrix multiplies on sharded tensors. Without being able to do this quickly and accurately, working through examples becomes a slog. I’ve also found that while diagrammatic approaches (e.g., drawing a matrix and slicing it into shards) are intuitive, they are slow and unwieldy to do calculations with.

Hugo Migration January 4, 2026

This blog has lived on WordPress since it was initially created during a social challenge at MIT to write a blog post a week or pay up with beer. I remember a very important piece of advice I had been given at that time: don’t fuck around with your blog authoring software, just do the minimum viable thing (use WordPress) and focus on writing posts.

It’s 2026 now, the world is different, and in particular the existence of coding agents means that this particular advice falls flat now: it has never been easier to vibe code your own blog software and be done in an afternoon of token generation. Similarly, over the years, I had been increasingly unhappy about my WordPress setup (too hard to add images, ancient version of WordPress, Markdown has taken over the world why am I still writing in ReST, I love scripts.mit.edu but I definitely don’t want to use it to host serious things). So I typed this into ChatGPT and Claude and asked it what I should migrate too.

The gap between a Helpful Assistant and a Senior Engineer January 4, 2026

Let’s suppose you asked an AI coding agent to “implement a CLI calculator”. Imagine if, instead of only writing short Python script, it also started building an automated test suite, a crash reporting mechanism and a telemetry subsystem. You’d be like, “What the fuck is this?”

But now let’s say that you were planning to release this project to users. It would be clearly negligent to not have an automated test suite. A crash reporting mechanism might be overkill for a simple calculator, but for more complicated CLIs interacting with the real world, it may not always be feasible to have reproducer, in which case crash logs are essential. Similarly, a telemetry subsystem would be wildly inappropriate for an open source local-only calculator, but it could make sense for a networked application or a corporate tool of all consenting users. One of the important functions of a senior engineer is to be able to evaluate the context a software project lives in and figure out if we need to do something, even if it isn’t explicitly asked for. This is contrast to a helpful assistant, who is first and foremost obligated to follow the user’s instructions. This leads to a gap between a Helpful Assistant and a Senior Engineer.