ezyang's blog

Read Less, Steer More

March 18, 2026

I was coaching some juniors on effective use of AI coding agents (like, looking over their shoulder as they were prompting the LLMs) and one reoccurring theme was that the AI agents were demanding a lot more of their reading code skills, compared to how they used to operate before. “How,” they asked, “can we get better at reading code?”

I had to think about my answer to this, because even before AI, it’s always been difficult to make the jump from “I mostly write code” to “I spend all my time reviewing code other people have written.” There is some intrinsic difficulty about this; there has always been a shortage in the number of code reviewers, even without LLMs exacerbating the problem!

Often, the secret to solving a difficult problem is to change the problem. If you treat AI generated output as code to read, you have already lost the game. Read less, steer more. I want you to be the laziest, most obstreperous Karen to the LLMs. If something doesn’t make sense, force it to justify. If you know how you want something to look, make it do it the way you want it–don’t accommodate the LLM. Ask the LLM to repaint the walls and move the couch exactly to your specifications.

The point is that vibe coding (in the original, Karpathy-Willison sense) is a trap: if you are interested in having a line-by-line understanding of the output code, you shouldn’t be treating the AI as a full-fledged autonomous teammate: instead, you should be treating it as a really fast typist that is carrying out your will. No one complains about having to read code they just wrote; the degree to which you are involved in the AI coding process also reduces the cognitive burden of reading the AI code. I recommend spending at least some time without “accept edits” enabled on your favorite coding agent, and force yourself to read every edit as it comes by. (I do think its better to run modern agents with “accept edits” on, but as a beginner it’s very important for training your intuition on what the models do).

There is still the intrinsic complexity of creation that is irreducible. When we write code, run tests and deploy in production, we learn things that we could not easily figure out just by thinking. If you are the type of person who has to type out code to figure out where it should go, you should spend the time typing out code, at least at first. If you are discovering genuinely new things about your problem space, this is going to take time, don’t feel forced to rush it. This stuff still needs to go at human speed; savor it!

Parallel Agents ❤️ Sapling

March 13, 2026

Previously in AI-assisted programming for spmd_types, I mentioned that I have been enjoying using Sapling (the version control system Meta uses internally) to manage parallel agents on worktrees. In this post, I want to describe this workflow in more detail. The basic development flow here is I have three work streams going in parallel:

I am working on enabling an end-to-end use case; in this case, enabling strict type-checking on a realistic codebase.
I am adding new features and fixing bugs to the core library itself, as identified by (1).
I am addressing code review comments and CI failures for the diffs produced by (2).

All of this is in the context of a single stack, the bottom of which is the feature diffs, and the top of which is the enablement diffs. I have multiple worktrees: one at the top for E2E enablement, one or more at the top of the feature diffs to add new features, and one or more inside the stack to address code review and fix CI. All of these are going in parallel, and so the problem is how to coordinate these parallel changes on a single source-of-truth stack. Git does terribly at this, but if you know the trick, Sapling does really well!

Sapling primer¶

You can skip this section if you’re familiar with Sapling. Disclosure: this section is AI generated and then heavily edited by me, all mistakes are mine.

If you’ve used Git, you’re used to branches. Sapling replaces branches with stacks: linear chains of draft commits sitting on top of public (landed) commits. There are no branch names to manage. You just make commits, and they form a stack.

Smartlog: your best friend¶

The first thing to internalize is sl (short for sl smartlog). Run it with no arguments and you get a view of all your in-flight work:

$ sl

  o  00030000  39 minutes ago  ezyang
  │  Wire up the new endpoint
  │
  @  00020000  39 minutes ago  ezyang
  │  Add request validation
  │
  o  00010000  39 minutes ago  ezyang
╭─╯  Define the schema
│
o  00000000  45 minutes ago  remote/master

In the example logs I’m showing, I use fake hashes of the form 000X000Y, where X identifies the logical commit and Y is the amend/restack version. So 00020000 and 00020001 are two versions of the same commit (before and after an amend).

A few things to notice about the smartlog:

@ marks where you are (your working directory parent).
The kink (╭─╯) is how you can tell where the public/draft boundary is. Everything above the kink is your draft stack; below is public.
Public commits are the landed/pulled ones at the base. Sapling hides the thousands of other public commits—you only see what’s relevant.
No branch names. The stack is the branch. If you have two independent pieces of work, you’ll see two separate stacks in smartlog.

You can have all sorts of stacks running around for parallel work, but in the flow we’re going to describe there will be one main stack we’re worrying about, with temporary branches.

Navigating the stack¶

Sapling gives you commands to move up and down without remembering hashes:

sl prev       # move down one commit
sl next       # move up one commit
sl top        # jump to the top of the stack
sl bottom     # jump to the bottom of the stack

These are relative to the current stack. sl top doesn’t take you to some other stack—it goes to the tip of whichever stack you’re currently in. You can also directly sl checkout HASH to jump to a particular commit.

Working on a stack¶

When you sl commit a change, it makes a new commit in your stack. If you’re in the middle of your stack, commiting will create a fork, like this:

  o  00030000  ezyang
  │  Wire up the new endpoint
  │
  │ @  00040000  ezyang
  ├─╯  Fix validation bug
  │
  o  00020000  ezyang
  │  Add request validation
  │
  o  00010000  ezyang
╭─╯  Define the schema
│
o  00000000  remote/master

If you sl amend a change, by default Sapling will modify the change and then sl restack all the downstream commits so they are on top of the amended commit.

  o  00030001  ezyang
  │  Wire up the new endpoint
  │
  @  00020001  ezyang
  │  Add request validation (amended)
  │
  o  00010000  ezyang
╭─╯  Define the schema
│
o  00000000  remote/master

If there’s a merge conflict, Sapling will bail out and leave the commits at the old spot. You can explicitly run sl restack to do the restack and address the merge conflicts at your convenience.

  @  00020001  ezyang
  │  Add request validation (amended)
  │
  │ o  00030000  ezyang
  │ ╷  Wire up the new endpoint
  │ ╷
  │ x  00020000  [Rewritten into 00020001]  ezyang
  ├─╯  Add request validation
  │
  o  00010000  ezyang
╭─╯  Define the schema
│
o  00000000  remote/master

When you amend a commit, any other worktrees will stay at their old commits. So you might have another worktree that was on 00020000, and if you want to get to the updated copy, you would want to checkout 00020001.

Two aliases to rule the world¶

Remember that our desired workflow is that we have multiple agents working all over the stack. There are two aliases that handle the majority of common situations you will run into:

[alias]
follow = goto last(successors(.))
adopt = rebase -s 'children(parents(.)) - .' -d .

sl follow: Someone else amended and restacked a parent diff¶

When you amend a diff early in the stack, all the children diffs will automatically get restacked. In particular, all of your other worktrees later in the stack will now be on stale, “rewritten” diffs. This is actually a good thing: if you have an agent working on the working copy, you would prefer the working tree to not suddenly change out under them. So you are now in the following state:

The diff you are on has been rewritten, and
You have some uncommitted changes on your worktree.

Before you amend or commit the uncommitted changes, simply run sl follow. This will switch you to the newly amended commit, keeping your uncommitted changes, as long as the amended files are disjoint from your local changes. If there is a potential merge conflict, it will ask you to run it again with sl goto --merge, in which case it will try to merge changes in the normal way.

sl adopt: You added another diff in the middle of the stack¶

When you commit a new diff in the middle of a stack, the old children diffs won’t get restacked, because restack only occurs for amend. To do the equivalent of a restack, run sl adopt, which will rebase all of the old children onto your new commit.

I find these two commands cover the majority of situations that show up when you are working with parallel agents!

A worked example¶

Disclosure: this section is AI generated and lightly edited by me, all mistakes are mine.

Let’s walk through a concrete scenario. I have a stack of three diffs: a schema definition, a validation layer, and an E2E integration test.

  o  00030000  ezyang
  │  E2E integration test
  │
  o  00020000  ezyang
  │  Add validation layer
  │
  o  00010000  ezyang
╭─╯  Define schema
│
o  00000000  remote/master

I set up three worktrees:

worktree-e2e is checked out at 00030000 (E2E test), where an agent is working on expanding test coverage.
worktree-feature is checked out at 00020000 (validation), where an agent is adding a new validation rule.
worktree-review is checked out at 00010000 (schema), where an agent is fixing a code review comment on the schema diff.

All three agents are running in parallel.

Step 1: The review agent finishes first¶

The agent on worktree-review amends the schema diff to address a reviewer’s comment. It runs sl amend. Sapling automatically restacks the children, so the stack now looks like:

  o  00030001  ezyang
  │  E2E integration test
  │
  o  00020001  ezyang
  │  Add validation layer
  │
  @  00010001  ezyang
╭─╯  Define schema (amended)
│
o  00000000  remote/master

But worktree-e2e is still on 00030000 and worktree-feature is still on 00020000—the old, rewritten versions. That’s fine. The agents are still working on their uncommitted changes, and we don’t want the rug pulled out from under them.

Step 2: The feature agent finishes next¶

The agent on worktree-feature is done. It has uncommitted changes on 00020000, which has been rewritten to 00020001. Before committing, we run:

$ sl follow

This moves us from the stale 00020000 to the current 00020001, carrying our uncommitted changes along. Now we commit the new validation rule:

$ sl commit -m "Add email format validation"

This creates a fork in the stack, because we committed on top of 00020001 but the E2E test is also a child of 00020001:

  o  00030001  ezyang
  │  E2E integration test
  │
  │ @  00040000  ezyang
  ├─╯  Add email format validation
  │
  o  00020001  ezyang
  │  Add validation layer
  │
  o  00010001  ezyang
╭─╯  Define schema (amended)
│
o  00000000  remote/master

We want the E2E test on top of our new commit, so we run:

$ sl adopt

This rebases 00030001 (the E2E test, which was a child of 00020001) on top of 00040000 (our new commit):

  o  00030002  ezyang
  │  E2E integration test
  │
  @  00040000  ezyang
  │  Add email format validation
  │
  o  00020001  ezyang
  │  Add validation layer
  │
  o  00010001  ezyang
╭─╯  Define schema (amended)
│
o  00000000  remote/master

The stack is linear again, with our new diff slotted in.

Step 3: The E2E agent finishes last¶

The agent on worktree-e2e is still on 00030000, which has been rewritten twice—first to 00030001 by the review agent’s restack, then to 00030002 by the adopt. No matter: sl follow chases the full successor chain.

$ sl follow
$ sl amend

And we’re done. The final stack:

  @  00030003  ezyang
  │  E2E integration test (expanded coverage)
  │
  o  00040000  ezyang
  │  Add email format validation
  │
  o  00020001  ezyang
  │  Add validation layer
  │
  o  00010001  ezyang
╭─╯  Define schema (amended)
│
o  00000000  remote/master

All done!

Comparison with other VCS¶

I have tried out git-branchless in the past, because I was looking for a Sapling style stack workflow natively in Git. However, I found branchless buggy and failed to reliably restack commits. I feel Sapling’s tracking of successors is important for a good stack flow.

I think Jujutsu is a very similar workflow to Sapling. I did do some basic research and it seems that Jujutsu will transparently update worktrees when a separate worktree is amended? This seems bad: I want to delay worktree updates until I am ready (e.g., the agent finishes work.)

AI-assisted programming for spmd_types

March 10, 2026

spmd_types is a type checker for distributed PyTorch programs. I’ll write a blog post about the system itself as it gets closer to completion, but in this post I want to talk about my workflow for AI-assisted programming in the project. In particular, spmd_types is my first greenfield project that both:

I have written it entirely with AI, and
I am comfortable being accountable for it (in the same way I am accountable for my handwritten code).

I have encouraged others to also use AI as the primary mechanism of writing code for the codebase, but I have noticed that people get different outcomes (in terms of diff volume and number of review cycles), so I wanted to write down my workflow in case it was helpful for somebody.

This is NOT a workflow for letting the AI autonomously write tons of code. You’re going to be reading every single line it produces.

Pre-requisites¶

My workflow relies on a few important pre-requisites. If you don’t have these, this workflow may work less well:

I could have written the code from scratch. The LLM isn’t writing code in a language or domain that I don’t have extensive experience in. Importantly, this means I can review the code quickly, because I know how I “would have written it” and can compare the LLM code against this mental model.
I can hold the entire design in my head. I have a pretty good conceptual understanding of how spmd_types is supposed to work and I can reliably judge if LLM output is aligned with the design or not. There were many aspects of this type system that were non-obvious to us (even though we are shamelessly copying JAX’s design), so there was quite a lot of pre-work that happened before even a single line of code was written. I wrote the entire first DESIGN.md draft by hand with no LLM assistance, which served as the initial prompt.
I can multi-task work with the AI. Claude Opus is not fast enough for single session interactive. While it’s generating tokens, there need to be other things to do, e.g., simultaneously working on another orthogonal feature in the codebase or reviewing diffs. I personally find multi-tasking very strenuous, but if the tasks are related I find it easier to avoid getting sidetracked.
I am not price sensitive regarding token use. I am not using a rate limited subscription plan, and because my development is going at human speed (not agent swarms), the cost of the tokens is low enough that I don’t need to be penny-pinching token use all the time.

The workflow¶

I don’t use a Claude Code orchestrator. I have one window per work tree, each a single tab that is just a regular shell, and then as many tabs as I think I need for concurrent Claude Code instances on that worktree. Usually, only one is actually doing work, and the others are just if I want to ask questions without interrupting the main agent. Because I am typically doing work on a GPU system, these sessions are remote, so I use iTerm2 + tmux -CC (tmux control mode) to conveniently have multiple windows and tabs in native UI. It is so incredibly good to type ⌘T and get a fresh tab on the remote machine.

It is helpful to have multiple worktrees. This is not because you want to multitask, but because maybe one worktree is tied up with a debugging session, and you have some coding tasks on the TODO list to fry while it is going. I currently maintain three worktrees, but usually only one of them is running. You want true worktrees, not independent clones, because you are going to be passing commits between the trees. I have been especially enjoying the smartlog mechanism in Sapling, which makes it easy to keep track of all the patches flying around from the agents. I miss this UX a lot when working in Git.

Here is the central loop:

(Optional) If you’re not sure exactly how the feature should go, or you are worried about the LLM misunderstanding the prompt, go to plan mode and get the LLM to give you a satisfactory plan. You are reviewing the plan to understand if the LLM’s intent matches yours: the plan is shorter than the output code, so it’s quicker to review. Don’t blindly accept the plan, you should be able to evaluate it personally!
Prompt the LLM to make the change you want. While it’s going, maybe review some code or start another task in a parallel worktree. Check in regularly, including seeing what the LLM is up to.
Sometimes, the LLM will get stuck with some problem. Understand why the LLM is stuck: did it make a wrong turn somewhere? Did you misunderstand the problem statement? Is there a latent bug? This is one of the most common situations to multitask, since you may identify follow ups that you can kick off in parallel while you keep working on the original task.
Commit it. (I like LLM for this, since it can usually write something pretty good given my original prompt + the trajectory of actually implementing it.)
Read through every single line of code produced. Imagine you are instead writing the code. You are looking for cases you are surprised. Any time you are surprised, feed your comment to the LLM, either as a question (“Why did you do it this way?”) or a requested edit (“Instead of X, do Y”, copying in the diff to help the LLM localize the change). Claude Code supports queuing edits, so you can keep feeding them in and then continue reading. Be lazy: if something doesn’t make sense, don’t try to puzzle it out, just send it straight to the LLM. Do NOT multitask this part, it needs your undivided attention (unsurprisingly, this steps ends up being the bottleneck!) Repeat as many times necessary.
Submit the diff.
Clear the context, start your next change.

For the code review cycle, I like responding to comments in your review system, and then feeding them to the LLM to amend the commits. Those amendments need to be reviewed in the same way.

Expectations¶

What level of quality can we hope to get out of the LLM generated code in a workflow like this? Let’s be clear: with this workflow, there are going to be some aspects of the code that you do not fully understand. This is fine, because handwritten code is often also the same (e.g., with respect to whether or not they pass tests, or have problems in production!)

But in some sense, whether or not you fully understand a given line of code function implementation is immaterial: if there’s a bug in a function, yes… bugs are bad, but you can just fix it. (It’s not like human code doesn’t also have bugs as well! And frontier LLMs are pretty good at not making small localized mistakes–arguably, they’re better than humans at this already! And we didn’t pick a language you are unfamiliar with–so many basic problems you should be able to pick out already.) There are other things that are harder to fix: public APIs, overall architecture of the system, module boundary choices–these are the important things to evaluate. But you don’t really need to be reading every single line of code to evaluate these: most of the design intent shines through in the prompts you write (where you explain what it is you want out of the system.)

The biggest goal of a setup like this is continuously maintaining human comprehension of the AI authored codebase. You can’t offload understanding to an AI. But you can force the AI to write the code in a way that is easy for you, personally, to comprehend. You can force the AI to answer all your questions and then tell it, “That doesn’t seem very convincing.” One of the beautiful things about AI-assisted programming is how quickly you can invalidate your own mental model and rebuild a better, more accurate one.

As an aside, the other thing is that it’s (relatively) easy to make architectural changes: just ask the LLM to do it. If you notice that some downstream usage is struggling, and you come up with a better way to deal with it, just spin up an agent in a worktree to make the change! No more spending an entire day refactoring, you can be out in half an hour.

Silent BC Breaking Changes

March 2, 2026

A silent BC breaking change is a change of semantics in an API that is not immediately obvious; e.g., the old usages of the broken behavior won’t raise a compile or runtime error on upgrade. If you are a project that cares about backwards compatibility, it’s generally a bad idea to ship silent BC breaking changes, because they manifest as users getting silently incorrect results and having to painfully root cause the difference to a particular version upgrade. However, many bug fixes technically are silent BC breaking changes, especially when the old buggy behavior could conceivably be useful in certain circumstances. Load bearing bugs are often the reason why sometimes bug-for-bug compatibility is required.

For example, in PyTorch PR #167598, we are addressing a bug where calling item() on a partial DTensor (a tensor which is pending a summation across the nodes in your cluster) produces the partial value, rather than the full value (after having done the summation). This is a bug, because the existing behavior is inconsistent with how DTensor generally handles operations in global SPMD semantics. However, the current behavior is potentially useful, because computing the full value requires actually issuing a collective across the nodes, which is expensive, and maybe you actually did want to just look at the local partial value.

So let’s say we want to fix this bug and make this “BC breaking” change. What are our options?

YOLO fix. If we strongly believe that no one possibly was using the old semantics, or that the software is young and unstable enough that breaking changes are OK, the simplest and easiest thing to do is just change the behavior and write a note about it in the release notes. Anyone who was previously triggering this behavior will silently get opted into the new behavior, for better (they didn’t realize their code was buggy and was expecting the correct semantics) or worse (they were intentionally relying on the old semantics and now their code is broken).

When an API like this has a plausible alternate semantics (return the partial value), it’s important in the release notes to describe how you can recover the old semantics. In this particular case, you would rewrite:

partial_dtensor.item()

into

partial_dtensor.to_local().item()

which unambiguously indicates that you wanted the partial value without doing any communications. Conversely, it’s also helpful to indicate to users how to get the new semantics in older versions of your software. In this case:

partial_dtensor.redistribute(placements=[Replicate()]).item()

Turn a silent break into a noisy one. A general rule for diagnosable systems is that you should endeavor to fail noisily and as quickly as possible. So in the case above, we can instead make partial_dtensor.item() directly raise an error, instructing users to switch to either the to_local or redistribute forms to disambiguate which semantics they wanted. If it is desirable to reclaim the original API form partial_dtensor.item() (which it is here), after a suitable amount of time, you can make the original API stop erroring and now produce the new behavior: the point is to force everyone to hit the error and disambiguate their call sites before you (safely) turn a previously erroring API into a usable one.

If you expect users of the old behavior to be rare, or in situations where it’s easy to fix the problem in userland, this two-step process is all you need.

Making a BC break gentler. Now, to be clear, making a function that used to work raise an error is still a BC breaking change, and you can potentially break people by doing so! In a situation where the BC breaking API to be widely used, you can soften the blow by adding more steps:

First, make the deprecated API raise a warning.
Next, make the deprecated API raise an error.
Finally, switch the deprecated API to the new semantics.

If you are rolling out your change to a bigco style monorepo, it is better to replace a “warning” with “some sort of telemetry where you can identify all users of the deprecated API.” Then you can roll out (1) to the fleet, look for breakages and proactively fix them before moving on to step (2). However, sometimes release cycles can be very slow (from personal experience, it can take more than a month for new code to be taken up in some systems), so you can’t rely on this catching everything.

When you actually make the API raise an error, it is helpful to have a circuit breaker or feature flag that lets you easily revert back to the old (non-erroring) behavior. This is helpful if someone is trying to take a version update of your software, and everything is fine except for this BC breakage. The feature flag lets you temporarily turn off the BC breakage until you’re able to go fix the code, because sometimes it is inconvenient to rollback all of the code.

Miscellanea. Sometimes, it is difficult to tell if a deprecated usage of an API occurred. For example, in the past I worked on an ill-fated attempt to make reshape() always return semantically a clone of a tensor (using copy-on-write when a clone was not actually needed). It is difficult to tell if a reshape() usage here is deprecated, because to observe it you have to do a observable mutation on the input/output of reshape and observe this mutation in the output/input; it’s not as simple as raising an error when reshape() is called. In a situation like this, you may need to build an apparatus to simply detect if you should error or not in the first place.

Vibe Coding Design Study: tlparse

February 3, 2026

I recently received the following question about vibe-coding for tlparse, a structured log parser for torch.compile (slightly edited for ease of reading):

Hi Ed, I have some thoughts on vibe-coding tlparse and I’m curious what your opinions are. I think it’s fairly easy to vibe code and improve tlparse, but it’s hard to find people who know Rust or HTML or JavaScript well to properly review the vibe-coded stuff. The Rust PRs are not impossible to review, but the JavaScript ones are certainly hard… YOLO-landing PRs that we can’t review is certainly bad, I guess one option is just not landing any changes, and tell people to vibe-code themselves…?

I wonder if you have any opinion on this? I saw one of your BE week proposals is for a more vibe-coding friendly tlparse. Do you think we should just not attempt to review or land any front-end features (which most likely we cannot review), and all-in on the “custom vibe-coded” frontend route?

Oh boy, do I have opinions!

High stakes, low stakes¶

When is it acceptable not to review code? I find this is the wrong question to ask, because code review is only a proxy for the underlying measure we actually care about: does it matter if this code is good? And to answer this question, we have to understand whether or not whether or not we are in a high stakes or low stakes situation.

Here are some things that suggest high stakes:

It does destructive/irreversible actions
It used by many users (human or computer)
It covers a large surface area
- It constitutes BC surface
It handles money, private data, secrets, lives
It runs automatically
There are no automated tests / it is tested in prod

Here are some things that suggest low stakes:

It is personal use only
It is disposable or exploratory
The output is easy to verify
It is trivial to rollback to old versions (or better yet, you can run as many versions as you want at the same time)
It comes with no warranty
It is a starting point, rather than a complete product
No side effects, no persistent data

Many problems won’t neatly fall into one bucket or another, but it’s still helpful to know what the high stakes aspects are, because you can spend more effort (e.g., doing code review) on those aspects and less on the low stakes things. Also, being aware of why something is high stakes can push you towards restructuring things so that a problem becomes lower stakes.

Let’s run this exercise for tlparse. tlparse is generally a low stakes project: it takes a structured logs and generates HTML files for viewing it. You can (in principle) run newer or older versions of it, and as a program with no side effects it is basically as simple as these things get.

However, there are some things that are high stakes about it. It is used extremely widely internally at Meta; if it was broken, we would likely get a message within a day from someone who was relying on it to diagnose problems in training jobs (including production SEVs.) The way it is deployed internally is as a classic executable which is automatically rolled out as new versions are published; most users don’t know how to run an old version of it–so while in principle it can be rolled back trivially, in practice you would need to instruct users on how to do so. Finally, although it doesn’t do that much (just generate HTML from a structured log), there is a large amount of variety of actual logs you see in production, which makes it difficult to comprehensively test it. A memorable example from the past is someone adding syntax highlighting to the tool. I reverted this because it caused extremely long files to take a long time to parse, and it also made it more difficult to grep for specific lines in the generated folder.

How can you lower the stakes? In the case of tlparse, here are some ideas:

Have a separation between prod (the stable stuff that is shown by default) and experimental (the wacky untested stuff that needs some dogfooding to see if it works)
Make it easy to have multiple versions of tlparse; update broke something, just go back to your favorite version
Don’t deploy tlparse via a single rollout mechanism; have it as a local app that people can run / vibe code on without a deployment step
Improve testing to ensure features don’t break
Keep it simple, don’t add lots of features to keep the surface area of things to test simpler

Evaluating LLM generated code¶

Another trap is to think of code review as an all or nothing option: “if I haven’t done a careful line-by-line review of this code, I might as well just not review it at all.” There are lots of ways to evaluate LLM generated code, and some of these don’t involve reviewing the code at all.

First, let’s talk about evaluating LLM generated code in high stakes situations. The bar here is that your LLM generated code should be indistinguishable from the code you would have written yourself: the LLM just helped you type faster. This is a very high bar: imagine a mathematical proof written by someone not you. Can you just read the proof and understand it? Typically not: the only way to a real, durable understanding is to work through the steps of the proof yourself. It is difficult, intellectually taxing work–and you cannot offload this to the LLM, because the LLM isn’t the owner of the code, you are the owner of the code! I find in a situation like this it is best to steer the LLM very closely during authoring: I should have a clear idea of what I want written, and the LLM is just typing, and I am pausing and correcting it when it does things I don’t want. If you have the LLM run by itself for an hour, you’ll end up with code that is not yours, and you will have to spend the effort to make it yours. It is much easier, with both LLMs and regular colleagues, to own code if you’re involved every step in its conception.

A lower stakes situation is when the exact details of the code don’t matter, but you’re still thinking architecturally about the problem. I often am in this mode when I’m doing exploration: I don’t care about the exact things the LLM is typing, but I do care that it roughly has the right shape. After you’ve used LLMs a bunch, you get a sense for what kinds of mistakes they do and don’t make. You can just skip reviewing all the things that you know LLMs generally won’t mess up, and just look for the big strokes. To do this, however, you need to have a good, high level understanding of how things should work. If there’s a big pile of JavaScript code and you know absolutely nothing about how DOMs work, this isn’t going to work! And if there is a way to do something without having reams of JavaScript, you should absolutely steer the LLM to not do that.

In a situation where you are vibe coding and not looking at the generated code at all, you still have a very important job. You need to actually Q&A the feature and see if it actually works. I find for my pure vibe coding projects, this is the most time consuming part: I can ask the LLM to do anything, but then actually checking if it works (and transmitting feedback to the LLM) takes up most of my time. You can ask the LLM to write tests, but then you have to check if the tests are actually testing the important thing (to be clear, reviewing LLM generated tests is very high leverage.) The golden city in the distance is the LLM being able to Q&A the tool for yourself, but for something like tlparse HTML, we are still very early days in this kind of sophisticated browser use. (Claude has come a long way here, though, and I expect it to get better.)

So, what should you do?¶

Let’s look at some concrete PRs:

Add create symbols logs to compilation metrics artifact - I agree with the outcome (an approval) here. It is not super intrusive, the new indexes are a little messy but this is unlikely to cause problems with other parts of the system.
Add style to provenance tracking - Big JavaScript changes, likely all vibe coded. To get out of draft, there needs to be evidence of dog fooding to show it actually works. There will likely be bugs. It might be worth getting this into a lower stakes setting to iterate more quickly on real data.

It is perhaps unsatisfying that you have to evaluate things on a case-by-case basis. But hopefully this post gives you some framework to think about it.

Replicate Forwards, Partial Backwards

February 3, 2026

A central thesis of sharding in types is that the backward sharding can be directly computed from the forward sharding. This is not true for DTensor today, e.g., as seen in sum to Partial, and it leads to confusion where users cannot easily predict what the sharding of tensors are in their program. The question now arises: given a forward sharding, what should its backward sharding be? There are some easy cases to fill in:

If forwards is sharded, backwards is sharded.
If forwards is partial, backwards should be replicate (the easiest example to see this is local loss backwards: your local loss is partial over DP, but when you generate a ones tensor as grad_loss, this should clearly be a replicated tensor).

But if your forwards is replicate, what should your backwards be? One answer to this question, adopted by JAX, is that the backwards should also be replicate. From a theoretical standpoint, this semantics is easily motivated by the following degenerate example: if my forwards computation is replicated across all my nodes (e.g., everyone is performing the same compute on exactly the same tensor), then the gradient should clearly be replicated across all nodes (and this is the only choice that allows us to avoid communications entirely). However, this does lead to an irritating problem where if you have a forwards that is replicated, and want your backwards to be partial, you need to introduce an entirely new forwards sharding (JAX calls it “reduced”) to indicate this. JAX chose to preserve replicate-to-replicate for backwards compatibility reasons.

The purpose of this post is to argue that Replicate Forwards, Partial Backwards is a better default (concretely, in JAX, swap the default so that all axes are “reduced” by default, and you have to explicitly make some “not” reduced–not to be confused with unreduced!) To see this, let’s look carefully at the sharding of a gated MLP with DP and TP (without the w1-w3 concatenation) in JAX. Here is the end-to-end example, done with einsums for ease of differentiation, written as explicitly as possible (the reshards can be elided in JAX, but they play an important role for DTensor erasure):

import jax
import jax.numpy as jnp
from jax import lax

jax.config.update('jax_num_cpu_devices', 4)
jax.set_mesh(jax.make_mesh((2, 2), ('dp', 'tp')))

def mlp(x, w1, w3, w2):
    print(f"{jax.typeof(x)=}, {jax.typeof(w1)=}, {jax.typeof(w3)=}, {jax.typeof(w2)=}")

    # !!! ATTENTION !!!
    rx = jax.reshard(x, jax.P(None, 'dp', None, reduced={'tp'}))
    rw1 = jax.reshard(w1, jax.P(None, 'tp', reduced={'dp'}))
    rw3 = jax.reshard(w3, jax.P(None, 'tp', reduced={'dp'}))
    rw2 = jax.reshard(w2, jax.P('tp', None, reduced={'dp'}))
    print(f"{jax.typeof(rx)=}, {jax.typeof(rw1)=}, {jax.typeof(rw3)=}, {jax.typeof(rw2)=}")

    h1 = jnp.einsum("sbh,hi->sbi", rx, rw1)
    print(f"{jax.typeof(h1)=}")

    h3 = jnp.einsum("sbh,hi->sbi", rx, rw3)
    print(f"{jax.typeof(h3)=}")

    h = jnp.einsum("sbi,sbi->sbi", jax.nn.silu(h1), h3)
    print(f"{jax.typeof(h)=}")

    out = jnp.einsum("sbi,ih->sbh", h, rw2, out_sharding=jax.P(None, 'dp', None, unreduced={'tp'}))
    print(f"{jax.typeof(out)=}")

    return out

seq = 4
batch = 8
hidden = 16
intermediate = 32

x = jax.device_put(
    jnp.ones((seq, batch, hidden), dtype=jnp.float32),
    jax.P(None, 'dp', None)
)
w1 = jax.device_put(
    jnp.ones((hidden, intermediate), dtype=jnp.float32),
    jax.P(None, 'tp')
)
w3 = jax.device_put(
    jnp.ones((hidden, intermediate), dtype=jnp.float32),
    jax.P(None, 'tp')
)
w2 = jax.device_put(
    jnp.ones((intermediate, hidden), dtype=jnp.float32),
    jax.P('tp', None)
)
mlp(x, w1, w3, w2)

If you take the prints and annotated them inline in the program, it looks like this:

x:  f32[seq, batch@dp, hidden]
w1: f32[hidden, intermediate@tp]
w3: f32[hidden, intermediate@tp]
w2: f32[intermediate@tp, hidden]

rx: f32[seq, batch@dp, hidden]{R:tp} = jax.reshard(x, jax.P(None, 'dp', None, reduced={'tp'}))
rw1: f32[hidden, intermediate@tp]{R:dp} = jax.reshard(w1, jax.P(None, 'tp', reduced={'dp'}))
rw3: f32[hidden, intermediate@tp]{R:dp} = jax.reshard(w3, jax.P(None, 'tp', reduced={'dp'}))
rw2: f32[intermediate@tp, hidden]{R:dp} = jax.reshard(w2, jax.P('tp', None, reduced={'dp'}))

h1: f32[seq, batch@dp, intermediate@tp] = jnp.einsum("sbh,hi->sbi", x, rw1)
h3: f32[seq, batch@dp, intermediate@tp] = jnp.einsum("sbh,hi->sbi", x, rw3)
h: f32[seq, batch@dp, intermediate@tp] = jnp.einsum("sbi,sbi->sbi", jax.nn.silu(h1), h3)

out: f32[seq, batch@dp, hidden]{U:tp} = jnp.einsum("sbi,ih->sbh", h, rw2, out_sharding=jax.P(None, 'dp', None, unreduced={'tp'}))

Each reshard corresponds to a situation where we have a no-op in forwards, but an all-reduce in backwards. Why does JAX’s primal-cotangent rule for sharding in types imply this? I have three arguments.

The intuitive argument. In a DP+TP gated MLP, you expect to need to do a TP all-reduce on grad_input (because as you leave the sharded TP region you need to aggregate gradients from the TP shards), and you need to do the traditional DP all-reduces on all the parameters. In PyTorch, the DP all-reduces are typically handled by the DDP/FSDP wrapper outside of this code, but when we accept JAX semantics, grad_w1: f32[hidden, intermediate@tp] (it’s replicated!) so we are obligated to ensure the all-reduce occurs before we exit this region.

The peephole argument. Let’s just look at one specific backwards and work it out by hand.

# Recall:
rw1: f32[hidden, intermediate@tp]{R:dp}
h1: f32[seq, batch@dp, intermediate@tp]
# Therefore: (reduced->unreduced)
grad_rw1: f32[hidden, intermediate@tp]{U:dp}
grad_h1: f32[seq, batch@dp, intermediate@tp]

# Recall:
h1: f32[seq, batch@dp, intermediate@tp] = jnp.einsum("sbh,hi->sbi", x, rw1)
# Einsum backwards says:
grad_rw1: f32[hidden, intermediate@tp]{U:dp} = jnp.einsum("sbh,sbi->hi", x, grad_h1)
# Contraction is on replicated 's' and sharded 'b', so the result is unreduced on dp axis

The conversion to ‘reduced’ in forwards turns into an all-reduce to compute grad_w1: f32[hidden, intermediate@tp]. If we want to be extremely explicit about our code, we are obligated to convert w1 to “reduced”, so that its backward is “unreduced” as is implied by einsum backwards. By the way, inside of a shard_map region, a very similar thing occurs; as w1 is invariant in DP, but x is varying in DP, we must pcast w1 from invariant to varying to get correct gradients.

The exhaustive argument. We can write out the full backwards (for brevity, I use g_ instead of grad_ for the gradients):

g_out: f32[seq, batch@dp, hidden]{R:tp}

# out: f32[seq, batch@dp, hidden]{U:tp} = jnp.einsum("sbi,ih->sbh", h, rw2, out_sharding=jax.P(None, 'dp', None, unreduced={'tp'}))
g_h: f32[seq, batch@dp, intermediate@tp] = einsum("sbh,ih->sbi", g_out, rw2)
g_rw2: f32[intermediate@tp, hidden]{U:dp} = einsum("sbi,sbh->ih", h, g_out, out_sharding=jax.P('tp', None, unreduced={'dp'}))

# h: f32[seq, batch@dp, intermediate@tp] = jnp.einsum("sbi,sbi->sbi", jax.nn.silu(h1), h3)
g_silu_h1: f32[seq, batch@dp, intermediate@tp] = einsum("sbi,sbi->sbi", g_h, h3)
g_h3: f32[seq, batch@dp, intermediate@tp] = einsum("sbi,sbi->sbi", g_h, silu(h1))
g_h1: f32[seq, batch@dp, intermediate@tp] = silu_backward(g_silu_h1, h1)

# h3: f32[seq, batch@dp, intermediate@tp] = jnp.einsum("sbh,hi->sbi", x, rw3)
g_rx_from_h3: f32[seq, batch@dp, hidden]{U:tp} = einsum("sbi,hi->sbh", g_h3, rw3, out_sharding=jax.P(None, 'dp', None, unreduced={'tp'}))
g_rw3: f32[hidden, intermediate@tp]{U:dp} = einsum("sbh,sbi->hi", rx, g_h3, out_sharding=jax.P(None, 'tp', unreduced={'dp'}))

# h1: f32[seq, batch@dp, intermediate@tp] = jnp.einsum("sbh,hi->sbi", x, rw1)
g_rx_from_h1: f32[seq, batch@dp, hidden]{U:tp} = einsum("sbi,hi->sbh", g_h1, rw1, out_sharding=jax.P(None, 'dp', None, unreduced={'tp'}))
g_rw1: f32[hidden, intermediate@tp]{U:dp} = einsum("sbh,sbi->hi", rx, g_h1, out_sharding=jax.P(None, 'tp', unreduced={'dp'}))

g_rx: f32[seq, batch@dp, hidden]{U:tp} = g_rx_from_h1 + g_rx_from_h3

g_w2: f32[intermediate@tp, hidden] = reshard(g_rw2, P('tp', None))
g_w1: f32[hidden, intermediate@tp] = reshard(g_rw1, P(None, 'tp'))
g_w3: f32[hidden, intermediate@tp] = reshard(g_rw3, P(None, 'tp'))
g_x: f32[seq, batch@dp, hidden] = reshard(g_rx, P(None, 'dp', None))

You can individually verify that each unreduced gradient is implied by the einsum in question.

The upshot. The real point of this example is to see that the “intuitive” sharding type for the arguments on mlp, actually forces a lot of communications in backwards, because we must make the gradients replicated, and that implies all-reduces. This can actually result in a suboptimal communication pattern: when both TP and SP are being used, the unreduced grad_input can be delayed all the way to the forwards all-gather between the SP and TP region. In backwards, we can directly do a reduce-scatter than doing an all-reduce and then later throwing out most of the result in the all-gather backwards. (Arguably, this isn’t a huge deal if you have a compiler like XLA, since you would expect it to know how to optimize the comms here, but the whole point of sharding-in-types is to give more control over when comms occur.)

A better type signature for this function is mlp(rx, rw1, rw3, rw2), where all of these arguments are reduced (rx on tp, and rw{1,3,2} on dp). Now the reshards can be controlled by the user; you can do exactly the same communication pattern as our original implementation, or you can delay them until later. And the best way to encourage people to write their code this way is to have replicate forwards imply partial backwards. (P.S. It is still useful to have another variant of replicate which really does have a replicate backwards. I don’t have a good name for it, but it could occasionally be used to do an all-reduce early before fan-out would imply you have to do multiple all-reduces.)

Thanks Natalia Gimelshein, Tianyu Lu, and Ailing Zhang for detailed discussions that helped me reach this position. Thanks Twitter for giving this position a sanity check. Any mistakes are my own.

Appendix¶

Patrick Toulme requested the HLO and Shardy MLIR for the JAX program. Here they are, generated from this script. The raw output is here. Below, I have posted annotated versions, courtesy of Claude.

Pre-partition HLO¶

Full pre-partition HLO. The most interesting thing is you can see the sharding constraints added which trigger reductions:

    %16 = sdy.sharding_constraint %15 <@mesh, [{"tp"}, {}], unreduced={"dp"}> : tensor<32x16xf32>
    %27 = sdy.sharding_constraint %26 <@mesh, [{}, {"tp"}], unreduced={"dp"}> : tensor<16x32xf32>
    %29 = sdy.sharding_constraint %28 <@mesh, [{}, {"dp"}, {}], unreduced={"tp"}> : tensor<4x8x16xf32>
    %33 = sdy.sharding_constraint %32 <@mesh, [{}, {"tp"}], unreduced={"dp"}> : tensor<16x32xf32>
    %35 = sdy.sharding_constraint %34 <@mesh, [{}, {"dp"}, {}], unreduced={"tp"}> : tensor<4x8x16xf32>
    %36 = stablehlo.add %29, %35 : tensor<4x8x16xf32>

    ...

    # ========== REDUCE WEIGHT GRADIENTS (dp reduction) ==========
    # Reduce g_w2 across dp dimension
    %37 = sdy.sharding_constraint %16 <@mesh, [{"tp"}, {}]> : tensor<32x16xf32>
    # Reduce g_w3 across dp dimension
    %38 = sdy.sharding_constraint %27 <@mesh, [{}, {"tp"}]> : tensor<16x32xf32>
    # Reduce g_w1 across dp dimension
    %39 = sdy.sharding_constraint %33 <@mesh, [{}, {"tp"}]> : tensor<16x32xf32>

    # ========== REDUCE g_x (tp reduction) ==========
    # Reduce g_x across tp dimension
    %40 = sdy.sharding_constraint %36 <@mesh, [{}, {"dp"}, {}]> : tensor<4x8x16xf32>

    # ========== RETURN ==========
    # Returns: (out, grad_x, grad_w1, grad_w3, grad_w2)
    return %12, %40, %39, %38, %37 : tensor<4x8x16xf32>, tensor<4x8x16xf32>, tensor<16x32xf32>, tensor<16x32xf32>, tensor<32x16xf32>

Post-partition HLO¶

Full post-partition HLO. The most notable thing to observe here is that the DP all-reduces have been bucketed together into a single all-reduce, which is what you’d expect any self-respecting DP implementation to do. Also interestingly, the TP reduction is actually done first before summing the gradients together on both paths; you could probably also sum the gradients together first before all-reducing.

# ============================================================================
# MAIN ENTRY POINT
# Returns: (out, grad_x, grad_w1, grad_w3, grad_w2)
# ============================================================================
ENTRY %main.11_spmd (param.5: f32[4,4,16], param.6: f32[16,16], param.7: f32[16,16], param.8: f32[16,16], param.9: f32[4,4,16]) -> (f32[4,4,16], f32[4,4,16], f32[16,16], f32[16,16], f32[16,16]) {

  # ========== BACKWARD: All-reduce weight gradients across TP ==========
  # Sum weight gradients across tensor parallel devices (replica_groups={{0,2},{1,3}})
  # Returns: (g_w1, g_w3, g_w2)
  %all-reduce.1 = (f32[16,16]{1,0}, f32[16,16]{1,0}, f32[16,16]{1,0}) all-reduce(%dot.27, %dot.28, %dot.29), channel_id=3, replica_groups={{0,2},{1,3}}, use_global_device_ids=true, to_apply=%region_2.5, metadata={op_name="jit(forward_and_backward)/transpose(jvp(sbh,hi->sbi))/transpose/jit(forward_and_backward)/transpose(jvp(sbh,hi->sbi))/transpose" stack_frame_id=5}

  # line 16 backward: reshape g_x_h3
  %bitcast.8 = f32[4,4,16]{2,1,0} bitcast(%dot.18), metadata={op_name="jit(forward_and_backward)/transpose(jvp(sbh,hi->sbi))/dot_general" stack_frame_id=8}

  # Extract weight gradients from all-reduce tuple
  # line 15 backward: g_w1 (reduced)
  %get-tuple-element.4 = f32[16,16]{1,0} get-tuple-element(%all-reduce.1), index=0, metadata={op_name="jit(forward_and_backward)/transpose(jvp(sbh,hi->sbi))/transpose/jit(forward_and_backward)/transpose(jvp(sbh,hi->sbi))/transpose" stack_frame_id=5}
  # line 16 backward: g_w3 (reduced)
  %get-tuple-element.6 = f32[16,16]{1,0} get-tuple-element(%all-reduce.1), index=1, metadata={op_name="jit(forward_and_backward)/transpose(jvp(sbh,hi->sbi))/transpose/jit(forward_and_backward)/transpose(jvp(sbh,hi->sbi))/transpose" stack_frame_id=8}
  # line 18 backward: g_w2 (reduced)
  %get-tuple-element.8 = f32[16,16]{1,0} get-tuple-element(%all-reduce.1), index=2, metadata={op_name="jit(forward_and_backward)/transpose(jvp(sbi,ih->sbh))/transpose/jit(forward_and_backward)/transpose(jvp(sbi,ih->sbh))/transpose" stack_frame_id=11}

  # ========== BACKWARD: All-reduce input gradients across TP ==========
  # Sum input gradients across tensor parallel devices (replica_groups={{0,1},{2,3}})
  # Returns: (g_x_h3, g_x_h1)
  %all-reduce = (f32[4,4,16]{2,1,0}, f32[4,4,16]{2,1,0}) all-reduce(%bitcast.8, %bitcast.11), channel_id=1, replica_groups={{0,1},{2,3}}, use_global_device_ids=true, to_apply=%region_0.1, metadata={op_name="jit(forward_and_backward)/transpose(jvp(sbh,hi->sbi))/dot_general/jit(forward_and_backward)/transpose(jvp(sbh,hi->sbi))/dot_general" stack_frame_id=8}

  # Extract input gradients from all-reduce tuple
  # line 16 backward: g_x_h3 (reduced across TP)
  %get-tuple-element = f32[4,4,16]{2,1,0} get-tuple-element(%all-reduce), index=0, metadata={op_name="jit(forward_and_backward)/transpose(jvp(sbh,hi->sbi))/dot_general/jit(forward_and_backward)/transpose(jvp(sbh,hi->sbi))/dot_general" stack_frame_id=8}
  # line 15 backward: g_x_h1 (reduced across TP)
  %get-tuple-element.2 = f32[4,4,16]{2,1,0} get-tuple-element(%all-reduce), index=1, metadata={op_name="jit(forward_and_backward)/transpose(jvp(sbh,hi->sbi))/dot_general/jit(forward_and_backward)/transpose(jvp(sbh,hi->sbi))/dot_general" stack_frame_id=5}

  # ========== BACKWARD: Sum input gradients from both paths ==========
  # line 15+16 backward: g_x = g_x_h1 + g_x_h3
  %wrapped_add = f32[4,4,16]{2,1,0} fusion(%get-tuple-element, %get-tuple-element.2), kind=kLoop, calls=%wrapped_add_computation, metadata={op_name="jit(forward_and_backward)/transpose(jvp(sbh,hi->sbi))/add_any" stack_frame_id=5}

  # ========== FINAL OUTPUT ==========
  # Returns: (out, grad_x, grad_w1, grad_w3, grad_w2)
  ROOT %tuple.6 = (f32[4,4,16]{2,1,0}, f32[4,4,16]{2,1,0}, f32[16,16]{1,0}, f32[16,16]{1,0}, f32[16,16]{1,0}) tuple(%bitcast.4, %wrapped_add, %get-tuple-element.4, %get-tuple-element.6, %get-tuple-element.8)
}

DTensor erasure

February 1, 2026

DTensor has famously terrible eager mode performance; for example, this paper measured a 35-60% slowdown in end-to-end training performance with and without DTensor (with DTensor operations taking at least 7x longer than actually running the computation for real). While it is possible to alleviate some of this slowdown via optimizations (in the paper, veScale shows fast bypass of sharding propagation, improved cache lookups and C++ code can take dispatch overhead to 30us), this is still too high for some settings.

You could eliminate this overhead by rewriting your code without DTensors at all, but this gives up the benefits of expressing your code in global SPMD form.
You could eliminate the overhead by using a compiler or CUDA graphs, but this requires your code to be traceable.

Is there a way to have our cake (global SPMD) while eating it too (eager mode with no DTensor)? veScale proposed Static Eager mode as a way of eliminating DTensor at runtime, by observing that DTensor placements are largely static in a program, which means that you can just drop DTensor at runtime as long as you manually insert any communication that would have occurred if you had run with DTensor (veScale does extra work under the hood to add hooks for these communications). However, it is quite difficult for researchers to understand how to insert redistributes / gradient redistributes. In this blog post, I want to reimagine their system under the following constraint: what if you could just erase the DTensors, without having to add any hooks at all? Spoiler alert: JAX-style sharding in types without implicit conversions is enough to get the job done.

First, let’s describe the problem more carefully. Typically, a desirable property for type systems is that the types can be erased before execution without changing the runtime behavior of a program. In the case of DTensor, we want to erase all of the placements in a program, with the hope that we can still run everything we need to without them. Actually, most of the time in DTensor this will work, because the ideal situation for DTensor is that you just run the operation as-is on the local tensors. The problem is, of course, redistribute, which needs to know what the original placement of a DTensor is to issue the collectives to get it into the new desired placement. Even worse, to detect if an implicit redistribution would occur, you need to compute that input placements are illegal and how to insert redistributes to make it legal.

The explicit redistribute problem is easy enough to fix: a user could specify a specific collective (which, with DTensors, would be checked for consistency against the actual input/output placements), or we could ask the user to specify the input placements so that we can still compute the sequence of collectives to get to the output placement. To avoid implicit redistributions in forwards, one can simply ensure you insert explicit redistributes anywhere they are needed; to avoid implicit redistributions in backwards, you need a type system that guarantees that the backwards collectives correspond precisely to forward collectives. You need two ideas from the JAX sharding type system: first, the backward gradient placement should always be computable from the forward primal placement, so you can always reason locally about if comms need (as you always know the placement of grad_output.) Second, you need enough vocabulary (e.g., reduced/unreduced) to ensure that these forced backward gradient placements don’t lead to communication inefficiencies.

Once your user program has a DTensor erasure property, you now have code that can be run with either plain Tensor or DTensor. Running with DTensor is akin to running a (dynamic) type checker on the program: in explicit mode it will error if implicit redistributes occur, and it will also error if you messed up and claimed that something was in some placement when it was not. Running with Tensor, you just elide all of the shard checking and trust that the user program was written correctly. This works if the user program doesn’t branch; if you are worried about inexhaustive testing, you could run under both DTensor and torch.compile, where guards and Dynamo can help you identify whether or not you have actually exercised all potential inputs or not.

The resulting code you write is very similar to a Megatron-style training framework, but with the twist that you can check that you inserted all of your collectives by running with DTensors rather than Tensors. More generally, this is an interesting pattern for gradual compilation; your program can be entirely run in eager mode, and the compiler is relegated to a sideband static analysis tool (akin to an optional type checker), which still can be an essential part of the workflow for catching and debugging problems.

Appendix¶

Sum should not cast to Partial. In classic PyTorch DTensor, a sum on a sharded dimension performs no communications and produces a Partial placement. This behavior cannot be supported under DTensor erasure. Let us reason through it:

input: Shard(0)

sum = input.sum(0)
sum: Partial()

grad_sum = torch.ones_like(sum)
grad_sum: Replicate()  # if we suppose Partial() <-> Replicate()

grad_input = grad_sum.expand(input.shape)
grad_input: Replicate()

We see that the backwards formula for sum is a plain eager operation that expands grad_sum. This operation will produce a Replicate tensor. But the primal-cotangent sharding mapping specifies that grad_input should be a Shard(0) tensor; a (free) redistribute from Replicate to Shard(0) is required. In ordinary DTensor, we discover this redistribution happens later when there is a shard-replicate interaction; however, with DTensor erasure, there is no way to discover this, and we must ban this sum.

To resolve the problem, we simply need to distinguish between two distinct APIs for sum. Standard torch.sum should only ever do a local summation and error out if the reduction is across a sharded dimension. Cast to partial (aka lax.pcast(to='unreduced') in JAX) is a separate function that only works on sharded tensors. The distinct function can now be associated with a custom autograd function that triggers the re-sharding in backwards.

Update (Feb 8, 2026): The analysis here is incomplete. In particular, it assumes that differentiation is done on the global tensor. If instead we desugar global SPMD into local SPMD operations and run a local expand operation, I don’t expand “too much” and consequently have to slice it again. So it is possible sum can still cast to partial in global SPMD.

Broadcasts over shards require conversion to ‘reduced’. Above, I claimed that JAX’s explicit mode for global SPMD only issues collectives when explicitly asked to do so. But this isn’t completely true. When I do an operation between a sharded and replicated tensor (under JAX semantics), this can result in an implicit all-reduce in backwards:

import jax
import jax.numpy as jnp

jax.config.update('jax_num_cpu_devices', 4)
jax.set_mesh(jax.make_mesh((4,), ('dp',)))

batch = 8
hidden = 16
out_dim = 4

x = jax.device_put(
    jnp.ones((batch, hidden), dtype=jnp.float32),
    jax.P('dp', None)
)
w = jax.device_put(
    jnp.ones((hidden, out_dim), dtype=jnp.float32),
    jax.P(None, None)
)
grad_out = jax.device_put(
    jnp.ones((batch, out_dim), dtype=jnp.float32),
    jax.P('dp', None)
)

def forward(x, w):
    return jnp.einsum("bh,ho->bo", x, w)

def backward(x, w, grad_out):
    _, vjp_fn = jax.vjp(forward, x, w)
    return vjp_fn(grad_out)[1]  # gradient w.r.t. w

compiled = jax.jit(backward).lower(x, w, grad_out).compile()
print(compiled.as_text())

This prints:

%add.clone (x.3: f32[], y.1: f32[]) -> f32[] {
  %x.3 = f32[] parameter(0)
  %y.1 = f32[] parameter(1)
  ROOT %add.1 = f32[] add(%x.3, %y.1)
}

%fused_computation (param_0.1: f32[4,16]) -> f32[16,4] {
  %param_0.1 = f32[4,16]{1,0} parameter(0)
  %transpose.7 = f32[16,4]{0,1} transpose(%param_0.1), dimensions={1,0}, metadata={op_name="jit(backward)/transpose(jvp(bh,ho->bo))/transpose" stack_frame_id=3}
  ROOT %copy.1 = f32[16,4]{1,0} copy(%transpose.7), metadata={op_name="jit(backward)/transpose(jvp(bh,ho->bo))/transpose" stack_frame_id=3}
}

ENTRY %main.0_spmd (param.1: f32[2,16], param: f32[2,4]) -> f32[16,4] {
  %param = f32[2,4]{1,0} parameter(1), sharding={devices=[4,1]<=[4]}, metadata={op_name="grad_out"}
  %param.1 = f32[2,16]{1,0} parameter(0), sharding={devices=[4,1]<=[4]}, metadata={op_name="x"}
  %dot = f32[4,16]{1,0} dot(%param, %param.1), lhs_contracting_dims={0}, rhs_contracting_dims={0}, metadata={op_name="jit(backward)/transpose(jvp(bh,ho->bo))/dot_general" stack_frame_id=3}
  %all-reduce = f32[4,16]{1,0} all-reduce(%dot), channel_id=1, replica_groups=[1,4]<=[4], use_global_device_ids=true, to_apply=%add.clone, metadata={op_name="jit(backward)/transpose(jvp(bh,ho->bo))/dot_general" stack_frame_id=3}
  ROOT %transpose_copy_fusion = f32[16,4]{1,0} fusion(%all-reduce), kind=kLoop, calls=%fused_computation, metadata={op_name="jit(backward)/transpose(jvp(bh,ho->bo))/transpose" stack_frame_id=3}
}

We can see the original program has no collectives, but the HLO has an all-reduce. Actually, this is just the classic all-reduce you have to do to the gradients of all parameters when doing DP, so this isn’t exactly surprising. This is bad for DTensor erasure, though, because the way JAX knows to insert the collective here is by noticing that the backwards of the linear produces an unreduced result, but sharding-in-types demands that the gradient be replicated.

Now, we could fix the particular case of DP by arguing that DP sharding is special and it’s the responsibility of the DP framework to know that a reduction is necessary (this is how torchtitan on DTensor classically operates: we don’t represent DP directly in the DTensor, and FSDP is responsible for actually doing the all-reduces and grad scaling.) A more theoretically sound solution, however, is to simply say that einsum should forbid broadcasting a replicated tensor with a sharded tensor (even though in forwards this can be done without any communication); instead, in this case, you must have a reduced tensor on the sharded mesh axis, so that the gradient is an unreduced tensor on that mesh axis. A conversion from replicate to reduced will trigger the all-reduce in backwards.

We can do an analysis on einsum to see that the broadcast situation is the only situation where these extra reductions can occur. Recall from Computing sharding with einsum that when we compute the gradient for an input, we interchange the indices for that input with the indices for the output in the einsum formula. We can do a case-by-case analysis for every valid input sharding to see what communication happens in the backwards:

Shard("batch"), Shard("batch") -> Shard("batch"): interchanging an input with output still results in a batch pass-through, no comms.
Shard("contract"), Shard("contract") -> Partial(): interchanging an input with output results in Shard("contract"), Replicate() -> Shard("contract") (recall that the cotangent type for reduced, is unreduced, aka replicate!), no comms
Shard("broadcast"), Replicate() -> Shard("broadcast"): there are two cases here:
- Shard("broadcast"), Replicate() -> Shard("broadcast") (grad of first input): this is the same as forwards, no comms.
- Shard("broadcast"), Shard("broadcast") -> Partial() (grad of second input): contraction over sharded dimension produces partial, yes comms.

So this really is the only situation in einsum that has this problem. (Pointwise ops that broadcast would also have this problem.)

The JAX sharding type system

January 28, 2026

Conventionally, a type system is something that classifies values into data types like float32 or int64. However, fancy type systems go beyond data types, allowing us to talk about potentially arbitrary invariants on data; for example, if we were to talk about the “type” of a array, it would cover not only its data type, but also its shape, e.g., f32[40, 20]. JAX’s type system of abstract values (avals) goes further than just data types and shapes and is equipped to reason about sharding related invariants. However, this type system is poorly documented, especially recent additions like reduced/unreduced axes (circa June 2025). In this blog post, I want to give a consolidated description of the sharding related aspects of JAX’s typing in explicit sharding mode, as of 2026. Disclaimer: I am not a JAX developer, and there may potentially be mistakes in this presentation; please let me know about errors in Twitter. I will assume that you have some knowledge about how to work with JAX sharding in the frontend; please refer to Distributed arrays and automatic parallelization, Explicit sharding and Manual parallelism with shard_map for a refresher on these topics.

Warning: All of this discussion is about Explicit sharding. The avals do not store accurate sharding information in the (default) auto mode. For example:

import jax
import jax.numpy as jnp
from jax.sharding import AxisType

jax.config.update('jax_num_cpu_devices', 2)

mesh1 = jax.make_mesh((2,), ('i',), axis_types=(AxisType.Auto,))
jax.set_mesh(mesh1)
array1 = jax.device_put(jnp.ones((8,)), jax.P("i"))
print(array1.aval.sharding)

mesh2 = jax.make_mesh((2,), ('i',), axis_types=(AxisType.Explicit,))
jax.set_mesh(mesh2)
array2 = jax.device_put(jnp.ones((8,)), jax.P("i"))
print(array2.aval.sharding)

will print:

NamedSharding(mesh=AbstractMesh('i': 2, axis_types=(Auto,), device_kind=cpu, num_cores=None), spec=PartitionSpec(None,))
NamedSharding(mesh=AbstractMesh('i': 2, axis_types=(Explicit,), device_kind=cpu, num_cores=None), spec=PartitionSpec('i',))

The new jax.make_mesh API defaults to Explicit, but the direct jax.sharding.Mesh constructor defaults to Auto. Beware!

Definition of a type¶

The name of a type in JAX is AbstractValue. The important concrete subclass of this type is ShapedArray, whose type I have simplified below:

class ShapedArray(AbstractValue):
  shape: tuple[int, ...]
  dtype: dtype
  weak_type: bool  # when True, don't force type promotion
  sharding: NamedSharding
  vma: frozenset[MeshAxisName]
  memory_space: MemorySpace  # see https://github.com/jax-ml/jax/pull/30556

You can see ordinary things like shape and dtype, as well as some weirder things. In this post, we are going to focus on sharding and vma. Here are simplified definitions for NamedSharding and PartitionSpec:

MeshAxisName = str

class NamedSharding:
  mesh: Mesh
  spec: PartitionSpec

class PartitionSpec(tuple[None | MeshAxisName | tuple[MeshAxisName, ...], ...]):
  unreduced: frozenset[MeshAxisName]
  reduced: frozenset[MeshAxisName]

Let us now describe these types in more detail. For each, we will ask:

What does it mean?
Is the type applicable for global SPMD or local SPMD (or both?)
How does the type propagate across operations?
How is the type transformed by autograd (what is the mapping from primal to cotangent type?)

PartitionSpec¶

PartitionSpec is the most user visible concept. In this section, we’ll ignore unreduced/reduced for now. Without those fields, it is simply a tuple with one entry per dimension of the array (in PyTorch I often refer to this as tensor-oriented sharding, as opposed to mesh-oriented sharding). For each array dimension, you specify which mesh axes shard it. There can be zero (None), one ("i") or many (("i", "j")) mesh axes sharding a dimension; sharding is applied from left to right. You recover the global view (i.e., the one whose shape is described in ShapedArray) by stacking the arrays distributed across those mesh axes. PartitionSpec is commonly abbreviated as just P in JAX code. When you print the type of an array in explicit mode, JAX will inline the partition spec into the shape: e.g., float32[8,16@tp] implies a PartitionSpec of P(None, "tp"). There’s a much longer description with pretty pictures at Distributed arrays and automatic parallelization.

PartitionSpec propagate according to shard propagation rules, which must be defined on a per-operator basis. If you want to perform an operation without performing communication, it must be the case that running that operation locally on the sharded tensors and then stacking it (per the output sharding), would be the same as stacking the arrays first (per the input sharding) and then running the operation globally. The output sharding is not always the same as the input sharding, and the shard propagation rule is also responsible for computing this output sharding.

How does PartitionSpec interact with autograd? The PartitionSpec of primals and cotangents matches: a value that is replicated in forwards, will also be replicated in backwards; similarly, if it is sharded in forwards, it will be sharded in backwards. (Reduced/unreduced will be an exception to this rule, discussed below.)

I want to take a moment to discuss a subtlety of PartitionSpec with respect to shard_map. As I discussed in Global vs Local SPMD, inside of a shard_map region, it’s not really well defined what the global shape of a array is: you only have to specify a PartitionSpec on the inputs and outputs. By default, inside of a shard_map, the shape is the local shape of a array, the mesh says your axis is Manual, and the PartitionSpec says there is no sharding on the array anymore (perhaps confusingly, since None here in the global SPMD view would imply it’s replicated, but that’s not at all the meaning here).

As a small example:

import jax
import jax.numpy as jnp

jax.config.update('jax_num_cpu_devices', 2)
jax.set_mesh(jax.make_mesh((2,), ('i',)))

x = jax.device_put(jnp.ones((8,)), jax.P("i"))

@jax.shard_map(out_specs=jax.P("i"))
def f(x_local):
    print(jax.typeof(x_local))
    print(jax.typeof(x_local).sharding)
    return x_local

f(x)

prints:

float32[4]{V:i}
NamedSharding(mesh=AbstractMesh('i': 2, axis_types=(Manual,), device_kind=cpu, num_cores=None), spec=PartitionSpec(None,))

The PartitionSpec is along for the ride, but it no longer contains sharding information for the manual axes (as you would expect.) (Wondering about the {V:i}? See the next section.) Another subtlety is that JAX allows for mixing Manual mode with not-Manual mode, via axis_names. So you can actually potentially see nontrivial PartitionSpec inside of a shard_map region! The footnote has a worked example.¹

VMA¶

VMA is short for “varying manual axes”. The motivation for this feature is described in Efficient transposition of replication-inducing collectives and the feature has been around long enough that the shard_map docs have a section about it.

Unlike PartitionSpec, VMA is a shard_map only concept; as its name suggests, it only applies to Manual axes. VMA tracks whether or not values are varying or invariant across a given mesh dimension. Actually, we can think of VMA as an approximation of PartitionSpec. PartitionSpec tells us exactly how to stack the local arrays to form a global array–they vary across that mesh dimension: if PartitionSpec for a particular array dim is None, it instead claims says this array is invariant across this mesh dimension (no stacking needed, everything is the same!) With VMA, we don’t know how to stack the arrays (because the whole point of local SPMD is that the global view is undefined), but we do know if they are the same or different across a mesh dimension. You don’t need to track VMA for non-Manual axes, because PartitionSpec subsumes it. In the print of jax.typeof, the {V:i} indicates all of the varying mesh axes (V is for varying). When all axes are invariant, we just elide this from the print.

Because VMA is an approximation of PartitionSpec, its propagation rules are simpler as well. For non-communication ops, it is simply required that the VMA of all inputs match exactly. If one input is invariant while another is varying, JAX will insert an implicit conversion from invariant to varying to make the VMA match. In Megatron via shard_map, I have a worked example of this, where VMA is how JAX triggers an all-reduce on gradients from the TP region into a replicated grad_input.

How does VMA interact with autograd? The VMA of primals and cotangents matches: a value that is varying in forwards, will also be varying in backwards; similarly, if it is invariant in forwards, it will be invariant in backwards. If you are skeptical that forcing this constraint is an efficient thing to do, you would be right! (See reduced/unreduced below.)

Collective operations are more complex regarding VMA, because we must reason about what happens to the variance of the local tensors before and after the collective. In the bottom of this section there is a table of how all the collectives affect variance. Beyond the standard collectives, VMA also introduces the necessity to represent a conversion from invariant to varying (which is a no-op in forwards and an all-reduce in backwards.) As a type system nerd, I think this is a very cool use of type systems to rule out illegal programs (that forget to do required all-reduces in backwards!)

One side note: VMA is not actually required for correctness, and you can actually disable this type system with check_vma=False. It was actually introduced as a way to make it possible to write more efficient programs that were impossible to write without it. Without VMA, JAX can conservatively assumes that all axes are potentially varying, and it will insert extra collectives to ensure you get the correct result. By actually modeling VMA, we can tell when something is invariant and potentially skip this collective.

Unreduced¶

The unreduced and reduced fields in PartitionSpec are quite new and are not documented in the public JAX documentation. However, we think of them as quite important when doing work with explicit sharding (for example, Wanchao, one of the original authors of DTensor, has told me the need to represent Partial placements is one of the big reasons why DTensor has mesh-oriented placements rather than tensor-oriented placements.) Unlike PartitionSpec/VMA, these fields apply for both Explicit and Manual axes.

It is easiest to first describe what unreduced means. Unreduced means there is a pending reduction (summation) on a device mesh axis that is necessary to get the global view of the array. Outside of shard_map, the most common way to generate an unreduced array in JAX is to use jnp.einsum with out_sharding that specifies unreduced. For example:

import jax
import jax.numpy as jnp

jax.config.update('jax_num_cpu_devices', 2)

jax.set_mesh(jax.make_mesh((2,), ('i',)))

x = jax.device_put(jnp.ones((4, 8,)), jax.P(None, "i"))
y = jax.device_put(jnp.ones((4, 8,)), jax.P(None, "i"))
u = jnp.einsum('bx,bx->b', x, y, out_sharding=jax.P(unreduced={'i'}))
print("u", jax.typeof(u))

prints:

float32[4]{U:i}

As I described in Computing sharding with einsum, when you do a contraction on two dimensions that are sharded on the same mesh axis, this can be done locally as long as you remember that there is a pending reduction (aka, partial/unreduced) that you need to do across that mesh axis to get the final result. Prior to the existence of unreduced in JAX, it wasn’t possible to express that you wanted no communication: the output sharding could only express replicated/sharded states, so you were going to end up doing an all-reduce or reduce-scatter.

You can also have unreduced axes in shard_map. For example, you can cast a varying array into an unreduced array, and then trigger the reduction. (Warning: unreduced doesn’t work with shard_map without jax.jit, see issue #34684)

import jax
import jax.numpy as jnp
from jax import lax

jax.config.update('jax_num_cpu_devices', 2)
jax.set_mesh(jax.make_mesh((2,), ('i',)))

x = jax.device_put(jnp.ones((8,)), jax.P("i"))

@jax.jit
@jax.shard_map(out_specs=jax.P(None))
def f(x_local):
    u = lax.pcast(x_local, to='unreduced', axis_name='i')
    print(jax.typeof(u))
    return lax.psum(u, axis_name='i')

f(x)

prints:

float32[4]{U:i}

How does unreduced propagate? Alas, this once again is something you have to define on a per operator basis, like in sharding propagation rules. One rule of thumb is that linear functions can always propagate unreduced, since linearity means f(x + y) == f(x) + f(y), but at time of writing JAX just writes all the unreduced rules out by hand.

How does unreduced interact with VMA and PartitionSpec? It is mutually exclusive from varying/sharded. If an array is unreduced on a mesh axis, it cannot also be varying or sharded on that mesh axis. (Technically, one might argue that if something is unreduced on a mesh axis, it is obviously varying on that axis, but varying and unreduced need to be treated differently for AD purposes, so it’s best to keep these distinct.)

Inside of shard_map, there are a few functions for working with unreduced:

lax.pcast will let you directly convert a varying axis into an unreduced axis, declaring your intent to do a reduction later. However, you can’t do the “no-op” pcast from unreduced to varying, because this is not well-defined from a global semantics perspective: it’s in general not defined to define a function as “take its input, decompose it into x + y, and then run an arbitrary function on x and y individually.
The existing reduction collectives like lax.psum and lax.psum_scatter will accept both axes that are varying as well as unreduced, and do the obvious thing.

Reduced¶

So what is reduced? As described in the original PR to make unreduced + AD work, reduced is like replicate, but it causes the cotangent (gradient) to be unreduced (and vice versa). Remember that in sharding with types, the cotangent sharding is always a function of the primal sharding. When you have a replicated primal, it is ambiguous whether or not you want the cotangent to be replicated or unreduced, so JAX introduces a new type (reduced) to let you distinguish them solely even if you are only looking at the primal type. The rule in JAX is replicate goes to replicate, and reduced goes to unreduced (and vice versa). Like unreduced, reduced is tracked both in Explicit and Manual mode.

How can you interact with reduced shardings? Unlike unreduced, you can directly device_put some data as reduced (since reduced is the same as replicated), or jax.reshard a tensor to a reduced placement. A transition from invariant/replicated to reduced is a no-op in forwards but triggers an all-reduce in backwards; similarly, you can pcast reduced to varying, which is a no-op for both forwards and backwards (invariant to varying forces an all-reduce immediately, since invariant’s cotangent is invariant, but reduced can delay the all-reduce since its cotangent type is unreduced). Separately, inside of a shard_map, lax.all_gather can be instructed to directly go to reduced (which will result in a reduce-scatter in backwards.)

How does reduced propagate? The propagation rule is simple: reduced is a statement about a set of mesh axes (not array dims), so we simply pass through the set of reduced axes whenever we do an operation. If an operation is N-ary, it’s required that all inputs are reduced on the same mesh axes. Unlike replicate, inputs cannot be replicated or sharded on reduced mesh axis, and JAX will force you to add conversions to make it typecheck. (It’s actually not clear to me that there isn’t a good default choice for implicit conversions here, but it’s certainly a lot safer to not allow it to start!)

Here is an example of mixed Manual and Explicit mode:

import jax
import jax.numpy as jnp

jax.config.update('jax_num_cpu_devices', 4)

jax.set_mesh(jax.make_mesh((2, 2), ('i', 'j')))

x = jax.device_put(jnp.ones((4, 4)), jax.P("i", "j"))

@jax.shard_map(out_specs=jax.P("i", None), axis_names={"i"})
def f(x_local):
    print(jax.typeof(x_local))
    print(jax.typeof(x_local).sharding)
    return x_local

out = f(x)
print(jax.typeof(out))

prints:

float32[2,4@j]{V:i}
NamedSharding(mesh=AbstractMesh('i': 2, 'j': 2, axis_types=(Manual, Explicit), device_kind=cpu, num_cores=None), spec=PartitionSpec(None, 'j'))
float32[4@i,4@j]

↩︎

Global vs Local SPMD

January 27, 2026

Global SPMD (also known as the “global view”, exposed by code using DTensor or jax.Array) refers to writing multi-device code as if it was on a single device, with an orthogonal mechanism for expressing how these full tensors are distributed over multiple devices (this mechanism can be implicit or explicit, e.g., as seen in this table).

Local SPMD (also known as “per-device view”, and exposed by local_map and shard_map, and also traditional PyTorch distributed code operating on plain Tensors, e.g., Megatron-style) refers to writing code from the “local” view on a single device, with explicit collectives when communicating across devices.

The big question I want to address in this post is, how do I pick between these two modes? Conventional wisdom in the JAX ecosystem is that you should default to global SPMD (either via auto or explicit sharding mode), and drop down to manual local SPMD if the compiler isn’t giving you the correct communications and you want to do it by hand. I want to give a more nuanced version of this take.

First, there is nothing about global SPMD that precludes fine-grained control over when collectives happen. JAX doesn’t directly support this kind of mode, but it’s not difficult to imagine how it would work: take JAX with explicit mesh axes, but instead of only erroring out on ambiguous communication, error out when any implicit communication happens (e.g., you must explicitly call reshard to trigger communication). We actually added an explicit mode to DTensor along these lines, although it currently doesn’t work super well because we lack some other important aspects of JAX sharding in types.¹

For me, the more important difference is that global and local SPMD are actually different semantics. An obvious divergence is that in local SPMD, there isn’t any source of truth about what the “global” view of the Tensor is: the local tensor just exists, you know that there are different versions of it on the other nodes. You don’t know how you’re supposed to stack these tensors together to get a global tensor: you typically only know this at the boundary of the local region using out_specs / out_placements. And even if you knew how to stack the tensors together, local SPMD has different semantics than global SPMD, as the exact computation you perform depends on how exactly the local tensor is sharded. You’re not doing an operation on the global tensor: you’re chunking the tensor, running the operation on each chunk, and then stacking it back together. The whole point of sharding propagation in global SPMD is to figure out if this is equivalent to running the operation on the full tensor, and there are many cases when it is not.

If you are not thinking carefully about your distributed computation, local SPMD can be a source of bugs. It is common to write distributed code where certain parallelisms are enabled or disabled. If you do a reduction over an axis, if that axis is replicated the result is replicated, but if it is sharded you will end up with a partial reduction that has to be accounted for in some other way. If you forget, the code will work when the parallelism is turned off and silently break when the parallelism is turned on. A bug like this is horrible enough that frameworks invest in ways to deal with situation.²

This is perhaps the reason why Megatron is sometimes considered unfriendly for experimentation. Everything is written in local SPMD (as it doesn’t use DTensor), and if you want to experiment on something new you must upfront resolve all of the interactions with parallelism in the implementation of your code. This is all doable, but it can be pretty confusing if you are not a parallelism expert and easy to get wrong.

There is a flip side to this, however: if you are thinking carefully about your parallelism and are carefully orchestrating your local compute with your communications, it is much more natural to write things in local SPMD style. The local SPMD style only gives you operations that can be efficiently computed (since they are always local) and doesn’t force you to say what the global interpretation of a Tensor is (especially when it’s unnatural, like online softmax.) So once you get out of the experimentation phase and are working on efficiency, if you need some nontrivial communication pattern, it would be pretty normal to switch from global SPMD to local SPMD. But there’s also a lot of pedestrian modules that don’t need anything fancy, and it is better to keep them in global SPMD in that case.

In the PyTorch ecosystem, there are some more low level reasons why you might prefer local SPMD over global SPMD. The most prominent is DTensor’s eager overhead. Many parts of DTensor are implemented in Python rather than C++, and on the first invocation we must compute shard propagation rules, which is entirely in Python and quite expensive. It is possible to get reasonable performance with DTensor: if you torch.compile you can eliminate the overhead entirely, CUDA graphs also work, and FSDP2 shows that careful, minimal use of DTensor can still have acceptable CPU overhead. But this is perhaps one of the big reasons why distributed code with plain Tensors remains quite popular today.

Specifically, we don’t have the invariant that the cotangent sharding is directly computable from the primal sharding–DTensor is very much like an “auto” mode in that respect, where the forwards/backwards sharding can be chosen differently. This makes it difficult to rule out implicit redistributes in backwards, since whether or not a redistribute occurs is heavily dependent on the exact details of how sharding has propagated through the backwards graph. ↩︎
In JAX’s shard_map with check_vma=True, a type system detects if you did a reduction on a sharded dimension and then tried to declare it as replicated on the way out, since the varying/invariant type system would notice that the sharding axis is varying across the mesh and thus inconsistent with an out_specs that claims it is replicated. In PyTorch, something like run_check) checks at runtime that tensors you claim are replicated are actually replicated (run_check is horribly inefficient but you can implement other ways to do this more quickly, like with async checksums). ↩︎

Megatron via shard_map

January 26, 2026

In Computing sharding with einsum, we worked an example of Megatron style tensor parallelism where we discover that the ordinary backwards formula for linear results in a pending reduction on grad_input, even though the input was replicated and no communications happened in forwards. In Megatron, which is implemented with plain Tensors and manual collectives, you just have to know that this reduction is necessary and manually insert it with a custom autograd function.

If we wanted to write a similar explicit-collective-style Megatron implementation in JAX, we would use shard_map. Like in Megatron, you have to call a function which is a no-op in forwards and an all-reduce in backwards. However, JAX has built this function into its core library, and has an interesting name for it: jax.lax.pcast(..., to='varying') (previously known as pvary, although apparently this is deprecated now.) Why is this a “cast”? The answer is that JAX’s shard_map actually comes with an optional type system (enabled with check_vma=True) which reject your program if you forget to insert an all-reduce in your backwards!

Let’s see how this works. As a reminder, our shapes are:

input: [sequence, batch, in_features]
weight: [in_features, out_features]
output: [sequence, batch, out_features]

We can describe the input and output sharding of these in JAX style (tensor-dim oriented), which is what we would feed into the in_specs and out_specs of shard_map:

input: P(None, None, None)
weight: P(None, "tp")
output: P(None, None, "tp")

Although on the boundaries of shard_map we can say exactly what the sharding of the input/output tensors are (e.g., how to reassemble them back into full tensors), on the inside of a shard_map this is not a well-defined question: you only have the local tensors and can do whatever you want with them before you reassemble them back into global tensors with shardings.

However, when check_vma=True, JAX will keep still track of something weaker than sharding: whether or not the tensors are varying (i.e., different) across a mesh dimension. This is conventionally notated as dtype[local_shape]@{varying axes}, e.g., f32[3]{i} means that this tensor varies across the mesh axis i (the braces are omitted when nothing varies). Let’s write down the local shapes of our input/output tensors with variance information (note that the sharded tensor dimensions have shapes that are divided by the mesh they are sharded over):

input: f32[sequence, batch, in_features]  # nothing varying
weight: f32[in_features, out_features/tp]@{tp}  # varies over tp dim
output: f32[sequence, batch, out_features/tp]@{tp}  # varies over tp dim

The point of a type system is that you have typing rules that say whether or not an operation is legal between two types, rejecting programs that are ill-typed. In particular, in jaxpr it’s illegal to have an operation like matrix multiply between two tensors with differing VMA: you have to insert a cast to make the VMAs match before you can do the operation. Actually, JAX will typically insert these casts implicitly for you, but for clarity we’re going to insert the cast explicitly here:

import jax
import jax.numpy as jnp

jax.config.update('jax_num_cpu_devices', 2)

jax.set_mesh(jax.make_mesh((2,), ('tp',)))

sequence, batch, in_features, out_features = 4, 2, 8, 16

input = jnp.ones((sequence, batch, in_features))
# NB: the full weight, shard_map will automatically shard weight for us given the
# input spec
weight = jnp.ones((in_features, out_features))

@jax.shard_map(in_specs=(jax.P(None, None, None), jax.P(None, "tp")), out_specs=jax.P(None, None, "tp"))
def colwise_linear(input, weight):
    print('input', jax.typeof(input))
    print('weight', jax.typeof(weight))
    input = jax.lax.pcast(input, "tp", to="varying")
    print('pcast_input', jax.typeof(input))
    output = jnp.einsum("sbi,io->sbo", input, weight)
    print('output', jax.typeof(output))
    return output

output = colwise_linear(input, weight)

This prints:

input float32[4,2,8]
weight float32[8,8]{V:tp}
pcast_input float32[4,2,8]{V:tp}
output float32[4,2,8]{V:tp}

It’s a little difficult to see why the program “doesn’t typecheck” when you omit the pcast, because even if you don’t pcast JAX will implicitly insert it for you, but you can verify that the cast happens anyway by inspecting the HLO with and without the explicit pcast (this is left as an exercise to the reader to verify; an LLM can one-shot this program transformation). The type system here is also not that invasive: local operations simply propagate variance (varying to varying, invariant to invariant), and you only need a small menagerie of collective operations to help you manage collectives which can take you between varying and invariant.

Although JAX’s type system here is a bit difficult to explain from first principles, it seems quite beneficial as it makes it impossible to forget backwards reductions that are required when you do operations between sharded and replicated tensors. We’re hoping to bring a similar capability to PyTorch DTensor in the near future, introducing a new “LTensor” subclass that is able to track metadata along these lines.

Older Posts »