I was coaching some juniors on effective use of AI coding agents (like, looking over their shoulder as they were prompting the LLMs) and one reoccurring theme was that the AI agents were demanding a lot more of their reading code skills, compared to how they used to operate before. “How,” they asked, “can we get better at reading code?”
I had to think about my answer to this, because even before AI, it’s always been difficult to make the jump from “I mostly write code” to “I spend all my time reviewing code other people have written.” There is some intrinsic difficulty about this; there has always been a shortage in the number of code reviewers, even without LLMs exacerbating the problem!
Read more...
Previously in AI-assisted programming for spmd_types, I mentioned that I have been enjoying using Sapling (the version control system Meta uses internally) to manage parallel agents on worktrees. In this post, I want to describe this workflow in more detail. The basic development flow here is I have three work streams going in parallel:
- I am working on enabling an end-to-end use case; in this case, enabling strict type-checking on a realistic codebase.
- I am adding new features and fixing bugs to the core library itself, as identified by (1).
- I am addressing code review comments and CI failures for the diffs produced by (2).
All of this is in the context of a single stack, the bottom of which is the feature diffs, and the top of which is the enablement diffs. I have multiple worktrees: one at the top for E2E enablement, one or more at the top of the feature diffs to add new features, and one or more inside the stack to address code review and fix CI. All of these are going in parallel, and so the problem is how to coordinate these parallel changes on a single source-of-truth stack. Git does terribly at this, but if you know the trick, Sapling does really well!
Read more...
spmd_types is a type checker for distributed PyTorch programs. I’ll write a blog post about the system itself as it gets closer to completion, but in this post I want to talk about my workflow for AI-assisted programming in the project. In particular, spmd_types is my first greenfield project that both:
- I have written it entirely with AI, and
- I am comfortable being accountable for it (in the same way I am accountable for my handwritten code).
I have encouraged others to also use AI as the primary mechanism of writing code for the codebase, but I have noticed that people get different outcomes (in terms of diff volume and number of review cycles), so I wanted to write down my workflow in case it was helpful for somebody.
Read more...
A silent BC breaking change is a change of semantics in an API that is not immediately obvious; e.g., the old usages of the broken behavior won’t raise a compile or runtime error on upgrade. If you are a project that cares about backwards compatibility, it’s generally a bad idea to ship silent BC breaking changes, because they manifest as users getting silently incorrect results and having to painfully root cause the difference to a particular version upgrade. However, many bug fixes technically are silent BC breaking changes, especially when the old buggy behavior could conceivably be useful in certain circumstances. Load bearing bugs are often the reason why sometimes bug-for-bug compatibility is required.
Read more...
I recently received the following question about vibe-coding for tlparse, a structured log parser for torch.compile (slightly edited for ease of reading):
Hi Ed, I have some thoughts on vibe-coding tlparse and I’m curious what your opinions are. I think it’s fairly easy to vibe code and improve tlparse, but it’s hard to find people who know Rust or HTML or JavaScript well to properly review the vibe-coded stuff. The Rust PRs are not impossible to review, but the JavaScript ones are certainly hard… YOLO-landing PRs that we can’t review is certainly bad, I guess one option is just not landing any changes, and tell people to vibe-code themselves…?
Read more...
A central thesis of sharding in types is that the backward sharding can be directly computed from the forward sharding. This is not true for DTensor today, e.g., as seen in sum to Partial, and it leads to confusion where users cannot easily predict what the sharding of tensors are in their program. The question now arises: given a forward sharding, what should its backward sharding be? There are some easy cases to fill in:
Read more...
DTensor has famously terrible eager mode performance; for example, this paper measured a 35-60% slowdown in end-to-end training performance with and without DTensor (with DTensor operations taking at least 7x longer than actually running the computation for real). While it is possible to alleviate some of this slowdown via optimizations (in the paper, veScale shows fast bypass of sharding propagation, improved cache lookups and C++ code can take dispatch overhead to 30us), this is still too high for some settings.
Read more...
Conventionally, a type system is something that classifies values into data types like float32 or int64. However, fancy type systems go beyond data types, allowing us to talk about potentially arbitrary invariants on data; for example, if we were to talk about the “type” of a array, it would cover not only its data type, but also its shape, e.g., f32[40, 20]. JAX’s type system of abstract values (avals) goes further than just data types and shapes and is equipped to reason about sharding related invariants. However, this type system is poorly documented, especially recent additions like reduced/unreduced axes (circa June 2025). In this blog post, I want to give a consolidated description of the sharding related aspects of JAX’s typing in explicit sharding mode, as of 2026. Disclaimer: I am not a JAX developer, and there may potentially be mistakes in this presentation; please let me know about errors in Twitter. I will assume that you have some knowledge about how to work with JAX sharding in the frontend; please refer to Distributed arrays and automatic parallelization, Explicit sharding and Manual parallelism with shard_map for a refresher on these topics.
Read more...
Global SPMD (also known as the “global view”, exposed by code using DTensor or jax.Array) refers to writing multi-device code as if it was on a single device, with an orthogonal mechanism for expressing how these full tensors are distributed over multiple devices (this mechanism can be implicit or explicit, e.g., as seen in this table).
Local SPMD (also known as “per-device view”, and exposed by local_map and shard_map, and also traditional PyTorch distributed code operating on plain Tensors, e.g., Megatron-style) refers to writing code from the “local” view on a single device, with explicit collectives when communicating across devices.
Read more...
In Computing sharding with einsum, we worked an example of Megatron style tensor parallelism where we discover that the ordinary backwards formula for linear results in a pending reduction on grad_input, even though the input was replicated and no communications happened in forwards. In Megatron, which is implemented with plain Tensors and manual collectives, you just have to know that this reduction is necessary and manually insert it with a custom autograd function.
Read more...