ezyang's blog

the arc of software bends towards understanding

AI Coding

Read Less, Steer More

I was coaching some juniors on effective use of AI coding agents (like, looking over their shoulder as they were prompting the LLMs) and one reoccurring theme was that the AI agents were demanding a lot more of their reading code skills, compared to how they used to operate before. “How,” they asked, “can we get better at reading code?”

I had to think about my answer to this, because even before AI, it’s always been difficult to make the jump from “I mostly write code” to “I spend all my time reviewing code other people have written.” There is some intrinsic difficulty about this; there has always been a shortage in the number of code reviewers, even without LLMs exacerbating the problem!

Read more...

Parallel Agents ❤️ Sapling

Previously in AI-assisted programming for spmd_types, I mentioned that I have been enjoying using Sapling (the version control system Meta uses internally) to manage parallel agents on worktrees. In this post, I want to describe this workflow in more detail. The basic development flow here is I have three work streams going in parallel:

  1. I am working on enabling an end-to-end use case; in this case, enabling strict type-checking on a realistic codebase.
  2. I am adding new features and fixing bugs to the core library itself, as identified by (1).
  3. I am addressing code review comments and CI failures for the diffs produced by (2).

All of this is in the context of a single stack, the bottom of which is the feature diffs, and the top of which is the enablement diffs. I have multiple worktrees: one at the top for E2E enablement, one or more at the top of the feature diffs to add new features, and one or more inside the stack to address code review and fix CI. All of these are going in parallel, and so the problem is how to coordinate these parallel changes on a single source-of-truth stack. Git does terribly at this, but if you know the trick, Sapling does really well!

Read more...

AI-assisted programming for spmd_types

spmd_types is a type checker for distributed PyTorch programs. I’ll write a blog post about the system itself as it gets closer to completion, but in this post I want to talk about my workflow for AI-assisted programming in the project. In particular, spmd_types is my first greenfield project that both:

  1. I have written it entirely with AI, and
  2. I am comfortable being accountable for it (in the same way I am accountable for my handwritten code).

I have encouraged others to also use AI as the primary mechanism of writing code for the codebase, but I have noticed that people get different outcomes (in terms of diff volume and number of review cycles), so I wanted to write down my workflow in case it was helpful for somebody.

Read more...

Vibe Coding Design Study: tlparse

I recently received the following question about vibe-coding for tlparse, a structured log parser for torch.compile (slightly edited for ease of reading):

Hi Ed, I have some thoughts on vibe-coding tlparse and I’m curious what your opinions are. I think it’s fairly easy to vibe code and improve tlparse, but it’s hard to find people who know Rust or HTML or JavaScript well to properly review the vibe-coded stuff. The Rust PRs are not impossible to review, but the JavaScript ones are certainly hard… YOLO-landing PRs that we can’t review is certainly bad, I guess one option is just not landing any changes, and tell people to vibe-code themselves…?

Read more...

Hugo Migration

This blog has lived on WordPress since it was initially created during a social challenge at MIT to write a blog post a week or pay up with beer. I remember a very important piece of advice I had been given at that time: don’t fuck around with your blog authoring software, just do the minimum viable thing (use WordPress) and focus on writing posts.

It’s 2026 now, the world is different, and in particular the existence of coding agents means that this particular advice falls flat now: it has never been easier to vibe code your own blog software and be done in an afternoon of token generation. Similarly, over the years, I had been increasingly unhappy about my WordPress setup (too hard to add images, ancient version of WordPress, Markdown has taken over the world why am I still writing in ReST, I love scripts.mit.edu but I definitely don’t want to use it to host serious things). So I typed this into ChatGPT and Claude and asked it what I should migrate too.

Read more...

The gap between a Helpful Assistant and a Senior Engineer

Let’s suppose you asked an AI coding agent to “implement a CLI calculator”. Imagine if, instead of only writing short Python script, it also started building an automated test suite, a crash reporting mechanism and a telemetry subsystem. You’d be like, “What the fuck is this?”

But now let’s say that you were planning to release this project to users. It would be clearly negligent to not have an automated test suite. A crash reporting mechanism might be overkill for a simple calculator, but for more complicated CLIs interacting with the real world, it may not always be feasible to have reproducer, in which case crash logs are essential. Similarly, a telemetry subsystem would be wildly inappropriate for an open source local-only calculator, but it could make sense for a networked application or a corporate tool of all consenting users. One of the important functions of a senior engineer is to be able to evaluate the context a software project lives in and figure out if we need to do something, even if it isn’t explicitly asked for. This is contrast to a helpful assistant, who is first and foremost obligated to follow the user’s instructions. This leads to a gap between a Helpful Assistant and a Senior Engineer.

Read more...

Code review as human alignment, in the era of LLMs

I’ve recently been doing a lot of both submitting and reviewing pull requests to PyTorch that were authored with substantial LLM assistance. This is a big difference from earlier this year, where it was clear LLMs worked well for greenfield projects but the code was too hopelessly sloppy for a production codebase. Here are my merged PRs that mention claude code in their description; Jason Ansel has also had a similar experience (Meta only link, here is the list of issues he referenced in his writeup). There already has been increasing discourse (Simon Willison, LLVM) on how code review should adapt to this new era of LLMs. My contribution to this discourse is this: within teams, code review should change to being primarily be a human alignment mechanism.

Read more...

Vibe coding case study: ScubaDuck

A lot of strong engineers that I know haven’t really taken a serious look at AI coding; they’ve used LLMs to ask questions or write simple scripts and appreciate that it is a useful tool, but haven’t actually tried building a nontrivial application entirely from scratch in vibe coding style (here, I use the term in its original meaning: when you do AI coding without carefully reviewing the output). This is understandable: if you’re not working on a green field project, there aren’t that many opportunities to write code in this style–standard practice for established projects is that someone else needs to review all of the code you write: this is a bad match for vibe coding! So in this post, I want to give a concrete case study of a nontrivial system that was entirely vibe coded (ScubaDuck), to argue the following claims:

Read more...

Why you should maintain a personal LLM coding benchmark

Do you use an LLM for coding? Do you maintain a personal benchmark based on problems you have posed the LLM? The purpose of this blog post is to convince you should do this: that you can do so with marginal effort on top of your day-to-day vibe coding and that you will get both short and long term benefits from making your own personal benchmark exist.


I started thinking about benchmarks for coding in part with my frustration with the discourse around LLMs in the public squares I frequent (Reddit and Twitter). People often want to know “what’s the best model” or “what’s the best coding IDE”? One might imagine that the way to answer this question would be to test the models on a variety of problems from real world uses of the LLM for coding, and then compare how well various systems do on this. Indeed, whenever a new SOTA model releases, the lab will usually tell you about the model’s performance against a few well known coding benchmarks. Problem solved?

Read more...