This blog has lived on WordPress since it was initially created during a social challenge at MIT to write a blog post a week or pay up with beer. I remember a very important piece of advice I had been given at that time: don’t fuck around with your blog authoring software, just do the minimum viable thing (use WordPress) and focus on writing posts.
It’s 2026 now, the world is different, and in particular the existence of coding agents means that this particular advice falls flat now: it has never been easier to vibe code your own blog software and be done in an afternoon of token generation. Similarly, over the years, I had been increasingly unhappy about my WordPress setup (too hard to add images, ancient version of WordPress, Markdown has taken over the world why am I still writing in ReST, I love scripts.mit.edu but I definitely don’t want to use it to host serious things). So I typed this into ChatGPT and Claude and asked it what I should migrate too.
Read more...
Let’s suppose you asked an AI coding agent to “implement a CLI calculator”.
Imagine if, instead of only writing short Python script, it also started
building an automated test suite, a crash reporting mechanism and a telemetry
subsystem. You’d be like, “What the fuck is this?”
But now let’s say that you were planning to release this project to users. It
would be clearly negligent to not have an automated test suite. A crash
reporting mechanism might be overkill for a simple calculator, but for more
complicated CLIs interacting with the real world, it may not always be
feasible to have reproducer, in which case crash logs are essential.
Similarly, a telemetry subsystem would be wildly inappropriate for an open
source local-only calculator, but it could make sense for a networked
application or a corporate tool of all consenting users. One of the important
functions of a senior engineer is to be able to evaluate the context a
software project lives in and figure out if we need to do something, even if
it isn’t explicitly asked for. This is contrast to a helpful assistant, who
is first and foremost obligated to follow the user’s instructions. This
leads to a gap between a Helpful Assistant and a Senior Engineer.
Read more...
I’ve recently been doing a lot of both submitting and reviewing pull requests to PyTorch that were authored with substantial LLM assistance. This is a big difference from earlier this year, where it was clear LLMs worked well for greenfield projects but the code was too hopelessly sloppy for a production codebase. Here are my merged PRs that mention claude code in their description; Jason Ansel has also had a similar experience (Meta only link, here is the list of issues he referenced in his writeup). There already has been increasing discourse (Simon Willison, LLVM) on how code review should adapt to this new era of LLMs. My contribution to this discourse is this: within teams, code review should change to being primarily be a human alignment mechanism.
Read more...
A lot of strong engineers that I know haven’t really taken a serious look at AI coding; they’ve used LLMs to ask questions or write simple scripts and appreciate that it is a useful tool, but haven’t actually tried building a nontrivial application entirely from scratch in vibe coding style (here, I use the term in its original meaning: when you do AI coding without carefully reviewing the output). This is understandable: if you’re not working on a green field project, there aren’t that many opportunities to write code in this style–standard practice for established projects is that someone else needs to review all of the code you write: this is a bad match for vibe coding! So in this post, I want to give a concrete case study of a nontrivial system that was entirely vibe coded (ScubaDuck), to argue the following claims:
Read more...
Do you use an LLM for coding? Do you maintain a personal benchmark based on problems you have posed the LLM? The purpose of this blog post is to convince you should do this: that you can do so with marginal effort on top of your day-to-day vibe coding and that you will get both short and long term benefits from making your own personal benchmark exist.
I started thinking about benchmarks for coding in part with my frustration with the discourse around LLMs in the public squares I frequent (Reddit and Twitter). People often want to know “what’s the best model” or “what’s the best coding IDE”? One might imagine that the way to answer this question would be to test the models on a variety of problems from real world uses of the LLM for coding, and then compare how well various systems do on this. Indeed, whenever a new SOTA model releases, the lab will usually tell you about the model’s performance against a few well known coding benchmarks. Problem solved?
Read more...