The PyTorch open source process
PyTorch is a fairly large and active open source project, and sometimes we have people come to us and ask if there are any lessons from how we run PyTorch that they could apply to their own projects. This post is an attempt to describe some of the processes as of 2021 that help PyTorch operate effectively as an open source project. I won't claim that everything we do necessarily the best way to go about doing things, but at the very least, everything I describe here is working in practice.
Background. Not all open source projects are the same, and there are some peculiarities to PyTorch which may reduce the applicability of some of what I describe below in other contexts. Here are some defining features of PyTorch, as a project:
- The majority of full time PyTorch developers work at Facebook. To be clear, there are many full time PyTorch developers that work at other companies: NVIDIA, Intel, Quansight, Microsoft, AMD, IBM, Preferred Networks, Google and Amazon all employ people whose job it is to work on PyTorch. But the majority of full timers are at Facebook, distinguishing PyTorch from hobbyist open source projects or projects run by a foundation of some sort.
- PyTorch is a federation. As coined by Nadia Eghbal, PyTorch is a project with high contributor growth and user growth. In my State of PyTorch (2020) talk, I go into more details, but suffice to say, we have over nine companies contributing to PyTorch, and a long tail of other contributors (making up 40% of all of our commits). This makes managing PyTorch sometimes particularly challenging, and many of the processes I will describe below arose from growing pains scaling this level of activity.
- PyTorch has a lot of surface area. CPU, CUDA, ROCm, ONNX, XLA, serving, distributions, quantization, etc. It's impossible for a single contributor to be well-versed in every area of the project, and so some of the challenge is just making sure the right people see the things they need to see.
Alright, so how does PyTorch deal with its scale? Here are some of the things we do.
Issue triage. PyTorch receives too many bug reports a day for any one person to keep track of all of them. Largely inspired by this apenwarr post, we setup an oncall rotation amongst Facebook contributors to serve as first line triage for all of these issues. The golden rule of issue triage is that you DO NOT fix bugs in triage; the goal of triage is to (1) route bugs to the correct people via appropriate GitHub labels, and (2) look for high priority bugs and raise awareness of these bugs. Every week, we have a meeting to review high priority bugs (and other bugs marked for triage review) and talk about them. The oncall itself rotates daily, to discourage people from letting a week's worth of issues pile up in the backlog, and we use a relatively intricate search query to make sure only relevant issues show up for the oncall to handle.
The most important consequence of issue triage is that you can unwatch PyTorch repository as a whole. Instead, by watching various labels (using our cc bot), you can trust that you will get CC'ed to issues related to topics, even if the triager doesn't know that you're interested in the issue! The weekly meeting makes sure that all maintainers collectively have an idea about what major issues are currently affecting PyTorch, and helps socialize what we as a project think of as a "high priority" issue. Finally, the high priority label is a good way to find impactful problems to work on in the project, even if you don't know much else about the project.
Pull request triage. Similarly, we receive a decent number of drive by pull requests from one time contributors. Those people are not in a good position to find reviewers for their contributions, so we also have a triager look through these pull requests and make sure someone is assigned to review them. If the PR is particularly simple, the triager might just go ahead and merge it themselves. There's actually some good automation for doing this (e.g., homu) but we've been too lazy to set any of it up, and by hand reviewer assignment doesn't seem to be too much burden on top of the existing oncall.
Tree hugging oncall. PyTorch has a huge CI system covering many different system configurations which most contributors rely on to test if their changes are safe. Sometimes people break master. Separate from the triage oncall, we have a tree hugging oncall whose job it is to revert jobs if they break master. This oncall involves mostly paying attention to the CI HUD and reverting commits if they result in master breakage in one of the configurations.
Importing to Facebook infrastructure. We actually run Facebook infrastructure directly off of the HEAD branch in PyTorch. The tooling that makes this possible is fbshipit, which mirrors commits between Facebook's internal monorepo and our public GitHub repository. This setup has been something of a double-edged sword for us: requiring Facebook and GitHub to be in sync means that only Facebook employees can actually land pull requests (we try to streamline the process as much as possible for external maintainers, but at the end of the day someone at Facebook has to actually push the green button), but it means we don't have to worry about doing periodic "mega-imports" into Facebook infrastructure (which we have done in the past and were quite difficult to do). We are very interested in fixing this situation and have floated some proposals on changing how we do internal releases to make it possible to let external contributors land PRs directly.
RFCs. Most feature discussion happens on GitHub issues, but sometimes, a feature is too big and complicated to adequately discuss in a GitHub issue. In those cases, they can be discussed in the rfcs repository (inspired by the Rust RFCs process). The formal process on this repository isn't too solidified yet, but generally people go there if they feel that it is too difficult to discuss the issue in GitHub issues. We don't yet have a process for shepherding unsolicited RFCs.
Conclusion. PyTorch's open source process isn't rocket science: there's an oncall, the oncall does some things. The devil is in the details: all of PyTorch's oncall responsibilities are carefully scoped so that your oncall responsibilities aren't something that will take an unbounded amount of time; they're something you can knock out in an hour or two and call it a day. You could make the argument that we rely excessively on oncalls when automation is possible, but what we have found is that oncalls require less infrastructure investment, and integrate well with existing processes and flows at Facebook. They might not be right everywhere, but at least for us they seem to be doing a good job.