The secret to successful autogenerated docs

by Edward Z. Yang

I've had a rather successful tenure with autogenerated documentation, both as a writer and a reader. So when Jacob Kaplan Moss's articles on writing “great documentation” resurfaced on Reddit and had some harsh words about auto-generated documentation, I sat back a moment and thought about why autogenerated documentation leave developers with a bad taste in their mouths.

I interpreted Moss's specific objections (besides asserting that they were “worthless” as the following:

  1. They usually didn't contain the information you were looking for (“At best it’s a slightly improved version of simply browsing through the source”),
  2. They were verbose (“good for is filling printed pages when contracts dictate delivery of a certain number of pages of documentation”),
  3. The writers skipped the “writing” part (“There’s no substitute for documentation written...”),
  4. The writers skipped the “organizing” part (“...organized...”),
  5. The writers skipped the “editing” part (“...and edited by hand.”),
  6. It gave the illusion of having documentation (“...it lets maintainers fool themselves into thinking they have documentation”).

Thus, the secret to successful autogenerated docs is:

Remember your audience.

No doubt, gentle reader, you are lifting your eyebrow at me, thinking to yourself, “Of course you should remember your audience; that's what they always teach you in any writing class. You haven't told us anything useful!” So, let me elaborate.

Why do developers forget to “remember their audience”? A defining characteristic of autogenerated documentation is that the medium it is derived from is source code: lines of programming language and docblocks interleaved together. This has certain benefits: for one, keeping the comments close to the code they describe helps ward off documentation rot as code changes, additionally, source divers have easy access to documentation pertinent to a source file they are reading. But documentation is frequently oriented towards people who are not interested in reading the codebase, and thus the act of writing code and documentation at the same time puts the writer into the wrong mindset. Compare this with sitting down to a tutorial, the text flowing into an empty document unprejudiced by such petty concerns as code.

This is a shame, because in the case of end-user developer documentation (really the only appropriate time for autodocs), the person who originally wrote the code is most likely to have the relevant knowledge to share about the interface being documented.

What does it mean to “remember my audience”? Concisely, it's putting yourself into your end-user’s shoes and asking yourself, “If I wanted to find out information about this API, what would I be looking for?” This can be hard (and unfortunately, there are no secrets here), but the first step is to be thinking about it at all.

How can I remember to consider the audience when writing docblocks? While it would be nice if I could just snap my fingers and say, “I'm going to write docblocks with my audience in mind,” I know that I'm going to forget and write a snippy docblock because I was in a rush one day or omit the docblock entirely and forget about it. Writing documentation immediately after the code is written can be frustrating if five minutes later you decide that function did the wrong thing and needs to be axed.

Thus, I've set up these two rules for myself:

  1. It's OK not to write documentation immediately after writing code (better not yet than poorly).

    Like many people working in high-level languages, I like using code to prototype API designs. I'll write something, try to use it, change it to fit my use-case, write some more, and eventually I'll have both working code and a working design. If I don't write documentation while I'm prototyping, that's fine, but when that's all over I need to write the documentation (hopefully before the code slips out of my active mindspace.) The act of writing the documentation at the end helps finalize the API, and can suggest finishing touches. I also use my toolchain to tell me when I've left code undocumented (with Sphinx, this is using the coverage plugin).

  2. When writing documentation, constantly look at the output the end-user will see.

    You probably have a write/preview authoring cycle when you edit any sort of text that contains formatting. This cycle should carry over to docblocks: you edit your docblock, run your documentation build script, and inspect the results in your browser. It helps if the output you're producing is beautiful! It also means that your documentation toolchain should be smart about what it needs to recompile when you make changes. The act of inspecting what a live user will see helps put you in the right mindset, and also force you to say, “Yeah, these docs are not actually acceptable.”

My autodocumentor produces verbose and unorganized output! I've generally found autogenerated documentation from Python or Haskell to be much more pleasant to read than that from Java or C++. The key difference between these languages is that Python and Haskell organize their modules into files; thus, programmers in those language find it easier to remember the module docblock!

The module docblock is one of great import. If your code is well-written and well-named, a competent source-diver can usually figure out what a particular function does in only a few times longer than it would take for them to read your docblock. The module is the first organizational unit above class and function, precisely where documentation starts becoming the most useful. It is the first form of “high-level documentation” that developers pine for.

So, in Python and Haskell, you write all of the functionality involved in a module in a file, and you can stick a docblock up top that says what the entire file does. Easy! But in Java and C++, each file is a class (frequently a small one), so you don't get a chance to do that. Java and recent C++ have namespaces, which can play a similar role, but where are you supposed to put the docblock for what in Java is effectively a directory?

There is also substantial verbosity pollution that comes from an autodocumenting tool attempting to generate documentation for classes and functions that were intended to not be used by the end-user. Haddock (Haskell autodocumentor) strongly enforces this by not generating documentation for any function that the module doesn't export. Sphinx (Python autodocumentor) will ignore by default functions prefixed with an underscore. People documenting Java, which tends to need a lot of classes, should think carefully about which classes they actually want people to use.

Final thoughts. The word “autogenerated documentation” is a misnomer: there is no automatic generation of documentation. Rather, the autodocumentor should be treated as a valuable documentation building tool that lets you get the benefits of cohesive code and comments, as well as formatting, interlinking and more.