The secret to successful autogenerated docs : ezyang’s blog

The secret to successful autogenerated docs

I've had a rather successful tenure with autogenerated documentation, both as a writer and a reader. So when Jacob Kaplan Moss's articles on writing “great documentation” resurfaced on Reddit and had some harsh words about auto-generated documentation, I sat back a moment and thought about why autogenerated documentation leave developers with a bad taste in their mouths.

I interpreted Moss's specific objections (besides asserting that they were “worthless” as the following:

They usually didn't contain the information you were looking for (“At best it’s a slightly improved version of simply browsing through the source”),
They were verbose (“good for is filling printed pages when contracts dictate delivery of a certain number of pages of documentation”),
The writers skipped the “writing” part (“There’s no substitute for documentation written...”),
The writers skipped the “organizing” part (“...organized...”),
The writers skipped the “editing” part (“...and edited by hand.”),
It gave the illusion of having documentation (“...it lets maintainers fool themselves into thinking they have documentation”).

Thus, the secret to successful autogenerated docs is:

Remember your audience.

No doubt, gentle reader, you are lifting your eyebrow at me, thinking to yourself, “Of course you should remember your audience; that's what they always teach you in any writing class. You haven't told us anything useful!” So, let me elaborate.

Why do developers forget to “remember their audience”? A defining characteristic of autogenerated documentation is that the medium it is derived from is source code: lines of programming language and docblocks interleaved together. This has certain benefits: for one, keeping the comments close to the code they describe helps ward off documentation rot as code changes, additionally, source divers have easy access to documentation pertinent to a source file they are reading. But documentation is frequently oriented towards people who are not interested in reading the codebase, and thus the act of writing code and documentation at the same time puts the writer into the wrong mindset. Compare this with sitting down to a tutorial, the text flowing into an empty document unprejudiced by such petty concerns as code.

This is a shame, because in the case of end-user developer documentation (really the only appropriate time for autodocs), the person who originally wrote the code is most likely to have the relevant knowledge to share about the interface being documented.

What does it mean to “remember my audience”? Concisely, it's putting yourself into your end-user’s shoes and asking yourself, “If I wanted to find out information about this API, what would I be looking for?” This can be hard (and unfortunately, there are no secrets here), but the first step is to be thinking about it at all.

How can I remember to consider the audience when writing docblocks? While it would be nice if I could just snap my fingers and say, “I'm going to write docblocks with my audience in mind,” I know that I'm going to forget and write a snippy docblock because I was in a rush one day or omit the docblock entirely and forget about it. Writing documentation immediately after the code is written can be frustrating if five minutes later you decide that function did the wrong thing and needs to be axed.

Thus, I've set up these two rules for myself:

It's OK not to write documentation immediately after writing code (better not yet than poorly).

Like many people working in high-level languages, I like using code to prototype API designs. I'll write something, try to use it, change it to fit my use-case, write some more, and eventually I'll have both working code and a working design. If I don't write documentation while I'm prototyping, that's fine, but when that's all over I need to write the documentation (hopefully before the code slips out of my active mindspace.) The act of writing the documentation at the end helps finalize the API, and can suggest finishing touches. I also use my toolchain to tell me when I've left code undocumented (with Sphinx, this is using the coverage plugin).
When writing documentation, constantly look at the output the end-user will see.

You probably have a write/preview authoring cycle when you edit any sort of text that contains formatting. This cycle should carry over to docblocks: you edit your docblock, run your documentation build script, and inspect the results in your browser. It helps if the output you're producing is beautiful! It also means that your documentation toolchain should be smart about what it needs to recompile when you make changes. The act of inspecting what a live user will see helps put you in the right mindset, and also force you to say, “Yeah, these docs are not actually acceptable.”

My autodocumentor produces verbose and unorganized output! I've generally found autogenerated documentation from Python or Haskell to be much more pleasant to read than that from Java or C++. The key difference between these languages is that Python and Haskell organize their modules into files; thus, programmers in those language find it easier to remember the module docblock!

The module docblock is one of great import. If your code is well-written and well-named, a competent source-diver can usually figure out what a particular function does in only a few times longer than it would take for them to read your docblock. The module is the first organizational unit above class and function, precisely where documentation starts becoming the most useful. It is the first form of “high-level documentation” that developers pine for.

So, in Python and Haskell, you write all of the functionality involved in a module in a file, and you can stick a docblock up top that says what the entire file does. Easy! But in Java and C++, each file is a class (frequently a small one), so you don't get a chance to do that. Java and recent C++ have namespaces, which can play a similar role, but where are you supposed to put the docblock for what in Java is effectively a directory?

There is also substantial verbosity pollution that comes from an autodocumenting tool attempting to generate documentation for classes and functions that were intended to not be used by the end-user. Haddock (Haskell autodocumentor) strongly enforces this by not generating documentation for any function that the module doesn't export. Sphinx (Python autodocumentor) will ignore by default functions prefixed with an underscore. People documenting Java, which tends to need a lot of classes, should think carefully about which classes they actually want people to use.

Final thoughts. The word “autogenerated documentation” is a misnomer: there is no automatic generation of documentation. Rather, the autodocumentor should be treated as a valuable documentation building tool that lets you get the benefits of cohesive code and comments, as well as formatting, interlinking and more.

Etienne Millon says:

June 21, 2010 at 10:05 am

> where are you supposed to put the docblock for what in Java is effectively a directory?

If you are referring to Java packages, you can use a special file named “package-info.java”, that is used only by javadoc. It should provide general information about the package, but it is the best place to describe general invariants and responsibilities in that place.

Edward Z. Yang says:

June 21, 2010 at 7:06 pm

Etienne, I was not aware of that, but that is excellent! Easily forgotten, of course, but definitely the right thing to do.

Andy Wingo says:

June 22, 2010 at 9:19 am

Do you have a link to example autogenerated docs that you are happy with?

I would like to rely on autogenerated docs, for reasons I wrote about (6 years ago!) here, but the output is currently not nice enough.

June 22, 2010 at 9:20 am

Ah sorry, I meant this link instead: hither. They are related but not the same.

June 22, 2010 at 9:28 am

Hey Andy, I’m particularly pleased with the output that Sphinx in combination with autodoc produces; here is an example of such output.

I notice that your linked posts discuss documenting Scheme code; having hacked on Scheme a bit myself, I suppose I can’t be too surprised if toolchain support in the language is a bit immature.

June 22, 2010 at 11:40 am

Hi Edward, thanks for the pointer. However I am still a bit unconvinced; the arguments made by the GNU coding standards are strong.

GNU (where I come from) is pretty weak on web design, but they do seem to have a good grasp of text-wranging and story-telling.

June 22, 2010 at 7:00 pm

I think GNU is talking about docstrings, which are quite different in character from docblocks. For example, docstrings are kept around during program and are intended to let you do things like help(function_name), which is an extremely specific use-case, and results in them needing to be standalone. I don’t really see any fundamental use-case that requires docblocks to work standalone.

June 27, 2010 at 10:29 pm

I should also add, Python docstrings are indeed docstrings, but I never use them like that (i.e., use the help function).

Für die Schublade generieren | Textmulch says:

July 12, 2010 at 8:44 am

[…] The secret to successful autogenerated docs merkt etwas an, dass eigentlich selbstverständlich sein sollte, aber doch gern übersehen – oder gar verdrängt? – wird: The word “autogenerated documentation” is a misnomer: there is no automatic generation of documentation. Rather, the autodocumentor should be treated as a valuable documentation building tool that lets you get the benefits of cohesive code and comments, as well as formatting, interlinking and more. […]

ColonelPanic » Integrating doxygen and confluence says:

October 19, 2010 at 2:16 pm

[…] good documentation for use with doxygen is difficult, but if you work at it you can come up with some pretty good stuff. However, the index in doxygen isn’t quite what […]