Best documentation format

Software has and should have many kinds of documentation:

  • Source code itself should be as self-documenting as possible.
  • Comments embedded in source code files should explain what is not obvious from the code itself.
  • Formalized comments embedded in source code files should declare the parts of the semantics of the interface of a module, that is not implicit in the interface itself. Such formalized comments should be extracted by a tool to build the API specification of that module.
  • Requirements of software should be written in plain language, possibly with examples (a prototype, some sketches or screen-shots, some formulas, and, in case the required software is a software library, some code snippets that show how to use the library).
  • User manual that explain how to use the software product. In case the software product is a software library, the user manual is a tutorial for how to use the library, plus the API reference specification. This manual may be distributed in several formats:
  • as a printed book (or booklet, or leaf);
  • as a file formatted as a printed book (typically in PDF format);
  • as a file to be read from the Web (typically in HTML format);
  • as a context-sensitive help window popped-up by a command in the user interface of the program (several formats, as HTML, or CHM).

Of course, the first three kinds of documentation is edited using the source code editor itself. But there are several practices for building the other kinds of documentation.

Some of them require that the source is in a proprietary binary format, like Microsoft Word’s or Framemaker’s; others require to use an open binary format, like OpenDocument Text, used by OpenOffice; others require to use a proprietary text format, like Rich Text Format; others require to use an open text format, like DocBook.

Text formats may be classified according several criteria: bracket-oriented, or command-oriented; free-format or line-oriented; formatting-oriented or semantic-oriented.

In bracket-oriented formats like RTF and DocBook and, in part, HTML, many elements (actually non-empty elements) have tags that open sections and corresponding tags that close them. Instead, in command-oriented formats, most elements have only a only command to open them, and not to close them; a section is closed when another section is opened or when the document terminates.

In free-format formats, in every point of a line a command may begin or end; therefore a single line may contain the whole book. Instead, line-oriented formats, like the C preprocessor, require that for many command types every command be isolated in a line; therefore a given document source has a minimum number of lines.

Formatting-oriented formats specify the actual presentation of the document: the typeface, color and size of characters, the indent and interline between paragraphs and inside paragraphs, the size of the page. Instead, semantic-oriented formats specify only which text portions are section titles and of which level, which are plain text and which are source code snippets; leaving the formatting decisions to a rendering or transforming program, possibly parameterized by a style-sheet.

Here I argue that there are several disadvantages with proprietary, binary, bracket-oriented, free-format, formatting-oriented formats, and therefore it is better to choose a format that is open, textual, command-oriented, line-oriented, and semantic-oriented.

Here are the rationales.

First of all, proprietary formats create a dependency on the strategies of the the copyright holder. If he decides not to support that format any more, who prefer to stick to that format is left out in the cold. In addition, as often the specification of such formats are unpublished or the documents using such formats are accessible only using costly proprietary tools, or both, one has to pay to access those formats and cannot share freely such documents without forcing other people to buy the tool, or, in case of reverse-engineering of the format by an open source tool, one is never sure that the document is exactly the original one.

Of course, binary formats, even the open ones, require a specific program to edit them. While some users could like using that program, others may prefer to use their favorite  text editor. Therefore a format editable as raw text is preferable.

But even some textual formats virtually require a specific program to edit them. Actually, free format text documents, like XML documents, if not well indented, are hard to read and write using a simple text editor. Well, syntax highlighting and bracket matching features may help, but line-oriented formats are much easier to read and write than free-format documents.

In addition, it is always a good idea to keep every important document under a revision control system. The purpose of a revision control system is manifold: to avoid clashing simultaneous changes (or “edits”), to keep commented history that allows rollbacks, to allow to merge parallel changes (simultaneous by different users, or from different development branches).

To avoid clashing changes, some system forbid them, while other systems allow them as different branches and then merges them. The latter is more powerful, provided there is a good merging facility. Some kinds of merges may be handled automatically by the revision control system, while others require manual intervention as the automatic system detects clashing changes. To allow automatic merges, it is necessary that when merging two well-formed documents a third well-formed document is built. As many systems cannot merge changes to the same line, the longer are the lines the likelier is a merge conflict, and the need of manual intervention.

In addition, both to analyze the history of a document and to merge two revisions, it is necessary to view two versions of the document side-by-side. Given that a modern computer screen cannot show much more that 160 characters in a line, if a line contains more than 80 characters, one can hardly see two versions of that line without scrolling horizontally the document. Therefore, short text lines and command-only lines are much easier to handle using a revision control system.

At last, semantic-oriented markup is well known to be better on the Web than formatting-oriented markup, as it may be subsequently customized according with the output medium size and resolution (large screen, notebook, netbook, smart phone, paper printer), and according with user preferences.

I looked around for a perfect match to my requirements, and I just decided that the best match is the tool named “Pandoc”, that handles several input and output formats, but whose main input format is a variation of the famous Markdown format. The main shortcoming of the Markdown format is that some formatting should be done embedding HTML tags. That has two disadvantages: HTML tags are not very readable (think of a table), and HTML tags are not good to generate non-HTML output. Pandoc adds the necessary commands to remove the need to use HTML in most documents.


About Carlo Milanesi

I am a software developer in Italy. I have develop financial, engineering and commercial software using many programming languages, mainly C, C++, Visual Basic, Java, and C#. Now I am interested in Rust and TypeScript.
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s