Sort records not numbers

Really many texts about sorting show how to sort an array of numbers. Some even evaluate performance of such algorithms.
But real programmers never need to sort a sequence of numbers. In real-world applications only records need to be sorted, using a sort key in such records. A record may be a simple pair of bytes, one being the sort key and the other being the associated value.

This distinction not only affects the actual algorithm code, but often also performance. Of course the asymptotic behavior is not changed, but the actual behavior for a give number of records to sort may depend on the size of the record. Actually some algorithms make fewer key comparisons but more record swaps, others make more key comparisons but fewer record swaps. If a record is just the key number, its swap is quite fast, while if the key is an integer but the record contains some strings or tens of numbers, the time for the key comparions is dwarfed by the time for the record moves.

Posted in Uncategorized | Leave a comment

New release of C++ Bolts

This weekend I released the first proper versions of two libraries of the C++ Bolts library set: “countingsort” and “memory_file”.

Before that, I just uploaded some chaotic mass (or mess?) of code. It is still so for the other libraries.

So now everyone could download and use these two useful libraries.

Countingsort is a very fast sorting algorithm, that has limited application though. You can use it only if the sort key is an integer having a range not much larger than the number of items to sort. For example, if you need to sort a sequence containing about 400,000 items, you can use the countingsort algorithm is the range of the keys is not larger than, say, one million.

The “memory_file” library is a wrapper over the memory-mapped file facilities of modern operating systems, like Microsoft Windows, Linux, Mac OS X. It is an easy, fast, safe, and flexible way to handle the contents of binary (and also textual) files.

Posted in Uncategorized | Leave a comment

Directory organization for software development projects

Why directories?

Usually the files used for a medium-to-large software development project are not stored in a single directory, but is several subdirectories of a project root directory.

Why files are stored in several directories?

In general, there are two reasons to use directories:

  • To allow to store possibly different files having the same name.
  • To allow to handle a group of files with a single command.
  • To allow long pathnames in limited filesystems (like MS-DOS).
  • To avoid slowdowns when many files are stored in a single directory in limited filesystems (like MS-DOS).

The last two reasons no more apply to modern operating systems.

Development files have distinguishing suffixes (“.c”, “.h”, “.o” or “.obj”), therefore, files of different type anyway have different names and can be handled as a separate group.

Therefore the above reasons become:

  • To allow to store possibly different files of the same type having the same name.
  • To allow to handle a group of files of the same type with a single command.

Therefore, let’s ideally start a software development project putting all the files in a single directory, and let’s find out which additional directories we need.

Separating source files

First of all, it is useful to split source files from generated files.

This is because we want to list source files, to back them up, to handle them using a revision control system, and on the other side we want to clear all generated files using a single command.

A generated file is not a kind of source file, and a source file is not a kind of generated file, and therefore it is not reasonable to put generated files in a subdirectory of the directory containing source files or vice versa.

So we have two parallel directories “source files”, or “src” for short, and “generated files”, or “gen” for short.

To simply run a program, we get into the “gen” directory, forgetting there is a “src” directory, and launch the generated program. Thus such directory should contain all is needed to run a program. Therefore, if a source file is actually needed “as is” to run a program, the build command should simply copy it from the directory “src” to the directory “gen”.

Separating generated configurations

With some development tools, like C and C++ compilers, there are several build options, like optimization flags, run-time checks (asserts), processor architecture targets, and often a single developer wants to keep generated versions for several option sets. Let’s name “configuration” a set of build options.

Typically, one want to keep a “debug” configuration and a “release” configuration, but more are usual. Therefore there is the need to keep many “generated files” directories for each supported configuration. Everyone of these directory “is a” generated directory, and therefore it is reasonable to put them under the “gen” directory”.

Up to now we have the following directory structure:

<project>
|
+-- src
|
+-- gen
    |
    +-- debug
    |
    +-- release

In such a way we can erase with a single command all generated files for all configurations, or only all generated files for a single configuration.

What if we have two or more orthogonal criterion to classify configuration?

For example we may need debug and release versions both for x86-32bit and x86-64bit.

There are three options, shown by the following structures.

<project>
|
+-- src
|
+-- gen
    |
    +-- debug
    |   |
    |   +-- x86-32bit
    |   |
    |   +-- x86-64bit
    |
    +-- release
        |
        +-- x86-32bit
        |
        +-- x86-64bit

Here above, first a distinction between debug and release configurations is made, and then, for each of them, there is a distinction between the processor architectures.

<project>
|
+-- src
|
+-- gen
    |
    +-- x86-32bit
    |   |
    |   +-- debug
    |   |
    |   +-- release
    |
    +-- x86-64bit
        |
        +-- debug
        |
        +-- release

Here above, first a distinction between processor architectures is made, and then, for each of them, there is a distinction between the debug and release configurations.

<project>
|
+-- src
|
+-- gen
    |
    +-- x86-32bit-debug
    |
    +-- x86-32bit-release
    |
    +-- x86-64bit-debug
    |
    +-- x86-64bit-release

Here above, there is a “flat” structure.

I think the best structure is the last one, i.e. the flat structure.

One can still select all debug configurations with the pattern “*-debug”, and all the 64bit configurations with the pattern “*-64bit-*”.

Instead the previous configurations do not allow both selections.

Separating other types of file

And where do we put pictures or other binary source files, and developer documentation?

Binary source files are of course source files; their name cannot clash with source code because of filename extension; they should be put under revision control even if they are binary; collective commands can isolate them by their filename extension. Therefore they should stay in the “src” directory, together with source code.

Developer documentation, like requirements or UML diagrams, may have a source version, for optimal editing and revision control, and a processed version, for optimal rendering. The source version, or the only version, should stay in the “src” directory,
while the possible processed version in the “gen” tree, provided there will be no need to edit it manually.

This raises the following question.

If there is only one kind of processing for a document, but 8 different configurations for processing source code, should we replicate the processing of the documentation for every configuration?

Well, such a procedure wouldn’t be so awful, as documents are usually easily processed and do not require inordinate amounts of storage space. However it may be better the following scheme.

<project>
|
+-- src
|
+-- gen
    |
    +-- x86-32bit-debug
    |
    +-- x86-32bit-release
    |
    +-- x86-32bit-doc
    |
    +-- x86-64bit-debug
    |
    +-- x86-64bit-release
    |
    +-- x86-64bit-doc
    |
    +-- -doc

The x86-32bit-doc directory contains all the processed documents specific for the x86-32bit processor architecture; the x86-64bit-doc directory contains all the processed documents specific for the x86-64bit processor architecture; and the -doc directory contains all the processed documents independent from the processor architecture.

But there is another, harder, question. What about code generated by design tools like UML tools or GUI designers?

Best code generation tools are two-way tools, a.k.a. round-trip engineering tools, i.e. such that when a change is made to the model in the tool, that change is applied to the code generated, without losing any previously code written by the user, provided that the user followed certain conventions, and when the generated code is edited in a text editor following those conventions, that change is automatically parsed and applied to the model in the tool. For such two-way tools, both the model and the generated code should be considered source code, as both may be explicitly changed by a user.

There is a redundancy in that, but it cannot be easily avoided.

Actually I don’t think that code generating tools are a real advantage for constructing easily maintainable software. They are RAD tools, but I think that Rapid Application Development invariably means Slow Application Maintenance.

One-way tools are even worse for software quality, but at least their generated code may be put into the generated code directory tree, with no redundancy in the “src” directory.

Automated test programs

And where to put test program sources, test data, test program executables?

Regarding test program sources, they of course should be considered source code.

The only issue is if they should be in the same directory of source code, in a subdirectory, or in a parallel directory.

It is rather typical that a test source file has the same name of a system-under-test source file, and however there is the need to handle with a single command all test source files or all non-test source files.

Five reasonable schemes (omitting the “gen” directory, and using the word “sut” for “System-Under-Test”, i.e. the main software product) come to my mind:

<project>
|
+-- sut
|
+-- test

Here above, test source code and SUT source code are parallel in the project root. This solution has the disadvantage of disallowing a single command to handle all source files.

<project>
|
+-- src
    |
    +-- test

Here above, test source code is in a subdirectory of the directory containing the SUT source code.

<project>
|
+-- src
    |
    +-- file.cpp
    |
    +-- test_file.cpp

Here above, test source code and SUT source code are in the same directory. Test source code files are distinguished by the fact that their names begin with “test_”. This solution has the disadvantage of disallowing a single command to handle all SUT (i.e. non-test) files.

<project>
|
+-- src
    |
    +-- sut_file.cpp
    |
    +-- test_file.cpp

Here above, test source code and SUT source code are in the same directory. Test source code files are distinguished by the fact that their names begin with “test_” and SUT source code file names begin with “sut_”. This solution has the disadvantage of forcing the use of prefixes to all file names.

<project>
|
+-- src
    |
    +-- sut
    |
    +-- test

Here above, test source code and SUT source code are parallel in the source code directory. This solution has the disadvantage of using the weird name “sut” for the directory containing non-test files. If possible, it is better to keep application file easily reachable and not into a hard-to-understand directory. Even renaming it “app” or “main” is not really better.

The only solution for which no disadvantage has been pointed out is the second. Therefore it should be the solution of choice. If needed, several test directory may be created, like “unit_test”, “system_test”, “stress_test”.

Regarding test data, well, there shouldn’t be much test data, as it would be hard to maintain.

If there is a need for stress test with much data, an algorithm should be implemented to generate it on-the-fly or to generate it (in the “gen” subtree) as part of the build process.

Given that there is not a huge amount of test data stored persistently, it should be considered part of source data, and therefore stored in the “src” subtree.

Regarding test executable programs, they of course should go in the “gen” subtree. If they share the configuration options of the SUT they should be in the same directory as the SUT generated files. Otherwise one or more parallel directories may be created in the “gen” directory.

Libraries

A software product (program or library) may use libraries in several ways.

  • External non-standard libraries, i.e. libraries created and maintained by another organization.
  • In-house libraries, i.e. libraries maintained by the same organization, and used by several projects.
  • Project-specific libraries, i.e. libraries created as part of a single project.

Regarding external libraries, like Boost, it’s better to put them in specific directories outside of the application projects directories. It may be where system libraries are installed or elsewhere. Then give to the build system references of the paths of such directories.

Regarding in-house libraries, they should be treated as external libraries, but not installed where system libraries are installed. In additions, but that’s not a directory issue, every major application should be registered as dependent for such libraries, so that a single command could build a library, run the tests for the library, rebuild all dependent projects, and possibly run all the automated tests for all dependent projects. If all this pass with no errors, it is a guarantee that the change to the library is not breaking existing code. Of course the library maintainer should keep a recent copy of all dependent projects. In addition, every project should keep a dependency on the library in its build script.

Regarding project-specific libraries, they are covered in the following section.

Many-targets projects

A project could generate several targets. Such situation is described by Microsoft Visual Studio as a “solution” referencing several “projects”, one for target.

There are many reasons for generating several targets (programs or libraries) from a single source code base. Many are wrong, but some are right.

However the reason, there is the need to organize such project. I think that it is better to put all the source files in the same directory, and possibly all the target files in the same directory too. Of course every file should have a distinct name, but that’s not so bad.

Multi-branch projects

Sometimes there is the need to recreate the development environment of a previous revision of the software, for various reasons. And sometimes several different revision of the same software should coexist in the same PC for years. How to organize them?

Of course several branches do not share neither source code nor generated code, but they give the same name to different files, and therefore they should reside in different subtrees, and share a revision control repository. Well, it would be reasonable to put them in parallel subtrees.

Conclusion

As a conclusion, here is a typical directory organization for a project.

<project name>
|
+-- trunk
|   |
|   +-- src
|   |   |
|   |   +-- u_test
|   |   |
|   |   +-- sys_test
|   |
|   +-- gen
|       |
|       +-- x86-32bit-debug
|       |
|       +-- x86-32bit-release
|       |
|       +-- x86-64bit-debug
|       |
|       +-- x86-64bit-release
|       |
|       +-- doc
+-- rev238
    ...
<in-house library name>
...
    Posted in Uncategorized | Leave a comment

    No object-oriented analysis

    In the heydays of object orientation, when software gurus (stupidly) said that object-orientation was good because the world is so full of objects, or when the operating system OS/400 was (stupidly) said to be object-oriented, many software engineers spoke about OOP (for object-oriented programming), about OOD (for object-oriented design), and even about OOA (for object-oriented analysis). I read a couple of books about object-oriented analysis and design, and I could find a clear difference between the two. In fact, object-oriented analysis appeared just as a high-level kind of object-oriented design.

    Another question is that the word analysis by itself has zillion of technical meanings in the different fields of knowledge, and therefore often it is quite ambiguous to talk about “analysis” tout court. Here we are talking about “requirement analysis”. Some people made a complete confusion between analysis and design, talking about “requirement capture” for what I here mean with the phrase “software requirement analysis”, and talking about “analysis” for what I here mean with the phrase “software design”. I heard even the word “microanalysis” to mean low-level design. A software developer was classified as “analyst”, “programmers”, or “programmer analyst” (a.k.a. “analyst programmer”). No mention for software designers. Then came the Web, and “Web designer” became the phrase to refer to the professional expert in artistic composition of Web pages and sites.

    The distinction between analysis and design was so blurred that the phrase “analysis and design” became a buzzword, and the word “modeling” was borrowed from the fashion world to mean just “analysis or design”. For example it was used in the acronym UML (Unified Modeling Language).

    Now at last, a more reasonable language usage has prevailed. Now for most people “software analysis” means “software requirement capture” and “software design” means “software architecture specification”, or “language-independent algorithm writing”.

    Also the meaning phrase “object-oriented” has shrunk to the original meaning of “class-oriented”, as, in my opinion, it should have always been called.

    Given that new/old meaning of “analysis” and of “object-oriented”, it is quite hard to make a sense for the concept of “object-oriented analysis”. Software analysis is the work of talking with the domain experts (or marketing experts) to get their wishes about what the new software should do. They are not expert of software development technology and concepts, like object and classes, they shouldn’t become. The purpose of such work is to write a document specifying the requirements of software. This document must be fully understood by all the stakeholders. How is thinkable to write such a document in an object-oriented fashion?

    The conclusion is: there is object-oriented programming, that is computer programming done using an object-oriented programming language; there is also object-oriented software design, that is software architectural planning or algorithm invention done in an object-oriented way, and possibly implemented in a non-object-oriented programming language; but there is no thing as object-oriented requirements analysis.

    Posted in Uncategorized | Leave a comment

    Best documentation format

    Software has and should have many kinds of documentation:

    • Source code itself should be as self-documenting as possible.
    • Comments embedded in source code files should explain what is not obvious from the code itself.
    • Formalized comments embedded in source code files should declare the parts of the semantics of the interface of a module, that is not implicit in the interface itself. Such formalized comments should be extracted by a tool to build the API specification of that module.
    • Requirements of software should be written in plain language, possibly with examples (a prototype, some sketches or screen-shots, some formulas, and, in case the required software is a software library, some code snippets that show how to use the library).
    • User manual that explain how to use the software product. In case the software product is a software library, the user manual is a tutorial for how to use the library, plus the API reference specification. This manual may be distributed in several formats:
    • as a printed book (or booklet, or leaf);
    • as a file formatted as a printed book (typically in PDF format);
    • as a file to be read from the Web (typically in HTML format);
    • as a context-sensitive help window popped-up by a command in the user interface of the program (several formats, as HTML, or CHM).

    Of course, the first three kinds of documentation is edited using the source code editor itself. But there are several practices for building the other kinds of documentation.

    Some of them require that the source is in a proprietary binary format, like Microsoft Word’s or Framemaker’s; others require to use an open binary format, like OpenDocument Text, used by OpenOffice; others require to use a proprietary text format, like Rich Text Format; others require to use an open text format, like DocBook.

    Text formats may be classified according several criteria: bracket-oriented, or command-oriented; free-format or line-oriented; formatting-oriented or semantic-oriented.

    In bracket-oriented formats like RTF and DocBook and, in part, HTML, many elements (actually non-empty elements) have tags that open sections and corresponding tags that close them. Instead, in command-oriented formats, most elements have only a only command to open them, and not to close them; a section is closed when another section is opened or when the document terminates.

    In free-format formats, in every point of a line a command may begin or end; therefore a single line may contain the whole book. Instead, line-oriented formats, like the C preprocessor, require that for many command types every command be isolated in a line; therefore a given document source has a minimum number of lines.

    Formatting-oriented formats specify the actual presentation of the document: the typeface, color and size of characters, the indent and interline between paragraphs and inside paragraphs, the size of the page. Instead, semantic-oriented formats specify only which text portions are section titles and of which level, which are plain text and which are source code snippets; leaving the formatting decisions to a rendering or transforming program, possibly parameterized by a style-sheet.

    Here I argue that there are several disadvantages with proprietary, binary, bracket-oriented, free-format, formatting-oriented formats, and therefore it is better to choose a format that is open, textual, command-oriented, line-oriented, and semantic-oriented.

    Here are the rationales.

    First of all, proprietary formats create a dependency on the strategies of the the copyright holder. If he decides not to support that format any more, who prefer to stick to that format is left out in the cold. In addition, as often the specification of such formats are unpublished or the documents using such formats are accessible only using costly proprietary tools, or both, one has to pay to access those formats and cannot share freely such documents without forcing other people to buy the tool, or, in case of reverse-engineering of the format by an open source tool, one is never sure that the document is exactly the original one.

    Of course, binary formats, even the open ones, require a specific program to edit them. While some users could like using that program, others may prefer to use their favorite  text editor. Therefore a format editable as raw text is preferable.

    But even some textual formats virtually require a specific program to edit them. Actually, free format text documents, like XML documents, if not well indented, are hard to read and write using a simple text editor. Well, syntax highlighting and bracket matching features may help, but line-oriented formats are much easier to read and write than free-format documents.

    In addition, it is always a good idea to keep every important document under a revision control system. The purpose of a revision control system is manifold: to avoid clashing simultaneous changes (or “edits”), to keep commented history that allows rollbacks, to allow to merge parallel changes (simultaneous by different users, or from different development branches).

    To avoid clashing changes, some system forbid them, while other systems allow them as different branches and then merges them. The latter is more powerful, provided there is a good merging facility. Some kinds of merges may be handled automatically by the revision control system, while others require manual intervention as the automatic system detects clashing changes. To allow automatic merges, it is necessary that when merging two well-formed documents a third well-formed document is built. As many systems cannot merge changes to the same line, the longer are the lines the likelier is a merge conflict, and the need of manual intervention.

    In addition, both to analyze the history of a document and to merge two revisions, it is necessary to view two versions of the document side-by-side. Given that a modern computer screen cannot show much more that 160 characters in a line, if a line contains more than 80 characters, one can hardly see two versions of that line without scrolling horizontally the document. Therefore, short text lines and command-only lines are much easier to handle using a revision control system.

    At last, semantic-oriented markup is well known to be better on the Web than formatting-oriented markup, as it may be subsequently customized according with the output medium size and resolution (large screen, notebook, netbook, smart phone, paper printer), and according with user preferences.

    I looked around for a perfect match to my requirements, and I just decided that the best match is the tool named “Pandoc”, that handles several input and output formats, but whose main input format is a variation of the famous Markdown format. The main shortcoming of the Markdown format is that some formatting should be done embedding HTML tags. That has two disadvantages: HTML tags are not very readable (think of a table), and HTML tags are not good to generate non-HTML output. Pandoc adds the necessary commands to remove the need to use HTML in most documents.

    Posted in Uncategorized | Leave a comment

    Best C++ testing framework

    For several years I’ve been searching the best testing framework for C++ software, and now I reached a decision.

    Actually, there are (at least) three kinds of automatic testing for C++ software:

    • Static (unit) testing: It is to state which statements are legal and which are not legal, and let the compiler check that they are actually as stated.
    • Dynamic unit testing: It is to state which is the expected result of some operations, compile and run the program, and let the testing framework check the operations return the expected values.
    • (Dynamic) System testing. It is to run the actual hardware/software system, or an emulation of it, run a testing program that submit to the system under test the input that would be provided by the user, another application, or hardware devices, and check that the system under test behaves as expected. Sometimes it requires that the system under test has a backdoor to receive test input, and, as the check of the behavior of the system is manual, it is only a semi-automatic testing technique.

    System testing is not C++-specific, and therefore I think there shouldn’t be C++-only system testing tools. In addition, it is rather application-kind-specific, meaning that system testing of a microcontroller system is quite different from system testing of a videogame, and both are quite different from system testing of an e-commerce website; therefore I think there shouldn’t be completely generic system testing tools.

    On the contrary, static testing is something that make sense almost only for metaprogramming, i.e. generic source code that, according to directives by other, application-oriented code, generates more specific source code, that may then be compiled in an optimized way into machine code.

    I still haven’t found a static testing tool superior to my own tool published on Dr. Dobb’s Journal.

    When someone speaks about a “unit testing framework or tool” invariably means a dynamic unit testing framework or tool. There are really many of them, for many programming languages. The most famous ones are those of the xUnit family, derived from the SUnit framework designed and implemented for the Smalltalk language.

    But for C++, there are even several unit testing frameworks that claim to be belong to the xUnit family.

    After a feature comparison, and after having read some comments and opinions over the Web, I decided that my framework of choice is Google C++ testing Framework.

    Its main defect is that it has a company name in its name and in the name of the initialization function. But, as it is open source, they can be removed.

    Its main advantages are that:

    • It is open source.
    • It is multiplatform.
    • It is available also for Microsoft tools and operating systems.
    • It doesn’t require to enumerate the tests.
    • It is quite fast to start.
    • Its macros allow a really terse coding, but still auto-documenting.
    • No need to duplicate information.
    Posted in Uncategorized | Leave a comment

    I just created “C++ Bolts” on SourceForge

    Over the years I wrote some C++ code that now I want to share with the world. Some of it has already been published on Wikibooks.

    Therefore I just created a new project on SoureForge code sharing site named “C++ Bolts” ( https://sourceforge.net/projects/cppbolts/ ). By now, it hasn’t much stuff, but stay tuned.

    Posted in Uncategorized | Leave a comment