Tips for creating and maintaining open source software for science
7 rules of thumb for your open science project
The subsequent rules of thumb arose during the development of the Empirical Gramian Framework (emgr), a young open source software project in the Workgroup for Numerical Analysis & Scientific Computing at the University of Münster which targets algorithmic model order reduction for control systems. emgr is written in the often—and wrongfully—belittled MATLAB programming language, which, by its almost pseudo-code like syntax, is easily understandable and yet performs very well. The following guidelines, many of which are related to the Science Code Manifesto, are given from a MATLAB perspective, but they apply to other programming languages and environments as well.
1. Be compatible
emgr is compatible with the two major interpreters of the MATLAB programming language: Mathworks MATLAB and GNU Octave, and using an additional utility script, also with Freemat. While MATLAB performs better in many scenarios, Octave provides an open source alternative, thus enabling running emgr on a full open source stack such as Octave on Linux. Personally, I feel the MATLAB programming language benefits from this propriety versus open source rivalry. Octave keeps Mathworks on its toes.
2. Be available
Of course, the emgr code is available from the project's website, but also from the GitHub repository, Zenodo, and the MATLAB Central Fileexchange. This used to create some work for each release but with the GitHub integration for Fileexchange and Zenodo, it's becoming easier. Apart from such convenience features, all code that is supposed to be maintainable should be under some kind of version control such as git.
3. Be reproducible
By providing the source code together with the results, it's easier to showcase those results. A platform that promotes reproducibility is runmycode, on which code accompanying a publication can be deposited. Additionally, one is unburdened from ensuring this specific code's availability.
4. Be compact
Overall, the emgr source code is about 400 lines long and shouldn't grow much in size. Designed like an app, emgr is focussing on one particular task. Due to its lean code base, within two hours, it's possible to explain the whole program and all features to a reasonably experienced programmer.
5. Be fast
Blindly "tweaking" the code rarely accelerates the computation time. For emgr, there is no performance optimization without previously consulting either a statistic or instrumentation profiler; both MATLAB and Octave provide such a tool. Furthermore, many of the performance leaks are already exposed by a static code analyzer like mlint.
6. Be informative
The file or function header can be densely packed with documentation and meta information. Apart from the imperative author, license, version, and a link to the project website, information about argument and return types, additional options, related documentation or functions, a citation hint, and keywords can be placed inside the header.
7. Be citable
Finally, an important factor in science is being cited. Together with GitHub, Figshare and Zenodo recently started providing DOIs to academic software projects, thus making software citable. Also, to adequately cite the project website, one should refer to a dated snapshot made, for example, with an archiving service like archive.today.