Proposal: Capacity building for sustainable scientific programming

Summary

Many workgroups complain about a lacking culture of code-sharing and code-reuse. This leads to inefficiencies in the workflow of scientific production. An initiative is proposed that a) provides pieces of immediately useful code of a quality suited for sharing and to be used in b), i.e. associated teaching activities to develop scientific programming skills and know-how to foster a culture of scientific programming that produces high-quality code, embraces code-reuse, and is aware of state-of-the art development tools and practices.

Problem statement

Many groups observe that PhD students and Postdocs are constantly “re-inventing” the wheel. In any group there are standard tasks that should be solved in a sufficiently generic way, implemented cleanly and and documented sufficiently verbose, so that dealing with that specific task does not involve writing a new program in the future. This in general does not happen, which leads to a reduced efficiency in producing research output.
Standardized plots and diagnostics can help to interpret results across different model runs, researchers, workgroups. Often the choice of the color scale alone, not to mention possible transformations of values or axes (projections) make the comparison of different plots and their interpretation unnecessarily difficult.

What does not work

The development of a software package that solves all problems once and for all is tempting, frequently attempted, and always fails. In fact, a great many software packages for the tasks mentioned above already exist. Their application however remains fragmented, sometimes because researchers are not aware of the existence of the particular package they need, but frequently because the “Not-Invented-Here” Syndrome is (for good reasons) a strong force in research environments.

One reason that the monolithic approach does not work is the need of researchers to understand exactly what they are doing, and that they often have the desire to determine a software’s operation to the smallest detail.

Therefore any pre-canned “black-box”¹ approach reaches its limit quickly, and researchers resort to implementing the algorithm themselves from scratch. The result is an ad-hoc piece of code, undocumented, opaque to everybody else, non-reusable, written in whatever language the researcher is most comfortable with, and frequently containing bugs that cost time later in the project.

Our approach

The hypothesis of this proposal is that the solution to the problem does not lie in writing just the right software, but in influencing workgroup research culture and practices and in promoting the development of relevant individual state-of-the art programming skills.

There are workgroups that have such a culture. From anecdotal evidence it seems that this always involves a person with the dedicated job to maintain an archive/software repository and curate the group’s software output (e.g. dyntools in the Atmospheric Dynamics group). There was a strong consensus across many C2SM-workgroups at a recent meeting, that a structural changes in hiring-policy are required for sustainable improvements of the scientific code & data situation. Until then, however, C2SM could play a similar (yet clearly limited) role for groups that do not have the necessary resources.

The focus of the proposed project lies on anchoring a culture of software re-use in the respective workgroups, on enabling the development of relevant programming skills through a tutorial, workshops, and relevant example code. The “example code” plays a key role, as it should serve, next to being a teaching device, as an immediately applicable program that responds to specific re-occurring tasks in the respective workgroup. It is different from “pre-canned” solutions, in that it is very well written, extremely well documented, and interlinked with other programming resources.

The tutorial will, for example, cover a simple modification to the example script, inclusive its submission to a version control system.

The overall aim is to supply a useful piece of software to lure (mostly beginning) researchers into building their individual code on tried and tested foundations, instead of re-inventing the wheel, and ideally to do so in a way that lets others profit from their work.

The project would use Python and R as main languages, which doesn’t prevent the inclusion of CDO, NCO, bash, or other external tools if necessary. In particular, exploring toolchain options ranging from NetCDF focussed command-line tools in the climate modelling community to GIS tools, such as QGIS, used by many impact modelling groups, could have the potential to boost the technical know-how available.

Requirement

A sufficient large number of C2SM-groups (min. 4?) that actively collaborate, i.e. contribute (wo)man-power towards the specification of the example code, coding, testing, documenting, …

Deliverables

Code that solves specific problems (non exhaustive!) for each participating workgroup “out-of-the-box”, yet is well written and documented in a way that makes it an attractive starting point for own developments.
A tutorial that uses these examples to teach basic techniques and enable and motivate researchers to expand the stock of useful example-scripts.
A software repository that holds code and documentation, can be accessed “reading and writing” by C2SM attached researchers and is maintained by C2SM.
A C2SM run course/workshop based on the tutorial.

”Black-box” here refers also to code that is available, e.g. cosmolib, but is too costly to understand well enough to have confidence that it is doing the right thing and to be able to modify it. Cosmolib’s main script consists of about 5000 lines of NCL and is incompletely documented. But makes extremely nice figures! ↩

Climate Data Fragments