Code and Software
Last updated on 2023-08-31 | Edit this page
Estimated time: 40 minutes
Overview
Questions
- What is research code and software?
- What can you do to make code usable and reusable?
- What are the characteristics of readable code?
Objectives
- Describe what is research software and its purposes
- Decompose a workflow into identifiable components
- Know when to separate a script into several functions
- Write code that is easy to run by others by including all dependencies, requirements, documentation, and examples
What your future self may think…
The twitter thread illustrates a real example of research software problems, from Prof. Neil Ferguson. In that case, the research software written for one purpose was suddenly in high demand, for important public health reasons, over a decade later. And that software was hard for others to use.
Smaller-scale versions of this problem are more common:
- you want to re-run a data analysis that you did six months ago
- a new post-doc starts working on a related project and needs to adapt your analysis to a new dataset
- you publish a paper, and a masters student from the other side of the world emails you to reproduce the results for their project
What is research code and software?
There are many different shapes and sizes of research software:
- Any code that runs in order to process your research data.
- A record of all the steps used to process your data (scripts and workflow such data analysis are software).
- R, Python, MATLAB, unix shell, OpenRefine, ImageJ, etc. are all scriptable. So are Microsoft Excel macros.
- Standalone programs or scripts that do particular research tasks are also research software.
There are extended discussions about research software at the Software Sustainability Institute.
Potential problems writing code
Discussion
What can go wrong with writing research code?
- I don’t remember what this code does
- I don’t remember why I made this choice
- This code doesn’t work any more
- This code doesn’t work on an updated version of my dataset
- This code doesn’t work on a different computing system
- I’m not sure if this calculation is correct
If you or your group are creating ten thousands lines of code for use by hundreds of people you have never met, you are doing software engineering. If you’re writing a few dozen lines now and again, and are probably going to be its only user, you may not be doing engineering, but you can still make things easier on yourself by adopting a few key engineering practices. What’s more, adopting these practices will make it easier for other people and your future self to understand and (re)use your code.
The core realization in these practices is that readable, reusable, and testable are all side effects of writing modular code, i.e., of building programs out of short, single-purpose functions with clearly-defined inputs and outputs [hunt1999]. Much has been written on this topic, and this section focuses on practices that best balance ease of use with benefit for you and collaborators.
Programs themselves are modular, and can be written in scripts that run a clearly-defined set of functions on defined inputs.
Place a brief explanatory comment at the start of every script or program
Short is fine; always include at least one example of how the program is used. Remember, a good example is worth a thousand words. Where possible, the comment should also indicate reasonable values for parameters like in this example.
Synthesize image files for testing circularity estimation algorithm.
Usage: make_images.py -f fuzzing -n flaws -o output -s seed -v -w size
where:
-f fuzzing = fuzzing range of blobs (typically 0.0-0.2)
-n flaws = p(success) for geometric distribution of # flaws/sample (e.g. 0.5-0.8)
-o output = name of output file
-s seed = random number generator seed (large integer)
-v = verbose
-w size = image width/height in pixels (typically 480-800)
-h = show help message
The reader doesn’t need to know what all these words mean for a comment to be useful. This comment tells the reader what words they need to look up: fuzzing, blobs, and so on.
Writing helpful explanatory comments
Multiple Choice
An example function GetData
reads in data files of a
particular type. Which of the following should be included in an
explanatory comment for this function?
- file type - it is good to know what the function is tailored to process
- output type - it is good to know how to integrate a function in a workflow
- data columns or other properties - it is good to know the internal structure of the data
- expected file path / address (for example a specific directory or web address) - it is good to know the origin to evaluate the validity of the data
Decompose programs into functions
A function is a reusable section of software that can be treated as a black box by the rest of the program. This is like the way we combine actions in everyday life. Suppose that it is teatime. You could get a teabag, put the teabag in a mug, boil the kettle, pour the boiling water into the mug, wait 3 minutes for the tea to brew, remove the teabag, and add milk if desired. It is much easier to think of this as a single function, “make a cup of tea”.
Software programming languages also allow you to combine many steps into a single function. The syntax for creating functions depends on programming language, but generally you:
- name the function
- list its input parameters
- describe what information it produces
- write some lines of code that produce the desired output.
Good functions should have only one main task: for example, “make a cup of tea” does not also specify how to make a sandwich. Functions can also be built up from other functions: for example, “boil the kettle” involves checking if there is water in the kettle, filling the kettle if not, and then turning the kettle on. Having one main task means that functions should take no more than five or six input parameters and should not reference outside information. Functions should be no more than one page (about 60 lines) long: you should be able to see the entire function in a standard (~10pt) font on a laptop screen. If your function grows larger than this, it is usually best to break that up into simpler functions.
The key motivation here is to fit the program into the most limited memory of all: ours. Human short-term memory is famously incapable of holding more than about seven items at once [miller1956]. If we are to understand what our software is doing, we must break it into chunks that obey this limit, then create programs by combining these chunks. Putting code into functions also makes it easier to test and troubleshoot when things go wrong.
Pseudocode is a plain language description of code or analysis steps. Writing pseudocode can be useful to think through the logic of your analysis, and how to decompose it into functions.
The “make a cup of tea” example above might look like this:
make_cup_of_tea = function(sugar, milk)
if kettle is not full
fill kettle
boil kettle
put teabag in cup
add water from kettle to cup
wait 2 minutes
if sugar is true
add sugar to cup
if milk is true
add milk to cup
stir with spoon
return cup
Using pseudocode
In this scenario, you’re managing fruit production on a set of islands. You have written a pseudocode function that tells you how to count how much fruit of a particular type is available to harvest on a given island.
count_fruit_on_island = function(fruit type, island)
total fruit = 0
for every tree of fruit type on the island
total fruit = total fruit + number of fruit on tree
end for loop
return total fruit
Write the commands to call this function to count how many coconuts there are on Sam’s island, how many cherries there are on Sam’s island, and how many cherries there are on Charlie’s island.
Write a pseudocode for loop like the one above that uses this function to count all the cherries on every island.
sams coconuts = count_fruit_on_island(coconuts, Sam's island)
sams cherries = count_fruit_on_island(cherries, Sam's island)
charlies cherries = count_fruit_on_island(cherries, Charlie's island)
To count all the cherries on every island:
total cherries = 0
for every island
total cherries = total cherries + count_fruit_on_island(cherries, island)
end for loop
print "There are " + total cherries + " cherries on all the islands"
Be ruthless about eliminating duplication
Write and re-use functions instead of copying and pasting code, and
use data structures like lists instead of creating many closely-related
variables, e.g. create score = (1, 2, 3)
rather than
score1
, score2
, and score3
.
Also look for well-maintained libraries that already do what you’re trying to do. All programming languages have libraries that you can import and use in your code. This is code that people have already written and made available for distribution that have a particular function. For instance, there are libraries for statistics, modeling, mapping and many more. Many languages catalog the libraries in a centralized source, for instance R has CRAN, Python has PyPI, and so on. So always search for well-maintained software libraries that do what you need before writing new code yourself, but test libraries before relying on them.
Give functions and variables meaningful names
Meaningful names for functions and variables document their purpose
and make the program generally easy to read. As a rule of thumb, the
greater the scope of a variable, the more informative its name should
be: while it’s acceptable to call the counter variable in a loop
i
or j
, things that are re-used often, such as
the major data structures in a program should not have
one-letter names.
Name that function
An example function is defined in the format
functionName (variableName)
This function cubes every third
number in a sequence. What are the most meaningful names for
functionName
and variableName
? Choose one from
each of the following sections:
functionName
- processFunction
- computeCubesOfThird
- cubeEveryThirdNumberInASequence
- cubeEachThird
- 3rdCubed
variableName
- arrayOfNumbersToBeCubed
- input
- numericSequence
- S
functionName
- processFunction - incorrect, too vague
- computeCubesOfThird - incorrect, doesn’t imply every third in sequence
- cubeEveryThirdNumberInASequence - incorrect, too long
- cubeEachThird - correct, short and includes information on the data and calculation performed
- 3rdCubed - incorrect, bad practice to put a number at the beginning of a function name (and not allowed by some programming languages)
variableName
- arrayOfNumbersToBeCubed - incorrect, too long
- input - incorrect, too vague
- numericSequence - correct, short and included information about the type of input
- S - incorrect, too vague
Language style guides
Remember to follow each language’s conventions for names, such as
net_charge
for Python and NetCharge
for Java.
These conventions are often described in “style guides”, and can even be
checked automatically.
Callout
Tab Completion
Almost all modern text editors provide tab completion, so that typing the first part of a variable name and then pressing the tab key inserts the completed name of the variable. Employing this means that meaningful longer variable names are no harder to type than terse abbreviations.
Make dependencies and requirements explicit.
This is usually done on a per-project rather than per-program basis,
i.e., by adding a file called something like
requirements.txt
to the root directory of the project, or
by adding a “Getting Started” section to the README
file.
Do not comment and uncomment sections of code to control a program’s behavior
This is error prone and makes it difficult or impossible to automate
analyses. Instead, put if/else statements in the program to control what
it does, and use input arguments on the command line to select
particular behaviour. For example, including the input argument
--option
and corresponding if/else statements to control
running an optional piece of the program. Remember to use descriptive
names for input arguments.
Provide a simple example or test data set
Users (including yourself) can run your program on this set to determine whether it is working and whether it gives a known correct output for a simple known input. Such a test is particularly helpful when supposedly-innocent changes are being made to the program, or when it has to run on several different machines, e.g., the developer’s laptop and the department’s cluster. This type of test is called an integration test.
Code can be managed like data
Your code is like your data and also needs to be managed, backed up, and shared.
Your software is as much a product of your research as your papers, and should be as easy for people to credit. Submit code to a reputable DOI-issuing repository, just as you do with data. DOIs for software are provided by Figshare and Zenodo, for example. Both Figshare and Zenodo integrate directly with GitHub.
Attribution
This episode was adapted from and includes material from Wilson et al. Good Enough Practices for Scientific Computing.
Key Points
- Any code that runs on your research data is research software
- Write your code to be read by other people, including future you
- Decompose your code into modules: scripts and functions, with meaningful names
- Be explicit about requirements and dependencies such as input files, arguments and expected behaviour