Project Organization
Last updated on 2023-09-23 | Edit this page
Estimated time: 25 minutes
Overview
Questions
- How should I name my files?
- How does folder organization help me
Objectives
- Understand elements of good naming strategy
- Evaluate pros and cons of different project organizations
Organizing the files that make up a project in a modular, logical, and consistent directory structure will help you and others keep track of them.
Project organisation problems
Discuss what can go wrong with project organisation:
- Struggling to find the code that creates a particular figure
- Hating to look at or even think about your project because of how badly organised it is
README files are magic
You look at a directory (or project), you read it, and it tells you what you need to know
… as long as you keep it updated!
Put each project in its own directory, which is named after the project
Similar to deciding when a chunk of code should be made a function, the ultimate goal of dividing research into distinct projects is to help you and others best understand your work. Some researchers create a separate project for each manuscript they are working on, while others group all research on a common theme, data set, or algorithm into a single project.
As a rule of thumb, divide work into projects or modules based on the overlap in data and code files. If two research efforts share no data or code, they will probably be easiest to manage independently. If they share more than half of their data and code, they are probably best managed together, while if you are building tools that are used in several projects, the common code should probably be in a project of its own.
Projects do often require their own organizational model. The below recommendations on how you can structure data, code, analysis outputs and other files, are drawn primarily from [noble2009, gentzkow2014]. Other structures to consider are:
The important concepts are that is useful to organize the project in modules by the types of files and that consistent planning and good names help you effectively find and use things later. Your lab or organization may have a template to use.
Put text documents associated with the project in the
doc
directory.
This includes files for manuscripts, documentation for source code, and/or an electronic lab notebook recording your experiments. Subdirectories may be created for these different classes of files in large projects.
Put raw data and metadata in a data
directory, and
files generated during cleanup and analysis in a results
directory
When we refer to “generated files”, this includes intermediate results, such as cleaned data sets or simulated data, as well as final results such as figures and tables.
The results
directory will usually require
additional subdirectories for all but the simplest projects.
Intermediate files such as cleaned data, statistical tables, and final
publication-ready figures or tables should be separated clearly by file
naming conventions or placed into different subdirectories; those
belonging to different papers or other publications should be grouped
together. Similarly, the data
directory might require
subdirectories to organize raw data based on time, method of collection,
or other metadata most relevant to your analysis.
Put project source code in the src
directory
src
contains all of the code written for the project.
This includes programs written in interpreted languages such as R or
Python; those written compiled languages like Fortran, C++, or Java; as
well as shell scripts, snippets of SQL used to pull information from
databases; and other code needed to regenerate the results.
This directory may contain two conceptually distinct types of files that should be distinguished either by clear file names or by additional subdirectories. The first type are files or groups of files that perform the core analysis of the research, such as data cleaning or statistical analyses. These files can be thought of as the “scientific guts” of the project.
The second type of file in src
is controller or driver
scripts that contains all the analysis steps for the entire project from
start to finish, with particular parameters and data input/output
commands. A controller script for a simple project, for example, may
read a raw data table, import and apply several cleanup and analysis
functions from the other files in this directory, and create and save a
numeric result. For a small project with one main output, a single
controller script should be placed in the main src
directory and distinguished clearly by a name such as “runall”. The
short example below is typical of scripts of this kind; note how it uses
one variable, TEMP_DIR
, to avoid repeating the name of a
particular directory four times.
TEMP_DIR = ./temp_zip_files
echo "Packaging zip files required by analysis tool..."
mkdir $(TEMP_DIR)
./src/make-zip-files.py $(TEMP_DIR) *.dat
echo "Analyzing..."
./bin/sqr_mean_analyze -i $(TEMP_DIR) -b "temp"
echo "Cleaning up..."
rm -rf $(TEMP_DIR)
Put compiled programs in the bin
directory
bin
contains executable programs compiled from code in
the src
directory. Projects that do not have any will not
require bin
.
Scripts vs. Programs
We use the term “script” to mean “something that is executed directly as-is”, and “program” to mean “something that is explicitly compiled before being used”. The distinction is more one of degree than kind—libraries written in Python are actually compiled to bytecode as they are loaded, for example—so one other way to think of it is “things that are edited directly” and “things that are not”.
External Scripts
If
src
is for human-readable source code, andbin
is for compiled binaries, where should projects put scripts that are executed directly—particularly ones that are brought in from outside the project? On the one hand, these are written in the same languages as the project-specific scripts insrc
; on the other, they are executable, like the programs inbin
. The answer is that it doesn’t matter, as long as each team’s projects follow the same rule. As with many of our other recommendations, consistency and predictability are more important than hair-splitting.
Name all files to reflect their content or function.
For example, use names such as bird_count_table.csv
,
manuscript.md
, or sightings_analysis.py
. Do
not use sequential numbers (e.g., result1.csv
,
result2.csv
) or a location in a final manuscript (e.g.,
fig_3_a.png
), since those numbers will almost certainly
change as the project evolves.
File names should be:
- Machine readable
- Human readable
- Descriptive of their contents
- Optional: Consistent
- Optional: Play well with default ordering
Example
The diagram below provides a concrete example of how a simple project might be organized following these recommendations:
.
|-- CITATION
|-- README
|-- LICENSE
|-- requirements.txt
|-- data
| -- birds_count_table.csv
|-- doc
| -- notebook.md
| -- manuscript.md
| -- changelog.txt
|-- results
| -- summarized_results.csv
|-- src
| -- sightings_analysis.py
| -- runall.py
The root directory contains a README
file that provides
an overview of the project as a whole, a CITATION
file that
explains how to reference it, and a LICENSE
file that
states the licensing. The requirements.txt
file lists the
software that is required to run the data analysis. The
data
directory contains a single CSV file with tabular data
on bird counts (machine-readable metadata could also be included here).
The src
directory contains
sightings_analysis.py
, a Python file containing functions
to summarize the tabular data, and a controller script
runall.py
that loads the data table, applies functions
imported from sightings_analysis.py
, and saves a table of
summarized results in the results
directory.
This project doesn’t have a bin
directory, since it does
not rely on any compiled software. The doc
directory
contains two text files written in Markdown, one containing a running
lab notebook describing various ideas for the project and how these were
implemented and the other containing a running draft of a manuscript
describing the project findings.
Naming and sorting (5 minutes)
Have a look at the example files from a project, similar to the one from the previous metadata episode.
All the files have been sorted by name and demonstrate consequences of different naming strategies.
For your information, to encode experimental details the following conventions were taken:
- phyB/phyA are sample genotypes (that is, which gene is mutated)
- sXX is the sample number
- LD/SD are different light conditions (long or short day)
- on/off are different media (on sucrose, off sucrose)
- measurement date
- other details are timepoint and raw or normalized data
2020-07-14_s12_phyB_on_SD_t04.raw.csv
2020-07-14_s1_phyA_on_LD_t05.raw.csv
2020-07-14_s2_phyB_on_SD_t11.raw.csv
2020-08-12_s03_phyA_on_LD_t03.raw.csv
2020-08-12_s12_phyB_on_LD_t01.raw.csv
2020-08-13_s01_phyB_on_SD_t02.raw.csv
2020-7-12_s2_phyB_on_SD_t01.raw.csv
AUG-13_phyB_on_LD_s1_t11.raw.csv
JUL-31_phyB_on_LD_s1_t03.raw.csv
LD_phyA_off_t04_2020-08-12.norm.csv
LD_phyA_on_t04_2020-07-14.norm.csv
LD_phyB_off_t04_2020-08-12.norm.csv
LD_phyB_on_t04_2020-07-14.norm.csv
SD_phyB_off_t04_2020-08-13.norm.csv
SD_phyB_on_t04_2020-07-12.norm.csv
SD_phya_off_t04_2020-08-13.norm.csv
SD_phya_ons_t04_2020-07-12.norm.csv
ld_phyA_ons_t04_2020-08-12.norm.csv
- What are the problems with having the date first?
- How do different date formats behave once sorted?
- Can you tell the importance of a leading 0 (zeros)?
- Is it equally easy to find all data from LD conditions as ON media?
- Can you spot the problem when using different cases (upper/lower)?
- Do you see benefits of keeping consistent lengths of the naming conventions?
- Do you see what happens when you mix conventions?
- Using dates up front makes it difficult to quickly find data for particular conditions or genotypes. It also masks the “logical” order of samples or timepoints.
- Named months break the “expected” sorting, same as dates without leading 0
- Without leading zeros, ‘s12’ appear before s1 and s2
- the first (and second) part of the name are easiest to spot
- the last file is also from LD conditions, but appears after SD, same with ‘phya’ genotypes
- the last 3 file names are easiest to read as all parts appear on top of each other due to the same 3 letter-length codes ons and off
- The lack of consistency makes it very difficult to get data from related samples/conditions.
Some helpful organisation tools
- Integrated Development Environments (IDEs) combine many features to
write code and organise projects:
- PyCharm for python
- RStudio for R
- VSCode for python and other languages, etc.
- Notebooks collect data, code, results, thinking in single documents:
- jupyter
- R markdown
- quarto, etc.
- Cookie Cutter project templates for reproducible science
Attribution
This episode was adapted from and includes material from Wilson et al. Good Enough Practices for Scientific Computing.
Some content was adapted from FAIR in Biological Practice episode on files and organisation. That material gives a slightly different and also useful perspective.
Key Points
- A good file name suggests the file content
- Good project organization saves you time