In a Data Carpentry Workshop
This workshop is designed to be run on pre-imaged Amazon Web Services (AWS) instances (a computer with all the required programs and files to which you will have access from your computer). Except for a spreadsheet program and an internet browser, all of the command line software and data used in the workshop are hosted on an Amazon Machine Image (AMI). If you are signed up to take a Metagenomics Data Carpentry Workshop, you do not need to worry about setting up an AMI instance. The Carpentries staff will create an instance for you, which will be free. This setup is accurate for both self-organized and centrally-organized workshops. Your Instructor will provide instructions for connecting to the AMI instance at the workshop.
If you are in The Carpentries-Workshop, you do not even need to install a bash terminal; the R-studio terminal provided in the AWS-AMI is enough to run all the commands in the lesson. Instead of connecting by ssh, users can use the R-studio AMI terminal.
This lesson requires a working spreadsheet program. If you don’t have a spreadsheet program already, you can use LibreOffice. It’s a free, open-source spreadsheet program.
Running the lesson by yourself (Not in a Data Carpentry Workshop)
Required software
If you are not in a Data Carpentry Workshop, the software you need is listed in the table below. Follow the instructions in Option A or Option B to have access to these programs.
Software website | Used Version in Conda | Manual | Available for | Description |
---|---|---|---|---|
FastQC | 0.11.9 | Help | Linux, macOS, Windows | Quality control tool for high throughput sequence data. |
Trimmomatic | 0.39 | GitHub | Linux, macOS, Windows | A flexible read trimming tool for Illumina NGS data. |
Kraken | 2.1.2 | GitHub | Linux, macOS | A tool for taxonomic assignation for reads from metagenomics |
KronaTools | 2.8.1 | GitHub | Linux, macOS, Windows | A tool for taxonomic visualization in hierarchical pie graphs. |
MaxBin2 | 2.2.7 | SourceForge | Linux, macOS | Tool for MAGs reconstruction |
Spades | 3.15.2 | GitHub | Linux, macOS | Tool for assemblies |
Kraken-biom | 1.2.0 | GitHub | Linux, macOS, Windows | Tool to convert kraken reports in R readable files |
CheckM-genome | 1.2.1 | Wiki | Linux, macOs, Windows | Tool to check completeness and contamination in MAGs |
Option A: Using the lessons with Amazon Web Services (AWS)
Follow these instructions on creating an Amazon instance. Use the AMI ami-028155394f1e36b0d
named The Carpentries Lab Metagenomics v1.0
listed on the Community AMIs page. Please note that you must set your location as N. Virginia
to access this community AMI. You can change your location in the upper right corner of the main AWS menu bar. The cost of using this AMI for a few days, with the t2.medium instance type, is very low (about USD $2.00 per user per day). Data Carpentry has no control over AWS pricing structure and provides this cost estimate without guarantees. Please read AWS documentation on pricing for up-to-date information.
If you’re an Instructor or Maintainer or want to contribute to these lessons, don’t hesitate to contact us at team@carpentries.org, and we will start instances for you.
In this instance, you can use the terminal available in RStudio, and users won’t need
to install their terminals or use ssh
(see Instructor Notes). If, nevertheless, you
prefer that the users install their own terminals, directions to install them are included
for each Windows, Mac OS X, and Linux below in the Option B section. For Windows, you will need to install Git Bash, PuTTY, or the Ubuntu Subsystem.
Option B: Following the lessons on your local machine
If you trust that your computer is powerful enough and want to have all the programs installed, you can follow all the workshops without using an AWS remote machine. To do this, you will need to install all of the software used in the workshop and obtain a copy of the dataset. Instructions for doing this are below.
Data
The data used in this workshop are available on Zenodo. Please read the Zenodo
page linked below for information about the data and access to the data files. Because this workshop works
with real data, be aware that file sizes for the data are large.
More information about these data will be presented in the first episode of the Data processing and visualization for metagenomics lesson.
Install a Bash terminal
Windows
macOS
Linux
Install Miniconda3
These instructions assume familiarity with the command line and with installation in general. There are different operating systems and many different versions of operating systems and environments, so these may not work on your computer. If an installation doesn’t work for you, please refer to the user guide for the tool listed in the table above. If you have difficulties with the installations or find better ways to install things in your operating system, please raise an Issue to let us know.
To make a Conda environment, first, you need to install Conda. We recommend installing the Miniconda3 version. Miniconda is a package manager that includes Conda and its dependencies and simplifies the installation process. Please first install Miniconda3 (installation instructions below) and then proceed to the installation of the environment.
Linux
MacOSX
WSL
Install the metagenomics environment
Once your Miniconda3 is ready, follow these instructions to install and activate the metagenomics environment.
Linux: Option 1 (recommended)
Linux: option 2
MacOSX
WSL
Execute some remaining installation scripts
Change dcuser
with your own username.
And run all these lines:
bash /home/dcuser/.miniconda3/envs/metagenomics/opt/krona/updateTaxonomy.sh
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -xzf taxdump.tar.gz
mkdir .taxonkit
cp names.dmp nodes.dmp delnodes.dmp merged.dmp /home/dcuser/.taxonkit
rm *dmp readme.txt taxdump.tar.gz gc.prt
Install R and RStudio
R and RStudio are two separate pieces of software:
R is a programming language that is especially powerful for data exploration, visualization, and statistical analysis RStudio is an integrated development environment (IDE) that makes using R easier. In this course, we use RStudio to interact with R.
Mac OS X
Windows
Linux
Install R libraries
Software | Version | Manual | Description |
---|---|---|---|
phyloseq | 1.39.1 | GitHub | Explore, manipulate and analyze microbiome profiles with R |
ggplot2 | 3.3.6 | GitHub | System for declaratively creating graphics, based on The Grammar of Graphics |
Type these commands in your console:
> install.packages("phyloseq")
> install.packages("ggplot2")