This lesson has passed peer-review! See the publication in JOSE.

Data Processing and Visualization for Metagenomics: Glossary - Data processing and visualization for metagenomics

Key Points

Starting a Metagenomics Project
  • Shotgun metagenomics can be used for taxonomic and functional studies.

  • Metabarcoding can be used for taxonomic studies.

  • Collecting metadata beforehand is fundamental for downstream analysis.

  • We will use data from a Cuatro Ciénegas project to learn about shotgun metagenomics.

Assessing Read Quality
  • It is important to know the quality of our data to make decisions in the subsequent steps.

  • FastQC is a program that allows us to know the quality of FASTQ files.

  • for loops let you perform the same operations on multiple files with a single command.

Trimming and Filtering
  • The options you set for the command-line tools you use are important!

  • Data cleaning is essential at the beginning of metagenomics workflows.

  • Use Trimmomatic to get rid of adapters and low-quality bases or reads.

  • Carefully fill in the parameters and options required to call a function in the bash shell.

  • Automate repetitive workflows using for loops.

Metagenome Assembly
  • Assembly groups reads into contigs.

  • De Bruijn Graphs use Kmers to assembly cleaned reads.

  • Program screen allows you to keep open remote sessions.

  • MetaSPAdes is a metagenomes assembler.

  • Assemblers take FastQ files as input and produce a Fasta file as output.

Metagenome Binning
  • Metagenome-Assembled Genomes (MAGs) sometimes are obtained from curated contigs grouped into bins.

  • Use MAXBIN to assign the contigs to bins of different taxa.

  • Use CheckM to evaluate the quality of each Metagenomics-Assembled Genome.

Taxonomic Assignment
  • A database with previously gathered knowledge (genomes) is needed for taxonomic assignment.

  • Taxonomic assignment can be done using Kraken.

  • Krona and Pavian are web-based tools to visualize the assigned taxa.

Exploring Taxonomy with R
  • kraken-biom formats Kraken output-files of several samples into the single .biom file that will be phyloseq input.

  • The library phyloseq manages metagenomics objects and computes analyses.

  • A phyloseq object stores a table with the taxonomic information of each OTU and a table with the abundance of each OTU.

Diversity Tackled With R
  • Alpha diversity measures the intra-sample diversity.

  • Beta diversity measures the inter-sample diversity.

  • Phyloseq includes diversity analyses such as alpha and beta diversity calculation.

Taxonomic Analysis with R
  • Depths and abundances can be visualized using phyloseq.

  • The library phyloseq lets you manipulate metagenomic data in a taxonomic specific perspective.

Other Resources
  • Enjoy metagenomics.

Glossary - Data processing and visualization for metagenomics

adapters
Artificial sequences of small length that are attached to both ends of a biological sequence for methodological purposes.
Alpha diversity (α-diversity)
mean species diversity in a site at a local scale
Assembly (Metagenomics)
stitching together of individual DNA reads into more complex and complete objects (contig, scaffold), which could lead to the complete representation of a gene or an entire genome.
Beta diversity (β-diversity)
the extent of change in community composition, or degree of community differentiation, in relation to a complex-gradient of environment, or a pattern of environments
bin
Group of reads, contigs, or scaffolds hypotetically assigned to a individual genome.
binning
The process of agruping DNA sequences in accordance to intrinsic chacarteristics of the sequence.
contig
contiguous fragments of DNA sequence from an incomplete draft genome. The result of assembling reads
Envirnomnet (conda)
Is a directory that contains a specific collection of packages that the user installed
fasta (format)
A text-based format for representing biological sequences.
fastq
A file storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores
for loop
A loop that is executed once for each value in some kind of set, list, or range. See also: while loop.
GC-content
is the percentage of nitrogenous bases in a DNA or RNA molecule that are either guanine (G) or cytosine (C).
gene
A sequence of nucleotides that contains the information to specify a trait.
genome
All genetic information of an organism.
Illumina (sequencing)
A technique used to determine the series of base pairs in DNA.
Lowest common ancestor (LCA)
is the lowest node that has all descendants of insterest in a tree.
k-mer
Are contiguous sequence of characters of length k contained within a biological sequence.
Mapping
The process of establishing the locations of a set of nucleotides on any set of biological information as reads.
Metabarcoding
Collection of a specific gene region of a set of organisms.
Metadata
Information concerning how the samples and data were treated.
Metagenome-Assembled Genomes (MAG)
A single-taxon assembly based on one or more binned metagenomes that has been asserted to be a close representation to an actual individual genome
Metagenomics (shotgun metagenomics)
collection of genomic sequences from various (micro)organisms that coexist in any given space.
Next generation sequencing (NGS)
Technology is used to determine the order of nucleotides in entire genomes or targeted regions of DNA or RNA that is characterized by its massively parallel processing.
Operational Taxonomic Unit (OTU)
A collection of sequences that have certain percentage of similarity and are thus classified into groups of closely related individuals.
quality control
any process which removes problematic data from a dataset
quality (Phred) scores
Is an integer value representing the estimated probability of an error, i.e. that the base is incorrect
read(s)
DNA sequence from one fragment (a small section of DNA).
read quality
the assignation of the probability of an error in the sequencing of a determined read
sequencing (genomics)
the process of determining the nucleic acid sequence – the order of nucleotides in DNA
Species diversity
The number of different species that are represented in a given community.
taxonomic assignment
Method of determining that a specific sequence belongs to a recognized taxon at different levels of the classification of all life organisms (Phylum, Genus, and Species). This is usually done by comparing the sequence of interest against a set of reference sequences.
thread
A thread is the unit of execution within a process. A process (the execution of a program) can have anywhere from just one thread to many threads.
Oligotrophic (environment)
A space that offers low levels of nutrients.
PCR (polymerase chain reaction)
method used to rapidly make millions of copies of a specific DNA sequence.
rRNA (Ribosomal ribonucleic acid)
a type of non-coding RNA which is the primary component of ribosomes.
scaffold
A portion of the genome sequence reconstructed from sequence fragments. Scaffolds are composed of contigs and gaps.
Sequencing depth (coverage)
Is the number of unique reads that include a given nucleotide in the reconstructed sequence.
Species abundance
The number of individuals of each species inside the environment.
Species richness:
Number of different species in an environment.
while loop
A loop that keeps executing as long as some condition is true. See also: for loop.