Category: Bioinformatics

NGS

USEARCH-based de-novo OTU picking ITS

Method

In the USEARCH-based ITS pipeline, R1 and R2 raw reads are processed by removing the primer sequences using Cutadapt v1.14 and then merged with USEARCH v.10. De-novo UPARSE-OTU algorithm is used to pick the OTUs at 97% of similarity and to remove chimeras. OTUs are identified against UNITE database version December 2017, collected in the .biom file and filtered at 0,005% abundance (Bokulich NA, et al., 2013) to eliminate spurious OTUs that are present at low frequency.

Output

We deliver an archive via the cloud storage provider Mega. The most important file in the folder is an .html file that links to an interactive web page with sunburst, pie charts per sample and links for taxonomy table and .fastq reads download. An example of the output is available here.

When to use it

You should use such method when you need to analyze various matrices. Moreover, as it is a de-novo OTU-picking strategy, it can be applied to find new OTUs.


NGS

USEARCH-based de-novo OTU picking

Method

In the USEARCH-based BMR pipeline, R1 and R2 raw reads are processed by removing the adapter sequences using Cutadapt v1.14 and then merged with USEARCH v.10. De-novo UPARSE-OTU algorithm is used to pick the OTUs at 97% of similarity and to remove chimeras. OTUs are identified against RDP training set v16, collected in the .biom file and filtered at 0,005% abundance (Bokulich NA, et al., 2013) to eliminate spurious OTUs that are present at low frequency.

Output

We deliver an archive via the cloud storage provider Mega. The most important file in the folder is an .html file that links to an interactive web page with sunburst, pie charts per sample and links for taxonomy table and .fastq reads download. An example of the output is available here.

When to use it

You should use such method when you need to analyze various matrices, mainly those derived from human, food and environment. Moreover, as it is a de-novo OTU-picking strategy, it can be applied to find new OTUs.


NGS

Qiime-based closed-reference OTU picking

Method

In the standard BMR pipeline, R1 and R2 raw reads are merged with FLASH software v.1.2.11 and quality filtered (Q>30). Reference-based UCLUST algorithm is used to pick the OTUs at 97% of similarity. OTUs are identified against Greengenes database v.13_8, collected in the .biom file and filtered at 0,005% abundance (Bokulich NA, et al., 2013) to eliminate spurious OTUs that are present at low frequency.

Output

We deliver an archive via the cloud storage provider Mega. The most important file in the folder is an .html file that links to an interactive web page with sunburst, pie charts per sample and links for taxonomy table and .fastq reads download. An example of the output is available here.

When to use it

You should use such method when you need to analyze human-derived matrices. In fact, as it is a closed-reference system, it cannot be applied to find new OTUs, besides those present in the reference database (Greengenes).


NGS

De novo bacterial genome: bioinformatic output

Our standard pipeline for the analysis of genome performs the following steps:

  1. Quality check and filter of the raw data (FASTQ format)
  2. De novo assembly of the reads, producing the contigs (in FASTA format)
  3. Gene prediction: Identification of putative coding regions (ORFs) and putative non coding genes (rRNAs, tRNAs…) producing a tabular output in GFF format
  4. Automatic annotation of the predicted genes in GBK (GenBank), GFF and FASTA format.

Read More


NGS

Informazioni sull’output dell’analisi Qiime [Italian]

La pipeline di analisi 16S Qiime

Le reads R1 ed R2 vengono unite con il software FLASH v1.2.11 e filtrate per qualità (Q>30). L’analisi è basata sul metodo pick_closed_reference_otus di Qiime 1.9.1 sul database GreenGenes v. 13-8. Le OTU raccolte nel file .biom vengono filtrate allo 0,005% di abbondanza (Bokulich et al., 2013) per eliminare le OTU spurie a bassa frequenza.

Le analisi di alpha-diversity (diversità intra-campione) vengono effettuate con il comando alpha_rarefaction che rarefa i campioni a diverse profondità (num. di reads) calcolandone gli indici di diversità (filogenetici e non). Il comando beta_diversity_through_plots genera le matrici di distanza e i PCoA in 3D sui campioni alla profondità di rarefazione ottimale. Il comando jackknifed_beta_diversity genera gli alberi UPGMA (clustering gerarchico tra i campioni) ed i PCoA 3D con supporto jackknife (campionamento ripetuto dei dati di ogni campione per evidenziarne la variabilità). Infine con il comando make_distance_boxplots vengono prodotti dei grafici boxplot di confronto tra le condizioni sperimentali di analisi, su cui viene effettuata una statistica t-student per testare la significatività della differenza tra i gruppi.

Le cartelle che troverete nell’output sono:

  • OTUs: contiene le otu trovate (gruppi di sequenze che rappresentano una specie batterica), la otu_table.biom, la tabella delle otu filtrata (filtered.biom) conte e albero filogenetico delle otu (.tree);

  • Heatmap.pdf: è la heatmap della tabella delle OTU in formato grafico.

  • Taxa_Summary: contiene file tabulari e grafici sulla composizione tassonomica finale di ogni campione, comprende file HTML interattivi (nella sottocartella “taxa_summary_plots”); la cartella contiene l’assegnazione tassonomica relativa a ciascun campione in formato .txt, .biom e .html (interattivo sotto forma di grafico ad area o istogramma) suddivisa per livelli tassonomici (L2=Phylum,…, L6=Genere, L7=specie);

  • AlphaRarefaction: contiene le analisi ed i grafici di rarefazione per diversi indici di alpha-diversity; al suo interno “alpha_rarefaction_plots/rarefaction_plots.html” è un file interattivo con cui visualizzare la rarefaction plot calcolata per ogni campione sulle diverse metriche di alpha-diversity;

  • BetaDiversity: contiene le analisi ed i grafici PCoA 3D dei campioni, hierarchical clustering dei campioni con alberi UPGMA, evidenziano quali sono i campioni che hanno delle comunità batteriche più simili; i file index.html contenuti nelle cartelle sono visibili tramite browser (strumento “Emperor”);

  • JakknifedBetaDiversity: simile a beta rarefaction ma con reiterazione dell’analisi per evidenziare l’affidabilità del risultato e la sua variabilità; i file index.html contenuti nelle cartelle sono visibili tramite browser (strumento “Emperor”);

  • Boxplot (Weighted, Unweighted): confronto tra gruppi di condizioni sperimentali, restituisce la significatività della differenza tra gruppi di confronto.

Altri file di interesse (dati grezzi):

  • filtered.biom e filtered.tsv: sono i file che contengono la OTU-table (dimostra la proporzione delle OTU nei campioni) in formato .biom e in formato .tsv (leggibile con programma come Blocco appunti o Excel), filtrate allo 0,005% (Bokulich et al., 2013);

  • OTUs/filtered_pruned.tre: file contenente l’albero filogenetico delle OTU (leggibile con programmi come FigTree) filtrato dalle OTU meno abbondanti dello 0,005% (l’albero originale è nella stessa cartella, rep_set.tre);

  • reads_counts.csv: conta del numero di reads per ogni campione.

Il nome delle reads e dei campioni che trovate nella cartella di output presenta in ordine: l’ID del file e la dicitura FROM seguito dall’ID del campione originale inviatoci.


NGS

The FASTQ file format

The output of NGS sequencers are files in the FASTQ format, you can see an in depth description of the format in Wikipedia. In the FASTQ format each sequence is saved using four lines:

  1. Sequence name, after a “@” prefix (e.g. “@MNL12948:2031:123”)
  2. The sequence itself (e.g. “ACGTACGACGGGAGGTACTCTACGATAT”)
  3. The “+” symbol, optionally followed by the sequence name again
  4. The sequence quality, encoded with a character per base (e.g. “@C@,AFDG@CAFGGGGGGGEFCEGGFG+CD”)

Each file is usually compressed using GZip (can be decompressed with the “gunzip” command using Mac OS X or Linux, or the 7-zip program from Windows).

File size

FASTQ files can be heavy, spanning from several Megabytes to Gigabytes.

As an example, a single amplicon sequenced with the Illumina MiSeq machine (2×300 bp) that yields 190.000 sequence pairs, will be saved into two files (one per pair). Each file will be ~100Mb, or 25Mb after the compression.


NGS

De novo assembly of (short) reads

The output of a Whole Genome Sequencing experiment performed with Illumina machines is a set of (relatively short) sequences, saved in the FASTQ format. Here we report three reads (60bp long) as they appear in a FASTQ file.

@M00961:44:000000000-ACVH6:1:1101:16881:1212 2:N:0:63
NCCTCCGCTTATTGATATGCTTAAGTTCATCGGGTAATCCTACTTGATCTGAGGTTGAAT
+
#8BCCGGGGGGGGGGGGGGD9CCEFFGGGGGGGGGGGGGGGGGEGGFGGGGGGGGGGGGG
@M00961:44:000000000-ACVH6:1:1101:13121:1227 2:N:0:63
NCCTCCGCTATTGATATGCTTAAGTTCAGCGGGTAATCCTACTTGATCTGAGGTTGAATA
+
#8BCCGGGFGGGGEGCFGGGCCCFEFGGGEF>CFFGGGGFGFFFGGGFGGEGGFGGGGGG
@M00961:44:000000000-ACVH6:1:1101:16888:1232 2:N:0:63
NCCTCCGCTTATTGATATGCTTAAGTTCATCGGGTGATCCTACTTGATCTGAGGTTGAAT
+
#8ACCGGGFGGGGGGGGGGEF<DEFCFFFGGGGGE+CFFGGGGF

1_denovoThe process of sequence assembly takes all the reads produced and try to merge them producing longer sequences called contigs. Its important to note that even if coming from a single chromosome, the assembly process is unable to produce a single contig, because of several limitations. A general problem is the presence of sequence repeats in every genome, and that the genome coverage could be not uniform.

The final output of the assembly step is a set of contigs, saved in a multi FASTA file. They are usually quite longer than the sequencing reads: for a typical prokaryotic genome when performing a shotgun sequenging with 300pb long reads its common to obtain contigs as long as 500.000 bp. A common metric to describe the assembly is the N50.


NGS

N50, a common metric to describe de novo assemblies

When assembling short reads to produce contigs, it’s very common to obtain a large number of very small contigs. This make common metrics, such as average or median, not so useful to give a functional idea of the final result in term of contigs size.

How to calculate it

To calculate the N50, a commonly used metrics, we have to sort the contig sizes from the largest to the shortest. We also calculate the total sum of the contig lengths and divide it by 2. Then we start summing the contig sizes, from the biggest to the smallest, until we reach the half total length. The length of the last contig picked is referred to as N50.

Its meaning

If my assembly has an N50 of 30kbp, this means that half of my assembly is contained in contigs >30kbp. Considering that the average ORF in a prokaryotic cell usually is ~1kb, most of our genes should be stored in a single contig, not splitted. Also several clusters of genes are represented in our assembly.

Of course with an N50 of 50-100kbp we can be even happier.


NGS

RNA-Seq standard analysis and example output

Differentially expressed genes

The standard RNA-Seq service is designed for Differentially Expressed (DE) genes identification and is performed on sets of biological replicates (minimum 3 per condition, recommended 5 or more). To perform the bioinformatic analysis we need:
  • A reference genome and annotation (see below)
  • A list of experimental conditions (e.g. “control” and “treated”) and the sample belonging to them.
For each sample sequenced we perform:
  1. FastQC, adaptor trimming With this step, we evaluate the average quality of reads at each base position.
  2. Alignment against the reference genome With this step we map reads using hisat2 to the reference genome, downloaded from Ensembl, producing a  BAM file for each sample.
  3. Raw reads counts We count the number of reads mapped within each feature (annotation of the reference, from the same Ensembl list) using featureCounts or similar software.
  4. Collection of raw counts in a matrix
  5. Differential Expression Analysis In this step, we use the edgeR package to evaluate if a gene is significantly over- or under-expressed.

What can I choose to receive?

  • Raw data only – You’ll receive the sequencing output, i.e. the FASTQ files demultiplexed
  • Alignments and counts – Raw data plus steps 1-3  from the previous list
  • Differentially Expressed Genes – Raw data plus steps 1-5 from the previous list

What kind of files will I receive?

  • Sequencing data – Sequences produced with the Illumina NextSeq 500, in FASTQ format, usually compressed with GZip. These files are usually big, some to several Gigabytes each. They are the raw output of the experiment and it’s very important to safely archive them. We highly recommend that you backup them: this means having two independent copies of the data (included in all packages). You can download an example from the ENCODE project: Reads in FASTQ format (8,0 Gb)
  • Alignments – The output of the sequence alignment step is given in the BAM format, a compressed version of the SAM file (included in Alignments and Counts and Differentially Expressed Genes package). You can download an example from the ENCODE project: Alignment in BAM format.
  • Raw counts matrix – This is a simple text file containing a first column with all the genes included in the annotation, plus as many columns as the sequenced. It’s a file that usually has about 30.000 lines. The first lines looks like:

      ID1039508 ID1039509 ID039512 ID1039513 ID1039516
    ENSG00000000003 679 448 873 408 1138
    ENSG00000000005 0 0 0 0 0
    ENSG00000000419 467 515 621 365 587
    ENSG00000000457 260 211 263 164 245
    ENSG00000000460 60 55 40 35 78
    ENSG00000000938 0 0 2 0 1
  • edgeR output – The edgeR analysis will produce three files: under.txt and over.txt, a list of underexpressed and overexpressed genes, and DE.txt, a whole matrix reporting for each gene the log of fold change in the two conditions (logFC), the log of counts per million of reads (logCPM) and the corrected p-value. Please, refer to the edgeR Manual PDF for further details (included only in theDifferentially Expressed Genes package). The first lines of the DE.txt file look like:

      logFC logCPM PValue FDR
    ENSG00000275852 0.01352 -2.59210 0.85455 1
    ENSG00000261140 -0.28245 -1.38742 0.44214 1
    ENSG00000236211 -3.76650 -0.64187 2.22330 e-21 5.46551 e-19

What to do next

Once you receive the analysis of differentially expressed genes, you can analyze and annotate it using any tools taking as input a list of gene IDs, as the DAVID online tool. With DAVID you can, for example, annotate your results and perform Gene Set Enrichment Analysis (GSEA) to understand if a functional category is enriched in your list of over- or under-expressed genes if compared with the background (the whole transcriptome).