BMR Genomics

Welcome to our NGS website

NGS

De novo bacterial genome: bioinformatic output

Our standard pipeline for the analysis of genome performs the following steps:

  1. Quality check and filter of the raw data (FASTQ format)
  2. De novo assembly of the reads, producing the contigs (in FASTA format)
  3. Gene prediction: Identification of putative coding regions (ORFs) and putative non coding genes (rRNAs, tRNAs…) producing a tabular output in GFF format
  4. Automatic annotation of the predicted genes in GBK (GenBank), GFF and FASTA format.

De novo assembly

Quality checked and filtered reads are assembled using SPAdes 3.7 (or newer if available), and its main output is a multi FASTA file containing the contigs, contiguous sequences representing portions of the genome. The content of the file could start like:

>ctg_1
CCGCTGAAGCCGCCACCGCCACCGCCGCCGAATCCGCCGCCTCCGCCAAAGCCGCCCCCT
CCTCCGGAGCCACCCCGGCCGGCAGGCAGGATACCGAGCATCTGGCAGACAAACACCGTC
AGGATGAACAACATCACCAGGAACATGAACAGCGCAGGGTGCCGCGAGATGAAATCATCC
GCCGGGTCGCCGCTGGATTCGTACACGGTCGACGGTTCGTCCAGCGGATTGCCACCCAGC
ACCACCAGCATCGCCGCAACGCCATCGCTGATGCCTTTGCTGAAATTGCCGGCCTTGAAC
GCTGGCGTGATGACCTGATGAATGATCACCGACGACTGCGCATCGGTCAGGCGATCCTCC
AGGCCATAGCCGACTTCGATGCGCAGTTTGCGCTCGTCACGCGCGACGATCAGCAAGGCG
CCGTTGTTCTTGTCTTTCTGACCGATGCCCCAGTGCCGACCGAGCTGAACGCCGAAATCC

For example the sequencing of a bacterium produced 2 x 590.000 reads. The contigs generated from their assembly were only 111, the longest being 530.000 bp long (and the N50 242.000 bp).

Gene prediction and annotation

The output for the gene prediction/annotation step is given both in tabular (GFF) format and in GenBank (GBK) format. The latter looks like:

LOCUS ctg_1 510358 bp DNA linear 20-JAN-2017
DEFINITION Genus species strain strain.
ACCESSION 
VERSION
KEYWORDS .
SOURCE Genus species
 ORGANISM Genus species
 Unclassified.
FEATURES Location/Qualifiers
 source 1..510358
 /organism="Genus species"
 /mol_type="genomic DNA"
 /strain="strain"
 CDS 318..692
 /locus_tag="PROKKA_00001"
 /inference="ab initio prediction:Prodigal:2.6"
 /codon_start=1
 /transl_table=11
 /product="hypothetical protein"
 /translation="MNDHRRLRIGQAILQAIADFDAQFALVTRDDQQGAVVLVFLTDA
 PVPTELNAEILDGGALQIGQRDHHQLLAGGLLVRLQLLAQLRANRWLEHLRLIDHAPA
 QRRKRQFGPGADGEQPQHQHQA"
 CDS complement(721..1329)
 /locus_tag="PROKKA_00002"
 /inference="ab initio prediction:Prodigal:2.6"
 /inference="protein motif:Pfam:PF04011.6"
...

The GFF format looks like:

ctg_1 . CDS 180224 181552 . - 0 ID=64;eC=6.3.4.19;gene=tilS;similar to AA sequence:UniProtKB:P52097;locus_tag=PROKKA_00184;product=tRNA(Ile)-lysidine synthase
ctg_1 . CDS 181686 182633 . - 0 ID=85;eC=6.4.1.2;gene=accA;similar to AA sequence:UniProtKB:Q886M7;locus_tag=PROKKA_00185;product=Acetyl-coenzyme A carboxylase carboxyl transferase subunit alpha
ctg_1 . CDS 182773 186294 . - 0 ID=76;eC=2.7.7.7;gene=dnaE;similar to AA sequence:UniProtKB:P10443;locus_tag=PROKKA_00186;product=DNA polymerase III subunit alpha
Tags: annotation / assembly / bacterial genome / bioinformatics / De novo