Detect Microbial Contamination in Contigs by Kraken

Purpose in short: I want to detect (and remove) potential contaminants in the genome assembly, and Kraken is a tool designed for that.

Introduction

From its webpage:

Kraken is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies. Previous attempts by other bioinformatics software to accomplish this task have often used sequence alignment or machine learning techniques that were quite slow, leading to the development of less sensitive but much faster abundance estimation programs. Kraken aims to achieve high sensitivity and high speed by utilizing exact alignments of k-mers and a novel classification algorithm.

In its fastest mode of operation, for a simulated metagenome of 100 bp reads, Kraken processed over 4 million reads per minute on a single core, over 900 times faster than Megablast and over 11 times faster than the abundance estimation program MetaPhlAn. Kraken’s accuracy is comparable with Megablast, with slightly lower sensitivity and very high precision.

Kraken is written in C++ and Perl, and is designed for use with the Linux operating system. We have also successfully compiled and run it under the Mac OS.

In practice

See Kraken manual for full instructions.

Install

Download the latest release.

1
2
3

$ unzip kraken-1.1.zip && cd kraken-1.1

$ ./install_kraken.sh $TOOLDIR/kraken.1.1

If you want to build your own dabase, jellyfish version 1 should be in your PATH.

Build standard Kraken database

The standard Kraken database includes bacterial, archeal, and viral genomes in Refseq at the time of the build.

#!/bin/sh

path2kraken=$TOOLDIR/kraken.1.1

DBNAME=Kraken_DB

$path2kraken/kraken-build --standard --threads 16 --db $DBNAME

If any step (including the initial downloads) fails, the build process will abort. However, kraken-build will produce checkpoints throughout the installation process, and will restart the build at the last incomplete step if you attempt to run the same command again on a partially-built database.

Build custom Kraken database

Because the standard Kraken database doesen’t include fungi and protozoa, which I want to include in my analysis.

I found refseq2kraken, which facilitates the downloading and preparation of Kraken database.

path2kraken=$TOOLDIR/kraken.1.1

DBNAME=Kraken_DB_201806

# install refseq2kraken
git clone https://github.com/sschmeier/refseq2kraken.git refseq2kraken
cd refseq2kraken

# Download refseq => here only "Complete Genome"
# assemblies, e.g. the default
python getRefseqGenomic.py -p 8

# convert to kraken format => again only "Complete Genome"
# assemblies here, e.g. the default
python getKrakenFna.py -p 8 $DBNAME

# build a new minikraken database
# download taxonomy
kraken-build --download-taxonomy --db $DBNAME

# for each branch, add all fna in the directory to the database
for dir in bacteria viral archaea fungi protozoa; do
    find $DBNAME/$dir/ -name '*.fna' -print0 | xargs -0 -I{} -n1 -P8 kraken-build --add-to-library {} --db $DBNAME;
done

# build the actual database
kraken-build --build --db $DBNAME

# remove intermediate files
kraken-build --clean --db $DBNAME

In the following part of this note, this library will be refered as to ‘Non-Masked library’.

This post Download refseq-genomic data and prepare it for Kraken is also helpful.

Mask low-complexity regions

I noticed a pull request of kraken: Fixed human genome downloading and added auto-masking feature using dustmasker, and realized that I should mask the library! So I re-generated the kraken library (I don’t know why the author of kraken didn’t mention that in their docs.)

Here is the new script:

#!/bin/sh

path2kraken=$TOOLDIR/kraken.1.1
path2refseq2kraken=$TOOLDIR/refseq2kraken
path2dustmasker=$TOOLDIR/ncbi-blast-2.7.1+/bin/dustmasker

DBNAME=Kraken_DB_abfpv_1806

# Download refseq => here only "Complete Genome"
python $path2refseq2kraken/getRefseqGenomic.py -p 8

# convert to kraken format => again only "Complete Genome"
python $path2refseq2kraken/getKrakenFna.py -p 8 $DBNAME

# build a new minikraken database
$path2kraken/kraken-build --download-taxonomy --db $DBNAME

# for each branch
# filter with Dustmasker and convert low complexity regions to N's with Sed (skipping headers)
# and add all fna to the database
for i in `find $DBNAME \( -name '*.fna' -o -name '*.ffn' \)`
do
    $path2dustmasker -in $i -infmt fasta -outfmt fasta | sed -e '/>/!s/a\|c\|g\|t/N/g' > tempfile
    $path2kraken/kraken-build --add-to-library tempfile --db $DBNAME
done

# qsub jobs below to the computer nodes
# build the actual database
$path2kraken/kraken-build --build --threads 20 --db $DBNAME

# remove intermediate files
$path2kraken/kraken-build --clean --db $DBNAME

In the following part, this library will be refered as to ‘Masked library’.

Classify contigs

Non-Masked library

# classify
$path2kraken/kraken --db $Kraken_DB_201806 --threads $PPN --fasta-input $WROKDIR/flye/run2/contigs.fasta --classified-out $WORKDIR/kraken/run2/flye.run2.classified --unclassified-out $WORKDIR/kraken/run2/flye.run2.unclassified > $WORKDIR/kraken/run2/flye.run2.kraken

# report
$path2kraken/kraken-report --db $Kraken_DB_201806 $WORKDIR/kraken/run2/flye.run2.kraken > $WORKDIR/kraken/run2/flye.run2.report

The ‘classified’ sequences will be saved in ‘classified.fa’, and the ‘unclassified.fa’ will be the ‘clean’ one, which can be used for downstream analysis.

But when I looked over the ‘report’, the result quite disappointed me.

$ head -5 flye.run2.report
 31.30	4647	4647	U	0	unclassified
 68.70	10199	435	-	1	root
 51.55	7653	166	-	131567	  cellular organisms
 47.44	7043	471	D	2	    Bacteria
 24.67	3662	19	P	1224	      Proteobacteria

This means about 70% contigs were contaminated!

I was curious about the percentage of contaminant lengths of each read. Then I used a simple Python script to count that.

Then I used R to visualize the relationship between contig lengths and the contaminated lengths. There were 14846 contigs, among which 12004 (80.86%) contigs, 2788 (18.78%) contigs and 54 (0.36%) contigs has a contaminated rate of blow 1%, between 1% and 10%, and more than 10% respectively.

Masked library

# classify
$path2kraken/kraken --db Kraken_DB_abfpv_1806 --threads $PPN --fasta-input $WROKDIR/flye/flye/run2/contigs.fasta --classified-out $WROKDIR/flye/run3/flye.run2.classified --unclassified-out $WROKDIR/flye/run3/flye.run2.unclassified > $WROKDIR/flye/run3/flye.run2.kraken

# report
$path2kraken/kraken-report --db Kraken_DB_abfpv_1806 $WROKDIR/flye/run3/flye.run2.kraken > $WROKDIR/flye/run3/flye.run2.report

And about 17% of contigs were classified as ‘contaminant’.

$ head -5 flye.run2.report
 83.49	12395	12395	U	0	unclassified
 16.51	2451	15	-	1	root
 13.53	2009	0	D	10239	  Viruses
 13.46	1999	0	-	35237	    dsDNA viruses, no RNA stage
 13.44	1996	0	F	10482	      Polydnaviridae

There were 14846 contigs, among which 14766 (99.46%) contigs, 61 (0.41%) contigs and 19 (0.13%) contigs has a contaminated rate of blow 1%, between 1% and 10%, and more than 10% respectively.

Replace contaminant regions with N

My collaborator asked me to give him an assembly draft for meta-genomic study, which a ‘clean’ one was needed. Removing all the contaminated sequences was not feasible, since there would be few contigs left.

Then I checked the output of Kraken, which contained the detailed information of each contig:

$ head -3  flye.run2.kraken
C	contig_1102	118110	395511	0:71251 28874:1 1:73 35237:1 1241371:1 1:1 694430:4 131567:16 1:15 0:376 158:2 0:1 1:33 2:1 1783272:1 1769:1 0:5975 633697:1 1783272:1 273035:1 1:19 1307:1 0:18402 2:1 1:39 0:14 85655:2 0:152 78219:2 1:18 0:8591 679926:1 131567:2 1:12 1118964:3 0:2539 523841:1 1:31 0:3050 1:27 0:3370 1:26 1267001:1 0:1341 523841:1 1:5 131567:1 1783272:1 2100:2 0:75636 1:38 0:2 864702:2 0:6975 1428:4 131567:1 1:114 1783272:1 203124:1 0:10293 1353243:1 0:1 1:32 2:1 1313292:1 0:48537 1783272:1 273035:1 1:11 2:1 1313292:1 0:10283 1161:1 1507806:1 1:91 131567:1 0:611 1:13 131567:16 694430:4 1:1 1241371:1 35237:1 1:49 35237:3 10401:3 0:628 1605721:1 0:8 1:6 0:20 1605721:8 1:25 0:15 1:83 0:16 1605721:1 0:3 1605721:8 1:47 10239:2 12315:8 0:9 1605721:8 1:100 0:15 1:47 0:15 1:53 0:7 1:22 0:1 1:53 0:7 1:36 35237:1 0:3604 523841:1 1:51 2:2 444612:1 0:667 1654582:4 1:3 118110:9 0:3597 46170:1 1783272:1 1:7 0:25607 118110:7 131567:2 2:1 1898474:2 1783272:2 1:22 2:3 0:732 273035:2 1:35 0:16896 1:23 273035:1 0:5012 28890:2 2207:2 1:30 28890:1 2209:3 0:11916 118110:6 1:3 118110:1 1:13 320432:1 0:13148 39152:3 1:15 0:669 1105113:1 320432:1 1:29 1654582:1 0:6006 10371:3 2:1 1:41 0:5940 85655:2 0:6 864702:2 0:2 1:34 0:14 85655:2 0:301 1307:2 1783272:1 1:2 35237:1 1:2 1654582:1 0:5603 1428:1 1:23 0:4788 456320:1 1:23 0:169 2:2 131567:1 1:10 131567:1 1783272:1 0:1735 10335:1 1:28 118110:1 1:3 118110:4 0:11831 118110:3 131567:2 2:1 1898474:2 1783272:2 1:19 0:1542 118110:3 1:3 118110:1 1:19 216946:1 0:452 444612:1 2:2 1:15 0:1 66266:1 0:1790 118110:8 131567:2 2:1 1898474:2 1783272:2 1:11 0:1 66266:1 0:319 375175:2 0:328 1:17 131567:1 1769:1 0:2798
U	contig_11021	0	4721	0:4691
C	contig_11022	118110	43806	0:12538 1:19 131567:1 444612:1 0:5305 118110:9 131567:2 2:1 1898474:1 0:149 1:17 39640:1 0:25732

Because Kraken uses a k-mer mapping strategy to locate potential contaminants, there would be many contigs within which only a small part could be aligned to bacteria/virul sequences. So I retrieved the precise k-mers that mapped to contamination, and masked them with ‘N’. I discussed this with another guy: Questions about de novo genome assembly from mixed DNA samples, and he also thought it’s a vivid approach.

Finally

But I began to think how the contaminanted sequences would affect the assembly process, and maybe removing contaminants from raw data is a good way. See discussions here: Filtering for contamination when assembling a genome, before or after assemby? and Question: How to remove contamination from the transcriptome assembly.

Useful links

Change log

20180424: create the note.
20180620: add the “refseq2kraken” part.
20180808: add “Mask low-complexity regions” and respective parts.