Remove Microbial Contamination in Reads

Purpose in short: I’ve got both illumina (PE and MPE) and PacBio reads of an insect for de novo geome assembly. Since whole bodies were used for DNA extraction, I thought there was some contamination in the raw sequencing reads, so I want to remove them before assembling.

The main tool for illumina reads I used was BBDuk, which belongs to the BBTools suite. Seal and BBSplit can also do this, but BBDuk is the most suitable for my purpose. See discussions here: Question: How to remove contamination from the transcriptome assembly. BBTools is quite versatile, fast, and convenient! Though now I don’t fully understand every kit in it, the experience was very good from the tools I’ve tried. The author Brian Bushnell is very active and responsive. Here is a handy summary of BBMap: Yes … BBMap can do that!.

But for PacBio long reads, I didn’t find any tools specialized for that. I tried several tools including mapPacBio.sh (based on bbmap, and is tuned for the error profile of long reads.), blasr, MashMap, and minimap2. At last, I found minimap2 best met my needs.

Prepare the contamination library

Download all the sequences of bacteria, viral, fungi, protozoa, and archaea using refseq2kraken. See note Detect Microbial Contamination in Contigs by Kraken for details. We can also use ncbi-genome-download to download them.
1
2
3
# folder structures
$ ls genomes/refseq/
archaea bacteria fungi protozoa viral
Merge all the sequences into one single Fasta.
Since I’ve got the mitochondrion DNA sequences of the target organism, I added them in the contaminant library to remove reads from mtDNA too. This library is refered as to $contaminants below.
After discussing with the author of MashMap, I included several insects’ genomes of the same order of my target insect to improve the specificity of aligners. This library is refered as to $contaminants2.

BBDuk for short reads

BBDuk is a member of the BBTools package. “Duk” stands for Decontamination Using Kmers. BBDuk is extremely fast, scalable, and memory-efficient, while maintaining greater sensitivity and specificity than other tools.

WORKDIR=/parastor300/niuyw/Project/beetle_genome_171231/data/second

ANNODIR=/home/niuyw/RefData
TOOLDIR=/home/niuyw/software
path2java=$TOOLDIR/jre1.8.0_111/bin/java

DATADIR=/home/niuyw/Project/beetle_genome_171231/data/second

PPN=24

path2bbduk=$TOOLDIR/bbmap/bbduk.sh
contaminants=$ANNODIR/NCBI_contaminants/contaminants_4_bettle_genome.fa

for sample in 270B 500B 800B 3k_1 5k-1 5k-2 10k
do
$path2bbduk ref=${contaminants} threads=${PPN} ordered=t k=31 in=${DATADIR}/TrimGalore/${sample}_R1_val_1.fq.gz in2=${DATADIR}/TrimGalore/${sample}_R2_val_2.fq.gz out=${DATADIR}/bbduk/${sample}.R1.fq out2=${DATADIR}/bbduk/${sample}.R2.fq outm=${DATADIR}/bbduk/${sample}.bad.R1.fq outm2=${DATADIR}/bbduk/${sample}.bad.R2.fq
done

The ‘good’ reads will be in ${sample}.R*.fq, and the ‘bad’ ones will be in ${sample}.bad.R*.fq.

Long reads

mapPacBio

mapPacBio.sh is based on bbmap, and is tuned for the error profile of long reads.

WORKDIR=/parastor300/niuyw/Project/beetle_genome_171231

ANNODIR=/home/niuyw/RefData

TOOLDIR=/home/niuyw/software
path2java=$TOOLDIR/jre1.8.0_111/bin/java

DATADIR=/home/niuyw/Project/beetle_genome_171231/data/third

PPN=24

path2bbmap=$TOOLDIR/bbmap
contaminants=$ANNODIR/NCBI_contaminants/contaminants_4_bettle_genome.fa

$path2bbmap/mapPacBio.sh threads=${PPN} ref=${contaminants} in=third_all.fasta maxlen=6000 out=bbmap.sam
$path2bbmap/reformat.sh unmappedonly in=bbmap.sam out=good.fa

The output of bbmap or mapPacBio.sh is SAM, and both mapped and unmapped reads are saved in one file. reformat.sh was used to extract mapped reads and transfrom it to FASTA. See this: Question: bbmap command to extract mapped and unmapped pair end reads.

But mapPacBio was very very very slow, even I used 24 threads. After about 25 days, the ouput was about 4.6G (my input was about 60G!), though it was still increasing, I killed the job.

blasr

blasr may be the one of the first long-read aligner ^[1]. So I gave it a shot.

1	$TOOLDIR/anaconda2/bin/blasr third_all.fasta $contaminants --nproc $PPN --out blasr.bad --unaligned blasr.good

Related: output unmapped reads.

But it went wrong:

1 2	[INFO] 2018-07-02T18:56:18 [blasr] started. ERROR! Reading fasta files greater than 4Gbytes is not supported.

What?! My fasta was about 60G. If I wanted use blasr, I had to split the input into small ones, but I didn’t want to.

MashMap

MashMap is a fast approximate aligner for long DNA sequences ^[2]. It’s very fast and is easy to use.

1 2	PPN=8 $TOOLDIR/mashmap-Linux64-v2.0/mashmap -r $contaminants -q third_all.fasta -o mashmap.out

-s is the minimum query length (default is 5000), and --pi is the minimum identity to be reported (default is 85).

The output of MashMap is like this. Separated by space, it is query name, length, 0-based start, end, strand, target name, length, start, end and mapping nucleotide identity in turn.

1
2
3

m161109_080520_42256_c101052872550000001823247601061737_s1_p0/104/24935_31494 6559 0 4999 - NC_037282.1 2038340 596479 601478 82.1711
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/357/31541_41029 9488 0 4999 - NC_004326.2 1343557 117176 122175 81.9933
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/562/0_12626 12626 7626 12625 + NC_001224.1 85779 50909 55908 82.2314

Here is also a thread talking about contamination: Decontamination of bacterial sequences in an assembly. The author suggested that “One more thing that may be helpful for you is to include the representative genome (corresponding to your assembly) in the database as well. This would help improve the specificity of the method for correct portions of your assembly.”

But, the default parameters MashMap use are a bit loose, I wanted to use more strict parameters. When I lowered the -s and --pi, the program needed huge RAM and couldn’t run on our machine.

I’ve reported this issue to the author: Decontamination of bacterial sequences in an assembly.

And, there were also some questions about its alignment: Questions about the alignment of MashMap.

I first ran MashMap with different parameters and got several outputs:

# run1, with default parameters
$path2mashmap -r $contaminants -q third_all.fasta -o mashmap.out

# run2, with -s 2500 --pi 80
$path2mashmap -t 8 -r $contaminants -q third_all.fasta -s 2500 --pi 80 -o mashmap2.out

# run3, with -s 500 --pi 85
$path2mashmap -t 8 -r $contaminants -q third_all.fasta -s 500 --pi 85 -o mashmap3.out

And the outputs from three runs varied:

# there are 6633142 sequences of input
$ grep -c '>' third_all.fasta
6633142

# run1
cut -f 1 -d ' ' mashmap.out |sort|uniq|wc -l
463569

# run2
cut -f 1 -d ' ' mashmap2.out |sort|uniq|wc -l
2821004

# run3
cut -f 1 -d ' ' mashmap3.out |sort|uniq|wc -l
6189307

As can be seen, nearly all the sequences were aligned to contaminant library. That really shocked me!

Then I checked the top 10 sequences with highest identity and top 10 ones with loweset identity from the first run using blastn. The highest ones were fine. There were some differences between hits reported by blastn and MashMap, but maybe it’s because they used different databases. But the loweset ones were problematic. Most of them were ‘No significant similarity found’ when default parameters of blastn were used. And when I unselected ‘Low complexity regions’, the alignments were unreliable. There maybe something with ‘low complexity regions’ or ‘repeat’ things.

And the author explained:

Mashmap identity is an estimate based on Jaccard similarity- not the precise identity; unfortunately the Jaccard-similarity based metric delivers poor specificity in cases when the source of reads is absent from database. See if you can include an insect reference genome (s) in the reference list to avoid this.

Then I added several insects’ genomes of the same order of the target insect into the contaminant library and ran MapshMap again:

1	$path2mashmap -t 20 -r $contaminants2 -q third_all.fasta -s 500 --pi 80 -o mashmap4.out

For each read, it will be one of two states: mapped or unmapped, and for the mapped, it will be one of three states: mapped to insects (good), mapped to contaminants (bad) and mapped to both (ambivalent). So I counted reads in each categories:

No. of total reads: 6633142
  No. of reads in the mashmap.output: 6214500
    No. of good: 853515
    No. of bad: 138799
    No. of ambivalent: 5222186
  No. of reads not in the mashmap.output: 418642

As can be seen, majority of reads had been mapped ambivalently, and here is a example of such reads:

m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 3500 4015 + NW_017852934.1 2683736 1681820 1682319 79.4204
m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 1000 1999 + NW_019280650.1 1003565 813077 813577 78.0766
m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 2500 2999 - LJIG01019880.1 38067 34865 35364 79.3626
m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 500 999 + NC_007418.3 31381287 24981081 24981580 79.5573
m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 3000 3499 - kraken:taxid|76857|NZ_CP022123.1 2521394 1537365 1537864 81.3159
m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 0 499 - kraken:taxid|1202539|NC_018417.1 157543 40864 41363 79.4397
m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 1500 2499 + kraken:taxid|1936081|NZ_CP019389.1 3752836 1813582 1814081 76.7726

So it’s very hard to extract good ones.

After all this, I got two points:

I should add some insects’ genomes into the library to reduce the false positive hits.
Alignment from MashMap maybe not so reliable ().

minimap2

minimap2 is a versatile pairwise aligner for genomic and spliced nucleotide sequences created by Heng Li ^[3]. It’s easy to use and runs very fast.

mimimap2

PPN=24
$TOOLDIR/minimap2-2.8_x64-linux/minimap2 -x map-pb $contaminants third_all.fasta -t $PPN -a -Q > minimap.sam
# -a: output the SAM format
# -Q:not output base quality in SAM

I had about 60G input, and minimap2 was so fast, finished after 77 CPU hours (4 real hours, 24 threads).

The output was very huge (~685G!), and it’s like this (truncated, and the sequences were replaced by ‘seq’.):

m54174_171023_074758/9962478/22561_26454  4 * 0 0 * * 0 0 seq *
m54174_171023_074758/9962478/22561_26454  4 * 0 0 * * 0 0 seq *
m54174_171023_074758/9962478/22561_26454  16  kraken:taxid|1094466|NC_017025.1  1304571 1 3661S5M2D6M4I8M1D4M1D13M1D11M1I16M1D13M1D13M1D27M1D4M1D4M2I4M2D12M2D9M1D2M1D13M1D12M1D14M1D13M1D6M2I7M1D7M  * 0 0 seq * NM:i:43 ms:i:220  AS:i:220  nn:i:0  tp:A:P  cm:i:3  s1:i:40 s2:i:0  dv:f:0.0200
m54174_171023_074758/9962478/22561_26454  4 * 0 0 * * 0 0 seq *
m54174_171023_074758/9962478/22561_26454  16  kraken:taxid|1323664|NZ_CP012748.1  2198444 1 903S12M1I12M1I12M1I4M1I6M1I2M1I12M1I39M1I9M1I6M5I4M4I7M1I2M1I5M1D6M1I12M1I12M1I12M1I13M1I11M2I5M1D8M2I22M1I13M1I23M1I12M1I12M1I10M1I3M2I10M3I5M1I6M1I3M1I16M1D6M1I12M2I6M1D3M1I2M1I7M1D4M1I5M1I7M1I12M1I6M7I3M1I3M1D6M1I11M1I12M1I12M2464S  * 0 0 seq * NM:i:95 ms:i:432  AS:i:432  nn:i:0  tp:A:P  cm:i:4  s1:i:49 s2:i:77 dv:f:0.0701 SA:Z:kraken:taxid|1323664|NZ_CP012748.1,2198449,-,1461S464M47I1921S,15,95;kraken:taxid|1323664|NZ_CP012748.1,2198443,-,1853S470M47I1523S,1,95;kraken:taxid|1323664|NZ_CP012748.1,2198444,-,683S469M51I2690S,4,106;kraken:taxid|1323664|NZ_CP012748.1,2198449,-,2305S464M17I1107S,16,105;kraken:taxid|1323664|NZ_CP012748.1,2198456,-,2607S450M11I825S,22,93;kraken:taxid|1323664|NZ_CP012748.1,2198451,-,430S460M63I2940S,3,105;kraken:taxid|1323664|NZ_CP012748.1,2198449,-,13S464M94I3322S,11,133;
m54174_171023_074758/9962478/22561_26454  2064  kraken:taxid|1323664|NZ_CP012748.1  2198449 15  1461H7M1I12M1I7M1D4M1I12M1I4M1D10M1I10M1D24M1I18M2I9M1I8M1I12M2I9M1I2M6I7M1I13M5I5M1I7M1D4M1I10M1I2M1I12M1I12M1I12M2I13M1I11M1I11M1I5M1I7M1I5M1I8M1I12M1I5M1D4M1I2M1I4M2I2M1I7M5D5M1D7M2I21M1I24M4I8M1I12M1I12M1I4M1I20M1I12M1921H  * 0 0 seq * NM:i:95 ms:i:420  AS:i:420  nn:i:0  tp:A:P  cm:i:6  s1:i:80 s2:i:0  dv:f:0.0526 SA:Z:kraken:taxid|1323664|NZ_CP012748.1,2198444,-,903S469M57I2464S,1,95;kraken:taxid|1323664|NZ_CP012748.1,2198443,-,1853S470M47I1523S,1,95;kraken:taxid|1323664|NZ_CP012748.1,2198444,-,683S469M51I2690S,4,106;kraken:taxid|1323664|NZ_CP012748.1,2198449,-,2305S464M17I1107S,16,105;kraken:taxid|1323664|NZ_CP012748.1,2198456,-,2607S450M11I825S,22,93;kraken:taxid|1323664|NZ_CP012748.1,2198451,-,430S460M63I2940S,3,105;kraken:taxid|1323664|NZ_CP012748.1,2198449,-,13S464M94I3322S,11,133;
m54174_171023_074758/9962478/22561_26454  272 kraken:taxid|1323664|NZ_CP012748.1  2198445 0 1010S9M1I2M1I4M1I8M1I12M1I5M1D6M1I12M1I39M1I11M2I3M1I10M1I22M1I13M1I23M1I12M1I12M1I10M1I3M2I11M1I1M2I3M1I6M1I3M1I16M1D6M1I13M4D7M1D5M2I7M1D4M1I5M1I6M1I13M1I6M1I9M1I2M1D7M1I11M1I10M1I15M1I10M1I13M1I12M1I12M1I7M1D4M1I13M2386S * 0 0 * *   NM:i:86 ms:i:418  AS:i:418  nn:i:0  tp:A:S  cm:i:6  s1:i:77 dv:f:0.0589
m54174_171023_074758/9962478/22561_26454  2064  kraken:taxid|1323664|NZ_CP012748.1  2198443 1 1853H19M1I22M4I8M1I12M1I38M1I18M2I9M1I3M1I5M1I12M1I24M1I12M2I24M2I3M1I3M1D5M3I12M1I13M2I3M1I8M1I11M2I8M1D3M1I2M1I11M1I12M1I12M1I5M1D3M1D2M1I12M1I12M1I13M2I1M3I4M1I10M1D14M2I2M1I4M1D4M1I4M1I8M1I5M1D4M1I2M1I12M2I12M1I7M1D10M1523H * 0 0 seq * NM:i:95 ms:i:414  AS:i:414  nn:i:0  tp:A:P  cm:i:7  s1:i:60 s2:i:97 dv:f:0.0466 SA:Z:kraken:taxid|1323664|NZ_CP012748.1,2198444,-,903S469M57I2464S,1,95;kraken:taxid|1323664|NZ_CP012748.1,2198449,-,1461S464M47I1921S,15,95;kraken:taxid|1323664|NZ_CP012748.1,2198444,-,683S469M51I2690S,4,106;kraken:taxid|1323664|NZ_CP012748.1,2198449,-,2305S464M17I1107S,16,105;kraken:taxid|1323664|NZ_CP012748.1,2198456,-,2607S450M11I825S,22,93;kraken:taxid|1323664|NZ_CP012748.1,2198451,-,430S460M63I2940S,3,105;kraken:taxid|1323664|NZ_CP012748.1,2198449,-,13S464M94I3322S,11,133;

We can clearly found that the same sequence had been mapped to different reference sequences, and it’s been reported several times, with the exactly same line content.

This is a limitation of minimap2, as reported: Multiple empty hits? and How does using a multi part index affect the accuracy?. Specifically, when a huge reference is used, minimap2 will split it into multiple parts and align all queries against each part independently. For most parts, minimap2 will print unmapped records.

The good news is that Minimap2-2.12 (r827) had addressed this bug.

minimap2-arm

minimap2-arm is a solution provieded by Hasindu Gamaarachchi, and it merges the results from a multi-part index to achieve a considerably similar output from a single-part index. So I cloned this modified minimap2 to run the job anain.

# install
git clone https://github.com/hasindu2008/minimap2 minimap2-arm && cd minimap2-arm && git checkout multipart-merge-tmp && make

# run
$TOOLDIR/minimap2-arm/minimap2 -x map-pb -I 500G -t $PPN -a -Q --multi-prefix tmp $contaminants2 third_all.fasta > minimap2.sam
# --multi-prefix: enable mergine
# -I: split index for every ~500G input bases, this number is far more than the reference.

I counted reads in each categories like I did in MashMap part:

No. of total reads: 6633142
  No. of reads in the SAM: 6633142
    No. of mapped: 661646
      No. of good: 490150
      No. of bad: 125390
      No. of ambivalent: 46106
    No. of Unmapped: 5971496
  No. of reads not in the SAM: 0

Then I kept the ‘unmapped’, ‘good’ and ‘ambivalent’ reads for downstream analysis.

To parse the result and get the clean sequences, I used a simple python script to extract clean fasta and bad fasta ids. (appendix 1)

To validate the effciency, I checked several sequences by blastn manually.

Belows are two examples.

This is the alignment of ‘m161109_080520_42256_c101052872550000001823247601061737_s1_p0/28/0_4873’ by minimap2.

$ grep 'm161109_080520_42256_c101052872550000001823247601061737_s1_p0/28/0_4873' minimap2.sam
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/28/0_4873 16  kraken:taxid|66084|NC_012416.1  1207118 5 2514S9M1D7M1D9M1I8M1D15M1I6M1I7M1D5M1D15M1D8M1I4M1D11M1D12M1I17M1D5M1D4M1D5M1I3M2D18M1I1M1I33M1I13M2I13M1D6M1I29M1I4M1D26M1D8M1I27M1D3M1D12M1I4M1I8M1I8M1I12M1D3M1D8M1I16M1I12M1I11M1I2M2D6M2D3M1D31M1D9M2I33M1D16M1I15M1D11M2D9M1D25M2I5M1D3M2I11M1D3M1D14M1I8M1D4M1D19M4I29M1D11M1D8M1D2M1D5M1I2M4I9M1D15M1I19M2I7M1I6M1I28M1D3M1D3M1I16M1D2M1D3M1I6M1I14M2I12M1D28M1D3M1D6M2I5M1I32M1D8M1D25M1D8M1D30M1I11M1I1M1I16M1D14M1D17M1D19M1I32M1D14M1D6M1D12M1I50M1D26M1I8M1D3M1D12M1D13M1I8M1D18M1D12M1I24M1D4M2D6M3I10M1D3M2D18M1D3M1I5M1D5M1I3M2D15M1I5M1I7M1I9M1D2M3I31M1I13M1D6M1I22M2D2I3M1I6M1I10M1I5M1D6M2D10M1I16M1I13M1I3M1I13M1I3M3I2M1I11M1I7M1D4M2I8M1D9M1I17M1I3M1I7M2I3M1I20M1D2M3I5M2I5M1I13M5I3M2I6M1I17M2I3M1I5M1D10M2I11M1D13M1D2M1I8M1D9M1D8M1I3M1I12M1I3M1D3M1D5M1I6M1I8M1I5M1I5M1I10M1I4M1I8M1I3M1I7M1I5M1D8M1D8M1I6M1D10M2I9M1I1M1I9M1D10M1D3M1D4M1D6M1D7M2D5M2I5M2I2M2D2M1I16M1I5M14I4M2D3M1I11M1D6M1I1M1I9M1I14M2I6M2I14M1D2M1D6M1D9M2D1M1D22M1D12M1D7M1I13M2D8M1I3M1I15M1I18M1I17M9S  * 0 0 seq * NM:i:424  ms:i:2078 AS:i:2078 nn:i:0  tp:A:P  cm:i:28 s1:i:477  s2:i:474  dv:f:0.0866 SA:Z:kraken:taxid|66084|NC_012416.1,1360016,-,616S1173M32I3052S,9,195;
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/28/0_4873 272 kraken:taxid|1236909|NC_021089.1  1067958 0 2514S9M1D7M1D9M1I8M1D15M1I6M1I7M1D5M1D15M1D8M1I4M1D11M1D12M1I17M1D5M1D4M1D8M7D4M3D15M1I1M1I33M1I13M2I13M1D6M1I29M1I4M1D26M1D8M1I27M1D3M1D12M1I4M1I8M1I8M1I12M1D3M1D8M1I16M1I12M1I11M1I2M2D6M2D3M1D31M1D9M2I33M1D16M1I15M1D11M2D9M1D25M2I5M1D3M2I11M1D3M1D14M1I8M1D4M1D19M4I29M1D11M1D8M1D2M1D5M1I2M4I9M1D15M1I19M2I7M1I6M1I28M1D3M1D3M1I16M1D2M1D3M1I6M1I18M1I3M1I5M1D28M1D3M1D6M2I5M1I32M1D3M1D30M1D8M1D30M1I11M1I1M1I16M1D14M1D22M1D14M1I32M1D14M1D6M1D13M1I49M1D26M1I8M1D3M1D10M1D15M1I8M1D18M1D12M1I25M1D3M2D6M3I10M1D3M2D17M1D4M1I5M1D5M1I3M2D15M1I5M1I7M1I9M1D2M3I35M1I9M1D6M1I22M1I11M1I10M1I5M1D6M2D10M1I16M1I13M1I3M1I13M1I3M3I2M1I11M1I7M1D4M2I8M1D9M1I17M1I3M1I7M2I3M1I20M1D2M3I5M2I5M1I13M5I3M2I6M1I17M2I3M1I5M1D10M2I11M1D13M1D2M1I8M1D9M1D8M1I3M1I12M1I4M1D2M1D5M1I6M1I8M1I5M1I5M1I10M1I4M1I8M1I3M1I7M1I5M1D8M1D8M1I6M1D10M2I9M1I1M1I9M1D13M2D4M1D6M1D7M2D5M2I11M1I16M1I5M14I4M2D4M1I10M1D6M1I1M1I9M1I14M2I6M2I14M1D2M1D6M1D8M2D2M1D22M1D12M1D7M1I12M1D1M1D8M1D8M1I3M1I1M1I5M1I18M1I17M9S * 0 0 *   NM:i:435  ms:i:2028 AS:i:2028 nn:i:0  tp:A:S  cm:i:28 s1:i:474  dv:f:0.0866
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/28/0_4873 272 kraken:taxid|225364|NZ_LK055284.1 991010  0 2514S9M1D7M1D9M1I8M1D15M1I6M1I7M1D5M1D15M1D8M1I4M1D11M1D12M1I17M1D5M1D4M1D8M7D4M3D15M1I1M1I33M1I13M2I13M1D6M1I29M1I4M1D26M1D8M1I27M1D3M1D12M1I4M1I8M1I8M1I12M1D3M1D8M1I16M1I12M1I3M1I3M1D4M1I2M2D6M3D34M1D9M2I33M1D16M1I15M1D11M2D9M1D25M2I6M1D2M2I11M1D3M1D14M1I8M1D4M1D19M4I29M1D11M1D8M1D2M1D5M1I2M4I9M1D15M1I19M2I7M1I6M1I24M1D7M1D4M1I15M1D2M1D3M1I6M1I18M1I3M1I5M1D28M1D3M1D6M2I5M1I32M1D3M1D30M1D8M1D30M1I11M1I1M1I16M1D14M1D22M1D14M1I32M1D12M1D8M1D13M1I49M1D26M1I8M1D3M1D10M1D15M1I8M1D18M1D12M1I25M1D3M2D6M3I10M1D3M2D17M1D4M1I5M1D5M1I3M2D15M1I5M1I7M1I9M1D2M3I31M1I13M1D6M1I22M2D2I3M1I6M1I10M1I5M1D6M2D10M1I16M1I13M1I3M1I13M1I3M3I2M1I11M1I7M1D4M2I8M1D10M1I16M1I3M1I7M2I3M1I20M1D2M3I5M2I5M1I13M5I3M2I6M1I17M2I3M1I5M1D10M2I11M1D13M1D2M1I8M1D9M1D8M1I3M1I12M1I3M1D3M1D5M1I6M1I8M1I5M1I5M1I10M1I4M1I8M1I3M1I7M1I5M1D8M1D8M1I6M1D10M2I9M1I1M1I9M1D10M1D3M1D4M1D6M1D7M2D5M2I5M2I2M2D2M1I16M1I5M14I4M2D4M1I10M1D6M1I1M1I3M1I14M2I3M1I4M1D4M2I14M1D2M1D6M1D8M2D2M1D22M1D12M1D7M1I13M2D8M1I3M1I15M1I18M1I17M9S *   NM:i:444  ms:i:1984 AS:i:1984 nn:i:0  tp:A:S  cm:i:24 s1:i:437  dv:f:0.0923
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/28/0_4873 272 kraken:taxid|163164|NC_002978.6 1039834 0 2514S9M1D7M1D9M1I8M1D15M1I6M1I7M1D5M1D15M1D8M1I4M1D11M1D12M1I17M1D5M1D4M1D8M7D4M3D15M1I1M1I33M1I13M2I13M1D6M1I29M1I4M1D26M1D8M1I27M1D3M1D12M1I4M1I8M1I8M1I12M1D3M1D8M1I16M1I12M1I3M1I3M1D4M1I2M2D6M3D34M1D9M2I33M1D16M1I15M1D11M2D9M1D25M2I6M1D2M2I11M1D3M1D14M1I8M1D4M1D19M4I29M1D11M1D8M1D2M1D5M1I2M4I9M1D15M1I19M2I7M1I6M1I24M1D7M1D4M1I15M1D2M1D3M1I6M1I18M1I3M1I5M1D28M1D3M1D6M2I5M1I32M1D3M1D30M1D8M1D30M1I11M1I1M1I16M1D14M1D22M1D14M1I32M1D12M1D8M1D13M1I49M1D26M1I8M1D3M1D10M1D15M1I8M1D18M1D12M1I25M1D3M2D6M3I10M1D3M2D17M1D4M1I5M1D5M1I3M2D15M1I5M1I7M1I9M1D2M3I31M1I13M1D6M1I22M2D2I3M1I6M1I10M1I5M1D6M2D10M1I16M1I13M1I3M1I13M1I3M3I2M1I11M1I7M1D4M2I8M1D10M1I16M1I3M1I7M2I3M1I20M1D2M3I5M2I5M1I13M5I3M2I6M1I17M2I3M1I5M1D10M2I11M1D13M1D2M1I8M1D9M1D8M1I3M1I12M1I3M1D3M1D5M1I6M1I8M1I5M1I5M1I10M1I4M1I8M1I3M1I7M1I5M1D8M1D8M1I6M1D10M2I9M1I1M1I9M1D10M1D3M1D4M1D6M1D7M2D5M2I5M2I2M2D2M1I16M1I5M14I4M2D4M1I10M1D6M1I1M1I3M1I14M2I3M1I4M1D4M2I14M1D2M1D6M1D8M2D2M1D22M1D12M1D7M1I13M2D8M1I3M1I15M1I18M1I17M9S * 0   NM:i:444  ms:i:1984 AS:i:1984 nn:i:0  tp:A:S  cm:i:25 s1:i:452  dv:f:0.0908
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/28/0_4873 272 kraken:taxid|1633785|NZ_CP011148.1  1039878 0 2514S9M1D7M1D9M1I8M1D15M1I6M1I7M1D5M1D15M1D8M1I4M1D11M1D12M1I17M1D5M1D4M1D8M7D4M3D15M1I1M1I33M1I13M2I13M1D6M1I29M1I4M1D26M1D8M1I27M1D3M1D12M1I4M1I8M1I8M1I12M1D3M1D8M1I16M1I12M1I3M1I3M1D4M1I2M2D6M3D34M1D9M2I33M1D16M1I15M1D11M2D9M1D25M2I6M1D2M2I11M1D3M1D14M1I8M1D4M1D19M4I29M1D11M1D8M1D2M1D5M1I2M4I9M1D15M1I19M2I7M1I6M1I24M1D7M1D4M1I15M1D2M1D3M1I6M1I18M1I3M1I5M1D28M1D3M1D6M2I5M1I32M1D3M1D30M1D8M1D30M1I11M1I1M1I16M1D14M1D22M1D14M1I32M1D12M1D8M1D13M1I49M1D26M1I8M1D3M1D10M1D15M1I8M1D18M1D12M1I25M1D3M2D6M3I10M1D3M2D17M1D4M1I5M1D5M1I3M2D15M1I5M1I7M1I9M1D2M3I31M1I13M1D6M1I22M2D2I3M1I6M1I10M1I5M1D6M2D10M1I16M1I13M1I3M1I13M1I3M3I2M1I11M1I7M1D4M2I8M1D10M1I16M1I3M1I7M2I3M1I20M1D2M3I5M2I5M1I13M5I3M2I6M1I17M2I3M1I5M1D10M2I11M1D13M1D2M1I8M1D9M1D8M1I3M1I12M1I3M1D3M1D5M1I6M1I8M1I5M1I5M1I10M1I4M1I8M1I3M1I7M1I5M1D8M1D8M1I6M1D10M2I9M1I1M1I9M1D10M1D3M1D4M1D6M1D7M2D5M2I5M2I2M2D2M1I16M1I5M14I4M2D4M1I10M1D6M1I1M1I3M1I14M2I3M1I4M1D4M2I14M1D2M1D6M1D8M2D2M1D22M1D12M1D7M1I13M2D8M1I3M1I15M1I18M1I17M9S *   NM:i:446  ms:i:1972 AS:i:1972 nn:i:0  tp:A:S  cm:i:25 s1:i:452  dv:f:0.0908
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/28/0_4873 2064  kraken:taxid|66084|NC_012416.1  1360016 9 616H6M1I3M1I9M1D12M1I3M1I4M1I1M1I6M3I5M1D10M1D2M1D4M1D4M1I28M2I7M4I3M2I10M1I1M1I4M1D4M6I7M1I6M1I9M1I2M1I13M1I1M1I8M1D4M1I8M1I15M1D3M1I15M1D5M1I1M3D4M1I10M1I8M2I7M1I9M1D4M1I7M1I12M2D11M1I27M1D21M1D5M2I23M5I3M1I9M1I6M3I14M1I6M1D3M1D43M1I2M1D4M1D8M1D21M2I16M1I7M1D5M1D5M1I11M1I21M1D6M1D2M1D7M2I3M1I15M3I2M1I1M2I6M2I4M1I6M1D9M1D62M1I10M1D3M1I11M2D7M1D13M1I4M1I10M1D4M1I8M1D15M1D12M1D10M1D10M1D10M1I12M1D2M2D1M1D14M1I8M1D6M1D11M2D15M1I13M1I14M1I8M1I32M1D4M1I18M1D30M2D5M1D10M1D5M1D11M1D13M1D6M1I7M1I4M1D13M1I1M2D18M2D8M1I8M1I3M1I7M3052H * 0 0 seq * NM:i:195  ms:i:1194 AS:i:1194 nn:i:0  tp:A:P  cm:i:21 s1:i:287  s2:i:284  dv:f:0.0502 SA:Z:kraken:taxid|66084|NC_012416.1,1207118,-,2514S2299M51I9S,5,424;
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/28/0_4873 272 kraken:taxid|1633785|NZ_CP011148.1  1051218 0 616S6M1I3M1I9M1D12M1I3M1I4M1I1M1I6M3I5M1D10M1D2M1D4M1D4M1I28M2I7M4I3M2I10M1I1M1I4M1D4M6I7M1I6M1I9M1I2M1I13M1I1M1I8M1D4M1I8M1I15M1D3M1I15M1D7M2D4M1I10M1I8M2I7M1I9M1D4M1I7M1I12M2D11M1I27M1D29M1I21M5I8M1I4M1I6M3I14M1I6M1D3M1D43M1I2M1D4M1D8M1D21M2I16M1I7M1D5M1D5M1I11M1I21M1D6M1D2M1D7M2I3M1I15M3I2M1I2M4I9M1I6M1D9M1D62M1I10M1D3M1I11M2D5M1D15M1I4M1I10M1D4M1I8M1D15M1D12M1D10M1D10M1D10M1I12M1D2M2D1M1D14M1I8M1D6M1D11M2D15M1I13M1I14M1I8M1I32M1D4M1I18M1D30M2D5M1D10M1D5M1D11M1D13M1D6M1I7M1I4M1D13M1I1M2D18M2D8M1I8M1I3M1I7M3052S * 0 0 * * NM:i:200  ms:i:1164 AS:i:1164 nn:i:0  tp:A:S  cm:i:22 s1:i:284  dv:f:0.0655

This is the results of blastn, and I marked them with taxonomy ID. The results agreed well with that of minimap2.

Here is another example.

1
2
3

$ grep 'm161109_080520_42256_c101052872550000001823247601061737_s1_p0/796/0_3902' minimap2.sam
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/796/0_3902  16  kraken:taxid|573570|NZ_CP016796.1 218142  29  390S6M1D15M1D10M3I6M1D15M1I3M2I11M3I4M1I14M3I4M1I9M1D7M1D3M2I9M3I20M2I6M1D15M1I3M2I17M3I9M2I4M1I5M1D11M3I9M3I4M4I5M1I10M1I12M1I6M3D10M2D5M2D12M2I11M3177S * 0 0 seq * NM:i:85 ms:i:178  AS:i:178  nn:i:0  tp:A:P  cm:i:5  s1:i:93 s2:i:52 dv:f:0.0291 SA:Z:kraken:taxid|573570|NZ_CP016796.1,218159,-,296S287M9I3310S,22,82;
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/796/0_3902  2064  kraken:taxid|573570|NZ_CP016796.1 218159  22  296H9M3I6M4I7M1D11M3I3M1I3M1D11M3I8M1I10M3I6M1D11M3I13M1I5M3I6M1D15M1I3M2I11M4D4M5D7M5D11M3I4M1I9M1D7M1D3M2I6M3I23M2I6M1D17M5D4M2D5M2D13M3310H  * 0 0 seq * NM:i:82 ms:i:178  AS:i:178  nn:i:0  tp:A:P  cm:i:7  s1:i:166  s2:i:120  dv:f:0.0235 SA:Z:kraken:taxid|573570|NZ_CP016796.1,218142,-,390S304M31I3177S,29,85;

And, I got different results using blastn.

And the detailed alignment of the top 1.

But when I checked the raw sequence, there are lots of repeats. I don’t know what it is.

>m161109_080520_42256_c101052872550000001823247601061737_s1_p0/796/0_3902 RQ=0.860
CAGAAGATGAGATTTAAATTGCATCCCAATTGTCATATATAATTAATCATATAAATTATATATGTGTAAG
TTTTTTTTTTTTGTGAACATTATAGAAATAATAATAATATTCCATGGCCCTGACAAAGGGCCGAGTATTT
ACTCAATTTGGACAAAGAATAAAAAATTTAAGCCTTCCTTTCAGTTTCTTGGGCAACAGAACATATTTGG
TAAAACACCACCTTGTGCAATCGTACACCAAACAGAAGTTTAATTTATTCTTCGTCCACCATCATACGAA
TTGCAATTGCAAATGTCTATGGAATAATACGAGTTTTCTTGTTGTCCACAACTCTTATGAGCAGCATTAC
CGCGCCAATTCTAGTACTATCAAGCAGCTAAATATTACCCATGACAGCAGCCAAATCATAAACTGGTGCA
CCAGCTCCAACACGTTCAGCATAATTTCCTTTGAAAAAATTTAAAAATTTTGAAAAATTCAAAATTTTGA
AAAATTCAAAAATTTTGAAAAATTACAACAAATTTGAAAAAAATTCAAAAATTTGAAAAAATTCAATAAA
ATTTTGAAAAAATTCAAAAAAATTTTAAAAAATCTCAAAAATTTTGAAAATTCATAAATTTTGAAAAAAT
TCAAAACCTTTTGAAAACAAAATCAAAATTTGAAAAAAATTCAAAAATTTTGAAAAATTCAAAAATTTGA
AAAACCATTCAAAAATTTTGAAAAAAAATTCAAAAATTTTGAAAAAATTCAAAAATTTTGAAATAAAAAA
TTTCAAAAATTTTGAAAAAAATTCAAAAATTTTGAAAAATCTCAAAAAATTTTGAAAAAGTTCAATGAAA
TTTTGAAAAAGTTTAAAAGATTTTGAAAAAAATACCAAAAATTTTGAAAAAATCAAGAATTTTGAAAAAA
ATTCCAAAATTTTGAAAAAATTCAATTAAATTTTGAAAAATTCAAAATCTTTGAAAAAAATTCAACAAAT
TTTGAAAAAATTCACAAAAATTTTGAAAAAATTCAAAAATTTTGAAAAATTCAAAAATTTTTGAAAAAAC
AAACTCAAAATTTTGAAAAAATTCAAAAATTTGAAAAAATTCAAAAAATTTGAAAAAATTCAAAAACTTT
TGAAAAAAATTCAAAAATCTTTGAAAAAATTCAAAAATTTTTGAAAAAATTCAAAAATTTTGAAAAAAAT
TCAAAAATTTTGAAAAAATTCAAAAATTTTTGAAAAAATCAAAAATTTGAAAAAATTCCAAAAATTTTTG
AAAAAATTCAAAAAATTTTCGAAAAATTCAATTTTGAAAAATTCAAAAATTTACTGAAAAAATTTCAAAA
ATTTTGAAAAAATTCAATAAATTTTGAAAAAATTCAAACATTTTGAAAAATTCAAAAATTTTGAAAAAAT
CAACAAATTTTTGAAAAAATTTCAAAAATTTTTGAAAAATCAAAAAATTTTTTGAAAAAAATTCACGCCA
AATTTTGAAAAAATTTAAAAATTTTGAAAAAATTAAATAAATTTTAAAATTTGTATGATTTTTCAAATTT
TTGATAAGTTTTCATTTTGAAAAATTTAAATTTTTTCAAAATTTTTTGAATTTTTTCAAAATTTTGAATT
TTTTCACCCAATTTTTGAATTTTCTTTCAAAAAATTTTTAGAAATTAACAAATAACCTATTTCAAATTTT
TGAATTTTTCAAATTTTTGATTTTTTCAAAATTTTGAATTTTTTTTCAAAATTTTGAATTTTTAAATACA
AATTTTTGAATTTTTCAAAATTGTTTGAATTTTTATCAAAATTTTTGAATTTTTTTAAAAATTTTTGATT
TTTTCAAAGGAAATTATTGCTGAAGCCGTGTTGGAGCTGTGCACCAGTTTTATTTGGCTGCCTGTTCATG
GAACTTTTAGCTGCTGAGTACTAAGAATTGGCGGGTAATGCTGTCAGTGACAGAACAGAAAACTCGTATT
ATTTCCTAGACATTTGCAATTGGCAATTCAGTAAATGACGAACCAATTAAATAATACTTCTGTCTGGTGT
TTACGAATGCACAAGGTGAGTGTTTTACCAAATAACAAGCTGTTCTGATGCCTCAACGAAAACTGAAAGA
AGGCTTAAATTTTTTTATTCCTTTGTCCAAATTAGTAAACTACTTCGGCCCTTTTCAGGGCCATAATATT
CATTATCTATTTTCACTCAGATGTTCGCAAAAAAAAAACTTACACATATAATAATTTTATATATTAATTT
GCAAATTTGGATGCAATTTAAAATTGAATGATAAAGTGCAATGGTGTTGTCTAGTCTACAAAAATTCTAT
AAACGTACACAAAAAAAATATTCGCTAAATTGAATTGTTGATAAAAAAAATATTTTTACTTAGAAAATCT
TAAAAAAAAAAAACACAATAGATATGATTGTATATAATCAAAAAATGCTATTGAATGTAAATAATTTTTT
AACAATTTCAAATTTTTAAAAACGTTCTCGCTTAGTTATATCAAATTATTCGGATTTTGGTTTTTTTATT
ATTAATTATTATTATATTAATAATAAATTATTATTCTACTCAATTATTAAAATTTGAATAATTTTGTGGC
CTCAAAGGGCCATTTGTTTTATAATGACAATTTTATTGAAGGAATACCAAAAGAATTGAAATAAACATAC
GATTGATATATTAAAGTATTAGCGTGTTTTATTTCTTTCTTTTAGGTGAAAATGCTGCCTTCCTGGGCTT
TAGGTGAAGTATGGATGTTTTTGGCCAGTTTTATGGTTTATTTGGTGCCTTTGGTTTTTAGTTTGGACCT
TTTAGCTGATTTTTTGGCCTTTAATCGGTGAATTTTGCAGTTTGTAGTCGTACGAATGTAGTTTTAGGCA
GCAGATGGTGGTTTTTTCTTCTCGTCTTTTTTTCGGTTTAGCAGCTTTAATTGGTGATTTTTTTTGCTTT
GATTGATTTCTTACAGCCGGTTTTATGTTTTAATTGTCCAGTTTGCTAAAGTAGAATTCGTTTTACAGTT
GCAAGCTTTTTTGGGTTTTGTACCAGAAGCCCTTCTTTCTTTTTCTTTTAGAACTTTTTTTTCTTAGAAC
TTTTTTTTTTTAGAACATTTTTTTTTAGGAACTTTTTTTTTAGAACTTTTTTTAGAAACTTTTGTTATCT
TTTCTGTTTGTTCTTGTTATCTTTTTGTTTGTTTTGTTATCATTTTTTTGTTTCTTTTGGTTCTCTTTTT
GTTTGTTTTTGTTATCATTTTTTGTTTGTTTTTGGGTATCTCTTTTTTTGTTTGTTTTGTTATCTTTTTG
TTGTTTGGTTTATCTTTTCCTTTGTTTTGTTTATTTTTTTGTTTGTTTTGGTTATATTTTTGTTTGTTTG
TTATCTTTTTTGTTTTTTTTGTTATCTTTTTGTTTGTTTGTATCTTTTTTTGTTTGTTTTGTTATCTTTT
TTTGTTTGTTTTGTTATCTTTTTGTTTGTTTTTGTTATCTTTTTGTTTGTTTTGTTTATCTTTATTGTTT
GTTTTGTTATCTTTTTGTTTGTTTTGTTATCTTCTTTTGTTTGTTTTGTTATCTTTTTGCTTTGTTTTGT
TATCTTTTTGTTTTGTTTTGTTATCTTTTTGTTGTTCTTTATTCTGGTTAATCATTTTTTTGGTTGTTTT
GTTCATCTTTTTGTGTTTGTTTTTGTTATCTTTTTGTTTGTTTTGTTAATCTTTTTGTTTGCTTTGTTAT
CATTTTTTGTTTGCTTTGTTATTTTTTGTTTGCTTTGTTATATTTTTGTTGCTTTGTTATCATTTTTGTT
TGCTTTGGTTATCTTTGTTTGCTTTGTTATCTTTTTTTGTTGGCTTTGTTTATCGTTTTTGTATTTGCTT
TGTTATCGTTTTTGTTTGCTTTGTTATCTTTTTTGTTTTGCGTTTTTTTAGC

The differences between minimap2 and blastn are explainable, they used different algorithm and different databases. And generally, I think minimap2 is reliable.

In summary

In summary, I removed contaminants in reads by the following steps:

Prepare contamination library (bacteria, viral, fungi, protozoa, and archaea from Refseq and mtDNAs).
Use BBDuk to remove contaminants from illumina short reads.
Use minimap2 to remove contaminants from PacBio long reads.

But there are some concerns existing:

Some real sequences may also be removed along with the contaminants.
Many repeat sequences were removed, and I don’t know where they were from.
For the ‘ambivalent’ reads, I kept them for downstream analysis, but I didn’t know whether they should be removed, say, throw away reads which the primary aligment were contaminant.

There are some other long-reads mapper, and people also try to tune the parameters of short-read aligners to work with long-reads. There are some threads/posts talking about this:

I don’t really understand the “mapping” things now, but I expect that there will be several dominant tools for long-reads mapping, just as short-reads mapping.

Useful links

Change log

20180628: create the note.
20180725: complete the note.
20180807: update the ‘minimap-arm’ part, add the results of kraken.
20180815: add the part of ‘Minimap2-2.12 (r827)’

Chaisson MJ, Tesler G. 2012. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics. 13:238. doi:10.1186/1471-2105-13-238. ↩︎
Jain C, Dilthey A, Koren S, Aluru S, Phillippy AM. 2018 Apr 30. A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases. Journal of Computational Biology. doi:10.1089/cmb.2018.0036. [accessed 2018 Jul 2]. https://www.liebertpub.com/doi/10.1089/cmb.2018.0036. ↩︎
Li H. 2017 Aug 4. Minimap2: versatile pairwise alignment for nucleotide sequences. arXiv:170801492 [q-bio]. [accessed 2018 Jan 10]. http://arxiv.org/abs/1708.01492. ↩︎