Genome Assembly Pipeline: wtdbg

Introduction

wtdbg has two git repos: wtdbg and wtdbg-1.2.8, and the author Jue Ruan (who also developed SMARTdenovo) introduces them as:

wtdbg: A fuzzy Bruijn graph approach to long noisy reads assembly. wtdbg is desiged to assemble huge genomes in very limited time, it requires a PowerPC with multiple-cores and very big RAM (1Tb+). wtdbg can assemble a 100 X human pacbio dataset within one day.

wtdbg-1.2.8: Important update of wtdbg

Jue Ruan preferred wtdbg-1.2.8.

In personal feeling, I like wtdbg-1.2.8 more than SMARTdenovo and wtdbg-1.1.006.

This tool hasn’t been published now (20180307), and I found it in an evaluation paper from BIB:

Jayakumar V, Sakakibara Y. Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data. Briefings in Bioinformatics. 2017 Nov 3:bbx147-bbx147. doi:10.1093/bib/bbx147

My feelings:

  • very fast
  • easy to install
  • easy to use
  • docs and discussions about this tool is limited.
  • aggressive
  • good N50 (at least in our two genome projects, an insect and a plant)
  • relatively bad completeness

General usage

Because wtdbg has two different versions and I didn’t know which one is more suitable for me, I just tried both.

wtdbg v1.1.006

Install

I got a problem when compile the software. The issue is caused by the CPATH of our OS, and eventually solved with the help of Jue Ruan.

1
2
git clone https://github.com/ruanjue/wtdbg.git && cd wtdbg
make

Examples in the doc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# assembly of contigs
wtdbg-1.1.006 -t 96 -i pb-reads.fa -o dbg -H -k 21 -S 1.02 -e 3 2>&1 | tee log.wtdbg
# -t: number of threads, please type 'wtdbg-1.1.006 -h' to get a document
# -i: you can set more than one sequences files, such as -i 1.fa. -i 2.fq -i 3.fa.gz -i 4.fq.gz
# -o: the prefix of results
# -S: 1.01 will use all kmers, 1.02 will use half by sumsampling, 1.04 will use 1/4, and so on
# 2.01 will use half by picking minimizers, but not fully tested
# -e: if too low coverage(< 30 X), try to set -e 2
# please note that dbg.ctg.fa is full of errors from raw reads

# first round of polishment
wtdbg-cns -t 96 -i dbg.ctg.lay -o dbg.ctg.lay.fa -k 15 2>&1 | tee log.cns.1
# dbg.ctg.lay.fa is the polished contigs

# if possible, further polishment
minimap -t 96 -L 100 dbg.ctg.lay.fa pb-reads.fa 2> >(tee log.minimap) | best_minimap_hit.pl | awk '{print $6"\t"$8"\t"$9"\t"$1"\t"$5"\t"$3"\t"$4}' >dbg.map
map2dbgcns dbg.ctg.lay.fa pb-reads.fa dbg.map >dbg.map.lay
wtdbg-cns -t 96 -i dbg.map.lay -o dbg.map.lay.fa -k 13 2>&1 | tee log.cns.2
# you need to concat all reads into one file for minimap and map2dbgcns
# dbg.map.lay.fa is the final contigs

wtdbg v1.2.8

Install

1
2
git https://github.com/ruanjue/wtdbg-1.2.8.git && cd wtdbg-1.2.8
make

For higher error rate long sequences

  • Decrease -p. Try -p 19 or -p 17
  • Decrease -S. Try -S 2 or -S 1

Both will increase computing time.

For very high coverage

  • Increase --edge-min. Try --edge-min 4, or higher.

For low coverage

  • Decrease --edge-min. Try --edge-min 2 --rescue-low-cov-edges.

Filter reads

  • --tidy-reads 5000. Will filtered shorter sequences. If names in format of \/\d+_\d+$, will selected the longest subread.

Consensus

1
wtdbg-cns -t 64 -i dbg.ctg.lay -o dbg.ctg.lay.fa

The output file dbg.ctg.lay.fa is ready for further polished by PILON or QUIVER.

In practice

I’ve tried two versions of wtdbg and diferent parameter combinations in two genome assembly projects. The parameters and the logs/stats received are as follows:

An insect

  • The species: high heterogeneity, high AT, high repetition.
  • Genome size: male 790M, female 830M.
  • Data used:about 70X PacBio long-reads.
  • OS environment: CentOS6.6 86_64 glibc-2.12. QSUB grid system. 15 Fat nodes (2TB RAM, 40 CPU) and 10 Blade nodes (156G RAM, 24 CPU).

wtdbg v1.1.006

run1, with -H -k 21 -S 1.02 -e 3:

stats:

1
2
3
4
5
6
7
8
total base: 607971510
%GC: 32.21
num: 4655
min: 2700
max: 6594103
avg: 130606
N50: 573208
N90: 46228

wtdbg v1.2.8

run1, with defalult -k 0 -p 21 -S 4:

stats:

1
2
3
4
5
6
7
8
total base: 757804309
%GC: 32.37
num: 20960
min: 2247
max: 3846453
avg: 36154
N50: 103681
N90: 12128
run2, with --edge-min 2 --rescue-low-cov-edges --tidy-reads 5000 (Because median node depth = 6, less than 20)

stats:

1
2
3
4
5
6
7
8
total base: 845834770
%GC: 32.51
num: 19555
min: 2030
max: 2025061
avg: 43254
N50: 158013
N90: 14248
run3, with -k 15 -p 0 -S 1 --rescue-low-cov-edges --tidy-reads 5000

stats:

1
2
3
4
5
6
7
8
9
10
Size_includeN	795503989
Size_withoutN 795503989
Seq_Num 12557
Mean_Size 63351
Median_Size 15690
Longest_Seq 7257493
Shortest_Seq 2277
GC_Content 32.44
N50 308340
N90 21383
run4, with -k 0 -p 19 -S 2 --rescue-low-cov-edges --tidy-reads 5000

stats:

1
2
3
4
5
6
7
8
9
10
Size_includeN	780618272
Size_withoutN 780618272
Seq_Num 11722
Mean_Size 66594
Median_Size 16335
Longest_Seq 8184393
Shortest_Seq 2547
GC_Content 32.4
N50 294217
N90 23008
run5, with --tidy-reads 5000 -k 21 -p 0 -S 2 --rescue-low-cov-edges

stats:

1
2
3
4
5
6
7
8
9
10
Size_includeN	843085698
Size_withoutN 843085698
Seq_Num 26341
Mean_Size 32006
Median_Size 18982
Longest_Seq 491063
Shortest_Seq 2992
GC_Content 32.51
N50 54544
N90 13737
run6, with -k 0 -p 21 -S 4 --aln-noskip

After discussion with the author, he suggested me to use --aln-noskip.

stats:

1
2
3
4
5
6
7
8
9
10
Size_includeN	726925732
Size_withoutN 726925732
Seq_Num 15983
Mean_Size 45481
Median_Size 12714
Longest_Seq 2523944
Shortest_Seq 2290
GC_Content 32.21
N50 164635
N90 14464
run7, with -k 15 -p 0 -S 1 --rescue-low-cov-edges --tidy-reads 5000 --aln-noskip

stats:

1
2
3
4
5
6
7
8
9
10
Size_includeN	762713695
Size_withoutN 762713695
Seq_Num 9803
Mean_Size 77804
Median_Size 15366
Longest_Seq 11163143
Shortest_Seq 2449
GC_Content 32.22
N50 488952
N90 25913

After all the experiments, I’m not sure what to do next (try more or move on). As suggeested by Jue Ruan, N50 contig of ~500kb is good enough for scaffolding and genomic analysis. So I should try to evaluate the assembly and improve it while trying other tools.

A plant

  • The species: high heterogeneity, high repetition.
  • Genome size: 2.1G.
  • Data used:more than 100X PacBio long reads.
  • OS environment: CentOS6.6 86_64 glibc-2.12. QSUB grid system. 15 Fat nodes (2TB RAM, 40 CPU) and 10 Blade nodes (156G RAM, 24 CPU).

wtdbg v1.1.006

commands:

1
2
3
# run1, version 1.1.006
$TOOLDIR/wtdbg/wtdbg -t $PPN -i $WORKDIR/data/Pacbio/all.fq.gz -o run1 -H -k 21 -S 1.02 -e 3 2>&1 | tee log.run1
$TOOLDIR/wtdbg/wtdbg-cns -t $PPN -i run1.ctg.lay -o run1.ctg.lay.fa -k 15 2>&1 | tee log.run1.cns.1

stats:

1
2
3
4
5
6
7
8
9
10
11
12
Size_includeN	2105945650
Size_withoutN 2105945650
Seq_Num 21871
Mean_Size 96289
Median_Size 48435
Longest_Seq 2968570
Shortest_Seq 2531
GC_Content 38.27
N50 194480
L50 2523
N90 40454
Gap 0.0

wtdbg v1.2.8

commands:

1
2
3
# run2, version 1.2.8
$TOOLDIR/wtdbg-1.2.8/wtdbg-1.2.8 -t $PPN -i $WORKDIR/data/Pacbio/all.fq.gz -o run2
$TOOLDIR/wtdbg-1.2.8/wtdbg-cns -t $PPN -i run2.ctg.lay -o run2.ctg.lay.fa

stats:

1
2
3
4
5
6
7
8
9
10
11
12
Size_includeN	1924031835
Size_withoutN 1924031835
Seq_Num 37933
Mean_Size 50721
Median_Size 14836
Longest_Seq 2424157
Shortest_Seq 2006
GC_Content 38.75
N50 184177
L50 2391
N90 17404
Gap 0.0

Where to go next?

I asked Jue Ruan that if it is necessary to run consensus tools on the results of wtdbg or smartdenovo, he said:

The inside consensus tool wtdbg-cns aims to provide a quick way to reduce sequencing errors. It is suggested to use Quiver and/or Pilon to polish the consensus sequences after you feel happy with the assembly. Usually, wtdbg-cns can reduce error rate down to less than 1%, which can be well-aligned by short reads.

Change log

  • 20180307: create the note.
  • 20180630: add the ‘A plant’ part.
0%