Genome Assembly Pipeline: wtdbg

Introduction

wtdbg has two git repos: wtdbg and wtdbg-1.2.8, and the author Jue Ruan (who also developed SMARTdenovo) introduces them as:

wtdbg: A fuzzy Bruijn graph approach to long noisy reads assembly. wtdbg is desiged to assemble huge genomes in very limited time, it requires a PowerPC with multiple-cores and very big RAM (1Tb+). wtdbg can assemble a 100 X human pacbio dataset within one day.

wtdbg-1.2.8: Important update of wtdbg

Jue Ruan preferred wtdbg-1.2.8.

In personal feeling, I like wtdbg-1.2.8 more than SMARTdenovo and wtdbg-1.1.006.

This tool hasn’t been published now (20180307), and I found it in an evaluation paper from BIB:

Jayakumar V, Sakakibara Y. Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data. Briefings in Bioinformatics. 2017 Nov 3:bbx147-bbx147. doi:10.1093/bib/bbx147

My feelings:

very fast
easy to install
easy to use
docs and discussions about this tool is limited.
aggressive
good N50 (at least in our two genome projects, an insect and a plant)
relatively bad completeness

General usage

Because wtdbg has two different versions and I didn’t know which one is more suitable for me, I just tried both.

wtdbg v1.1.006

Install

I got a problem when compile the software. The issue is caused by the CPATH of our OS, and eventually solved with the help of Jue Ruan.

1 2	git clone https://github.com/ruanjue/wtdbg.git && cd wtdbg make

Examples in the doc

# assembly of contigs
wtdbg-1.1.006 -t 96 -i pb-reads.fa -o dbg -H -k 21 -S 1.02 -e 3 2>&1 | tee log.wtdbg
# -t: number of threads, please type 'wtdbg-1.1.006 -h' to get a document
# -i: you can set more than one sequences files, such as -i 1.fa. -i 2.fq -i 3.fa.gz -i 4.fq.gz
# -o: the prefix of results
# -S: 1.01 will use all kmers, 1.02 will use half by sumsampling, 1.04 will use 1/4, and so on
#     2.01 will use half by picking minimizers, but not fully tested
# -e: if too low coverage(< 30 X), try to set -e 2
# please note that dbg.ctg.fa is full of errors from raw reads

# first round of polishment
wtdbg-cns -t 96 -i dbg.ctg.lay -o dbg.ctg.lay.fa -k 15 2>&1 | tee log.cns.1
# dbg.ctg.lay.fa is the polished contigs

# if possible, further polishment
minimap -t 96 -L 100 dbg.ctg.lay.fa pb-reads.fa 2> >(tee log.minimap) | best_minimap_hit.pl | awk '{print $6"\t"$8"\t"$9"\t"$1"\t"$5"\t"$3"\t"$4}' >dbg.map
map2dbgcns dbg.ctg.lay.fa pb-reads.fa dbg.map >dbg.map.lay
wtdbg-cns -t 96 -i dbg.map.lay -o dbg.map.lay.fa -k 13 2>&1 | tee log.cns.2
# you need to concat all reads into one file for minimap and map2dbgcns
# dbg.map.lay.fa is the final contigs

wtdbg v1.2.8

Install

1 2	git https://github.com/ruanjue/wtdbg-1.2.8.git && cd wtdbg-1.2.8 make

For higher error rate long sequences

Decrease -p. Try -p 19 or -p 17
Decrease -S. Try -S 2 or -S 1

Both will increase computing time.

For very high coverage

Increase --edge-min. Try --edge-min 4, or higher.

For low coverage

Decrease --edge-min. Try --edge-min 2 --rescue-low-cov-edges.

Filter reads

--tidy-reads 5000. Will filtered shorter sequences. If names in format of \/\d+_\d+$, will selected the longest subread.

Consensus

1	wtdbg-cns -t 64 -i dbg.ctg.lay -o dbg.ctg.lay.fa

The output file dbg.ctg.lay.fa is ready for further polished by PILON or QUIVER.

In practice

I’ve tried two versions of wtdbg and diferent parameter combinations in two genome assembly projects. The parameters and the logs/stats received are as follows:

An insect

The species: high heterogeneity, high AT, high repetition.
Genome size: male 790M, female 830M.
Data used：about 70X PacBio long-reads.
OS environment: CentOS6.6 86_64 glibc-2.12. QSUB grid system. 15 Fat nodes (2TB RAM, 40 CPU) and 10 Blade nodes (156G RAM, 24 CPU).

wtdbg v1.1.006

run1, with `-H -k 21 -S 1.02 -e 3`:

stats：

total base: 607971510
%GC: 32.21
num: 4655
min: 2700
max: 6594103
avg: 130606
N50: 573208
N90: 46228

wtdbg v1.2.8

run1, with defalult `-k 0 -p 21 -S 4`:

stats:

total base: 757804309
%GC: 32.37
num: 20960
min: 2247
max: 3846453
avg: 36154
N50: 103681
N90: 12128

run2, with `--edge-min 2 --rescue-low-cov-edges --tidy-reads 5000` (Because median node depth = 6, less than 20)

stats:

total base: 845834770
%GC: 32.51
num: 19555
min: 2030
max: 2025061
avg: 43254
N50: 158013
N90: 14248

run3, with `-k 15 -p 0 -S 1 --rescue-low-cov-edges --tidy-reads 5000`

stats:

Size_includeN	795503989
Size_withoutN	795503989
Seq_Num	12557
Mean_Size	63351
Median_Size	15690
Longest_Seq	7257493
Shortest_Seq	2277
GC_Content	32.44
N50	308340
N90	21383

run4, with `-k 0 -p 19 -S 2 --rescue-low-cov-edges --tidy-reads 5000`

stats:

Size_includeN	780618272
Size_withoutN	780618272
Seq_Num	11722
Mean_Size	66594
Median_Size	16335
Longest_Seq	8184393
Shortest_Seq	2547
GC_Content	32.4
N50	294217
N90	23008

run5, with `--tidy-reads 5000 -k 21 -p 0 -S 2 --rescue-low-cov-edges`

stats:

Size_includeN	843085698
Size_withoutN	843085698
Seq_Num	26341
Mean_Size	32006
Median_Size	18982
Longest_Seq	491063
Shortest_Seq	2992
GC_Content	32.51
N50	54544
N90	13737

run6, with `-k 0 -p 21 -S 4 --aln-noskip`

After discussion with the author, he suggested me to use --aln-noskip.

stats:

Size_includeN	726925732
Size_withoutN	726925732
Seq_Num	15983
Mean_Size	45481
Median_Size	12714
Longest_Seq	2523944
Shortest_Seq	2290
GC_Content	32.21
N50	164635
N90	14464

run7, with `-k 15 -p 0 -S 1 --rescue-low-cov-edges --tidy-reads 5000 --aln-noskip`

stats:

Size_includeN	762713695
Size_withoutN	762713695
Seq_Num	9803
Mean_Size	77804
Median_Size	15366
Longest_Seq	11163143
Shortest_Seq	2449
GC_Content	32.22
N50	488952
N90	25913

After all the experiments, I’m not sure what to do next (try more or move on). As suggeested by Jue Ruan, N50 contig of ~500kb is good enough for scaffolding and genomic analysis. So I should try to evaluate the assembly and improve it while trying other tools.

A plant

The species: high heterogeneity, high repetition.
Genome size: 2.1G.
Data used：more than 100X PacBio long reads.
OS environment: CentOS6.6 86_64 glibc-2.12. QSUB grid system. 15 Fat nodes (2TB RAM, 40 CPU) and 10 Blade nodes (156G RAM, 24 CPU).

wtdbg v1.1.006

commands:

1
2
3

# run1, version 1.1.006
$TOOLDIR/wtdbg/wtdbg -t $PPN -i $WORKDIR/data/Pacbio/all.fq.gz -o run1 -H -k 21 -S 1.02 -e 3 2>&1 | tee log.run1
$TOOLDIR/wtdbg/wtdbg-cns -t $PPN -i run1.ctg.lay -o run1.ctg.lay.fa -k 15 2>&1 | tee log.run1.cns.1

stats:

Size_includeN	2105945650
Size_withoutN	2105945650
Seq_Num	21871
Mean_Size	96289
Median_Size	48435
Longest_Seq	2968570
Shortest_Seq	2531
GC_Content	38.27
N50	194480
L50	2523
N90	40454
Gap	0.0

wtdbg v1.2.8

commands:

1
2
3

# run2, version 1.2.8
$TOOLDIR/wtdbg-1.2.8/wtdbg-1.2.8 -t $PPN -i $WORKDIR/data/Pacbio/all.fq.gz -o run2
$TOOLDIR/wtdbg-1.2.8/wtdbg-cns -t $PPN -i run2.ctg.lay -o run2.ctg.lay.fa

stats:

Size_includeN	1924031835
Size_withoutN	1924031835
Seq_Num	37933
Mean_Size	50721
Median_Size	14836
Longest_Seq	2424157
Shortest_Seq	2006
GC_Content	38.75
N50	184177
L50	2391
N90	17404
Gap	0.0

Where to go next?

I asked Jue Ruan that if it is necessary to run consensus tools on the results of wtdbg or smartdenovo, he said:

The inside consensus tool wtdbg-cns aims to provide a quick way to reduce sequencing errors. It is suggested to use Quiver and/or Pilon to polish the consensus sequences after you feel happy with the assembly. Usually, wtdbg-cns can reduce error rate down to less than 1%, which can be well-aligned by short reads.

Useful links

Change log

20180307: create the note.
20180630: add the ‘A plant’ part.

Introduction

General usage

wtdbg v1.1.006

Install

Examples in the doc

wtdbg v1.2.8

Install

In practice

An insect

wtdbg v1.1.006

run1, with -H -k 21 -S 1.02 -e 3:

wtdbg v1.2.8

run1, with defalult -k 0 -p 21 -S 4:

run2, with --edge-min 2 --rescue-low-cov-edges --tidy-reads 5000 (Because median node depth = 6, less than 20)

run3, with -k 15 -p 0 -S 1 --rescue-low-cov-edges --tidy-reads 5000

run4, with -k 0 -p 19 -S 2 --rescue-low-cov-edges --tidy-reads 5000

run5, with --tidy-reads 5000 -k 21 -p 0 -S 2 --rescue-low-cov-edges

run6, with -k 0 -p 21 -S 4 --aln-noskip

run7, with -k 15 -p 0 -S 1 --rescue-low-cov-edges --tidy-reads 5000 --aln-noskip

A plant

wtdbg v1.1.006

wtdbg v1.2.8

Where to go next?

Useful links

Change log

run1, with `-H -k 21 -S 1.02 -e 3`:

run1, with defalult `-k 0 -p 21 -S 4`:

run2, with `--edge-min 2 --rescue-low-cov-edges --tidy-reads 5000` (Because median node depth = 6, less than 20)

run3, with `-k 15 -p 0 -S 1 --rescue-low-cov-edges --tidy-reads 5000`

run4, with `-k 0 -p 19 -S 2 --rescue-low-cov-edges --tidy-reads 5000`

run5, with `--tidy-reads 5000 -k 21 -p 0 -S 2 --rescue-low-cov-edges`

run6, with `-k 0 -p 21 -S 4 --aln-noskip`

run7, with `-k 15 -p 0 -S 1 --rescue-low-cov-edges --tidy-reads 5000 --aln-noskip`