Introduction
wtdbg
has two git repos: wtdbg and wtdbg-1.2.8, and the author Jue Ruan (who also developed SMARTdenovo) introduces them as:
wtdbg: A fuzzy Bruijn graph approach to long noisy reads assembly. wtdbg is desiged to assemble huge genomes in very limited time, it requires a PowerPC with multiple-cores and very big RAM (1Tb+). wtdbg can assemble a 100 X human pacbio dataset within one day.
wtdbg-1.2.8: Important update of wtdbg
Jue Ruan preferred wtdbg-1.2.8
.
In personal feeling, I like wtdbg-1.2.8 more than SMARTdenovo and wtdbg-1.1.006.
This tool hasn’t been published now (20180307), and I found it in an evaluation paper from BIB:
Jayakumar V, Sakakibara Y. Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data. Briefings in Bioinformatics. 2017 Nov 3:bbx147-bbx147. doi:10.1093/bib/bbx147
My feelings:
- very fast
- easy to install
- easy to use
- docs and discussions about this tool is limited.
- aggressive
- good N50 (at least in our two genome projects, an insect and a plant)
- relatively bad completeness
General usage
Because wtdbg
has two different versions and I didn’t know which one is more suitable for me, I just tried both.
wtdbg v1.1.006
Install
I got a problem when compile the software. The issue is caused by the CPATH
of our OS, and eventually solved with the help of Jue Ruan.
1 | git clone https://github.com/ruanjue/wtdbg.git && cd wtdbg |
Examples in the doc
1 | assembly of contigs |
wtdbg v1.2.8
Install
1 | git https://github.com/ruanjue/wtdbg-1.2.8.git && cd wtdbg-1.2.8 |
For higher error rate long sequences
- Decrease
-p
. Try-p 19
or-p 17
- Decrease
-S.
Try-S
2 or-S
1
Both will increase computing time.
For very high coverage
- Increase
--edge-min
. Try--edge-min 4
, or higher.
For low coverage
- Decrease
--edge-min
. Try--edge-min 2 --rescue-low-cov-edges
.
Filter reads
--tidy-reads 5000
. Will filtered shorter sequences. If names in format of\/\d+_\d+$
, will selected the longest subread.
Consensus
1 | wtdbg-cns -t 64 -i dbg.ctg.lay -o dbg.ctg.lay.fa |
The output file dbg.ctg.lay.fa
is ready for further polished by PILON
or QUIVER
.
In practice
I’ve tried two versions of wtdbg
and diferent parameter combinations in two genome assembly projects. The parameters and the logs/stats received are as follows:
An insect
- The species: high heterogeneity, high AT, high repetition.
- Genome size: male 790M, female 830M.
- Data used:about 70X PacBio long-reads.
- OS environment: CentOS6.6 86_64 glibc-2.12. QSUB grid system. 15 Fat nodes (2TB RAM, 40 CPU) and 10 Blade nodes (156G RAM, 24 CPU).
wtdbg v1.1.006
run1, with -H -k 21 -S 1.02 -e 3
:
stats:
1 | total base: 607971510 |
wtdbg v1.2.8
run1, with defalult -k 0 -p 21 -S 4
:
stats:
1 | total base: 757804309 |
run2, with --edge-min 2 --rescue-low-cov-edges --tidy-reads 5000
(Because median node depth = 6, less than 20)
stats:
1 | total base: 845834770 |
run3, with -k 15 -p 0 -S 1 --rescue-low-cov-edges --tidy-reads 5000
stats:
1 | Size_includeN 795503989 |
run4, with -k 0 -p 19 -S 2 --rescue-low-cov-edges --tidy-reads 5000
stats:
1 | Size_includeN 780618272 |
run5, with --tidy-reads 5000 -k 21 -p 0 -S 2 --rescue-low-cov-edges
stats:
1 | Size_includeN 843085698 |
run6, with -k 0 -p 21 -S 4 --aln-noskip
After discussion with the author, he suggested me to use --aln-noskip
.
stats:
1 | Size_includeN 726925732 |
run7, with -k 15 -p 0 -S 1 --rescue-low-cov-edges --tidy-reads 5000 --aln-noskip
stats:
1 | Size_includeN 762713695 |
After all the experiments, I’m not sure what to do next (try more or move on). As suggeested by Jue Ruan, N50 contig of ~500kb is good enough for scaffolding and genomic analysis. So I should try to evaluate the assembly and improve it while trying other tools.
A plant
- The species: high heterogeneity, high repetition.
- Genome size: 2.1G.
- Data used:more than 100X PacBio long reads.
- OS environment: CentOS6.6 86_64 glibc-2.12. QSUB grid system. 15 Fat nodes (2TB RAM, 40 CPU) and 10 Blade nodes (156G RAM, 24 CPU).
wtdbg v1.1.006
commands:
1 | run1, version 1.1.006 |
stats:
1 | Size_includeN 2105945650 |
wtdbg v1.2.8
commands:
1 | run2, version 1.2.8 |
stats:
1 | Size_includeN 1924031835 |
Where to go next?
I asked Jue Ruan that if it is necessary to run consensus tools on the results of wtdbg
or smartdenovo
, he said:
The inside consensus tool
wtdbg-cns
aims to provide a quick way to reduce sequencing errors. It is suggested to useQuiver
and/orPilon
to polish the consensus sequences after you feel happy with the assembly. Usually,wtdbg-cns
can reduce error rate down to less than 1%, which can be well-aligned by short reads.
Useful links
- Discussions about “Optimisation of parameters”
- if it is necessary to run consensus tools on the results of wtdbg or smartdenovo
Change log
- 20180307: create the note.
- 20180630: add the ‘A plant’ part.