Introduction
wtdbg has two git repos: wtdbg and wtdbg-1.2.8, and the author Jue Ruan (who also developed SMARTdenovo) introduces them as:
wtdbg: A fuzzy Bruijn graph approach to long noisy reads assembly. wtdbg is desiged to assemble huge genomes in very limited time, it requires a PowerPC with multiple-cores and very big RAM (1Tb+). wtdbg can assemble a 100 X human pacbio dataset within one day.
wtdbg-1.2.8: Important update of wtdbg
Jue Ruan preferred wtdbg-1.2.8.
In personal feeling, I like wtdbg-1.2.8 more than SMARTdenovo and wtdbg-1.1.006.
This tool hasn’t been published now (20180307), and I found it in an evaluation paper from BIB:
Jayakumar V, Sakakibara Y. Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data. Briefings in Bioinformatics. 2017 Nov 3:bbx147-bbx147. doi:10.1093/bib/bbx147
My feelings:
- very fast
- easy to install
- easy to use
- docs and discussions about this tool is limited.
- aggressive
- good N50 (at least in our two genome projects, an insect and a plant)
- relatively bad completeness
General usage
Because wtdbg has two different versions and I didn’t know which one is more suitable for me, I just tried both.
wtdbg v1.1.006
Install
I got a problem when compile the software. The issue is caused by the CPATH of our OS, and eventually solved with the help of Jue Ruan.
1 | git clone https://github.com/ruanjue/wtdbg.git && cd wtdbg |
Examples in the doc
1 | assembly of contigs |
wtdbg v1.2.8
Install
1 | git https://github.com/ruanjue/wtdbg-1.2.8.git && cd wtdbg-1.2.8 |
For higher error rate long sequences
- Decrease
-p. Try-p 19or-p 17 - Decrease
-S.Try-S2 or-S1
Both will increase computing time.
For very high coverage
- Increase
--edge-min. Try--edge-min 4, or higher.
For low coverage
- Decrease
--edge-min. Try--edge-min 2 --rescue-low-cov-edges.
Filter reads
--tidy-reads 5000. Will filtered shorter sequences. If names in format of\/\d+_\d+$, will selected the longest subread.
Consensus
1 | wtdbg-cns -t 64 -i dbg.ctg.lay -o dbg.ctg.lay.fa |
The output file dbg.ctg.lay.fa is ready for further polished by PILON or QUIVER.
In practice
I’ve tried two versions of wtdbg and diferent parameter combinations in two genome assembly projects. The parameters and the logs/stats received are as follows:
An insect
- The species: high heterogeneity, high AT, high repetition.
- Genome size: male 790M, female 830M.
- Data used:about 70X PacBio long-reads.
- OS environment: CentOS6.6 86_64 glibc-2.12. QSUB grid system. 15 Fat nodes (2TB RAM, 40 CPU) and 10 Blade nodes (156G RAM, 24 CPU).
wtdbg v1.1.006
run1, with -H -k 21 -S 1.02 -e 3:
stats:
1 | total base: 607971510 |
wtdbg v1.2.8
run1, with defalult -k 0 -p 21 -S 4:
stats:
1 | total base: 757804309 |
run2, with --edge-min 2 --rescue-low-cov-edges --tidy-reads 5000 (Because median node depth = 6, less than 20)
stats:
1 | total base: 845834770 |
run3, with -k 15 -p 0 -S 1 --rescue-low-cov-edges --tidy-reads 5000
stats:
1 | Size_includeN 795503989 |
run4, with -k 0 -p 19 -S 2 --rescue-low-cov-edges --tidy-reads 5000
stats:
1 | Size_includeN 780618272 |
run5, with --tidy-reads 5000 -k 21 -p 0 -S 2 --rescue-low-cov-edges
stats:
1 | Size_includeN 843085698 |
run6, with -k 0 -p 21 -S 4 --aln-noskip
After discussion with the author, he suggested me to use --aln-noskip.
stats:
1 | Size_includeN 726925732 |
run7, with -k 15 -p 0 -S 1 --rescue-low-cov-edges --tidy-reads 5000 --aln-noskip
stats:
1 | Size_includeN 762713695 |
After all the experiments, I’m not sure what to do next (try more or move on). As suggeested by Jue Ruan, N50 contig of ~500kb is good enough for scaffolding and genomic analysis. So I should try to evaluate the assembly and improve it while trying other tools.
A plant
- The species: high heterogeneity, high repetition.
- Genome size: 2.1G.
- Data used:more than 100X PacBio long reads.
- OS environment: CentOS6.6 86_64 glibc-2.12. QSUB grid system. 15 Fat nodes (2TB RAM, 40 CPU) and 10 Blade nodes (156G RAM, 24 CPU).
wtdbg v1.1.006
commands:
1 | run1, version 1.1.006 |
stats:
1 | Size_includeN 2105945650 |
wtdbg v1.2.8
commands:
1 | run2, version 1.2.8 |
stats:
1 | Size_includeN 1924031835 |
Where to go next?
I asked Jue Ruan that if it is necessary to run consensus tools on the results of wtdbg or smartdenovo, he said:
The inside consensus tool
wtdbg-cnsaims to provide a quick way to reduce sequencing errors. It is suggested to useQuiverand/orPilonto polish the consensus sequences after you feel happy with the assembly. Usually,wtdbg-cnscan reduce error rate down to less than 1%, which can be well-aligned by short reads.
Useful links
- Discussions about “Optimisation of parameters”
- if it is necessary to run consensus tools on the results of wtdbg or smartdenovo
Change log
- 20180307: create the note.
- 20180630: add the ‘A plant’ part.