Genome Assembly Pipeline: SMARTdenovo

Introduction

SMARTdenovo is a de novo assembler for PacBio and Oxford Nanopore (ONT) data. It produces an assembly from all-vs-all raw read alignments without an error correction stage. It also provides tools to generate accurate consensus sequences, though a platform dependent consensus polish tools (e.g. Quiver for PacBio or Nanopolish for ONT) are still required for higher accuracy.

SMARTdenovo consists of several separate command line tools: wtzmo for read overlapping, wtgbo to rescue missing overlaps, wtclp for identifying low-quality regions and chimaera, and wtcns or wtmsa to produce better unitig consensus. The smartdenovo.pl script provides a convenient interface to call these programs in one go.

This tool has not been published yet. (20180313)

My feelings:

easy to install/use
not as fast as wtdbg, but fast
comparatively good results (at least in my case)
docs and discussions about this tool is limited.

General usage

# Download sample PacBio from the PBcR website
wget -O- http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz | tar zxf -
awk 'NR%4==1||NR%4==2' selfSampleData/pacbio_filtered.fastq | sed 's/^@/>/g' > reads.fa
# Install SMARTdenovo
git clone https://github.com/ruanjue/smartdenovo.git && (cd smartdenovo; make)
# Assemble (raw unitigs in wtasm.lay.utg; consensus unitigs: wtasm.cns)
smartdenovo/smartdenovo.pl -c 1 reads.fa > wtasm.mak
make -f wtasm.mak

In practice

An insect

The species: high heterogeneity, high AT, high repetition.
Genome size: male 790M, female 830M.

commands:

1
2
3

# run1, default
$path2perl $TOOLDIR/smartdenovo/smartdenovo.pl -t $PPN -c 1 -p run1 $DATADIR/third/third_all.fasta > run1.mak
make -f run1.mak

stats:

Size_includeN	756816708
Size_withoutN	756816708
Seq_Num	6135
Mean_Size	123360
Median_Size	55901
Longest_Seq	5704487
Shortest_Seq	10769
GC_Content	31.72
N50	240010
N90	44546
Gap	0.0

SMARTdenovo can also use zmo overlapper. I also test this option, but it generated about 17G genome! (The estimated genome size is about 850M.)

A plant

The species: high heterogeneity, high repetition.
Genome size: 2.1G.

run1, with about 100X data

commands:

1
2
3

# run1, default
$path2perl $TOOLDIR/smartdenovo/smartdenovo.pl -t 24 -c 1 -p run1 $WORKDIR/data/Pacbio/all.fq.gz > run1.mak
make -f run1.mak

And the stats I got:

Size_includeN	2103140368
Size_withoutN	2103140368
Seq_Num	6164
Mean_Size	341197
Median_Size	163362
Longest_Seq	9288681
Shortest_Seq	12171
GC_Content	38.16
N50	703465
L50	809
N90	151138
Gap	0.0

run2, with about 50X data

commands:

1
2
3

# run2, 50X
$path2perl $TOOLDIR/smartdenovo/smartdenovo.pl -t $PPN -c 1 -p run2 $WORKDIR/data/Pacbio/Pacbio_50x.fasta > run2.mak
make -f run2.mak

And the stats I got:

Size_includeN	2028605527
Size_withoutN	2028605527
Seq_Num	5811
Mean_Size	349097
Median_Size	170070
Longest_Seq	10046321
Shortest_Seq	24367
GC_Content	38.18
N50	708215
L50	758
N90	147345
Gap	0.0

This was a very good N50 size! And the assembled size was close to the expected one.

Change notes

20180423: create the note.