Genome Assembly Pipeline: SMARTdenovo

Introduction

From its Git Repo:

SMARTdenovo is a de novo assembler for PacBio and Oxford Nanopore (ONT) data. It produces an assembly from all-vs-all raw read alignments without an error correction stage. It also provides tools to generate accurate consensus sequences, though a platform dependent consensus polish tools (e.g. Quiver for PacBio or Nanopolish for ONT) are still required for higher accuracy.

SMARTdenovo consists of several separate command line tools: wtzmo for read overlapping, wtgbo to rescue missing overlaps, wtclp for identifying low-quality regions and chimaera, and wtcns or wtmsa to produce better unitig consensus. The smartdenovo.pl script provides a convenient interface to call these programs in one go.

This tool has not been published yet. (20180313)

My feelings:

  • easy to install/use
  • not as fast as wtdbg, but fast
  • comparatively good results (at least in my case)
  • docs and discussions about this tool is limited.

General usage

1
2
3
4
5
6
7
8
# Download sample PacBio from the PBcR website
wget -O- http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz | tar zxf -
awk 'NR%4==1||NR%4==2' selfSampleData/pacbio_filtered.fastq | sed 's/^@/>/g' > reads.fa
# Install SMARTdenovo
git clone https://github.com/ruanjue/smartdenovo.git && (cd smartdenovo; make)
# Assemble (raw unitigs in wtasm.lay.utg; consensus unitigs: wtasm.cns)
smartdenovo/smartdenovo.pl -c 1 reads.fa > wtasm.mak
make -f wtasm.mak

In practice

An insect

  • The species: high heterogeneity, high AT, high repetition.
  • Genome size: male 790M, female 830M.

commands:

1
2
3
# run1, default
$path2perl $TOOLDIR/smartdenovo/smartdenovo.pl -t $PPN -c 1 -p run1 $DATADIR/third/third_all.fasta > run1.mak
make -f run1.mak

stats:

1
2
3
4
5
6
7
8
9
10
11
Size_includeN	756816708
Size_withoutN 756816708
Seq_Num 6135
Mean_Size 123360
Median_Size 55901
Longest_Seq 5704487
Shortest_Seq 10769
GC_Content 31.72
N50 240010
N90 44546
Gap 0.0

SMARTdenovo can also use zmo overlapper. I also test this option, but it generated about 17G genome! (The estimated genome size is about 850M.)

A plant

  • The species: high heterogeneity, high repetition.
  • Genome size: 2.1G.

run1, with about 100X data

commands:

1
2
3
# run1, default
$path2perl $TOOLDIR/smartdenovo/smartdenovo.pl -t 24 -c 1 -p run1 $WORKDIR/data/Pacbio/all.fq.gz > run1.mak
make -f run1.mak

And the stats I got:

1
2
3
4
5
6
7
8
9
10
11
12
Size_includeN	2103140368
Size_withoutN 2103140368
Seq_Num 6164
Mean_Size 341197
Median_Size 163362
Longest_Seq 9288681
Shortest_Seq 12171
GC_Content 38.16
N50 703465
L50 809
N90 151138
Gap 0.0

run2, with about 50X data

commands:

1
2
3
# run2, 50X
$path2perl $TOOLDIR/smartdenovo/smartdenovo.pl -t $PPN -c 1 -p run2 $WORKDIR/data/Pacbio/Pacbio_50x.fasta > run2.mak
make -f run2.mak

And the stats I got:

1
2
3
4
5
6
7
8
9
10
11
12
Size_includeN	2028605527
Size_withoutN 2028605527
Seq_Num 5811
Mean_Size 349097
Median_Size 170070
Longest_Seq 10046321
Shortest_Seq 24367
GC_Content 38.18
N50 708215
L50 758
N90 147345
Gap 0.0

This was a very good N50 size! And the assembled size was close to the expected one.

Change notes

  • 20180423: create the note.
0%