Genome Assembly Pipeline: MEGAHIT & SOAPdenovo-fusion

MEGAHIT can be used to assemble contigs, and SOAPdenovo-fusion can be used for scaffolding. Since they were developed by the same team, I just put them together.

This note is more about MEGAHIT and its performance, because you can choose not to use SOAPdevovo-fusion. SOAPdenovo-fusion had comparatively good performance in my case, so why not give it a try?

Intro

From MEGAHIT git repo

MEGAHIT is a single node assembler for large and complex metagenomics NGS reads, such as soil. It makes use of succinct de Bruijn graph (SdBG) to achieve low memory assembly. MEGAHIT can optionally utilize a CUDA-enabled GPU to accelerate its SdBG contstruction. The GPU-accelerated version of MEGAHIT has been tested on NVIDIA GTX680 (4G memory) and Tesla K40c (12G memory) with CUDA 5.5, 6.0 and 6.5. MEGAHIT v1.0 or greater also supports IBM Power PC and has been tested on IBM POWER8.

Its paper

Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015;31(10):1674–1676. doi:10.1093/bioinformatics/btv033

My feelings:

very easy to use
fast enough
better than SOAPdenovo2
no need to designate k-mer

General usage

See MEGAHIT wiki for full docs.

In practice - MEGAHIT

An insect
The species: high heterogeneity, high AT, high repetition.
Genome size: male 790M, female 830M.

data

The Illumina data I used:

Source	Insert size (bp)	Avg. read size (bp)	Raw bases (G)	Raw reads (M)	Sequencing depth
AV1, M	270	150	44.1	293.6	55.5
AV2, F	500	150	24.4	162.8	29.4
AV2, F	800	150	15.8	105.4	19.0
AV2, F	3k	114	10.4	91.8	12.5
AV2, F	5k	150	29.8	198.7	35.9
AV2, F	5k	114	11.5	101.2	13.8
AV2, F	10k	150	17.5	116.8	21.1
Total	-	-	153.5	1070.3	187.3

I’ve tried MEGAHIT with raw/trimmed data, with/without ins_270 library, with all (PE and MPE)/PE libraries, and here are the scripts I used and stats received. The reason why I tried with/without ins_270 library was because it’s from a male but other libraries were from females.

The reason why I tried with all (PE and MPE)/PE libraries was because I ran MEGAHIT with all data I had, and then the author recommended only to use PE libraries. See the discussions with the authors.

run1, all raw data

/software/megahit_v1.1.1_LINUX_CPUONLY_x86_64-bin/megahit -t 20 --no-mercy -1 /DenovoSeq/raw_data/270B_R1.fastq,/DenovoSeq/raw_data/500B_R1.fastq,/DenovoSeq/raw_data/800B_R1.fastq,/DenovoSeq/raw_data/3k_1_R1.fastq,/DenovoSeq/raw_data/5k-1_R1.fastq,/DenovoSeq/raw_data/5k-2_R1.fastq,/DenovoSeq/raw_data/10k_R1.fastq -2 /DenovoSeq/raw_data/270B_R2.fastq,/DenovoSeq/raw_data/500B_R2.fastq,/DenovoSeq/raw_data/800B_R2.fastq,/DenovoSeq/raw_data/3k_1_R2.fastq,/DenovoSeq/raw_data/5k-1_R2.fastq,/DenovoSeq/raw_data/5k-2_R2.fastq,/DenovoSeq/raw_data/10k_R2.fastq -o megahit_out1

and the stats:

Size_includeN: 894816039
Size_withoutN: 894816039
Seq_Num: 1292253
Mean_Size: 692
Median_Size: 418
Longest_Seq: 42542
Shortest_Seq: 200
GC_Content: 32.9
N50: 834
N90: 333

run2, all trimmed data (by Trimmomatic)

/software/megahit_v1.1.1_LINUX_CPUONLY_x86_64-bin/megahit -t 38 --no-mercy -1 /DenovoSeq/trimmomatic/270B_R_1P.fastq,/DenovoSeq/trimmomatic/500B_R_1P.fastq,/DenovoSeq/trimmomatic/800B_R_1P.fastq,/DenovoSeq/trimmomatic/3k_1_R_1P.fastq,/DenovoSeq/trimmomatic/5k-1_R_1P.fastq,/DenovoSeq/trimmomatic/5k-2_R_1P.fastq,/DenovoSeq/trimmomatic/10k_R_1P.fastq -2 /DenovoSeq/trimmomatic/270B_R_2P.fastq,/DenovoSeq/trimmomatic/500B_R_2P.fastq,/DenovoSeq/trimmomatic/800B_R_2P.fastq,/DenovoSeq/trimmomatic/3k_1_R_2P.fastq,/DenovoSeq/trimmomatic/5k-1_R_2P.fastq,/DenovoSeq/trimmomatic/5k-2_R_2P.fastq,/DenovoSeq/trimmomatic/10k_R_2P.fastq

and the stats:

Size_includeN: 767662393
Size_withoutN: 767662393
Seq_Num: 989298
Mean_Size: 775
Median_Size: 428
Longest_Seq: 80889
Shortest_Seq: 200
GC_Content: 32.64
N50: 1115
N90: 340

run3, all raw data, without ins_270

/software/megahit_v1.1.1_LINUX_CPUONLY_x86_64-bin/megahit -t 20 --no-mercy -1 /DenovoSeq/raw_data/500B_R1.fastq,/DenovoSeq/raw_data/800B_R1.fastq,/DenovoSeq/raw_data/3k_1_R1.fastq,/DenovoSeq/raw_data/5k-1_R1.fastq,/DenovoSeq/raw_data/5k-2_R1.fastq,/DenovoSeq/raw_data/10k_R1.fastq -2 /DenovoSeq/raw_data/500B_R2.fastq,/DenovoSeq/raw_data/800B_R2.fastq,/DenovoSeq/raw_data/3k_1_R2.fastq,/DenovoSeq/raw_data/5k-1_R2.fastq,/DenovoSeq/raw_data/5k-2_R2.fastq,/DenovoSeq/raw_data/10k_R2.fastq -o megahit_out.no2701

and the stats:

Size_includeN: 800012379
Size_withoutN: 800012379
Seq_Num: 1045419
Mean_Size: 765
Median_Size: 422
Longest_Seq: 52915
Shortest_Seq: 200
GC_Content: 32.64
N50: 1074
N90: 340

run4, all trimmed data, without ins_270

Size_includeN: 679262960
Size_withoutN: 679262960
Seq_Num: 782765
Mean_Size: 867
Median_Size: 429
Longest_Seq: 61533
Shortest_Seq: 200
GC_Content: 32.31
N50: 1529
N90: 348

run5, use only with all PE libraries

/software/megahit_v1.1.1_LINUX_CPUONLY_x86_64-bin/megahit -t 40 --no-mercy -1 /DenovoSeq/trimmomatic/270B_R_1P.fastq,/DenovoSeq/trimmomatic/500B_R_1P.fastq,/DenovoSeq/trimmomatic/800B_R_1P.fastq -2 /DenovoSeq/trimmomatic/270B_R_2P.fastq,/DenovoSeq/trimmomatic/500B_R_2P.fastq,/DenovoSeq/trimmomatic/800B_R_2P.fastq -o small_insert

and the stats:

Size_includeN: 672319141
Size_withoutN: 672319141
Seq_Num: 777747
Mean_Size: 864
Median_Size: 428
Longest_Seq: 71939
Shortest_Seq: 200
GC_Content: 31.93
N50: 1505
N90: 346

run6, use only with all PE libraries but ins_270

/software/megahit_v1.1.1_LINUX_CPUONLY_x86_64-bin/megahit -t 40 --no-mercy -1 /DenovoSeq/trimmomatic/500B_R_1P.fastq,/DenovoSeq/trimmomatic/800B_R_1P.fastq -2 /DenovoSeq/trimmomatic/500B_R_2P.fastq,/DenovoSeq/trimmomatic/800B_R_2P.fastq -o small_insert.no270

and the stats:

Size_includeN: 582099585
Size_withoutN: 582099585
Seq_Num: 575808
Mean_Size: 1010
Median_Size: 433
Longest_Seq: 55064
Shortest_Seq: 200
GC_Content: 31.43
N50: 2158
N90: 358

conclusions

Though not been fully tested, I can draw some simple conclusions

trimmed data generates better results than raw data. (but the way trimming data will influce the results)
using only PE libraries generates better results than using all libraries (PE, MPE)

In practice - SOAPdenovo-fusion

I’ve asked the author that How to use SOAPdenovo-fusion scaffold the output of MEGAHIT?.

I first tried SOAPdenovo-fusion with/without ins_270 library, and found not using ins_270 library got better results (tested with k-mer=63). Then I tested different kmer: 37, 41, 43, 45, 55, 61, 63, 71, 75 and found that kmer=41 got best results. I’ve also tried with/without -F parameter, but I didn’t understand the diference completely.

I just put the config, scripts and stats here when using kmer = 41.

config

#maximal read length
max_rd_len=151
[LIB]
avg_ins=500
reverse_seq=0
asm_flags=2
#in which order the reads are used while scaffolding
rank=1
# cutoff of pair number for a reliable connection (at least 3 for short insert size)
pair_num_cutoff=3
#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)
map_len=32
#a pair of fastq file, read 1 file should always be followed by read 2 file
q1=/DenovoSeq/trimmomatic/500B_R_1P.fastq
q2=/DenovoSeq/trimmomatic/500B_R_2P.fastq
[LIB]
#average insert size
avg_ins=800
#if sequence needs to be reversed
reverse_seq=0
#in which part(s) the reads are used
asm_flags=2
#in which order the reads are used while scaffolding
rank=3
# cutoff of pair number for a reliable connection (at least 3 for short insert size)
pair_num_cutoff=3
#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)
map_len=32
#a pair of fastq file, read 1 file should always be followed by read 2 file
q1=/DenovoSeq/trimmomatic/800B_R_1P.fastq
q2=/DenovoSeq/trimmomatic/800B_R_2P.fastq
[LIB]
avg_ins=3000
reverse_seq=1
asm_flags=2
rank=3
# cutoff of pair number for a reliable connection (at least 5 for large insert size)
pair_num_cutoff=4
#minimum aligned length to contigs for a reliable read location (at least 35 for large insert size)
map_len=35
q1=/DenovoSeq/trimmomatic/3k_1_R_1P.fastq
q2=/DenovoSeq/trimmomatic/3k_1_R_2P.fastq
[LIB]
avg_ins=5000
reverse_seq=1
asm_flags=2
rank=4
# cutoff of pair number for a reliable connection (at least 5 for large insert size)
pair_num_cutoff=5
#minimum aligned length to contigs for a reliable read location (at least 35 for large insert size)
map_len=35
q1=/DenovoSeq/trimmomatic/5k-1_R_1P.fastq
q2=/DenovoSeq/trimmomatic/5k-1_R_2P.fastq
[LIB]
avg_ins=5000
reverse_seq=1
asm_flags=2
rank=4
# cutoff of pair number for a reliable connection (at least 5 for large insert size)
pair_num_cutoff=5
#minimum aligned length to contigs for a reliable read location (at least 35 for large insert size)
map_len=35
q1=/DenovoSeq/trimmomatic/5k-2_R_1P.fastq
q2=/DenovoSeq/trimmomatic/5k-2_R_2P.fastq
[LIB]
avg_ins=10000
reverse_seq=1
asm_flags=2
rank=5
# cutoff of pair number for a reliable connection (at least 5 for large insert size)
pair_num_cutoff=5
#minimum aligned length to contigs for a reliable read location (at least 35 for large insert size)
map_len=35
q1=/DenovoSeq/trimmomatic/10k_R_1P.fastq
q2=/DenovoSeq/trimmomatic/10k_R_2P.fastq

run1, without `-F`

1
2
3

/software/SOAPdenovo2-r241/SOAPdenovo-fusion -D -s config -p 40 -K 41 -g k41 -c ../final.contigs.fa
/software/SOAPdenovo2-r241/SOAPdenovo-127mer map -s config -p 40 -g k41
/software/SOAPdenovo2-r241/SOAPdenovo-127mer scaff -p 40 -g k41

stats:

Size_includeN: 765246417
Size_withoutN: 574675978
Seq_Num: 377325
Mean_Size: 2028
Median_Size: 393
Longest_Seq: 436254
Shortest_Seq: 200
GC_Content: 31.42
N50: 33478
N90: 439
Gap: 24.9

run2, with `-F`

1
2
3

/software/SOAPdenovo2-r241/SOAPdenovo-fusion -D -s config -p 40 -K 41 -g k41_1 -c ../final.contigs.fa
/software/SOAPdenovo2-r241/SOAPdenovo-127mer map -s config -p 40 -g k41_1
/software/SOAPdenovo2-r241/SOAPdenovo-127mer scaff -p 40 -g k41_1 -F

stats:

Size_includeN: 764869730
Size_withoutN: 579451220
Seq_Num: 377325
Mean_Size: 2027
Median_Size: 393
Longest_Seq: 436104
Shortest_Seq: 200
GC_Content: 31.45
N50: 33463
N90: 439
Gap: 24.24

It seems that -F parameter didn’t help much (the %gap).

This note can be a reference in case I will have to use it again.

Change log

20180308: create the note.

Intro

General usage

In practice - MEGAHIT

data

run1, all raw data

run2, all trimmed data (by Trimmomatic)

run3, all raw data, without ins_270

run4, all trimmed data, without ins_270

run5, use only with all PE libraries

run6, use only with all PE libraries but ins_270

conclusions

In practice - SOAPdenovo-fusion

run1, without -F

run2, with -F

Change log

run1, without `-F`

run2, with `-F`