MEGAHIT
can be used to assemble contigs, and SOAPdenovo-fusion
can be used for scaffolding. Since they were developed by the same team, I just put them together.
This note is more about MEGAHIT
and its performance, because you can choose not to use SOAPdevovo-fusion
. SOAPdenovo-fusion
had comparatively good performance in my case, so why not give it a try?
Intro
From MEGAHIT git repo
MEGAHIT is a single node assembler for large and complex metagenomics NGS reads, such as soil. It makes use of succinct de Bruijn graph (SdBG) to achieve low memory assembly. MEGAHIT can optionally utilize a CUDA-enabled GPU to accelerate its SdBG contstruction. The GPU-accelerated version of MEGAHIT has been tested on NVIDIA GTX680 (4G memory) and Tesla K40c (12G memory) with CUDA 5.5, 6.0 and 6.5. MEGAHIT v1.0 or greater also supports IBM Power PC and has been tested on IBM POWER8.
Its paper
Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015;31(10):1674–1676. doi:10.1093/bioinformatics/btv033
My feelings:
- very easy to use
- fast enough
- better than SOAPdenovo2
- no need to designate k-mer
General usage
See MEGAHIT wiki for full docs.
In practice - MEGAHIT
- An insect
- The species: high heterogeneity, high AT, high repetition.
- Genome size: male 790M, female 830M.
data
The Illumina data I used:
Source | Insert size (bp) | Avg. read size (bp) | Raw bases (G) | Raw reads (M) | Sequencing depth |
---|---|---|---|---|---|
AV1, M | 270 | 150 | 44.1 | 293.6 | 55.5 |
AV2, F | 500 | 150 | 24.4 | 162.8 | 29.4 |
AV2, F | 800 | 150 | 15.8 | 105.4 | 19.0 |
AV2, F | 3k | 114 | 10.4 | 91.8 | 12.5 |
AV2, F | 5k | 150 | 29.8 | 198.7 | 35.9 |
AV2, F | 5k | 114 | 11.5 | 101.2 | 13.8 |
AV2, F | 10k | 150 | 17.5 | 116.8 | 21.1 |
Total | - | - | 153.5 | 1070.3 | 187.3 |
I’ve tried MEGAHIT
with raw/trimmed data, with/without ins_270 library, with all (PE and MPE)/PE libraries, and here are the scripts I used and stats received. The reason why I tried with/without ins_270 library was because it’s from a male but other libraries were from females.
The reason why I tried with all (PE and MPE)/PE libraries was because I ran MEGAHIT
with all data I had, and then the author recommended only to use PE libraries. See the discussions with the authors.
- Segmentation fault with scaff step when use different MEGAHIT’s output
- How to use SOAPdenovo-fusion scaffold the output of MEGAHIT?
run1, all raw data
1 | /software/megahit_v1.1.1_LINUX_CPUONLY_x86_64-bin/megahit -t 20 --no-mercy -1 /DenovoSeq/raw_data/270B_R1.fastq,/DenovoSeq/raw_data/500B_R1.fastq,/DenovoSeq/raw_data/800B_R1.fastq,/DenovoSeq/raw_data/3k_1_R1.fastq,/DenovoSeq/raw_data/5k-1_R1.fastq,/DenovoSeq/raw_data/5k-2_R1.fastq,/DenovoSeq/raw_data/10k_R1.fastq -2 /DenovoSeq/raw_data/270B_R2.fastq,/DenovoSeq/raw_data/500B_R2.fastq,/DenovoSeq/raw_data/800B_R2.fastq,/DenovoSeq/raw_data/3k_1_R2.fastq,/DenovoSeq/raw_data/5k-1_R2.fastq,/DenovoSeq/raw_data/5k-2_R2.fastq,/DenovoSeq/raw_data/10k_R2.fastq -o megahit_out1 |
and the stats:
1 | Size_includeN: 894816039 |
run2, all trimmed data (by Trimmomatic)
1 | /software/megahit_v1.1.1_LINUX_CPUONLY_x86_64-bin/megahit -t 38 --no-mercy -1 /DenovoSeq/trimmomatic/270B_R_1P.fastq,/DenovoSeq/trimmomatic/500B_R_1P.fastq,/DenovoSeq/trimmomatic/800B_R_1P.fastq,/DenovoSeq/trimmomatic/3k_1_R_1P.fastq,/DenovoSeq/trimmomatic/5k-1_R_1P.fastq,/DenovoSeq/trimmomatic/5k-2_R_1P.fastq,/DenovoSeq/trimmomatic/10k_R_1P.fastq -2 /DenovoSeq/trimmomatic/270B_R_2P.fastq,/DenovoSeq/trimmomatic/500B_R_2P.fastq,/DenovoSeq/trimmomatic/800B_R_2P.fastq,/DenovoSeq/trimmomatic/3k_1_R_2P.fastq,/DenovoSeq/trimmomatic/5k-1_R_2P.fastq,/DenovoSeq/trimmomatic/5k-2_R_2P.fastq,/DenovoSeq/trimmomatic/10k_R_2P.fastq |
and the stats:
1 | Size_includeN: 767662393 |
run3, all raw data, without ins_270
1 | /software/megahit_v1.1.1_LINUX_CPUONLY_x86_64-bin/megahit -t 20 --no-mercy -1 /DenovoSeq/raw_data/500B_R1.fastq,/DenovoSeq/raw_data/800B_R1.fastq,/DenovoSeq/raw_data/3k_1_R1.fastq,/DenovoSeq/raw_data/5k-1_R1.fastq,/DenovoSeq/raw_data/5k-2_R1.fastq,/DenovoSeq/raw_data/10k_R1.fastq -2 /DenovoSeq/raw_data/500B_R2.fastq,/DenovoSeq/raw_data/800B_R2.fastq,/DenovoSeq/raw_data/3k_1_R2.fastq,/DenovoSeq/raw_data/5k-1_R2.fastq,/DenovoSeq/raw_data/5k-2_R2.fastq,/DenovoSeq/raw_data/10k_R2.fastq -o megahit_out.no2701 |
and the stats:
1 | Size_includeN: 800012379 |
run4, all trimmed data, without ins_270
1 | Size_includeN: 679262960 |
run5, use only with all PE libraries
1 | /software/megahit_v1.1.1_LINUX_CPUONLY_x86_64-bin/megahit -t 40 --no-mercy -1 /DenovoSeq/trimmomatic/270B_R_1P.fastq,/DenovoSeq/trimmomatic/500B_R_1P.fastq,/DenovoSeq/trimmomatic/800B_R_1P.fastq -2 /DenovoSeq/trimmomatic/270B_R_2P.fastq,/DenovoSeq/trimmomatic/500B_R_2P.fastq,/DenovoSeq/trimmomatic/800B_R_2P.fastq -o small_insert |
and the stats:
1 | Size_includeN: 672319141 |
run6, use only with all PE libraries but ins_270
1 | /software/megahit_v1.1.1_LINUX_CPUONLY_x86_64-bin/megahit -t 40 --no-mercy -1 /DenovoSeq/trimmomatic/500B_R_1P.fastq,/DenovoSeq/trimmomatic/800B_R_1P.fastq -2 /DenovoSeq/trimmomatic/500B_R_2P.fastq,/DenovoSeq/trimmomatic/800B_R_2P.fastq -o small_insert.no270 |
and the stats:
1 | Size_includeN: 582099585 |
conclusions
Though not been fully tested, I can draw some simple conclusions
- trimmed data generates better results than raw data. (but the way trimming data will influce the results)
- using only PE libraries generates better results than using all libraries (PE, MPE)
In practice - SOAPdenovo-fusion
I’ve asked the author that How to use SOAPdenovo-fusion scaffold the output of MEGAHIT?.
I first tried SOAPdenovo-fusion
with/without ins_270 library, and found not using ins_270 library got better results (tested with k-mer=63
). Then I tested different kmer: 37, 41, 43, 45, 55, 61, 63, 71, 75
and found that kmer=41
got best results. I’ve also tried with/without -F
parameter, but I didn’t understand the diference completely.
I just put the config, scripts and stats here when using kmer = 41
.
config
1 | #maximal read length |
run1, without -F
1 | /software/SOAPdenovo2-r241/SOAPdenovo-fusion -D -s config -p 40 -K 41 -g k41 -c ../final.contigs.fa |
stats:
1 | Size_includeN: 765246417 |
run2, with -F
1 | /software/SOAPdenovo2-r241/SOAPdenovo-fusion -D -s config -p 40 -K 41 -g k41_1 -c ../final.contigs.fa |
stats:
1 | Size_includeN: 764869730 |
It seems that -F
parameter didn’t help much (the %gap).
This note can be a reference in case I will have to use it again.
Change log
- 20180308: create the note.