Genome Assembly Pipeline: OPERA-LG

Genome assembly pipeline: OPERA-LG

tags: bio-tools, genome assembly pipeline, hybrid genome assembly, scaffloding

category: genome assembly, hyrid pipeline

Intro

From The OPERA wiki

OPERA (Optimal Paired-End Read Assembler) is a sequence assembly program (http://en.wikipedia.org/wiki/Sequence_assembly). It uses information from paired-end/mate-pair/long reads to order and orient the intermediate contigs/scaffolds assembled in a genome assembly project, in a process known as Scaffolding. OPERA is based on an exact algorithm that is guaranteed to minimize the discordance of scaffolds with the information provided by the paired-end/mate-pair/long reads (for further details see Gao et al, 2011).

Note that since the original publication, we have made significant changes to OPERA (v1.0 onwards) including refinements to its basic algorithm (to reduce local errors, improve efficiency etc.) and incorporated features that are important for scaffolding large genomes (multi-library support, better repeat-handling etc.), in addition to other scalability and usability improvements (bam and gzip support, smaller memory footprint). We therefore encourage you to download and use our latest version: OPERA-LG. In our benchmarks, it has significantly improved corrected N50 and reduced the number of scaffolding errors. Furthermore, our latest release contains the wrapper script OPERA-long-read that enables scaffolding with long-reads from third-generation sequencing technologies (PacBio or Oxford Nanopore). The manuscript describing the new features and algorithms is available at Genome Biology. We look forward to getting your feedback to improve it further.

Its paper

Gao S, Bertrand D, Chia BKH, Nagarajan N. OPERA-LG: efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees. Genome Biology. 2016;17:102. doi:10.1186/s13059-016-0951-y

My feelings:

  • too many dependencies
  • not so easy to use
  • have bugs
  • support re-scaffolding
  • can’t use NGS reads and long-reads simultaneously.

In practice

See The OPERA wiki for full docs.

Scripts used:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!/bin/sh
#PBS -N OPERA-LG
#PBS -j eo
#PBS -q Test
#PBS -l nodes=1:ppn=8
#PBS -d /DenovoSeq/OPERA-LG
#PBS -V

echo Start time is `date +%Y/%m/%d--%H:%M`

# scaffold with short reads
## preprocess reads
#for sample in 270B 500B 800B 3k_1 5k-1 5k-2 10k; do
#perl /software/OPERA-LG_v2.0.6/bin/preprocess_reads.pl --contig /DenovoSeq/MEGAHIT/megahit_out.no270/final.contigs.fa --illumina-read1 /DenovoSeq/trimmomatic/${sample}_R_1P.fastq --illumina-read2 /DenovoSeq/trimmomatic/${sample}_R_2P.fastq --out ${sample}.map --tool-dir /software/bwa-0.7.15 --samtools-dir /software/samtools-0.1.19
#done

## with all libraries
/software/OPERA-LG_v2.0.6/bin/OPERA-LG /DenovoSeq/MEGAHIT/megahit_out.no270/final.contigs.fa 270B.map,500B.map,800B.map,3k_1.map,5k-1.map,5k-2.map,10k.map ./opera /software/samtools-0.1.19

## without 270 library
/software/OPERA-LG_v2.0.6/bin/OPERA-LG /DenovoSeq/MEGAHIT/megahit_out.no270/final.contigs.fa 500B.map,800B.map,3k_1.map,5k-1.map,5k-2.map,10k.map ./opera.no270 /software/samtools-0.1.19

# This is the first run of OPERA-LG, with 270 library, and megahit's contigs
perl /software/OPERA-LG_v2.0.6/bin/OPERA-long-read.pl --contig-file /DenovoSeq/MEGAHIT/megahit_out/final.contigs.fa --illumina-read1 10k_R1.fasta --illumina-read2 10k_R2.fasta --long-read-file av_20k.fasta --output-prefix 10k.lr --output-directory ./ --num-of-processors 40 --blasr /src/wgs-8.3rc2/Linux-amd64/bin --short-read-tooldir /software/bwa-0.7.15 --opera /software/OPERA-LG_v2.0.6/bin

# This the second run of OPERA-LG, re-scaffold the results of SOAP-fusion. ins_270 library
perl /software/OPERA-LG_v2.0.6/bin/OPERA-long-read.pl --contig-file /DenovoSeq/MEGAHIT/small_insert.no270/SOAP-fusion/k41.scafSeq --illumina-read1 /DenovoSeq/trimmomatic/270B_R_1P.fasta --illumina-read2 /DenovoSeq/trimmomatic/270B_R_2P.fasta --long-read-file /DenovoSeq/Third_rawData/av_20k.fasta --output-prefix 270B.lr --output-directory ./270B --num-of-processors 10 --blasr /software/src/wgs-8.3rc2/Linux-amd64/bin --short-read-tooldir /software/bwa-0.7.15 --opera /software/OPERA-LG_v2.0.6/bin --samtools-dir /software/samtools-0.1.19/

# This is the third run of OPERA-LG, re-scaffold the results of SOAP-fusion. ins_500 library
perl /software/OPERA-LG_v2.0.6/bin/OPERA-long-read.pl --contig-file /DenovoSeq/MEGAHIT/small_insert.no270/SOAP-fusion/k41.scafSeq --illumina-read1 /DenovoSeq/trimmomatic/500B_R_1P.fasta --illumina-read2 /DenovoSeq/trimmomatic/500B_R_2P.fasta --long-read-file /DenovoSeq/Third_rawData/av_20k.fasta --output-prefix 500B.lr --output-directory ./500B --num-of-processors 10 --blasr /software/src/wgs-8.3rc2/Linux-amd64/bin --short-read-tooldir /software/bwa-0.7.15 --opera /software/OPERA-LG_v2.0.6/bin --samtools-dir /software/samtools-0.1.19/

echo Finish time is `date +%Y/%m/%d--%H:%M`

The stats I got:

OPERA

with all libraries

1
2
3
4
5
6
7
8
9
10
11
Size_includeN: 679262960
Size_withoutN: 679262960
Seq_Num: 782765
Mean_Size: 867
Median_Size: 429
Longest_Seq: 61533
Shortest_Seq: 200
GC_Content: 32.31
N50: 1529
N90: 348
Gap: 0.0

without ins_270 library:

1
2
3
4
5
6
7
8
9
10
11
Size_includeN: 679262960
Size_withoutN: 679262960
Seq_Num: 782765
Mean_Size: 867
Median_Size: 429
Longest_Seq: 61533
Shortest_Seq: 200
GC_Content: 32.31
N50: 1529
N90: 348
Gap: 0.0

OPERA-LG

First run with long-reads, with 270 library, and megahit’s contigs:

1
2
3
4
5
6
7
8
9
10
11
Size_includeN	767662393
Size_withoutN 767662393
Seq_Num 989298
Mean_Size 775
Median_Size 428
Longest_Seq 80889
Shortest_Seq 200
GC_Content 32.64
N50 1115
N90 340
Gap 0.0

Second run, re-scaffold the results of SOAP-fusion. ins_270 library

1
2
3
4
5
6
7
8
9
10
11
Size_includeN: 765246417
Size_withoutN: 574675978
Seq_Num: 377325
Mean_Size: 2028
Median_Size: 393
Longest_Seq: 436254
Shortest_Seq: 200
GC_Content: 31.42
N50: 33478
N90: 439
Gap: 24.9

Third run, re-scaffold the results of SOAP-fusion. ins_500 library

1
2
3
4
5
6
7
8
9
10
11
Size_includeN: 765246417
Size_withoutN: 574675978
Seq_Num: 377325
Mean_Size: 2028
Median_Size: 393
Longest_Seq: 436254
Shortest_Seq: 200
GC_Content: 31.42
N50: 33478
N90: 439
Gap: 24.9

What did this software do? The scaffold N50 of SOAPdenovo-fusion is 33478 … What a waste of time!

This note can serve as a reference in case I will have to use it again…

0%