Genome assembly pipeline: OPERA-LG
tags: bio-tools, genome assembly pipeline, hybrid genome assembly, scaffloding
category: genome assembly, hyrid pipeline
Intro
From The OPERA wiki
OPERA (Optimal Paired-End Read Assembler) is a sequence assembly program (http://en.wikipedia.org/wiki/Sequence_assembly). It uses information from paired-end/mate-pair/long reads to order and orient the intermediate contigs/scaffolds assembled in a genome assembly project, in a process known as Scaffolding. OPERA is based on an exact algorithm that is guaranteed to minimize the discordance of scaffolds with the information provided by the paired-end/mate-pair/long reads (for further details see Gao et al, 2011).
Note that since the original publication, we have made significant changes to OPERA (v1.0 onwards) including refinements to its basic algorithm (to reduce local errors, improve efficiency etc.) and incorporated features that are important for scaffolding large genomes (multi-library support, better repeat-handling etc.), in addition to other scalability and usability improvements (bam and gzip support, smaller memory footprint). We therefore encourage you to download and use our latest version: OPERA-LG. In our benchmarks, it has significantly improved corrected N50 and reduced the number of scaffolding errors. Furthermore, our latest release contains the wrapper script OPERA-long-read that enables scaffolding with long-reads from third-generation sequencing technologies (PacBio or Oxford Nanopore). The manuscript describing the new features and algorithms is available at Genome Biology. We look forward to getting your feedback to improve it further.
Its paper
Gao S, Bertrand D, Chia BKH, Nagarajan N. OPERA-LG: efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees. Genome Biology. 2016;17:102. doi:10.1186/s13059-016-0951-y
My feelings:
- too many dependencies
- not so easy to use
- have bugs
- support re-scaffolding
- can’t use NGS reads and long-reads simultaneously.
In practice
See The OPERA wiki for full docs.
Scripts used:
1 | !/bin/sh |
The stats I got:
OPERA
with all libraries
1 | Size_includeN: 679262960 |
without ins_270 library:
1 | Size_includeN: 679262960 |
OPERA-LG
First run with long-reads, with 270 library, and megahit’s contigs:
1 | Size_includeN 767662393 |
Second run, re-scaffold the results of SOAP-fusion. ins_270 library
1 | Size_includeN: 765246417 |
Third run, re-scaffold the results of SOAP-fusion. ins_500 library
1 | Size_includeN: 765246417 |
What did this software do? The scaffold N50
of SOAPdenovo-fusion is 33478 … What a waste of time!
This note can serve as a reference in case I will have to use it again…