Intro
From its git repo:
Flye is a de novo assembler for long and noisy reads, such as those produced by PacBio and Oxford Nanopore Technologies. The algorithm uses an A-Bruijn graph to find the overlaps between reads and does not require them to be error-corrected. After the initial assembly, Flye performs an extra repeat classification and analysis step to improve the structural accuracy of the resulting sequence. The package also includes a polisher module, which produces the final assembly of high nucleotide-level quality.
This tool is now on biRxiv:
Kolmogorov M, Yuan J, Lin Y, Pevzner P. Assembly of Long Error-Prone Reads Using Repeat Graphs. bioRxiv. 2018 Jan 12:247148. doi:10.1101/247148
My feelings:
- easy to use
- comparatively good results, good N50, good completeness
- not too many parameters to be tested
General usage
1 | # install |
See Flye manual for full usage.
In practice
An insect
- The species: high heterogeneity, high AT, high repetition.
- Genome size: male 790M, female 830M.
- Data used:about 70X PacBio long-reads.
- OS environment: CentOS6.6 86_64 glibc-2.12. QSUB grid system. 15 Fat nodes (2TB RAM, 40 CPU) and 10 Blade nodes (156G RAM, 24 CPU).
version 2.3.2-gd46edb7
Flye
version: 2.3.2-gd46edb7
I didn’t test all the parameters. Below is the results based on default settings.
command:
1 | flye --pacbio-raw $DATADIR/third/third_all.fasta --out-dir run1 --genome-size 850m --threads 24 |
stats:
1 | # contig |
Version 2.3.3-g47cdd0b
Flye
2.3.3 have two updates appealing to me:
- Automatic selection of minimum overlap parameter based on read length
- Minimap2 updated
Because I’ve run Canu
before, and Flye
can start from raw data and corrected data, I’ll test Flye
for both.
From raw data
Commands:
1 | TOOLDIR/Flye-2.3.3/bin/flye --pacbio-raw third_all.fasta --out-dir run2 --genome-size 830m --threads 40 |
Stats:
1 | contigs |
From corrected data from Canu
(about 33X)
Commands:
1 | TOOLDIR/Flye-2.3.3/bin/flye --pacbio-corr canu.correctedReads.fasta.gz --out-dir run3 --genome-size 830m --threads 40 |
Stats:
1 | contigs |
A plant
- The species: high heterogeneity, high repetition.
- Genome size: 2.1G.
- Data used:more than 100X PacBio long reads.
- OS environment: CentOS6.6 86_64 glibc-2.12. QSUB grid system. 15 Fat nodes (2TB RAM, 40 CPU) and 10 Blade nodes (156G RAM, 24 CPU).
run1, more than 100X data
commands:
1 | path2flye --pacbio-raw $WORKDIR/data/Pacbio/all.fasta --out-dir run1 --genome-size 2g --threads 30 |
But I came across a memory issue: ERROR: Caught unhandled exception: std::bad_alloc in both 2.3.2 and 2.3.3. And the author suggested me to downsample the data.
And I asked him that what’s the difference: using all raw data (say 100X) and using downsampling data (say longest 50X)? He said “You might have extra connectivity information in these 100x reads (you can resolve more repeats, for example). But some studies suggest (Canu paper, for example) that you don’t really need more than 40x in general (but it, of course, also depends on the genome complexity, ploidy etc…). Plus, extra coverage helps to get a good final consensus.”
run2, with about 50X data
I used SelectLongestReads to downsample about 50X data and ran Flye
again.
1 | run1.1, extract 50X data |
The reason why I removed the @
from the headers was because I encountered another problem: ERROR: parse error in 1-consensus/consensus.fasta on line 1: empty sequence. It seemed that Flye
would ignore these headers.
And the stats I got:
1 | contigs |
Not bad.
run3, with corrected data from Canu
(about 37X)
The Canu
version was 1.7
.
commands:
1 | run2, pacbio corrected by canu, defalut |
stats:
1 | contigs |
Useful links
Change log
- 20180314: create the note.
- 20180428: test version 2.3.3, and run from corrected reads of
Canu
- 20180630: add the part of ‘A plant’.