Genome Assembly Pipeline: Flye

Intro

Flye is a de novo assembler for long and noisy reads, such as those produced by PacBio and Oxford Nanopore Technologies. The algorithm uses an A-Bruijn graph to find the overlaps between reads and does not require them to be error-corrected. After the initial assembly, Flye performs an extra repeat classification and analysis step to improve the structural accuracy of the resulting sequence. The package also includes a polisher module, which produces the final assembly of high nucleotide-level quality.

This tool is now on biRxiv:

Kolmogorov M, Yuan J, Lin Y, Pevzner P. Assembly of Long Error-Prone Reads Using Repeat Graphs. bioRxiv. 2018 Jan 12:247148. doi:10.1101/247148

My feelings:

easy to use
comparatively good results, good N50, good completeness
not too many parameters to be tested

General usage

# install
git clone https://github.com/fenderglass/Flye Flye-2.3.3
cd Flye-2.3.3
python setup.py build

# usage
$ ./bin/flye -h
usage: flye (--pacbio-raw | --pacbio-corr | --nano-raw |
       --nano-corr | --subassemblies) file1 [file_2 ...]
       --genome-size size --out-dir dir_path [--threads int]
       [--iterations int] [--min-overlap int] [--resume]
       [--debug] [--version] [--help]

Assembly of long and error-prone reads

optional arguments:
  -h, --help            show this help message and exit
  --pacbio-raw path [path ...]
                        PacBio raw reads
  --pacbio-corr path [path ...]
                        PacBio corrected reads
  --nano-raw path [path ...]
                        ONT raw reads
  --nano-corr path [path ...]
                        ONT corrected reads
  --subassemblies path [path ...]
                        high-quality contig-like input
  -g size, --genome-size size
                        estimated genome size (for example, 5m or 2.6g)
  -o path, --out-dir path
                        Output directory
  -t int, --threads int
                        number of parallel threads (default: 1)
  -i int, --iterations int
                        number of polishing iterations (default: 1)
  -m int, --min-overlap int
                        minimum overlap between reads (default: auto)
  --resume              resume from the last completed stage
  --resume-from stage_name
                        resume from a custom stage
  --debug               enable debug output
  -v, --version         show program's version number and exit

Input reads could be in FASTA or FASTQ format, uncompressed
or compressed with gz. Currenlty, raw and corrected reads
from PacBio and ONT are supported. Additionally, --subassemblies
option does a consensus assembly of high-quality input contigs.
You may specify multiple fles with reads (separated by spaces).
Mixing different read types is not yet supported.

You must provide an estimate of the genome size as input,
which is used for solid k-mers selection. The estimate could
be rough (e.g. withing 0.5x-2x range) and does not affect
the other assembly stages. Standard size modificators are
supported (e.g. 5m or 2.6g)

# E. coli P6-C4 PacBio data
wget https://zenodo.org/record/1172816/files/E.coli_PacBio_40x.fasta
flye --pacbio-raw E.coli_PacBio_40x.fasta --out-dir out_pacbio --genome-size 5m --threads 4

# E. coli Oxford Nanopore Technologies data
wget https://zenodo.org/record/1172816/files/Loman_E.coli_MAP006-1_2D_50x.fasta
flye --nano-raw Loman_E.coli_MAP006-1_2D_50x.fasta --out-dir out_nano --genome-size 5m --threads 4

See Flye manual for full usage.

In practice

An insect

The species: high heterogeneity, high AT, high repetition.
Genome size: male 790M, female 830M.
Data used：about 70X PacBio long-reads.
OS environment: CentOS6.6 86_64 glibc-2.12. QSUB grid system. 15 Fat nodes (2TB RAM, 40 CPU) and 10 Blade nodes (156G RAM, 24 CPU).

version 2.3.2-gd46edb7

Flye version: 2.3.2-gd46edb7

I didn’t test all the parameters. Below is the results based on default settings.

command:

1	flye --pacbio-raw $DATADIR/third/third_all.fasta --out-dir run1 --genome-size 850m --threads 24

stats:

# contig
Size_includeN 724744485
Size_withoutN 724744485
Seq_Num 17602
Mean_Size 41173
Median_Size 17572
Longest_Seq 1648999
Shortest_Seq  55
GC_Content  31.55
N50 91066
N90 17456
Gap 0.0

# scaffold
Size_includeN 724753785
Size_withoutN 724744485
Seq_Num 17509
Mean_Size 41393
Median_Size 17532
Longest_Seq 1648999
Shortest_Seq  55
GC_Content  31.55
N50 92367
N90 17530
Gap 0.0

Version 2.3.3-g47cdd0b

Flye 2.3.3 have two updates appealing to me:

Automatic selection of minimum overlap parameter based on read length
Minimap2 updated

Because I’ve run Canu before, and Flye can start from raw data and corrected data, I’ll test Flye for both.

From raw data

Commands:

1	$TOOLDIR/Flye-2.3.3/bin/flye --pacbio-raw third_all.fasta --out-dir run2 --genome-size 830m --threads 40

Stats:

# contigs
Size_includeN 787629166
Size_withoutN 787629166
Seq_Num 14846
Mean_Size 53053
Median_Size 20542
Longest_Seq 1636300
Shortest_Seq  12
GC_Content  31.6
N50 121564
L50 1699
N90 25419
Gap 0.0

# scaffolds
Size_includeN 787632766
Size_withoutN 787629166
Seq_Num 14810
Mean_Size 53182
Median_Size 20465
Longest_Seq 1692734
Shortest_Seq  12
GC_Content  31.6
N50 122313
L50 1680
N90 25437
Gap 0.0

From corrected data from `Canu` (about 33X)

Commands:

1	$TOOLDIR/Flye-2.3.3/bin/flye --pacbio-corr canu.correctedReads.fasta.gz --out-dir run3 --genome-size 830m --threads 40

Stats:

# contigs
Size_includeN 833065987
Size_withoutN 833065987
Seq_Num 17536
Mean_Size 47506
Median_Size 26593
Longest_Seq 1145994
Shortest_Seq  518
GC_Content  31.47
N50 88680
L50 2594
N90 22129
Gap 0.0

# scaffolds
Size_includeN 833070387
Size_withoutN 833065987
Seq_Num 17492
Mean_Size 47625
Median_Size 26602
Longest_Seq 1145994
Shortest_Seq  518
GC_Content  31.47
N50 89165
L50 2581
N90 22242
Gap 0.0

A plant

The species: high heterogeneity, high repetition.
Genome size: 2.1G.
Data used：more than 100X PacBio long reads.
OS environment: CentOS6.6 86_64 glibc-2.12. QSUB grid system. 15 Fat nodes (2TB RAM, 40 CPU) and 10 Blade nodes (156G RAM, 24 CPU).

run1, more than 100X data

commands:

1	$path2flye --pacbio-raw $WORKDIR/data/Pacbio/all.fasta --out-dir run1 --genome-size 2g --threads 30

But I came across a memory issue: ERROR: Caught unhandled exception: std::bad_alloc in both 2.3.2 and 2.3.3. And the author suggested me to downsample the data.

And I asked him that what’s the difference: using all raw data (say 100X) and using downsampling data (say longest 50X)? He said “You might have extra connectivity information in these 100x reads (you can resolve more repeats, for example). But some studies suggest (Canu paper, for example) that you don’t really need more than 40x in general (but it, of course, also depends on the genome complexity, ploidy etc…). Plus, extra coverage helps to get a good final consensus.”

run2, with about 50X data

I used SelectLongestReads to downsample about 50X data and ran Flye again.

# run1.1, extract 50X data
SelectLongestReads sum 100000000000 longest 1 o Pacbio_50x.fasta f all.fasta

# remove @ from the fasta
replace_@_in_fasta_header.py Pacbio_50x.fasta > Pacbio_50x_no_@.fasta

$path2flye --pacbio-raw Pacbio_50x_no_@.fasta --out-dir run1.1 --genome-size 2g --threads 30 --resume

The reason why I removed the @ from the headers was because I encountered another problem: ERROR: parse error in 1-consensus/consensus.fasta on line 1: empty sequence. It seemed that Flye would ignore these headers.

And the stats I got:

# contigs
Size_includeN 1640872256
Size_withoutN 1640872256
Seq_Num 9843
Mean_Size 166704
Median_Size 72841
Longest_Seq 8808184
Shortest_Seq  139
GC_Content  37.78
N50 398108
L50 1119
N90 83615
Gap 0.0

# scaffolds
Size_includeN 1640874656
Size_withoutN 1640872256
Seq_Num 9819
Mean_Size 167112
Median_Size 72812
Longest_Seq 8808184
Shortest_Seq  139
GC_Content  37.78
N50 399898
L50 1114
N90 83809
Gap 0.0

Not bad.

run3, with corrected data from `Canu` (about 37X)

The Canu version was 1.7.