Genome Assembly Pipeline: miniasm & Racon

miniasm + Racon is a long-read de novo genome assembly pipeline.

miniasm + Racon assembly pipeline

There are two good examples:

and a paper based on miniasm, actually, it is a consensus tool called Racon ^[1].

The miniasm + Racon pipeline consists of the following steps:

using minimap/minimap2 ^[2] for fast all-vs-all overlap of raw reads (Minimap for overlap detection, Overlap)
using miniasm, this “simply concatenates pieces of read sequences to generate the final sequences. Thus the per-base error rate is similar to the raw input reads.” (Miniasm layout for generating raw contigs, Layout)
mapping the raw reads back to the assembly using minimap again (Minimap for mapping of raw reads to raw contigs, Consensus)
using racon (‘rapid consensus’) for consensus calling (Racon for generating high-quality consensus sequences, Consensus)

Compared with general pipelines, it achieves ‘similar or better quliaty’ while ‘being an order of magnitude faster’.

As described in the miniasm paper ^[3], published long-read assembly pipelines all include four stages:

all-vs-all raw read mapping
raw read error correction
assembly of error corrected reads (may involve all-vs-all read mapping again, but as the error rate is much reduced at this step, it is easier and faster than stage 1)
contig consensus polish

Intro

minimap

Minimap is an experimental tool to efficiently find multiple approximate mapping positions between two sets of long sequences, such as between reads and reference genomes, between genomes and between long noisy reads. By default, it is tuned to have high sensitivity to 2kb matches around 20% divergence but with low specificity. Minimap does not generate alignments as of now and because of this, it is usually tens of times faster than mainstream aligners. With four CPU cores, minimap can map 1.6Gbp PacBio reads to human in 2.5 minutes, 1Gbp PacBio E. coli reads to pre-indexed 9.6Gbp bacterial genomes in 3 minutes, to pre-indexed >100Gbp nt database in ~1 hour (of which ~20 minutes are spent on loading index from the network filesystem; peak RAM: 10GB), map 2800 bacteria to themselves in 1 hour, and map 1Gbp E. coli reads against themselves in a couple of minutes.

Minimap does not replace mainstream aligners, but it can be useful when you want to quickly identify long approximate matches at moderate divergence among a huge collection of sequences. For this task, it is much faster than most existing tools.

minimap2

Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Typical use cases include: (1) mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2) finding overlaps between long reads with error rate up to ~15%; (3) splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads against a reference genome; (4) aligning Illumina single- or paired-end reads; (5) assembly-to-assembly alignment; (6) full-genome alignment between two closely related species with divergence below ~15%.

For ~10kb noisy reads sequences, minimap2 is tens of times faster than mainstream long-read mappers such as BLASR, BWA-MEM, NGMLR and GMAP. It is more accurate on simulated long reads and produces biologically meaningful alignment ready for downstream analyses. For >100bp Illumina short reads, minimap2 is three times as fast as BWA-MEM and Bowtie2, and as accurate on simulated data. Detailed evaluations are available from the minimap2 preprint.

miniasm

miniasm was developed by Heng Li.

From its git repo:

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.

So far miniasm is in early development stage. It has only been tested on a dozen of PacBio and Oxford Nanopore (ONT) bacterial data sets. Including the mapping step, it takes about 3 minutes to assemble a bacterial genome. Under the default setting, miniasm assembles 9 out of 12 PacBio datasets and 3 out of 4 ONT datasets into a single contig. The 12 PacBio data sets are PacBio E. coli sample, ERS473430, ERS544009, ERS554120, ERS605484, ERS617393, ERS646601, ERS659581, ERS670327, ERS685285, ERS743109 and a deprecated PacBio E. coli data set. ONT data are acquired from the Loman Lab.

Miniasm confirms that at least for high-coverage bacterial genomes, it is possible to generate long contigs from raw PacBio or ONT reads without error correction. It also shows that minimap can be used as a read overlapper, even though it is probably not as sensitive as the more sophisticated overlapers such as MHAP and DALIGNER. Coupled with long-read error correctors and consensus tools, miniasm may also be useful to produce high-quality assemblies.

Algorithm Overview

Crude read selection. For each read, find the longest contiguous region covered by three good mappings. Get an approximate estimate of read coverage.

Fine read selection. Use the coverage information to find the good regions again but with more stringent thresholds. Discard contained reads.

Generate a string graph. Prune tips, drop weak overlaps and collapse short bubbles. These procedures are similar to those implemented in short-read assemblers.

Merge unambiguous overlaps to produce unitig sequences.

Limitations

Consensus base quality is similar to input reads (may be fixed with a consensus tool).

Only tested on a dozen of high-coverage PacBio/ONT data sets (more testing needed).

Prone to collapse repeats or segmental duplications longer than input reads (hard to fix without error correction).

Since miniasm is not a stand-alone genome assembly tool, it depends on minimap or minimap2. minimap had been archived by the author, and minimap2 now is the successor. But minimap is also worth a try.

In this note I only used minimap or minimap2 as a read overlapper for assembly. Go see the docs of minimap2 for full instructions.

Racon

Racon is a consensus module for raw de novo DNA assembly of long uncorrected reads.

Racon is intended as a standalone consensus module to correct raw contigs generated by rapid assembly methods which do not include a consensus step. The goal of Racon is to generate genomic consensus which is of similar or better quality compared to the output generated by assembly methods which employ both error correction and consensus steps, while providing a speedup of several times compared to those methods. It supports data produced by both Pacific Biosciences and Oxford Nanopore Technologies.

Racon can be used as a polishing tool after the assembly with either Illumina data or data produced by third generation of sequencing. The type of data inputed is automatically detected.

Racon takes as input only three files: contigs in FASTA/FASTQ format, reads in FASTA/FASTQ format and overlaps/alignments between the reads and the contigs in MHAP/PAF/SAM format. Output is a set of polished contigs in FASTA format printed to stdout. All input files can be compressed with gzip.

Racon can also be used as a read error-correction tool. In this scenario, the MHAP/PAF/SAM file needs to contain pairwise overlaps between reads with dual overlaps.

A wrapper script is also available to enable easier usage to the end-user for large datasets. It has the same interface as racon but adds two additional features from the outside. Sequences can be subsampled to decrease the total execution time (accuracy might be lower) while target sequences can be split into smaller chunks and run sequentially to decrease memory consumption. Both features can be run at the same time as well.

My feelings (about miniasm):

very fast
comparatively good results
The docs are not good enough
huge memory consumption (may not suitable for large genome)
bugs (at least for miniasm)

General usage

# Install minimap and miniasm (requiring gcc and zlib)
git clone https://github.com/lh3/minimap && (cd minimap && make)
git clone https://github.com/lh3/minimap2 && (cd minimap2 && make)
git clone https://github.com/lh3/miniasm && (cd miniasm && make)

# Install Racon (requiring gcc 4.8+ or clang 3.4+, and cmake 3.2+)
git clone --recursive https://github.com/isovic/racon.git racon
cd racon
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make

# All-vs-all PacBio read Overlap with minimap
minimap/minimap -Sw5 -L100 -m0 -t 8 reads.fq reads.fq | gzip -1 > reads.paf.gz
# or minimap2
minimap2/minimap2 -x ava-pb -t 8 reads.fq reads.fq | gzip -1 > reads.paf.gz

# Layout
miniasm/miniasm -f reads.fq reads.paf.gz > reads.gfa

# Consensus
## GFA to fasta
awk '$1 ~/S/ {print ">"$2"\n"$3}' reads.gfa > reads.fasta

## Correction 1
minimap/minimap2 -t 8 reads.fasta reads.fq > reads.gfa1.paf
racon -t 8 reads.fq reads.gfa1.paf reads.fasta reads.racon1.fasta

## Correction 2 (optional)
minimap/minimap2 -t 8 reads.racon1.fasta reads.fq > reads.gfa2.paf
racon -t 8 reads.fq reads.gfa2.paf reads.racon1.fasta reads.racon2.fasta

In practice

OS environment: CentOS6.6 86_64 glibc-2.12. QSUB grid system. 15 Fat nodes (2TB RAM, 40 CPU) and 10 Blade nodes (156G RAM, 24 CPU).
minimap2 version: 2.8-r672
miniasm version: 0.2-r168-dirty
Racon version: 0.5.0

run 1, with about 50X data

commands:

# run1, 171224 third data
$TOOLDIR/minimap2-2.8_x64-linux/minimap2 -t $PPN -x ava-pb $DATADIR/171224.fasta $DATADIR/171224.fasta | gzip -1 > reads.paf.gz
/home/zhangll/software/minimap/miniasm/miniasm -f $DATADIR/171224.fasta reads.paf.gz > reads.gfa

# racon
awk '$1 ~/S/ {print ">"$2"\n"$3}' reads.gfa > reads.fasta
$TOOLDIR/minimap2-2.8_x64-linux/minimap2 -t $PPN reads.fasta $DATADIR/171224.fasta | /home/zhangll/software/racon/bin/racon -t $PPN $DATADIR/171224.fastq - reads.gfa racon1.fasta

stats:

Size_includeN	1475310871
Size_withoutN	1475310871
Seq_Num	20186
Mean_Size	73085
Median_Size	52525
Longest_Seq	1183822
Shortest_Seq	689
GC_Content	31.66
N50	97955
L50	4346
N90	33406
Gap	0.0

run2, with about 70X data

Didn’t run Racon.

commands:

1
2
3

# run2, all thid data
$TOOLDIR/minimap2-2.8_x64-linux/minimap2 -t $PPN -x ava-pb $DATADIR/third_all.fasta $DATADIR/third_all.fasta | gzip -1 > reads.paf.gz
/home/zhangll/software/minimap/miniasm/miniasm -f $DATADIR/third_all.fasta reads.paf.gz > reads.gfa

stats:

Size_includeN	1592482491
Size_withoutN	1592482491
Seq_Num	23120
Mean_Size	68879
Median_Size	49376
Longest_Seq	1002424
Shortest_Seq	689
GC_Content	33.1
N50	92250
L50	5029
N90	31666
Gap	0.0

The assembling size were larger than the estimated genome size (~850M) in both runs. But this pipeline is very fast.

Vaser R, Sovic I, Nagarajan N, Sikic M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Research. 2017 Jan 18:gr.214270.116. doi:10.1101/gr.214270.116 ↩︎
Li H. Minimap2: versatile pairwise alignment for nucleotide sequences. arXiv:1708.01492 [q-bio]. 2017 Aug 4 [accessed 2018 Jan 10]. http://arxiv.org/abs/1708.01492 ↩︎
Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics. 2016;32(14):2103–2110. doi:10.1093/bioinformatics/btw152 ↩︎