Genome Assembly Pipeline: LINKS

Intro

LINKS is a scalable genomics application for scaffolding or re-scaffolding genome assembly drafts with long reads, such as those produced by Oxford Nanopore Technologies Ltd and Pacific Biosciences. It provides a generic alignment-free framework for scaffolding and can work on any sequences. It is versatile and supports not only long sequences as a source of long-range information, but also MPET pairs and linked-reads, such as those from the 10X Genomics GemCode and Chromium platform, via ARCS (http://www.bcgsc.ca/platform/bioinfo/software/arcs). Fill gaps in LINKS-derived scaffolds using Sealer (http://www.bcgsc.ca/platform/bioinfo/software/sealer).

Its paper:

Warren RL, Yang C, Vandervalk BP, Behsaz B, Lagman A, Jones SJM, Birol I. LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads. GigaScience. 2015;4:35. doi:10.1186/s13742-015-0076-3

My feelings:

easy to use
support scaffolding and re-scaffolding
fast
comparatively good results (at least in my case, compared to BESST and OPERA-LG)
huge RAM consumption

General usage

Install

Dowload the latest version from its release page.

# decompress
tar -zxvf links_v1-8-6.tar.gz; cd links_v1.8.6

# build BloomFilter PERL module
cd lib/bloomfilter/swig
swig -Wall -c++ -perl5 BloomFilter.i
g++ -c BloomFilter_wrap.cxx -I/usr/lib64/perl5/CORE -fPIC -Dbool=char -O3
g++ -Wall -shared BloomFilter_wrap.o -o BloomFilter.so -O3

Parameter setting

Here gives a general description of the way LINKS works.

Setting the parameters is crucial for scaffolding. Below I summarize some tips or practices when setting these parameters.

scaffolding control

-a: the maximum ratio between the best two contig pairs for a given seed/contig being extended (default: 0.3).
-l: the minimum number of links (read pairs) a valid contig pair MUST have to be considered (default: 5).
For example, contig A shares 4 links with B and 2 links with C, in this orientation. contig rA (reverse) also shares 3 links with D. When it’s time to extend contig A (with the options -l and -a set to 2 and 0.7, respectively), both contig pairs AB and AC are considered. Since C (second-best) has 2 links and B (best) has 4 (2/4) = 0.5 below the maximum ratio of 0.7, A will be linked with B in the scaffold and C will be kept for another extension. If AC had 3 links the resulting ratio (0.75), above the user-defined maximum 0.7 would have caused the extension to terminate at A, with both B and C considered for a different scaffold. A maximum links ratio of 1 (not recommended) means that the best two candidate contig pairs have the same number of links – LINKS will accept the first one since both have a valid gap/overlap. When a scaffold extension is terminated on one side, the scaffold is extended on the “left”, by looking for contig pairs that involve the reverse of the seed (in this example, rAD). With AB and AC having 4 and 2 links, respectively and rAD being the only pair on the left, the final scaffolds outputted by LINKS would be: rD-A-B and C.

k-mer length: -k

-k: k-mer value (default 15).
LINKS is a k-mer scaffolder, and the -k parameter controls the k-mer length.
Exploration of vast kmer space is expected to yield better scaffolding results.
You may increase -k to 21 while working with pacbio reads.
I also recommend correcting ONT reads if you can, it will allow you to choose higher k values and increase the specificity.
The sweet spot will be somewhere k 15-19 for 2Gb genome (assuming raw ONT reads).

k-mer pairs extraction: -d, -e and -t. And -t and -d are important for memory usage.

-e: error (%) allowed on -d distance e.g. -e 0.1 == distance +/- 10% (default: 0.1)
- In theory, the -e parameter will play an important role limiting linkages outside of the target range -d (+/-) -e %. This is especially true when using raw MPET for scaffolding, to limit spurious linkages by contaminating PETs.
-d: distance between the 5’-end of each pairs (default: 4000)
-t: sliding window when extracting k-mer pairs from long reads (default: 2)
Because you want want to start with a low -d for scaffolding, you have to estimate how many minimum links (-l) would fit in a -d window +/- error -e given sliding window -t. For instance, it may not make sense to use -t 200, -d 500 at low coverages BUT if you have at least 10-fold coverage it might since, in principle, you should be able to derive sufficient k-mer pairs within same locus if there’s no bias in genome sequencing.
On the data side of things, reducing the coverage (using less long reads), and limiting to only the highest quality reads would help decrease RAM usage.
WARNING: Specifying many distances will require large amount of RAM, especially with low -t values.
As -d increases, -t must decrease (otherwise you’ll end up with too few pairs for scaffolding over larger kmer distances).

An example

The power of LINKS is in scaffolding using various distance constraints, iteratively.

Running links multiple times (when working with large long read dataset / genomes are very big)

nohup ./runIterativeLINKS.sh 17 beluga.fa 5 0.3 &
./LINKS -f $2 -s ont.fof -b links1 -d 1000 -t 10 -k $1 -l $3 -a $4
./LINKS -f links1.scaffolds.fa -s ont.fof -b links2 -d 2500 -t 5 -k $1 -l $3 -a $4 -o 1 -r links1.bloom
./LINKS -f links2.scaffolds.fa -s ont.fof -b links3 -d 5000 -t 5 -k $1 -l $3 -a $4 -o 2 -r links1.bloom
./LINKS -f links3.scaffolds.fa -s ont.fof -b links4 -d 7500 -t 4 -k $1 -l $3 -a $4 -o 3 -r links1.bloom
./LINKS -f links4.scaffolds.fa -s ont.fof -b links5 -d 10000 -t 4 -k $1 -l $3 -a $4 -o 4 -r links1.bloom
./LINKS -f links5.scaffolds.fa -s ont.fof -b links6 -d 12500 -t 3 -k $1 -l $3 -a $4 -o 5 -r links1.bloom
./LINKS -f links6.scaffolds.fa -s ont.fof -b links7 -d 15000 -t 3 -k $1 -l $3 -a $4 -o 6 -r links1.bloom
./LINKS -f links7.scaffolds.fa -s ont.fof -b links8 -d 30000 -t 2 -k $1 -l $3 -a $4 -o 7 -r links1.bloom

Running links iteratively in a single command (works best when genomes are small/RAM not limiting)

1	./LINKS -f $2 -s ont.fof -b links1 -d 1000,5000,7500,10000,12500,15000,30000 -t 10,5,5,4,4,3,3,2 -k $1 -l $3 -a $4

the value of -t will determine the #kmer pairs extracted and incidentally the RAM used. this will vary from one dataset to the next.

The example was from: working with 2 Gb genome; stuck on bloom filter being built.

It seems LINKS needs huge RAM, and you may also want to try lrscaf, another long reads scaffolding tool.

In practice

background

An insect
The species: high heterogeneity, high AT, high repetition.
Genome size: male 790M, female 830M.

data

The Illumina data I used:

Source	Insert size (bp)	Avg. read size (bp)	Raw bases (G)	Raw reads (M)	Sequencing depth
AV1, M	270	150	44.1	293.6	55.5
AV2, F	500	150	24.4	162.8	29.4
AV2, F	800	150	15.8	105.4	19.0
AV2, F	3k	114	10.4	91.8	12.5
AV2, F	5k	150	29.8	198.7	35.9
AV2, F	5k	114	11.5	101.2	13.8
AV2, F	10k	150	17.5	116.8	21.1
Total	-	-	153.5	1070.3	187.3

And the PacBio data:

Source	Raw bases (G)	Raw reads (M)	Sequencing depth	Avg.read size (bp)	N50 (bp)	N90 (bp)	Note
AV3, F	15.2	20.1	18.31	7550	10046	4558	20170111
AV4, F	45.2	4.6	54.46	9798	17348	5702	20171224
Total	60.4	24.7	72.77	9115	14630	5310	-

Because we didn’t receive the data at the same time (first all Illumina data, then 20X PacBio data, finally another 50X PacBio data), we tried many different assembly strategies: Illumina-dominant, and PacBio-dominant. We found ‘PacBio-dominant pipelines’ produced significantly good results, so we gradually gave up other pipelines and focused on exploring PacBio assemblers.

This note is one of the attempts of ‘Illumina-dominant pipelines’, and I ran LINKS as a scaffoling tool, after runing MEGAHIT-SOAPdenovo successfully. I’ve tried so many with MEGAHIT, and I ran LINKS after every attempt. See Genome Assembly Pipeline: MEGAHIT & SOAPdenovo-fusion. I put two representative results here as a memo.

PS: I encountered a problem and solved with the help of the author: LINKS termineted with no error message when I used hybrid reads to scaffold.

The -t and -d are important for RAM consumption. I’ve tried -t with 5, 10, 20, and -t 20 works. -d didn’t matter in my case, so I just set lots of -d.

Scripts I used and the stats I got:


LINKS_HOME=/software/links_v1.8.5
WORK_DIR=/DenovoSeq

echo Start time is `date +%Y/%m/%d--%H:%M`

# re-scafflod the output of MEGAHIT small insert size and SOAP-fusion. use about 50X data
#$LINKS_HOME/LINKS -k 21 -f $WORK_DIR/MEGAHIT/small_insert.no270/SOAP-fusion/k41.scafSeq -s Pacbio.fof -b reLINKS2 -t 20 -d 4000,5000,6000,8000,10000,15000,20000

# re-scaffold the output of MEGAHIT small insert size and SOAP-fusion, use about 70X data
$LINKS_HOME/LINKS -k 21 -f $WORK_DIR/MEGAHIT/small_insert.no270/SOAP-fusion/k41.scafSeq -s Pacbio_171231.fof -b reLINKS_171231 -t 20 -d 4000,5000,6000,8000,10000,15000,20000

echo Finish time is `date +%Y/%m/%d--%H:%M`

And the input data:

## first batch PB data, about 20X
$ cat Pacbio.fof
/DenovoSeq/Third_rawData/av_20k.fasta

## all PB data, about 70X
$ cat Pacbio_171231.fof
/DenovoSeq/Third_rawData/third_all.fasta

The stats I got:

re-scaffloding the output of MEGAHIT small insert size and SOAP-fusion, use ~20X PB data:

	MEGAHIT	SOAP-fusion	LINKS
Size_includeN	582099585	765246417	776147223
Size_withoutN	582099585	574675978	574675978
Seq_Num	575808	377325	370033
Mean_Size	1010	2028	2097
Median_Size	433	393	389
Longest_Seq	55064	436254	621272
Shortest_Seq	200	200	200
GC_Content	31.43	31.42	31.42
N50	2158	33478	42519
N90	358	439	444
Gap	0	24.9	25.96

re-scaffolding the output of MEGAHIT small insert size and SOAP-fusion, use ~70X PB data.

	MEGAHIT	SOAP-fusion	LINKS
Size_includeN	582099585	765246417	798803587
Size_withoutN	582099585	574675978	574675978
Seq_Num	575808	377325	359653
Mean_Size	1010	2028	2221
Median_Size	433	393	384
Longest_Seq	55064	436254	732454
Shortest_Seq	200	200	200
GC_Content	31.43	31.42	31.42
N50	2158	33478	68195
N90	358	439	453
Gap	0	24.9	28.06

The last run took about 27 hours.

As can be seen, adding 50X PacBio data did not help a lot. After this, we started to use ‘PacBio-dominant pipelines’ and have tried many different assemblers. But, LINKS is a good scaffolding tool if you assemble the genome with illumina data and want to further scaffold with <20X long-reads.

This note can serve as a reference in case I will have to use it again.

Change log

20180307: createt the note.
20180709: update the ‘background information’.
20180817: change the ‘General usage’ part, add contents about ‘parameter setting’.