Genome Assembly Pipeline: LINKS

Intro

From its Git Repo

LINKS is a scalable genomics application for scaffolding or re-scaffolding genome assembly drafts with long reads, such as those produced by Oxford Nanopore Technologies Ltd and Pacific Biosciences. It provides a generic alignment-free framework for scaffolding and can work on any sequences. It is versatile and supports not only long sequences as a source of long-range information, but also MPET pairs and linked-reads, such as those from the 10X Genomics GemCode and Chromium platform, via ARCS (http://www.bcgsc.ca/platform/bioinfo/software/arcs). Fill gaps in LINKS-derived scaffolds using Sealer (http://www.bcgsc.ca/platform/bioinfo/software/sealer).

Its paper:

Warren RL, Yang C, Vandervalk BP, Behsaz B, Lagman A, Jones SJM, Birol I. LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads. GigaScience. 2015;4:35. doi:10.1186/s13742-015-0076-3

My feelings:

  • easy to use
  • support scaffolding and re-scaffolding
  • fast
  • comparatively good results (at least in my case, compared to BESST and OPERA-LG)
  • huge RAM consumption

General usage

Install

Dowload the latest version from its release page.

1
2
3
4
5
6
7
8
# decompress
tar -zxvf links_v1-8-6.tar.gz; cd links_v1.8.6

# build BloomFilter PERL module
cd lib/bloomfilter/swig
swig -Wall -c++ -perl5 BloomFilter.i
g++ -c BloomFilter_wrap.cxx -I/usr/lib64/perl5/CORE -fPIC -Dbool=char -O3
g++ -Wall -shared BloomFilter_wrap.o -o BloomFilter.so -O3

Parameter setting

Here gives a general description of the way LINKS works.

Setting the parameters is crucial for scaffolding. Below I summarize some tips or practices when setting these parameters.

  1. scaffolding control
  • -a: the maximum ratio between the best two contig pairs for a given seed/contig being extended (default: 0.3).
  • -l: the minimum number of links (read pairs) a valid contig pair MUST have to be considered (default: 5).
  • For example, contig A shares 4 links with B and 2 links with C, in this orientation. contig rA (reverse) also shares 3 links with D. When it’s time to extend contig A (with the options -l and -a set to 2 and 0.7, respectively), both contig pairs AB and AC are considered. Since C (second-best) has 2 links and B (best) has 4 (2/4) = 0.5 below the maximum ratio of 0.7, A will be linked with B in the scaffold and C will be kept for another extension. If AC had 3 links the resulting ratio (0.75), above the user-defined maximum 0.7 would have caused the extension to terminate at A, with both B and C considered for a different scaffold. A maximum links ratio of 1 (not recommended) means that the best two candidate contig pairs have the same number of links – LINKS will accept the first one since both have a valid gap/overlap. When a scaffold extension is terminated on one side, the scaffold is extended on the “left”, by looking for contig pairs that involve the reverse of the seed (in this example, rAD). With AB and AC having 4 and 2 links, respectively and rAD being the only pair on the left, the final scaffolds outputted by LINKS would be: rD-A-B and C.

  1. k-mer length: -k
  • -k: k-mer value (default 15).
  • LINKS is a k-mer scaffolder, and the -k parameter controls the k-mer length.
  • Exploration of vast kmer space is expected to yield better scaffolding results.
  • You may increase -k to 21 while working with pacbio reads.
  • I also recommend correcting ONT reads if you can, it will allow you to choose higher k values and increase the specificity.
  • The sweet spot will be somewhere k 15-19 for 2Gb genome (assuming raw ONT reads).
  1. k-mer pairs extraction: -d, -e and -t. And -t and -d are important for memory usage.
  • -e: error (%) allowed on -d distance e.g. -e 0.1 == distance +/- 10% (default: 0.1)
    • In theory, the -e parameter will play an important role limiting linkages outside of the target range -d (+/-) -e %. This is especially true when using raw MPET for scaffolding, to limit spurious linkages by contaminating PETs.
  • -d: distance between the 5’-end of each pairs (default: 4000)
  • -t: sliding window when extracting k-mer pairs from long reads (default: 2)
  • Because you want want to start with a low -d for scaffolding, you have to estimate how many minimum links (-l) would fit in a -d window +/- error -e given sliding window -t. For instance, it may not make sense to use -t 200, -d 500 at low coverages BUT if you have at least 10-fold coverage it might since, in principle, you should be able to derive sufficient k-mer pairs within same locus if there’s no bias in genome sequencing.
  • On the data side of things, reducing the coverage (using less long reads), and limiting to only the highest quality reads would help decrease RAM usage.
  • WARNING: Specifying many distances will require large amount of RAM, especially with low -t values.
  • As -d increases, -t must decrease (otherwise you’ll end up with too few pairs for scaffolding over larger kmer distances).

An example

The power of LINKS is in scaffolding using various distance constraints, iteratively.

  1. Running links multiple times (when working with large long read dataset / genomes are very big)
1
2
3
4
5
6
7
8
9
nohup ./runIterativeLINKS.sh 17 beluga.fa 5 0.3 &
./LINKS -f $2 -s ont.fof -b links1 -d 1000 -t 10 -k $1 -l $3 -a $4
./LINKS -f links1.scaffolds.fa -s ont.fof -b links2 -d 2500 -t 5 -k $1 -l $3 -a $4 -o 1 -r links1.bloom
./LINKS -f links2.scaffolds.fa -s ont.fof -b links3 -d 5000 -t 5 -k $1 -l $3 -a $4 -o 2 -r links1.bloom
./LINKS -f links3.scaffolds.fa -s ont.fof -b links4 -d 7500 -t 4 -k $1 -l $3 -a $4 -o 3 -r links1.bloom
./LINKS -f links4.scaffolds.fa -s ont.fof -b links5 -d 10000 -t 4 -k $1 -l $3 -a $4 -o 4 -r links1.bloom
./LINKS -f links5.scaffolds.fa -s ont.fof -b links6 -d 12500 -t 3 -k $1 -l $3 -a $4 -o 5 -r links1.bloom
./LINKS -f links6.scaffolds.fa -s ont.fof -b links7 -d 15000 -t 3 -k $1 -l $3 -a $4 -o 6 -r links1.bloom
./LINKS -f links7.scaffolds.fa -s ont.fof -b links8 -d 30000 -t 2 -k $1 -l $3 -a $4 -o 7 -r links1.bloom
  1. Running links iteratively in a single command (works best when genomes are small/RAM not limiting)
1
./LINKS -f $2 -s ont.fof -b links1 -d 1000,5000,7500,10000,12500,15000,30000 -t 10,5,5,4,4,3,3,2 -k $1 -l $3 -a $4
  • the value of -t will determine the #kmer pairs extracted and incidentally the RAM used. this will vary from one dataset to the next.

The example was from: working with 2 Gb genome; stuck on bloom filter being built.

It seems LINKS needs huge RAM, and you may also want to try lrscaf, another long reads scaffolding tool.

In practice

background

  • An insect
  • The species: high heterogeneity, high AT, high repetition.
  • Genome size: male 790M, female 830M.

data

The Illumina data I used:

Source Insert size (bp) Avg. read size (bp) Raw bases (G) Raw reads (M) Sequencing depth
AV1, M 270 150 44.1 293.6 55.5
AV2, F 500 150 24.4 162.8 29.4
AV2, F 800 150 15.8 105.4 19.0
AV2, F 3k 114 10.4 91.8 12.5
AV2, F 5k 150 29.8 198.7 35.9
AV2, F 5k 114 11.5 101.2 13.8
AV2, F 10k 150 17.5 116.8 21.1
Total - - 153.5 1070.3 187.3

And the PacBio data:

Source Raw bases (G) Raw reads (M) Sequencing depth Avg.read size (bp) N50 (bp) N90 (bp) Note
AV3, F 15.2 20.1 18.31 7550 10046 4558 20170111
AV4, F 45.2 4.6 54.46 9798 17348 5702 20171224
Total 60.4 24.7 72.77 9115 14630 5310 -

Because we didn’t receive the data at the same time (first all Illumina data, then 20X PacBio data, finally another 50X PacBio data), we tried many different assembly strategies: Illumina-dominant, and PacBio-dominant. We found ‘PacBio-dominant pipelines’ produced significantly good results, so we gradually gave up other pipelines and focused on exploring PacBio assemblers.

This note is one of the attempts of ‘Illumina-dominant pipelines’, and I ran LINKS as a scaffoling tool, after runing MEGAHIT-SOAPdenovo successfully. I’ve tried so many with MEGAHIT, and I ran LINKS after every attempt. See Genome Assembly Pipeline: MEGAHIT & SOAPdenovo-fusion. I put two representative results here as a memo.

PS: I encountered a problem and solved with the help of the author: LINKS termineted with no error message when I used hybrid reads to scaffold.

The -t and -d are important for RAM consumption. I’ve tried -t with 5, 10, 20, and -t 20 works. -d didn’t matter in my case, so I just set lots of -d.

Scripts I used and the stats I got:

1
2
3
4
5
6
7
8
9
10
11
12
13

LINKS_HOME=/software/links_v1.8.5
WORK_DIR=/DenovoSeq

echo Start time is `date +%Y/%m/%d--%H:%M`

# re-scafflod the output of MEGAHIT small insert size and SOAP-fusion. use about 50X data
#$LINKS_HOME/LINKS -k 21 -f $WORK_DIR/MEGAHIT/small_insert.no270/SOAP-fusion/k41.scafSeq -s Pacbio.fof -b reLINKS2 -t 20 -d 4000,5000,6000,8000,10000,15000,20000

# re-scaffold the output of MEGAHIT small insert size and SOAP-fusion, use about 70X data
$LINKS_HOME/LINKS -k 21 -f $WORK_DIR/MEGAHIT/small_insert.no270/SOAP-fusion/k41.scafSeq -s Pacbio_171231.fof -b reLINKS_171231 -t 20 -d 4000,5000,6000,8000,10000,15000,20000

echo Finish time is `date +%Y/%m/%d--%H:%M`

And the input data:

1
2
3
4
5
6
7
## first batch PB data, about 20X
$ cat Pacbio.fof
/DenovoSeq/Third_rawData/av_20k.fasta

## all PB data, about 70X
$ cat Pacbio_171231.fof
/DenovoSeq/Third_rawData/third_all.fasta

The stats I got:

re-scaffloding the output of MEGAHIT small insert size and SOAP-fusion, use ~20X PB data:

MEGAHIT SOAP-fusion LINKS
Size_includeN 582099585 765246417 776147223
Size_withoutN 582099585 574675978 574675978
Seq_Num 575808 377325 370033
Mean_Size 1010 2028 2097
Median_Size 433 393 389
Longest_Seq 55064 436254 621272
Shortest_Seq 200 200 200
GC_Content 31.43 31.42 31.42
N50 2158 33478 42519
N90 358 439 444
Gap 0 24.9 25.96

re-scaffolding the output of MEGAHIT small insert size and SOAP-fusion, use ~70X PB data.

MEGAHIT SOAP-fusion LINKS
Size_includeN 582099585 765246417 798803587
Size_withoutN 582099585 574675978 574675978
Seq_Num 575808 377325 359653
Mean_Size 1010 2028 2221
Median_Size 433 393 384
Longest_Seq 55064 436254 732454
Shortest_Seq 200 200 200
GC_Content 31.43 31.42 31.42
N50 2158 33478 68195
N90 358 439 453
Gap 0 24.9 28.06

The last run took about 27 hours.

As can be seen, adding 50X PacBio data did not help a lot. After this, we started to use ‘PacBio-dominant pipelines’ and have tried many different assemblers. But, LINKS is a good scaffolding tool if you assemble the genome with illumina data and want to further scaffold with <20X long-reads.

This note can serve as a reference in case I will have to use it again.

Change log

  • 20180307: createt the note.
  • 20180709: update the ‘background information’.
  • 20180817: change the ‘General usage’ part, add contents about ‘parameter setting’.
0%