Intro
From its Git Repo
LINKS is a scalable genomics application for scaffolding or re-scaffolding genome assembly drafts with long reads, such as those produced by Oxford Nanopore Technologies Ltd and Pacific Biosciences. It provides a generic alignment-free framework for scaffolding and can work on any sequences. It is versatile and supports not only long sequences as a source of long-range information, but also MPET pairs and linked-reads, such as those from the 10X Genomics GemCode and Chromium platform, via ARCS (http://www.bcgsc.ca/platform/bioinfo/software/arcs). Fill gaps in LINKS-derived scaffolds using Sealer (http://www.bcgsc.ca/platform/bioinfo/software/sealer).
Its paper:
Warren RL, Yang C, Vandervalk BP, Behsaz B, Lagman A, Jones SJM, Birol I. LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads. GigaScience. 2015;4:35. doi:10.1186/s13742-015-0076-3
My feelings:
- easy to use
- support scaffolding and re-scaffolding
- fast
- comparatively good results (at least in my case, compared to
BESST
andOPERA-LG
) - huge RAM consumption
General usage
Install
Dowload the latest version from its release page.
1 | decompress |
Parameter setting
Here gives a general description of the way LINKS
works.
Setting the parameters is crucial for scaffolding. Below I summarize some tips or practices when setting these parameters.
- scaffolding control
-a
: the maximum ratio between the best two contig pairs for a given seed/contig being extended (default: 0.3).-l
: the minimum number of links (read pairs) a valid contig pair MUST have to be considered (default: 5).-
For example, contig A shares 4 links with B and 2 links with C, in this orientation. contig rA (reverse) also shares 3 links with D. When it’s time to extend contig A (with the options -l and -a set to 2 and 0.7, respectively), both contig pairs AB and AC are considered. Since C (second-best) has 2 links and B (best) has 4 (2/4) = 0.5 below the maximum ratio of 0.7, A will be linked with B in the scaffold and C will be kept for another extension. If AC had 3 links the resulting ratio (0.75), above the user-defined maximum 0.7 would have caused the extension to terminate at A, with both B and C considered for a different scaffold. A maximum links ratio of 1 (not recommended) means that the best two candidate contig pairs have the same number of links – LINKS will accept the first one since both have a valid gap/overlap. When a scaffold extension is terminated on one side, the scaffold is extended on the “left”, by looking for contig pairs that involve the reverse of the seed (in this example, rAD). With AB and AC having 4 and 2 links, respectively and rAD being the only pair on the left, the final scaffolds outputted by LINKS would be: rD-A-B and C.
- k-mer length:
-k
-k
: k-mer value (default 15).- LINKS is a k-mer scaffolder, and the
-k
parameter controls the k-mer length. - Exploration of vast kmer space is expected to yield better scaffolding results.
- You may increase
-k
to21
while working with pacbio reads. - I also recommend correcting ONT reads if you can, it will allow you to choose higher k values and increase the specificity.
- The sweet spot will be somewhere k 15-19 for 2Gb genome (assuming raw ONT reads).
- k-mer pairs extraction:
-d
,-e
and-t
. And-t
and-d
are important for memory usage.
-e
: error (%) allowed on -d distance e.g. -e 0.1 == distance +/- 10% (default: 0.1)- In theory, the -e parameter will play an important role limiting linkages outside of the target range -d (+/-) -e %. This is especially true when using raw MPET for scaffolding, to limit spurious linkages by contaminating PETs.
-d
: distance between the 5’-end of each pairs (default: 4000)-t
: sliding window when extracting k-mer pairs from long reads (default: 2)- Because you want want to start with a low -d for scaffolding, you have to estimate how many minimum links (-l) would fit in a -d window +/- error -e given sliding window -t. For instance, it may not make sense to use -t 200, -d 500 at low coverages BUT if you have at least 10-fold coverage it might since, in principle, you should be able to derive sufficient k-mer pairs within same locus if there’s no bias in genome sequencing.
- On the data side of things, reducing the coverage (using less long reads), and limiting to only the highest quality reads would help decrease RAM usage.
- WARNING: Specifying many distances will require large amount of RAM, especially with low -t values.
- As
-d
increases,-t
must decrease (otherwise you’ll end up with too few pairs for scaffolding over larger kmer distances).
An example
The power of LINKS
is in scaffolding using various distance constraints, iteratively.
- Running links multiple times (when working with large long read dataset / genomes are very big)
1 | nohup ./runIterativeLINKS.sh 17 beluga.fa 5 0.3 & |
- Running links iteratively in a single command (works best when genomes are small/RAM not limiting)
1 | ./LINKS -f $2 -s ont.fof -b links1 -d 1000,5000,7500,10000,12500,15000,30000 -t 10,5,5,4,4,3,3,2 -k $1 -l $3 -a $4 |
- the value of
-t
will determine the #kmer pairs extracted and incidentally the RAM used. this will vary from one dataset to the next.
The example was from: working with 2 Gb genome; stuck on bloom filter being built.
It seems LINKS
needs huge RAM, and you may also want to try lrscaf, another long reads scaffolding tool.
In practice
background
- An insect
- The species: high heterogeneity, high AT, high repetition.
- Genome size: male 790M, female 830M.
data
The Illumina data I used:
Source | Insert size (bp) | Avg. read size (bp) | Raw bases (G) | Raw reads (M) | Sequencing depth |
---|---|---|---|---|---|
AV1, M | 270 | 150 | 44.1 | 293.6 | 55.5 |
AV2, F | 500 | 150 | 24.4 | 162.8 | 29.4 |
AV2, F | 800 | 150 | 15.8 | 105.4 | 19.0 |
AV2, F | 3k | 114 | 10.4 | 91.8 | 12.5 |
AV2, F | 5k | 150 | 29.8 | 198.7 | 35.9 |
AV2, F | 5k | 114 | 11.5 | 101.2 | 13.8 |
AV2, F | 10k | 150 | 17.5 | 116.8 | 21.1 |
Total | - | - | 153.5 | 1070.3 | 187.3 |
And the PacBio data:
Source | Raw bases (G) | Raw reads (M) | Sequencing depth | Avg.read size (bp) | N50 (bp) | N90 (bp) | Note |
---|---|---|---|---|---|---|---|
AV3, F | 15.2 | 20.1 | 18.31 | 7550 | 10046 | 4558 | 20170111 |
AV4, F | 45.2 | 4.6 | 54.46 | 9798 | 17348 | 5702 | 20171224 |
Total | 60.4 | 24.7 | 72.77 | 9115 | 14630 | 5310 | - |
Because we didn’t receive the data at the same time (first all Illumina data, then 20X PacBio data, finally another 50X PacBio data), we tried many different assembly strategies: Illumina-dominant, and PacBio-dominant. We found ‘PacBio-dominant pipelines’ produced significantly good results, so we gradually gave up other pipelines and focused on exploring PacBio assemblers.
This note is one of the attempts of ‘Illumina-dominant pipelines’, and I ran LINKS as a scaffoling tool, after runing MEGAHIT-SOAPdenovo successfully. I’ve tried so many with MEGAHIT, and I ran LINKS after every attempt. See Genome Assembly Pipeline: MEGAHIT & SOAPdenovo-fusion. I put two representative results here as a memo.
PS: I encountered a problem and solved with the help of the author: LINKS termineted with no error message when I used hybrid reads to scaffold.
The -t
and -d
are important for RAM consumption. I’ve tried -t
with 5, 10, 20
, and -t 20
works. -d
didn’t matter in my case, so I just set lots of -d
.
Scripts I used and the stats I got:
1 |
|
And the input data:
1 | # first batch PB data, about 20X |
The stats I got:
re-scaffloding the output of MEGAHIT small insert size and SOAP-fusion, use ~20X PB data:
MEGAHIT | SOAP-fusion | LINKS | |
---|---|---|---|
Size_includeN | 582099585 | 765246417 | 776147223 |
Size_withoutN | 582099585 | 574675978 | 574675978 |
Seq_Num | 575808 | 377325 | 370033 |
Mean_Size | 1010 | 2028 | 2097 |
Median_Size | 433 | 393 | 389 |
Longest_Seq | 55064 | 436254 | 621272 |
Shortest_Seq | 200 | 200 | 200 |
GC_Content | 31.43 | 31.42 | 31.42 |
N50 | 2158 | 33478 | 42519 |
N90 | 358 | 439 | 444 |
Gap | 0 | 24.9 | 25.96 |
re-scaffolding the output of MEGAHIT small insert size and SOAP-fusion, use ~70X PB data.
MEGAHIT | SOAP-fusion | LINKS | |
---|---|---|---|
Size_includeN | 582099585 | 765246417 | 798803587 |
Size_withoutN | 582099585 | 574675978 | 574675978 |
Seq_Num | 575808 | 377325 | 359653 |
Mean_Size | 1010 | 2028 | 2221 |
Median_Size | 433 | 393 | 384 |
Longest_Seq | 55064 | 436254 | 732454 |
Shortest_Seq | 200 | 200 | 200 |
GC_Content | 31.43 | 31.42 | 31.42 |
N50 | 2158 | 33478 | 68195 |
N90 | 358 | 439 | 453 |
Gap | 0 | 24.9 | 28.06 |
The last run took about 27 hours
.
As can be seen, adding 50X PacBio data did not help a lot. After this, we started to use ‘PacBio-dominant pipelines’ and have tried many different assemblers. But, LINKS is a good scaffolding tool if you assemble the genome with illumina data and want to further scaffold with <20X long-reads.
This note can serve as a reference in case I will have to use it again.
Change log
- 20180307: createt the note.
- 20180709: update the ‘background information’.
- 20180817: change the ‘General usage’ part, add contents about ‘parameter setting’.