Genome Assembly Pipeline: BESST

Intro

From the introduction of BESST git repo:

BESST is a package for scaffolding genomic assemblies.

It paper

Sahlin K, Chikhi R, Arvestad L. Assembly scaffolding with PE-contaminated mate-pair libraries. Bioinformatics. 2016;32(13):1925–1932. doi:10.1093/bioinformatics/btw064

My feeling:

  • too many steps
  • awkward
  • only support NGS reads
  • not so good results (at least in my case)

In practice

BESST git repo has full docs.

Scripts used:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/bin/sh
#PBS -N BESST
#PBS -j eo
#PBS -q Test
#PBS -l nodes=1:ppn=20
#PBS -V

echo Start time is `date +%Y/%m/%d--%H:%M`

# align the PE/MP reads to contigs with BWA MEM
#for sample in 270B 500B 800B 5k-1 10k; do
#/software/bwa-0.7.15/bwa mem -t 40 /DenovoSeq/MEGAHIT/megahit_out/final.contigs.fa /DenovoSeq/raw_data/${sample}_R1.fastq /DenovoSeq/raw_data/${sample}_R2.fastq | samtools view -uS - | samtools sort -@ 8 -m 4G - -T sam_sort_tmp -o ./bwaout/${sample}.sorted.bam
#samtools index ./bwaout/${sample}.sorted.bam
#done

# Damn, I forgot why I use repair.sh.
for sample in 3k_1 5k-2; do
/software/bbmap/repair.sh in1=/DenovoSeq/raw_data/${sample}_R1.fastq in2=/DenovoSeq/raw_data/${sample}_R2.fastq out1=/DenovoSeq/raw_data/${sample}_R1.fixed.fastq out2=/DenovoSeq/raw_data/${sample}_R2.fixed.fastq
/software/bwa-0.7.15/bwa mem -t 40 /DenovoSeq/MEGAHIT/megahit_out/final.contigs.fa /DenovoSeq/raw_data/${sample}_R1.fixed.fastq /DenovoSeq/raw_data/${sample}_R2.fixed.fastq | samtools view -uS - | samtools sort -@ 8 -m 4G - -T sam_sort_tmp -o ./bwaout/${sample}.sorted.bam
samtools index ./bwaout/${sample}.sorted.bam
done

# scaffold the contigs of MEGAHIT
export PATH=/software/Python.2.7.13/bin:$PATH

/software/BESST/runBESST -plots -q -c /DenovoSeq/MEGAHIT/megahit_out/final.contigs.fa -f ./bwaout/270B.sorted.bam ./bwaout/500B.sorted.bam ./bwaout/800B.sorted.bam ./bwaout/3k_1.sorted.bam ./bwaout/5k-1.sorted.bam ./bwaout/5k-2.sorted.bam ./bwaout/10k.sorted.bam -orientation fr fr fr rf rf rf rf

/software/BESST/runBESST -plots -q -c /DenovoSeq/MEGAHIT/megahit_out.no270/final.contigs.fa -f ./bwaout/500B.sorted.bam ./bwaout/800B.sorted.bam ./bwaout/3k_1.sorted.bam ./bwaout/5k-1.sorted.bam ./bwaout/5k-2.sorted.bam ./bwaout/10k.sorted.bam -orientation fr fr rf rf rf rf -o ./no270

echo Finish time is `date +%Y/%m/%d--%H:%M`

Tested with or without ins_270 library.

And the stats I got:

with ins_270 library:

1
2
3
4
5
6
7
8
9
10
11
Size_includeN: 778878802
Size_withoutN: 755401923
Seq_Num: 811939
Mean_Size: 959
Median_Size: 393
Longest_Seq: 561369
Shortest_Seq: 200
GC_Content: 32.65
N50: 2715
N90: 342
Gap: 3.01

without ins_270 library:

1
2
3
4
5
6
7
8
9
10
11
Size_includeN: 654690437
Size_withoutN: 626185453
Seq_Num: 618555
Mean_Size: 1058
Median_Size: 438
Longest_Seq: 327376
Shortest_Seq: 200
GC_Content: 32.31
N50: 2277
N90: 365
Gap: 4.35

Though not been fully tested, using BESST got worse results than SOAPdenovo. Then I gave up this tool.

This note can serve as a reference in case I will have to use it again…

0%