Calculate FRiP score

Ref: ENCODE - Terms and Definitions

Fraction of reads in peaks (FRiP) - Fraction of all mapped reads that fall into the called peak regions, i.e. usable reads in significantly enriched peaks divided by all usable reads. In general, FRiP scores correlate positively with the number of regions. (Landt et al, Genome Research Sept. 2012, 22(9): 1813–1831)

In ENCODE - ATAC-seq Data Standards and Prototype Processing Pipeline, FRiP score is calculated using tagAlign with intersectBed, and they use bamtobed covert BAM files to tagAlign. However, intersectBed can also use BAM files to count the intersection. Also, someone argued that one should use featureCounts to get accurate results (https://www.biostars.org/p/337872/#337890).

I compared different ways to calculate FRiP scores below.

# 1. prepare
## convert BAM (BAM used to call peaks) to BED
$ bedtools bamtobed -i ${sample}.sorted.marked.filtered.shifted.bam | awk 'BEGIN{OFS="\t"}{$4="N";$5="1000";print $0}' > ${sample}.sorted.marked.filtered.shifted.PE2SE.tagAlign

# 2. total reads in BAM/BED
$ samtools view -c ${sample}.sorted.marked.filtered.shifted.bam
38439210

$ wc -l ${sample}.sorted.marked.filtered.shifted.PE2SE.tagAlign
38439210 ${sample}.sorted.marked.filtered.shifted.PE2SE.tagAlign

# 3. count reads in peak regions

## 3.1 tagAlign, intersectBed -a tagAlign -b bed
$ time bedtools sort -i ${sample}_peaks.narrowPeak | bedtools merge -i stdin | bedtools intersect -u -a ${sample}.sorted.marked.filtered.shifted.PE2SE.tagAlign -b stdin | wc -l
428359

real    0m27.012s
user    0m25.726s
sys 0m1.357s

## 3.2 tagAlign, intersectBed -a bed -b tagAlign
$ time bedtools sort -i ${sample}_peaks.narrowPeak |bedtools merge -i stdin | bedtools intersect -c -a stdin -b ${sample}.sorted.marked.filtered.shifted.PE2SE.tagAlign | awk '{{ sum+=$4 }} END {{ print sum }}'
428359

real    0m51.089s
user    0m39.199s
sys 0m11.945s

## 3.3 BAM, intersectBed -a bam -b bed
$ time bedtools sort -i ${sample}_peaks.narrowPeak | bedtools merge -i stdin | bedtools intersect -u -a ${sample}.sorted.marked.filtered.shifted.bam -b stdin -ubam | samtools view -c
428359

real    1m12.844s
user    1m11.979s
sys 0m0.951s

## 3.4 BAM, intersectBed -a bed -b bam
$ time bedtools sort -i ${sample}_peaks.narrowPeak | bedtools merge -i stdin | bedtools intersect -c -a stdin -b ${sample}.sorted.marked.filtered.shifted.bam | awk '{{ sum+=$4 }} END {{ print sum }}'
428359

real    1m49.981s
user    1m28.747s
sys 0m20.837s

## 3.5 featureCoutns

### covert BED (the peaks) to SAF
$ awk 'BEGIN{FS=OFS="\t"; print "GeneID\tChr\tStart\tEnd\tStrand"}{print $4, $1, $2+1, $3, "."}' ${sample}_peaks.narrowPeak > ${sample}_peaks.saf

### count
$ featureCounts -p -a ${sample}_peaks.saf -F SAF -o readCountInPeaks.txt ${sample}.sorted.marked.filtered.shifted.bam
//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           P M0.A.005.sorted.marked.filtered.shifted.bam    ||
||                                                                            ||
||             Output file : readCountInPeaks.txt                             ||
||                 Summary : readCountInPeaks.txt.summary                     ||
||              Annotation : M0.A.005_peaks.saf (SAF)                         ||
||      Dir for temp files : ./                                               ||
||                                                                            ||
||                 Threads : 1                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : counted                                          ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file M0.A.005_peaks.saf ...                                ||
||    Features : 5948                                                         ||
||    Meta-features : 5948                                                    ||
||    Chromosomes/contigs : 24                                                ||
||                                                                            ||
|| Process BAM file M0.A.005.sorted.marked.filtered.shifted.bam...            ||
||    Paired-end reads are included.                                          ||
||    Assign alignments (paired-end) to features...                           ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Pairing up the read pairs.                                              ||
||                                                                            ||
||    Total alignments : 19219605                                             ||
||    Successfully assigned alignments : 230826 (1.2%)                        ||
||    Running time : 0.49 minutes                                             ||
||                                                                            ||
||                                                                            ||
|| Summary of counting results can be found in file "readCountInPeaks.txt.su  ||
|| mmary"                                                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

Equal read counts in peak regions were got either from BAM file or tagAlign file. Although counting from BAM consumes more time, one do not need to covert BAM to tagAlign.

And for featureCounts, the result is close to those from intersectBed, which is expected. featureCounts counts the number of fragments, while intersectBed counts the number of reads. But the number of reads is twice as big as the number of fragments (only considering properly mapped reads).

# from intersectBed
> 428359/38439210
[1] 0.0111438

# from featureCounts
> 230826/19219605
[1] 0.01200992

featureCounts may be more accurate when assigning reads spanning multiple features, but it may not be worthy.

So, in practice, I would like to use intersectBed to calculate the FRiP score from BAM files.

Change log

20190320: create the note.