Identify circRNAs and Fusions from RNA-seq Using STARChip

I want to find out some circRNAs from RNA-seq data (total RNA-seq, not poly-A enriched).

There are many tools for this mission. Here is a good review paper ^[1] talking about computational methods for analyzing circRNAs, both identification and downstream analysis. Also another review paper about identifying circRNAs ^[2]. There are also two evaluation papers for the identification tools ^[3]^[4].

From all the tools I know, CIRCexplorer2 ^[5] and CIRI ^[6] are well matained. But I want to try something new: STARChip ^[7].

STARChip is short for Star Chimeric Post, written by Dr. Nicholas Kipp Akers as part of his work in Bojan Losic’s group at the Icahn Institute of Genomics and Multiscale Biology at Mount Sinai School of Medicine

This software is designed to take the chimeric output from the STAR alignment tool and discover high confidence fusions and circular RNA in the data. Before running, you must have used a recent version of STAR with chimeric output turned on, to align your RNA-Seq data.

So, it can identify fusions and circRNAs at the same time. According to its paper, for circRNA detection, “STARChip achieves the best precision of all tools tested and nearly the best sensitivity. This does not appear to come at an increased resource cost. Both CIRI and CIRCexplorer had competitive precision and sensitivity values; STARChip required 43 and 179% of the runtimes of these packages, respectively, and ∼72% of the memory requirements.”; for fusions, “With STARChip, we have attempted to emphasize precision at the expense of sensitivity in these particular gold-standard studies, reasoning that such hyper-tuning inflates type I error in mining novel datasets.”

I’ve discussed with the author Kipp Akers about the precision: https://github.com/LosicLab/starchip/issues/9#issuecomment-381181507. He said:

To your final question, my goal with STARChip was to develop a tool that focused on precision. There are a dozen fusion finders out there that sacrifice everything to get the highest sensitivity. For my projects, this was not too helpful. However, STARChip’s read requirement settings can be set manually and because it runs so quickly, it’s easy to play with the settings to turn up sensitivity and turn down precision and see what you get. Feel free to do so, and let me know what you find!

I agree with the designing purpose of STARChip, so I decide to give it a shot.

There are two main modules in STARChip:

starchip-fusions is for fusion detection. It runs on individual samples.
- /path/to/starchip/starchip-fusions.pl output_seed Chimeric.out.junction Paramters.txt
starchip-circles is for circRNA detection. It runs on groups of samples.
- /path/to/starchip/starchip-circles.pl STARdirs.txt Parameters.txt
- /path/to/starchip/starchip-circles.pl fastq_files.txt parameters.txt

Notes below are more for my own convenience. See its git repo for full usage.

prepare

STARChip is written to be an extension of the STAR read aligner. It is optional for STARChip to run STAR on your samples. In most instances to run STARChip you must first run star on each of your samples. See the STAR documentation for installation, as well as building or downloading a STAR genome index. It is absolutely critical however, that you follow the STAR manual’s instructions and build a genome using all chromosomes plus unplaced contigs. Not doing so will strongly inflate your false positives rate, because reads that map perfectly to an unplaced contig will instead find the next best alignment, often a chimeric alignment. Run STAR with the following parameters required for chimeric output: –chimSegmentMin X –chimJunctionOverhangMin X (where X is an integer). Your project will have it’s own requirements, but a good starting point for your star alignments might look like:

STAR --genomeDir /path/to/starIndex/ --readFilesIn file1_1.fastq.gz file1_2.fastq.gz --runThreadN 11 --outReadsUnmapped Fastx --quantMode GeneCounts --chimSegmentMin 15 --chimJunctionOverhangMin 15 --outSAMstrandField intronMotif --readFilesCommand zcat --outSAMtype BAM Unsorted

reference/BED files

STARChip makes use of gtf files for annotating fusions and circRNA with gene names.

First, download the package and prepare annotation files:

1
2
3

$ git clone https://github.com/LosicLab/starchip.git && cd starchip

$ mkdir starchip_ref && ./setup.sh ~/RefData/Homo_sapiens/GENCODE_v27/gencode.v27.annotation.gtf ~/RefData/Homo_sapiens/GRCh38_no_alt/genome.fa ./starchip_ref

additional files for Fusions

starchip-fusions filters using the location of known repeats in bed format as well. Following the instructions in the picture to download repeats from UCSC genome browser.

Go to http://genome.ucsc.edu/cgi-bin/hgTables
Change ‘genome’ to your desired genome
Change the following settings:
1. group: Repeat
2. track: RepeatMasker
3. region: genome
4. output format: BED
5. output file: some reasonable name.bed
Click ‘get output’ to download your bed file.
On your local machine sort the bed file: sort -k1,1 -k2,2n repeats.bed > repeats.sorted.bed

If you’re working on hg19 or hg38, you don’t have to do the following things. The files needed are already included in the directory of STARChip.

starchip-fusions can also make use of known antibody parts, and copy number variants. These files come with starchip for human hg19 and hg38 in the reference directory. For other species you can create your own in the simple format: Chromosome StartPosition EndPosition

Finally, starchip-fusions uses known gene families and known/common false-positive pairs to filter out fusions which are likely mapping errors or PCR artifacts. Family data can be downloaded from ensembl biomart:

Go to http://www.ensembl.org/biomart/martview
Database: Ensembl Genes
Dataset: Your species
Click Attributes on the left hand side.
1. Under GENE dropdown, select only “Gene Name”
2. Under PROTEIN FAMILIES AND DOMAINS dropdown select Ensembl Protein Family ID.
Click Results at the top.
Export the file. It should have two columns, Family ID and Gene ID.

Known false positives are stored within data/pseudogenes.txt. In practice, we’ve found that pseudogenes and tissue specific highly expressed genes are commonly “fused” via PCR template switching errors. Feel free to put add any additional lines that result from your data to this file in the format: Gene1Name Gene2Name

run STARChip

Since my previous run of STAR didn’t use parameters --chimSegmentMin and --chimJunctionOverhangMin, I have to start with Fastq files.

starchip-circles can run from Fastq files, but starchip-fusions starts from Chimeric.out.junction. I’ll first run starchip-circles then run starchip-fusions.

First of all, I prepare dirs for STARChip under my WORKDIR like this:

STARChip/
├── STARChip-circRNA
│   ├── starchip-circles.fastqfiles        # the fastq files
│   └── starchip-circles.params            # starchip-circles parameters
└── STARChip-fusions
    └── starchip-fusions.param              # starchip-fusions parameters

run starchip-circles

The parameter file and Fastq file:

$ cat starchip-circles.params 
##Parameters for starchimp-circles
readsCutoff = 5
minSubjectLimit = 10
cpus = 20
do_splice = True
cpmCutoff = 0
subjectCPMcutoff = 0 
annotate = true

#Reference Files
refbed = /software/starchip/starchip_ref/gencode.v27.annotation.gtf.bed  #use setup.sh to create this from a gtf. 
refFasta = /RefData/Homo_sapiens/GRCh38_no_alt/genome.fa

#STAR Parameters
## Do you use a prefix for your STAR output?
starprefix =  
## Are you starting from fastq and need to run STAR alignment? 
runSTAR = True
STARgenome = /RefData/Homo_sapiens/GRCh38_no_alt/STARgenomes  #not necassary if runSTAR != True
STARreadcommand = zcat   #cat for fastq, zcat for fastq.gz etc. not necassary	if runSTAR != True
IDstepsback = 1 ## this is the position from the right of your path of the name of your files.  
		##for example: /path/to/sample1/star/2.4.2/output/Chimeric.out.junction 
		##sample1 is 4 steps back.
		##or /path/to/star/2.4.2/sample1/Chimeric.out.junction
		#sample1 is 1 step back.

$ head -3 starchip-circles.fastqfiles 
XJ-3-1-25_R1.fastq.gz  XJ-3-1-25_R2.fastq.gz
XJ-2-1-25_R1.fastq.gz  XJ-2-1-25_R2.fastq.gz
XJ-7-1-25_R1.fastq.gz  XJ-7-1-25_R2.fastq.gz

Then go into the $WORKDIR/STARChip/STARChip-circRNA and run starchip-circles to generate scripts:

$ $path2circles starchip-circles.fastqfiles starchip-circles.params
Using the following parameters: Circular RNA must have at least 5 reads in at least 10 subjects/output files.  Using 20 CPUs.
Rscript must be callable.  Requiring 0 subjects/outputs with 0 Counts per million circular reads to count a given circular RNA
Other requirements are bedtools (>= 2.24.0)
You have indicated you would like STARChip to perform STAR alignments.  starchip-circles.fastqfiles should contain a list of fastq files; 1 sample per line, multiple files separated by a comma, and paired end files separated by a space.
STARChip run scripts generated, please run ./Step1.sh through Step4.sh to detect and quantify circRNA

There will be four scripts:

Step1.sh: align
Step2.sh: discover circRNA
Step3.sh: re-align
Step4.sh: quantify/annotate

Step2.sh and Step3.sh use STAR in the system PATH, but I want to use another one:

1 2	sed -i '3,$s\|^\|/software/STAR-2.5.3a/bin/Linux_x86_64_static/\|g' Step1.sh sed -i '3,$s\|^\|/software/STAR-2.5.3a/bin/Linux_x86_64_static/\|g' Step3.sh

I’m working on a PBS grid system, then I create a script to submit these scripts:

#!/bin/bash
#PBS -V
#PBS -j eo
#PBS -N STARChip-circles
#PBS -q Blade
#PBS -l nodes=1:ppn=20

echo Start time is `date +%Y/%m/%d--%H:%M`

# work dir
WORKDIR=/STARChip/STARChip-circRNA

# starchip-circles
sh Step1.sh
sh Step2.sh
sh Step3.sh
sh Step4.sh

echo Finish time is `date +%Y/%m/%d--%H:%M`

In my samples, only four circRNAs were identified by STARChip-circles.

$ cat circRNA.5reads.10ind.countmatrix
8_4-10_R1.fastq.gz	8_4-11_R1.fastq.gz	8_4-3_R1.fastq.gz	8_4-4_R1.fastq.gz	8_4-5_R1.fastq.gz	8_4-6_R1.fastq.gz	8_4-7_R1.fastq.gz	8_4-8_R1.fastq.gz   8_4-9_R1.fastq.gz	XJ-10-1-25_R1.fastq.gz	XJ-11-1-25_R1.fastq.gz	XJ-1-1-25_R1.fastq.gz	XJ-12-1-25_R1.fastq.gz	XJ-13-1-25_R1.fastq.gz	XJ-2-1-25_R1.fastq.gz	XJ-3-1-25_R1.fastq.gXJ-4-1-25_R1.fastq.gz	XJ-5-1-25_R1.fastq.gz	XJ-6-1-25_R1.fastq.gz	XJ-7-1-25_R1.fastq.gz	XJ-8-1-25_R1.fastq.gz	XJ-9-1-25_R1.fastq.gz
chr1:117402186-117442325	0	0	0	2	0	0	4	0	6	3	8	0	0	4	3	2	2	2	0	0   3
chr15:101213315-101216678	0	0	38	20	8	0	0	0	0	0	0	26	0	0	58	60	18	12	25	23  0
chr15:90217439-90219891	0	4	0	1	0	0	0	0	1	0	1	0	5	2	0	1	0	2	0	1	9   4
chr9:111786793-111787947	0	0	33	186	23	0	0	0	0	1	0	10	0	0	5	29	42	14	15	15  0

run starchip-fusions

The parameter file:

### Parameters for fusions-from-star.pl

## Describing Your Data:

pairedend = TRUE  #TRUE means paired end data.  any other value means single end. $spancutoff should be 0 if data is single end.
consensus = TRUE  # anything but TRUE will make this skip the consensus sequence generation for each fusion.
## Filters, Cutoffs 

splitReads = auto # number of minimum read support at jxn.  Minimum 2.  Greatly impacts running time. Other options are: "auto" , "highsensitivity" , and "highprecision"   
uniqueReads = 2 # number of unique read support values (higher indicates more likely to be real. lower is more likely amplification artifact).
spancutoff = 1 #minimum number of non-split reads support.  If single end data, this must be 0 or auto.  Other options are: "auto" , "highsensitivity" , and "highprecision"
wiggle = 500  #number of base-pairs of 'wiggle-room' when determining the location of a fusion (for spanning read counts)
overlapLimit = 5 #wiggle room for joining very closely called fusion sites.
samechrom_wiggle = 20000 #this is the distance that fusions have to be from each other if on the same chromosome.  Set to 0 if you want no filtering of same-chromosome pr
lopsidedupper = 10 # (topsidereads + 0.1) / (bottomsidereads + 0.1) must be below this value. set very high to disable.  Reccomended setting 5
lopsidedlower = 0.1 # (topsidereads + 0.1) / (bottomsidereads + 0.1) must be above this value. set to 0 to disable. Reccomended setting 0.2
cnvwiggle = 1000 #we skip fusions that can be explained by known cnvs.  how close to the edges of the cnv must our fusion be?
circlesize = 100000 #we skip fusions that look more like circular rna/backsplices.  how big (bp) could a circle be?    

## Local Reference Files:
	#refbed is a bed format version of a gtf.  This should probably be derived from the same GTF that STAR aligned with using setup.sh.   
refbed=/software/starchip/starchip_ref/gencode.v27.annotation.gtf.bed
        # a bed format list of known repeats
repeatbed=/RefData/RepeatMasker/hg38.ucsc.180414.rmsk.sorted.bed
        # fasta reference.  should be indexed (run 'samtools faidx file.fa')
refFasta = /RefData/Homo_sapiens/GRCh38_no_alt/genome.fa
abparts = reference/hg38.abparts
cnvs = reference/conrad_hg38.cnvs
familyfile = reference/ensfams.txt
falsepositives = reference/knownFP.txt 


#Scoring Parameters (feel free to tweak).
splitscoremod = 10
spanscoremod = 20
skewpenalty = 4
repeatpenalty = 0.5  # score = score*(repeatpenalty^repeats)  --> a fusion can have 0,1,or 2 sites fall into repeat regions.

Based on the output of previous STAR running for starchip-circles, the script to run starchip-fusions contains:

#!/bin/bash
#PBS -V
#PBS -j eo
#PBS -N STARChip-fusions
#PBS -q Blade
#PBS -l nodes=1:ppn=2

echo Start time is `date +%Y/%m/%d--%H:%M`

# work dir
WORKDIR=/STARChip

# starchip-fusions
TOOLDIR=/software
path2fusions=$TOOLDIR/starchip/starchip-fusions.pl

for sample in for sample in XJ-3-1-25 XJ-2-1-25 XJ-7-1-25 XJ-4-1-25 XJ-5-1-25 XJ-6-1-25 XJ-1-1-25 XJ-10-1-25 XJ-9-1-25 XJ-13-1-25 XJ-11-1-25 XJ-12-1-25 XJ-8-1-25 8_4-3 8_4-4 8_4-5 8_4-6 8_4-7 8_4-8 8_4-9 8_4-10 8_4-11
do
	/usr/bin/perl $path2fusions $sample $WORKDIR/STARChip-circRNA/STARout/${sample}_R1.fastq.gz/Chimeric.out.junction starchip-fusions.param
done

echo Finish time is `date +%Y/%m/%d--%H:%M`

In my samples, no fusions were found by STARChip-fusions, and I don’t want to tweak parameters to improve sensitivity.

Change notes

20180413: create the note.

Gao Y, Zhao F. 2018 Jan 12. Computational Strategies for Exploring Circular RNAs. Trends in Genetics. doi:10.1016/j.tig.2017.12.016. [accessed 2018 Jan 15]. https://www.sciencedirect.com/science/article/pii/S0168952517302366. ↩︎
Szabo L, Salzman J. 2016. Detecting circular RNAs: bioinformatic and experimental challenges. Nat Rev Genet 17:679–692. doi:10.1038/nrg.2016.114. ↩︎
Hansen TB, Ven? MT, Damgaard CK, Kjems J. 2016. Comparison of circular RNA prediction tools. Nucleic Acids Research 44:e58–e58. doi:10.1093/nar/gkv1458. ↩︎
Zeng X, Lin W, Guo M, Zou Q. 2017. A comprehensive overview and evaluation of circular RNA detection tools. PLOS Computational Biology 13:e1005420. doi:10.1371/journal.pcbi.1005420. ↩︎
Zhang X-O, Dong R, Zhang Y, Zhang J-L, Luo Z, Zhang J, Chen L-L, Yang L. 2016. Diverse alternative back-splicing and alternative splicing landscape of circular RNAs. Genome Res. 26:1277–1287. doi:10.1101/gr.202895.115. ↩︎
Gao Y, Wang J, Zhao F. 2015. CIRI: an efficient and unbiased algorithm for de novo circular RNA identification. Genome Biology 16:4. doi:10.1186/s13059-014-0571-3. ↩︎
Akers NK, Schadt EE, Losic B. 2018 Feb 20. STAR Chimeric Post for rapid detection of circular RNA and fusion transcripts. Bioinformatics:bty091–bty091. doi:10.1093/bioinformatics/bty091. ↩︎