I want to find out some circRNAs from RNA-seq data (total RNA-seq, not poly-A enriched).
There are many tools for this mission. Here is a good review paper [1] talking about computational methods for analyzing circRNAs, both identification and downstream analysis. Also another review paper about identifying circRNAs [2]. There are also two evaluation papers for the identification tools [3][4].
From all the tools I know, CIRCexplorer2
[5] and CIRI
[6] are well matained. But I want to try something new: STARChip
[7].
STARChip is short for Star Chimeric Post, written by Dr. Nicholas Kipp Akers as part of his work in Bojan Losic’s group at the Icahn Institute of Genomics and Multiscale Biology at Mount Sinai School of Medicine
This software is designed to take the chimeric output from the STAR alignment tool and discover high confidence fusions and circular RNA in the data. Before running, you must have used a recent version of STAR with chimeric output turned on, to align your RNA-Seq data.
So, it can identify fusions and circRNAs at the same time. According to its paper, for circRNA detection, “STARChip achieves the best precision of all tools tested and nearly the best sensitivity. This does not appear to come at an increased resource cost. Both CIRI and CIRCexplorer had competitive precision and sensitivity values; STARChip required 43 and 179% of the runtimes of these packages, respectively, and ∼72% of the memory requirements.”; for fusions, “With STARChip, we have attempted to emphasize precision at the expense of sensitivity in these particular gold-standard studies, reasoning that such hyper-tuning inflates type I error in mining novel datasets.”
I’ve discussed with the author Kipp Akers about the precision: https://github.com/LosicLab/starchip/issues/9#issuecomment-381181507. He said:
To your final question, my goal with STARChip was to develop a tool that focused on precision. There are a dozen fusion finders out there that sacrifice everything to get the highest sensitivity. For my projects, this was not too helpful. However, STARChip’s read requirement settings can be set manually and because it runs so quickly, it’s easy to play with the settings to turn up sensitivity and turn down precision and see what you get. Feel free to do so, and let me know what you find!
I agree with the designing purpose of STARChip
, so I decide to give it a shot.
There are two main modules in STARChip
:
- starchip-fusions is for fusion detection. It runs on individual samples.
/path/to/starchip/starchip-fusions.pl output_seed Chimeric.out.junction Paramters.txt
- starchip-circles is for circRNA detection. It runs on groups of samples.
/path/to/starchip/starchip-circles.pl STARdirs.txt Parameters.txt
/path/to/starchip/starchip-circles.pl fastq_files.txt parameters.txt
Notes below are more for my own convenience. See its git repo for full usage.
prepare
STARChip is written to be an extension of the STAR read aligner. It is optional for STARChip to run STAR on your samples. In most instances to run STARChip you must first run star on each of your samples. See the STAR documentation for installation, as well as building or downloading a STAR genome index. It is absolutely critical however, that you follow the STAR manual’s instructions and build a genome using all chromosomes plus unplaced contigs. Not doing so will strongly inflate your false positives rate, because reads that map perfectly to an unplaced contig will instead find the next best alignment, often a chimeric alignment. Run STAR with the following parameters required for chimeric output: –chimSegmentMin X –chimJunctionOverhangMin X (where X is an integer). Your project will have it’s own requirements, but a good starting point for your star alignments might look like:
STAR --genomeDir /path/to/starIndex/ --readFilesIn file1_1.fastq.gz file1_2.fastq.gz --runThreadN 11 --outReadsUnmapped Fastx --quantMode GeneCounts --chimSegmentMin 15 --chimJunctionOverhangMin 15 --outSAMstrandField intronMotif --readFilesCommand zcat --outSAMtype BAM Unsorted
reference/BED files
STARChip makes use of gtf files for annotating fusions and circRNA with gene names.
First, download the package and prepare annotation files:
1 | git clone https://github.com/LosicLab/starchip.git && cd starchip |
additional files for Fusions
starchip-fusions
filters using the location of known repeats in bed format as well. Following the instructions in the picture to download repeats from UCSC genome browser.
- Go to http://genome.ucsc.edu/cgi-bin/hgTables
- Change ‘genome’ to your desired genome
- Change the following settings:
- group: Repeat
- track: RepeatMasker
- region: genome
- output format: BED
- output file: some reasonable name.bed
- Click ‘get output’ to download your bed file.
- On your local machine sort the bed file:
sort -k1,1 -k2,2n repeats.bed > repeats.sorted.bed
If you’re working on hg19
or hg38
, you don’t have to do the following things. The files needed are already included in the directory of STARChip
.
starchip-fusions
can also make use of known antibody parts, and copy number variants. These files come with starchip for human hg19 and hg38 in the reference directory. For other species you can create your own in the simple format: Chromosome StartPosition EndPosition
Finally, starchip-fusions uses known gene families and known/common false-positive pairs to filter out fusions which are likely mapping errors or PCR artifacts. Family data can be downloaded from ensembl biomart:
- Go to http://www.ensembl.org/biomart/martview
- Database: Ensembl Genes
- Dataset: Your species
- Click Attributes on the left hand side.
- Under GENE dropdown, select only “Gene Name”
- Under PROTEIN FAMILIES AND DOMAINS dropdown select Ensembl Protein Family ID.
- Click Results at the top.
- Export the file. It should have two columns, Family ID and Gene ID.
Known false positives are stored within data/pseudogenes.txt. In practice, we’ve found that pseudogenes and tissue specific highly expressed genes are commonly “fused” via PCR template switching errors. Feel free to put add any additional lines that result from your data to this file in the format: Gene1Name Gene2Name
run STARChip
Since my previous run of STAR
didn’t use parameters --chimSegmentMin
and --chimJunctionOverhangMin
, I have to start with Fastq
files.
starchip-circles
can run from Fastq
files, but starchip-fusions
starts from Chimeric.out.junction
. I’ll first run starchip-circles
then run starchip-fusions
.
First of all, I prepare dirs for STARChip
under my WORKDIR like this:
1 | STARChip/ |
run starchip-circles
The parameter file and Fastq
file:
1 | cat starchip-circles.params |
Then go into the $WORKDIR/STARChip/STARChip-circRNA
and run starchip-circles
to generate scripts:
1 | $path2circles starchip-circles.fastqfiles starchip-circles.params |
There will be four scripts:
Step1.sh
: alignStep2.sh
: discover circRNAStep3.sh
: re-alignStep4.sh
: quantify/annotate
Step2.sh
and Step3.sh
use STAR
in the system PATH
, but I want to use another one:
1 | sed -i '3,$s|^|/software/STAR-2.5.3a/bin/Linux_x86_64_static/|g' Step1.sh |
I’m working on a PBS grid system, then I create a script to submit these scripts:
1 | !/bin/bash |
In my samples, only four circRNAs were identified by STARChip-circles
.
1 | $ cat circRNA.5reads.10ind.countmatrix |
run starchip-fusions
The parameter file:
1 | ## Parameters for fusions-from-star.pl |
Based on the output of previous STAR
running for starchip-circles
, the script to run starchip-fusions
contains:
1 | !/bin/bash |
In my samples, no fusions were found by STARChip-fusions
, and I don’t want to tweak parameters to improve sensitivity.
Change notes
- 20180413: create the note.
Gao Y, Zhao F. 2018 Jan 12. Computational Strategies for Exploring Circular RNAs. Trends in Genetics. doi:10.1016/j.tig.2017.12.016. [accessed 2018 Jan 15]. https://www.sciencedirect.com/science/article/pii/S0168952517302366. ↩︎
Szabo L, Salzman J. 2016. Detecting circular RNAs: bioinformatic and experimental challenges. Nat Rev Genet 17:679–692. doi:10.1038/nrg.2016.114. ↩︎
Hansen TB, Ven? MT, Damgaard CK, Kjems J. 2016. Comparison of circular RNA prediction tools. Nucleic Acids Research 44:e58–e58. doi:10.1093/nar/gkv1458. ↩︎
Zeng X, Lin W, Guo M, Zou Q. 2017. A comprehensive overview and evaluation of circular RNA detection tools. PLOS Computational Biology 13:e1005420. doi:10.1371/journal.pcbi.1005420. ↩︎
Zhang X-O, Dong R, Zhang Y, Zhang J-L, Luo Z, Zhang J, Chen L-L, Yang L. 2016. Diverse alternative back-splicing and alternative splicing landscape of circular RNAs. Genome Res. 26:1277–1287. doi:10.1101/gr.202895.115. ↩︎
Gao Y, Wang J, Zhao F. 2015. CIRI: an efficient and unbiased algorithm for de novo circular RNA identification. Genome Biology 16:4. doi:10.1186/s13059-014-0571-3. ↩︎
Akers NK, Schadt EE, Losic B. 2018 Feb 20. STAR Chimeric Post for rapid detection of circular RNA and fusion transcripts. Bioinformatics:bty091–bty091. doi:10.1093/bioinformatics/bty091. ↩︎