Recently I wanted to check viral expression from RNA-seq data.
I found two good examples:
Cao S, Strong MJ, Wang X, Moss WN, Concha M, Lin Z, O’Grady T, Baddoo M, Fewell C, Renne R, et al. 2015. High-Throughput RNA Sequencing-Based Virome Analysis of 50 Lymphoma Cell Lines from the Cancer Cell Line Encyclopedia Project. J. Virol. 89:713–729. doi:10.1128/JVI.02570-14.
Wang Zheng, Hao Y, Zhang C, Wang Zhiliang, Liu X, Li G, Sun L, Liang J, Luo J, Zhou D, et al. 2017. The Landscape of Viral Expression Reveals Clinically Relevant Viruses with Potential Capability of Promoting Malignancy in Lower-Grade Glioma. Clinical Cancer Research 23:2177–2185.
Also some useful discussions:
Alex (the author of STAR
) suggested to combine human genome and viruses. But I already mapped the FASTQ
to human genome (hg38), and saved unmapped reads to seperated FASTQ
files.
Step 1, download all virul genomes from NCBI Refseq Viral Release.
1 | wget ftp://ftp.ncbi.nih.gov/refseq/release/viral/viral.1.1.genomic.fna.gz |
Step 2, build STAR
index.
1 | mkdir STARgenomes |
Step 3, align unmapped reads to viral genomes.
1 |
|
Step 4, compute viral expression.
I wanted to use existing read-counting software to quantify the viruses, so I had to create a fake annotation (a fake GTF file).
The tiny script to create GTF from FASTA file was like this:
1 | #!/usr/bin/evn python |
And the output looked like this:
1 | head -3 viral.refseq.180424.fake.gtf |
Then I used featureCounts
function from Rsubread
R package to count the reads of viruses (non-strand specific, 'cause not knowing the transcription direction), and used rpkm
function of edgeR
to normalize the raw count to viral “FPKM”.
Note:
- The expression is a estimation. There maybe lots of errors. Be careful to interpret the results.
Change notes
- 20180424: create the note.