Viral Expression in RNA-seq data

Recently I wanted to check viral expression from RNA-seq data.

I found two good examples:

Cao S, Strong MJ, Wang X, Moss WN, Concha M, Lin Z, O’Grady T, Baddoo M, Fewell C, Renne R, et al. 2015. High-Throughput RNA Sequencing-Based Virome Analysis of 50 Lymphoma Cell Lines from the Cancer Cell Line Encyclopedia Project. J. Virol. 89:713–729. doi:10.1128/JVI.02570-14.

Wang Zheng, Hao Y, Zhang C, Wang Zhiliang, Liu X, Li G, Sun L, Liang J, Luo J, Zhou D, et al. 2017. The Landscape of Viral Expression Reveals Clinically Relevant Viruses with Potential Capability of Promoting Malignancy in Lower-Grade Glioma. Clinical Cancer Research 23:2177–2185.

Also some useful discussions:

Alex (the author of STAR) suggested to combine human genome and viruses. But I already mapped the FASTQ to human genome (hg38), and saved unmapped reads to seperated FASTQ files.

Step 1, download all virul genomes from NCBI Refseq Viral Release.

$ wget ftp://ftp.ncbi.nih.gov/refseq/release/viral/viral.1.1.genomic.fna.gz
$ wget ftp://ftp.ncbi.nih.gov/refseq/release/viral/viral.2.1.genomic.fna.gz

$ gzip -d *

$ cat viral.1.1.genomic.fna viral.2.1.genomic.fna > viral.refseq.180424.fa

Step 2, build STAR index.

1
2
3

$ mkdir STARgenomes

$ /software/STAR-2.5.3a/bin/Linux_x86_64_static/STAR --runThreadN 10 --genomeDir ./STARgenomes --runMode genomeGenerate --genomeFastaFiles viral.refseq.180424.fa

Step 3, align unmapped reads to viral genomes.


for sample in x1 x2 ...
do
    $TOOLDIR/STAR-2.5.3a/bin/Linux_x86_64_static/STAR --runMode alignReads --runThreadN 24 --genomeDir $STARindex --outSAMtype BAM SortedByCoordinate --outSAMattributes All --readFilesIn $WORKDIR/STAR_out/${sample}_Unmapped.out.mate1.gz $WORKDIR/STAR_out/${sample}_Unmapped.out.mate2.gz --readFilesCommand zcat --outFileNamePrefix $WORKDIR/Viral_expression/${sample}_
    $TOOLDIR/samtools.1.3.1/bin/samtools index $WORKDIR/Viral_expression/${sample}_Aligned.sortedByCoord.out.bam
done

Step 4, compute viral expression.

I wanted to use existing read-counting software to quantify the viruses, so I had to create a fake annotation (a fake GTF file).

The tiny script to create GTF from FASTA file was like this:

#!/usr/bin/evn python

'''
purpose: to calculate read count of virus, I want to make a fake GTF of virus fasta.

usage: python xxx.py virus.fa > virus.gtf

'''
import sys

# step 1: read the fasta and put it into a dict
seq_dict = {}
with open(sys.argv[1], 'r') as fin:
	for line in fin:
		line = line.strip()
		if line.startswith('>'):
			seqID = line.split()[0].replace('>', '')
			seqName = ' '.join(line.split(',')[0].split()[1:])
			seq_dict[seqID] = [seqName, '']
		else:
			seq_dict[seqID][1] += line

# step2: traverse the dict and generate GTF. Use the whole virus as a exon.
for key in seq_dict:
	seqLen = len(seq_dict[key][1])
	tmp_list = [key, 'Virus', 'exon', '1', str(seqLen), '.', '+', '.']
	print '\t'.join(tmp_list) + '\t' + 'gene_id "' + key + '"; gene_name "' + seq_dict[key][0] + '";'

And the output looked like this:

$ head -3 viral.refseq.180424.fake.gtf
NC_003747.2	Virus	exon	1	4212	.	+	.	gene_id "NC_003747.2"; gene_name "Ryegrass mottle virus isolate MAFF. No. 307043 from Japan";
NC_011500.2	Virus	exon	1	1614	.	+	.	gene_id "NC_011500.2"; gene_name "Rotavirus A segment 5";
NC_007737.1	Virus	exon	1	3055	.	+	.	gene_id "NC_007737.1"; gene_name "Liao ning virus segment 2";

Then I used featureCounts function from Rsubread R package to count the reads of viruses (non-strand specific, 'cause not knowing the transcription direction), and used rpkm function of edgeR to normalize the raw count to viral “FPKM”.

Note:

The expression is a estimation. There maybe lots of errors. Be careful to interpret the results.

Change notes

20180424: create the note.