Biological ID Conversion

ID mapping is annoying but we have to face very often. This note is a collection of methods to deal with this trouble.

R (Bioconductor)

There are lots of annotation packages in Bioconductor and they contain various kinds of annotation we need and we don’t need. Different series of annotation packages may have different design purpose, and these differences should be considered when in practice.

For ID conversion, two main resources can be used: biomaRt, the R interface of BioMart, and various specialized annotation packages.

biomaRt

Ref: The biomaRt users guide

biomaRt is a R interface to BioMart databases. It’s very powerful and ID conversion is only one of many applications.

The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas or write complex SQL queries. Examples of BioMart databases are Ensembl, COSMIC, Uniprot, HGNC, Gramene, Wormbase and dbSNP mapped to Ensembl. These major databases give biomaRt users direct access to a diverse set of data and enable a wide range of powerful online queries from gene annotation to database mining. via: http://www.bioconductor.org/packages/release/bioc/html/biomaRt.html

1
library("biomaRt")

Display all available BioMart web services

1
listMarts()
1
2
3
4
5
               biomart               version
1 ENSEMBL_MART_ENSEMBL Ensembl Genes 93
2 ENSEMBL_MART_MOUSE Mouse strains 93
3 ENSEMBL_MART_SNP Ensembl Variation 93
4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 93

Choose to query the Ensembl BioMart database.

1
ensembl=useMart("ensembl")

Look at which datasets are available in the selected BioMart by using the function

1
listDatasets(ensembl)[1:5, ]
1
2
3
4
5
6
                           dataset                                                  description                 version
1 acarolinensis_gene_ensembl Anole lizard genes (AnoCar2.0) AnoCar2.0
2 amelanoleuca_gene_ensembl Panda genes (ailMel1) ailMel1
3 amexicanus_gene_ensembl Cave fish genes (AstMex102) AstMex102
4 anancymaae_gene_ensembl Ma's night monkey genes (Anan_2.0) Anan_2.0
5 aplatyrhynchos_gene_ensembl Duck genes (BGI_duck_1.0) BGI_duck_1.0

Update the Mart object using the function useDataset()

1
ensembl = useDataset("hsapiens_gene_ensembl", mart=ensembl)

Or alternatively if the dataset one wants to use is known in advance, we can select a BioMart database and dataset in one step by:

1
ensembl = useMart("ensembl", dataset="hsapiens_gene_ensembl")

Shows all available filters in the selected dataset

1
2
filters = listFilters(ensembl)
filters[1:5,]
1
2
3
4
5
6
##              name              description
## 1 chromosome_name Chromosome/scaffold name
## 2 start Start
## 3 end End
## 4 band_start Band Start
## 5 band_end Band End

Displays all available attributes in the selected dataset

1
2
attributes = listAttributes(ensembl)
attributes[1:5,]
1
2
3
4
5
6
                           name                  description         page
1 ensembl_gene_id Gene stable ID feature_page
2 ensembl_gene_id_version Gene stable ID version feature_page
3 ensembl_transcript_id Transcript stable ID feature_page
4 ensembl_transcript_id_version Transcript stable ID version feature_page
5 ensembl_peptide_id Protein stable ID feature_page

Annotate a set of Affymetrix identifiers with HUGO symbol and chromosomal locations of corresponding genes

1
2
3
4
5
affyids=c("202763_at","209310_s_at","207500_at")
getBM(attributes=c('affy_hg_u133_plus_2', 'entrezgene'),
filters = 'affy_hg_u133_plus_2',
values = affyids,
mart = ensembl)
1
2
3
4
  affy_hg_u133_plus_2 entrezgene
1 209310_s_at 837
2 207500_at 838
3 202763_at 836

Retrieve all HUGO gene symbols of genes that are located on chromosomes 17,20 or Y, and are associated with specific GO terms.

1
2
3
4
5
go=c("GO:0051330","GO:0000080","GO:0000114","GO:0000082")
chrom=c(17,20,"Y")
getBM(attributes= "hgnc_symbol",
filters=c("go","chromosome_name"),
values=list(go, chrom), mart=ensembl)
1
2
3
4
5
6
7
  hgnc_symbol
1 RPS6KB1
2 CDC6
3 RPA1
4 CDK3
5 MCM8
6 CRLF3

Annotate a set of EntrezGene identifiers with GO annotation.

1
2
3
4
5
6
entrez=c("673","837")
goids = getBM(attributes = c('entrezgene', 'go_id'),
filters = 'entrezgene',
values = entrez,
mart = ensembl)
head(goids)
1
2
3
4
5
6
7
  entrezgene      go_id
1 673 GO:0000166
2 673 GO:0004672
3 673 GO:0004674
4 673 GO:0005524
5 673 GO:0006468
6 673

Ensembl id to gene symbol and entrez id.

1
2
3
4
5
6
7
8
9
library(biomaRt)
ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl")

ensg = c('ENSG00000242268.2', 'ENSG00000158486.13')

# get stable id
ensg.no_version = sapply(strsplit(as.character(ensg),"\\."),"[[",1)

getBM(attributes = c('ensembl_gene_id', 'entrezgene', 'hgnc_symbol'), filters = 'ensembl_gene_id', values=ensg.no_version, mart=ensembl)
1
2
3
  ensembl_gene_id entrezgene hgnc_symbol
1 ENSG00000158486 55567 DNAH3
2 ENSG00000242268 NA LINC02082

If you do not want to NA, use na.omit to remove those genes that can’t be transformed.

OrgDb packages + bitr

Ref: clusterProfiler - bitr

orgDb packages are gene-centric annotation packages at organism level, such as org.Hs.eg.db, org.Mmu.eg.db.

1
library(org.Hs.eg.db)

Here is org.Hs.eg package. We can see all the resources used to build this package.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
> org.Hs.eg.db
OrgDb object:
| DBSCHEMAVERSION: 2.1
| Db type: OrgDb
| Supporting package: AnnotationDbi
| DBSCHEMA: HUMAN_DB
| ORGANISM: Homo sapiens
| SPECIES: Human
| EGSOURCEDATE: 2018-Apr4
| EGSOURCENAME: Entrez Gene
| EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| CENTRALID: EG
| TAXID: 9606
| GOSOURCENAME: Gene Ontology
| GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/
| GOSOURCEDATE: 2018-Mar28
| GOEGSOURCEDATE: 2018-Apr4
| GOEGSOURCENAME: Entrez Gene
| GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| KEGGSOURCENAME: KEGG GENOME
| KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
| KEGGSOURCEDATE: 2011-Mar15
| GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
| GPSOURCEURL:
| GPSOURCEDATE: 2018-Mar26
| ENSOURCEDATE: 2017-Dec04
| ENSOURCENAME: Ensembl
| ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
| UPSOURCENAME: Uniprot
| UPSOURCEURL: http://www.UniProt.org/
| UPSOURCEDATE: Mon Apr 9 20:58:54 2018

Please see: help('select') for usage information

Use keytypes() to list all supporting types.

1
keytypes(org.Hs.eg.db)
1
2
3
4
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
[9] "EVIDENCEALL" "GENENAME" "GO" "GOALL" "IPI" "MAP" "OMIM" "ONTOLOGY"
[17] "ONTOLOGYALL" "PATH" "PFAM" "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
[25] "UNIGENE" "UNIPROT"

Key types supported by differenct packages can be different.

1
keytypes(org.Ss.eg.db)
1
2
 [1] "ACCNUM"      "ALIAS"       "ENTREZID"    "ENZYME"      "EVIDENCE"    "EVIDENCEALL" "GENENAME"    "GO"          "GOALL"      
[10] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PMID" "REFSEQ" "SYMBOL" "UNIGENE" "UNIPROT"

Convert Ensembl ids to entrez id and gene symbol.

1
2
3
4
5
6
7
8
9
library(clusterProfiler)
library(org.Hs.eg.db)

ensg = c('ENSG00000242268.2', 'ENSG00000158486.13')

# remove version number
ensg.no_version = sapply(strsplit(as.character(ensg),"\\."),"[[",1)

bitr(ensg.no_version, fromType="ENSEMBL", toType=c("ENTREZID", "SYMBOL"), OrgDb="org.Hs.eg.db")
1
2
3
4
'select()' returned 1:1 mapping between keys and columns
ENSEMBL ENTREZID SYMBOL
1 ENSG00000242268 100507661 LINC02082
2 ENSG00000158486 55567 DNAH3

Other Annotation Packages

Apart from the OrgDb packages, there are also many other annotation packages like TxDbpackages and EnsDb packages, which provide various kinds of information. And most of them are based on AnnotationDb object, and one can use standard select function to retrieve information needed.

NCBI gene DATA

Sometimes we want to have all information on local disks and use in-house scripts to do the conversion. ftp://ftp.ncbi.nih.gov/gene/DATA provide most up-to-date and comprehensive collections of gene-centric information.

By incorporating the data from LocusLink in an Entrez database with gene-specific data from other species, you now have a single point of lookup for gene-specific information for the taxa within the scope of the RefSeq project. You also have more immediate access to related data that was cumbersome to maintain independent of Entrez, and can harness the power of Entrez-based tools such as Entrez Programming Utilities (E-Utilities) and MyNCBI. via: https://www.ncbi.nlm.nih.gov/entrez/query/static/help/LL2G.html#files

This README discribes all the files included. Here is a short summary.

Entrez Gene file name Comments
DATA/ASN_BINARY Files in this directory contain comprehensive extractions from Entrez Gene in ASN.1 format.
DATA/GENE_INFO extractions from Entrez Gene in the same format as the gene_info file. Each file contains a subset of data for the species or taxonomic group indicated by the file name.
DATA/expression reports of normalized RNA expression levels computed from RNA-seq data for human, mouse, and rat genes.
gene2accession a comprehensive report of the accessions that are related to a GeneID. It includes sequences from the international sequence collaboration, Swiss-Prot, and RefSeq. The RefSeq subset of this file is also available as gene2refseq… If you want to convert any accessions into GeneIDs, this one file should suffice.
gene2ensembl This file reports matches between NCBI and Ensembl annotation based on comparison of rna and protein features.
gene2vega This file reports matches between NCBI and Vega annotation.
gene2go GeneID/GO ID/Evidence Code. Consolidated summary based on gene_association files from the GO Consortium and Entrez Gene’s gene_info file.
gene2pubmed gene2pubmed includes the identifier for the species of the GeneID (i.e. the Taxonomy ID).
gene2refseq This file is the RefSeq subset of gene2accession. The file in Entrez Gene does not include information about secondary accessions. This function is now provided from the RefSeq ftp site, as documented in the current release notes: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-notes/RefSeq-release#.txt, where # is the value of the current release number.
gene2sts GeneID/UniSTS marker ID relationship
gene2unigene GeneID/UniGene cluster relationship
gene_group report of genes and their relationships to other genes
gene_orthologs report of orthologous genes
gene_history comprehensive information about GeneIDs that are no longer current
gene_info GeneID, names, map locations, and database cross-reference.
gene_neighbors reports neighboring genes for all genes placed on a given genomic sequence.
gene_refseq_uniprotkb_collab report of the relationship between NCBI Reference Sequence protein accessions and UniProtKB protein accessions
mim2gene_medgen report of the relationship between MIM numbers (OMIM), GeneIDs, and Records in MedGen

API

Many databases provide APIs to help access their data and some of them can be used for id conversion. But I do not recommend to use these APIs directly if one dose not want to spend much time on this job, as they can be changed over time and users have to be familiar with the data structure provided. Many commonly used APIs have external software or packages to access, and you may use Google to find them before using the APIs.

Ensembl REST API

Ensembl REST API provides many user-friendly interfaces to retrive information. And there are three APIs for cross biological id mapping.

  • GET xrefs/symbol/:species/:symbol looks up an external symbol and returns all Ensembl objects linked to it.
  • GET xrefs/id/:id performs lookups of Ensembl Identifiers and retrieve their external references in other databases.
  • GET xrefs/name/:species/:name performs a lookup based upon the primary accession or display label of an external reference and returning the information we hold about the entry.

I guess biomaRt aforementioned is actually a well-capsulated software that communicates with databases through APIs.

KEGG API

KEGG API is a REST-stype Application Programming Interface to the KEGG database resource.

We can use this API by bitr_kegg in clusterProfiler package or KEGGREST package.

bitr_kegg

Ref: clusterProfiler - bitr_kegg

1
2
3
4
5
library(clusterProfiler)

hg = c("4597", "7111", "5266", "2175", "755", "23046")

bitr_kegg(hg, fromType='kegg', toType='ncbi-proteinid', organism='hsa')
1
2
3
4
5
6
7
   kegg ncbi-proteinid
1 2175 NP_000126
2 23046 NP_001239029
3 4597 NP_002452
4 5266 NP_002629
5 7111 NP_001159588
6 755 NP_004919

The ID type (both fromType & toType) should be one of ‘kegg’, ‘ncbi-geneid’, ‘ncbi-proteinid’ or ‘uniprot’. The ‘kegg’ is the primary ID used in KEGG database. The data source of KEGG was from NCBI. A rule of thumb for the ‘kegg’ ID is entrezgene ID for eukaryote species and Locus ID for prokaryotes.

Many prokaryote species don’t have entrezgene ID available. For example we can check the gene information of ece:Z5100 in http://www.genome.jp/dbget-bin/www_bget?ece:Z5100, which have NCBI-ProteinID and UnitProt links in the Other DBs Entry, but not NCBI-GeneID.

The full list of KEGG supported organisms can be accessed via http://www.genome.jp/kegg/catalog/org_list.html.

KEGGREST

KEGGREST provides a client interface to the KEGG REST server. And keggConv() can be used for converting identifiers.

1
library(KEGGREEST)

Convert between KEGG identifiers and outside identifiers.

1
keggConv("ncbi-proteinid", c("hsa:10458", "ece:Z5100"))
1
2
                 hsa:10458                  ece:Z5100 
"ncbi-proteinid:NP_059345" "ncbi-proteinid:AAG58814"

…or get the mapping for an entire species:

1
head(keggConv("eco", "ncbi-geneid"))
1
2
ncbi-geneid:944742 ncbi-geneid:945803 ncbi-geneid:947498 ncbi-geneid:945198 ncbi-geneid:944747 ncbi-geneid:944749 
"eco:b0001" "eco:b0002" "eco:b0003" "eco:b0004" "eco:b0005" "eco:b0006"

Reversing the arguments does the opposite mapping:

1
head(keggConv("ncbi-geneid", "eco"))
1
2
           eco:b0001            eco:b0002            eco:b0003            eco:b0004            eco:b0005            eco:b0006 
"ncbi-geneid:944742" "ncbi-geneid:945803" "ncbi-geneid:947498" "ncbi-geneid:945198" "ncbi-geneid:944747" "ncbi-geneid:944749"

Web Server

Change log

  • 20180918: create the note.
0%