ID mapping is annoying but we have to face very often. This note is a collection of methods to deal with this trouble.
R (Bioconductor)
There are lots of annotation packages in Bioconductor and they contain various kinds of annotation we need and we don’t need. Different series of annotation packages may have different design purpose, and these differences should be considered when in practice.
For ID conversion, two main resources can be used: biomaRt
, the R interface of BioMart, and various specialized annotation packages.
biomaRt
biomaRt
is a R interface to BioMart databases. It’s very powerful and ID conversion is only one of many applications.
The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas or write complex SQL queries. Examples of BioMart databases are Ensembl, COSMIC, Uniprot, HGNC, Gramene, Wormbase and dbSNP mapped to Ensembl. These major databases give biomaRt users direct access to a diverse set of data and enable a wide range of powerful online queries from gene annotation to database mining. via: http://www.bioconductor.org/packages/release/bioc/html/biomaRt.html
1 | library("biomaRt") |
Display all available BioMart web services
1 | listMarts() |
1 | biomart version |
Choose to query the Ensembl BioMart database.
1 | ensembl=useMart("ensembl") |
Look at which datasets are available in the selected BioMart by using the function
1 | listDatasets(ensembl)[1:5, ] |
1 | dataset description version |
Update the Mart object using the function useDataset()
1 | ensembl = useDataset("hsapiens_gene_ensembl", mart=ensembl) |
Or alternatively if the dataset one wants to use is known in advance, we can select a BioMart database and dataset in one step by:
1 | ensembl = useMart("ensembl", dataset="hsapiens_gene_ensembl") |
Shows all available filters in the selected dataset
1 | filters = listFilters(ensembl) |
1 | ## name description |
Displays all available attributes in the selected dataset
1 | attributes = listAttributes(ensembl) |
1 | name description page |
Annotate a set of Affymetrix identifiers with HUGO symbol and chromosomal locations of corresponding genes
1 | affyids=c("202763_at","209310_s_at","207500_at") |
1 | affy_hg_u133_plus_2 entrezgene |
Retrieve all HUGO gene symbols of genes that are located on chromosomes 17,20 or Y, and are associated with specific GO terms.
1 | go=c("GO:0051330","GO:0000080","GO:0000114","GO:0000082") |
1 | hgnc_symbol |
Annotate a set of EntrezGene identifiers with GO annotation.
1 | entrez=c("673","837") |
1 | entrezgene go_id |
Ensembl id to gene symbol and entrez id.
1 | library(biomaRt) |
1 | ensembl_gene_id entrezgene hgnc_symbol |
If you do not want to NA
, use na.omit
to remove those genes that can’t be transformed.
OrgDb packages + bitr
orgDb packages are gene-centric annotation packages at organism level, such as org.Hs.eg.db
, org.Mmu.eg.db
.
1 | library(org.Hs.eg.db) |
Here is org.Hs.eg
package. We can see all the resources used to build this package.
1 | > org.Hs.eg.db |
Use keytypes()
to list all supporting types.
1 | keytypes(org.Hs.eg.db) |
1 | [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE" |
Key types supported by differenct packages can be different.
1 | keytypes(org.Ss.eg.db) |
1 | [1] "ACCNUM" "ALIAS" "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME" "GO" "GOALL" |
Convert Ensembl ids to entrez id and gene symbol.
1 | library(clusterProfiler) |
1 | 'select()' returned 1:1 mapping between keys and columns |
Other Annotation Packages
Apart from the OrgDb
packages, there are also many other annotation packages like TxDb
packages and EnsDb
packages, which provide various kinds of information. And most of them are based on AnnotationDb
object, and one can use standard select
function to retrieve information needed.
NCBI gene DATA
Sometimes we want to have all information on local disks and use in-house scripts to do the conversion. ftp://ftp.ncbi.nih.gov/gene/DATA provide most up-to-date and comprehensive collections of gene-centric information.
By incorporating the data from LocusLink in an Entrez database with gene-specific data from other species, you now have a single point of lookup for gene-specific information for the taxa within the scope of the RefSeq project. You also have more immediate access to related data that was cumbersome to maintain independent of Entrez, and can harness the power of Entrez-based tools such as Entrez Programming Utilities (E-Utilities) and MyNCBI. via: https://www.ncbi.nlm.nih.gov/entrez/query/static/help/LL2G.html#files
This README discribes all the files included. Here is a short summary.
Entrez Gene file name | Comments |
---|---|
DATA/ASN_BINARY | Files in this directory contain comprehensive extractions from Entrez Gene in ASN.1 format. |
DATA/GENE_INFO | extractions from Entrez Gene in the same format as the gene_info file. Each file contains a subset of data for the species or taxonomic group indicated by the file name. |
DATA/expression | reports of normalized RNA expression levels computed from RNA-seq data for human, mouse, and rat genes. |
gene2accession | a comprehensive report of the accessions that are related to a GeneID. It includes sequences from the international sequence collaboration, Swiss-Prot, and RefSeq. The RefSeq subset of this file is also available as gene2refseq… If you want to convert any accessions into GeneIDs, this one file should suffice. |
gene2ensembl | This file reports matches between NCBI and Ensembl annotation based on comparison of rna and protein features. |
gene2vega | This file reports matches between NCBI and Vega annotation. |
gene2go | GeneID/GO ID/Evidence Code. Consolidated summary based on gene_association files from the GO Consortium and Entrez Gene’s gene_info file. |
gene2pubmed | gene2pubmed includes the identifier for the species of the GeneID (i.e. the Taxonomy ID). |
gene2refseq | This file is the RefSeq subset of gene2accession. The file in Entrez Gene does not include information about secondary accessions. This function is now provided from the RefSeq ftp site, as documented in the current release notes: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-notes/RefSeq-release#.txt, where # is the value of the current release number. |
gene2sts | GeneID/UniSTS marker ID relationship |
gene2unigene | GeneID/UniGene cluster relationship |
gene_group | report of genes and their relationships to other genes |
gene_orthologs | report of orthologous genes |
gene_history | comprehensive information about GeneIDs that are no longer current |
gene_info | GeneID, names, map locations, and database cross-reference. |
gene_neighbors | reports neighboring genes for all genes placed on a given genomic sequence. |
gene_refseq_uniprotkb_collab | report of the relationship between NCBI Reference Sequence protein accessions and UniProtKB protein accessions |
mim2gene_medgen | report of the relationship between MIM numbers (OMIM), GeneIDs, and Records in MedGen |
API
Many databases provide APIs to help access their data and some of them can be used for id conversion. But I do not recommend to use these APIs directly if one dose not want to spend much time on this job, as they can be changed over time and users have to be familiar with the data structure provided. Many commonly used APIs have external software or packages to access, and you may use Google to find them before using the APIs.
Ensembl REST API
Ensembl REST API provides many user-friendly interfaces to retrive information. And there are three APIs for cross biological id mapping.
- GET xrefs/symbol/:species/:symbol looks up an external symbol and returns all Ensembl objects linked to it.
- GET xrefs/id/:id performs lookups of Ensembl Identifiers and retrieve their external references in other databases.
- GET xrefs/name/:species/:name performs a lookup based upon the primary accession or display label of an external reference and returning the information we hold about the entry.
I guess biomaRt
aforementioned is actually a well-capsulated software that communicates with databases through APIs.
KEGG API
KEGG API is a REST-stype Application Programming Interface to the KEGG database resource.
We can use this API by bitr_kegg
in clusterProfiler
package or KEGGREST
package.
bitr_kegg
Ref: clusterProfiler - bitr_kegg
1 | library(clusterProfiler) |
1 | kegg ncbi-proteinid |
The ID type (both
fromType
&toType
) should be one of ‘kegg’, ‘ncbi-geneid’, ‘ncbi-proteinid’ or ‘uniprot’. The ‘kegg’ is the primary ID used in KEGG database. The data source of KEGG was from NCBI. A rule of thumb for the ‘kegg’ ID isentrezgene
ID for eukaryote species andLocus
ID for prokaryotes.
Many prokaryote species don’t have entrezgene ID available. For example we can check the gene information of
ece:Z5100
in http://www.genome.jp/dbget-bin/www_bget?ece:Z5100, which haveNCBI-ProteinID
andUnitProt
links in theOther DBs
Entry, but notNCBI-GeneID
.
The full list of KEGG supported organisms can be accessed via http://www.genome.jp/kegg/catalog/org_list.html.
KEGGREST
KEGGREST provides a client interface to the KEGG REST server. And keggConv()
can be used for converting identifiers.
1 | library(KEGGREEST) |
Convert between KEGG identifiers and outside identifiers.
1 | keggConv("ncbi-proteinid", c("hsa:10458", "ece:Z5100")) |
1 | hsa:10458 ece:Z5100 |
…or get the mapping for an entire species:
1 | head(keggConv("eco", "ncbi-geneid")) |
1 | ncbi-geneid:944742 ncbi-geneid:945803 ncbi-geneid:947498 ncbi-geneid:945198 ncbi-geneid:944747 ncbi-geneid:944749 |
Reversing the arguments does the opposite mapping:
1 | head(keggConv("ncbi-geneid", "eco")) |
1 | eco:b0001 eco:b0002 eco:b0003 eco:b0004 eco:b0005 eco:b0006 |
Web Server
- DAVID - Gene ID Conversion Tool. easy to use, but a bit old.
- BioMart - Ensembl. up-to-date and powerful.
Change log
- 20180918: create the note.