Functional annotation and more
View the Project on GitHub kn3in/annotation
We consider various mapping/relationship between gene, SNP, pathway etc and other biological annotations in order to get a biological insight, whatever it may mean in your particular experiment.
Easiest way to annotate a small list of genes, SNPs, microarray probes etc is to query various web based interfaces often provided within genome browsers. Also useful when you need quick sanity check of your annotation scripts.
General interfaces
human transcriptome annotation; here for the sake of completeness, part of the ENCODE project. Available as a track in the UCSC browserSNP
Gene Expression
Pathway Analysis
Biomart, UCSC, Ensembl and NCBI integrate broad spectrum of data within a unified interface, on the other hand there are plenty of annotation tools addressing particular needs e.g. eQTL browsers, miRNA and lincRNA databases not(yet) mentioned here.
from www.biomart.org
BioMart is a freely available, open source, federated database system that provides unified access to disparate, geographically distributed data sources. It is designed to be data agnostic and platform independent, such that existing databases can easily be incorporated into the BioMart framework.
Ensembl is one of the databases available through Biomart. There are numerous APIs to access Ensembl programmatically. We're going to use R/Bioconductor interface which unlike other APIs allows access entire Biomart.
Ensembl/Biomart APIs:
does much more than just querying Ensembl, please look through the documentationIt is instructive to retrieve some
data using the highest level web-interface and the lowest MySQL level to see that after all
there is no magic. The rest of APIs provide nice in-between level in your preferred language which
hides all particularities of db schema but allows you to deal with the annotation programmatically.
Hopefully this is self explanatory:
By the same token one can build a query via Biomart: result
Query MySQL backend (not recommended, plus no sequence retrieval; schema available here)
mysql --host=ensembldb.ensembl.org -P 3306 --user=anonymous \
-A -e "SELECT v.name, v.minor_allele, v.minor_allele_freq \
FROM variation v WHERE v.name IN ('rs7775397', 'rs4783244', 'rs6450176');" \
homo_sapiens_variation_73_37
name minor_allele minor_allele_freq
rs4783244 T 0.3393
rs6450176 A 0.3356
rs7775397 G 0.033
Before we go into discovering lots of goodies in the Bioconductor project. Bioconductor conference and tutorial materials are available here. Each package in the project has a vignette which is worth checking. Bioconductor also have workflows section describing particular usage of the project e.g. Using bioconductor for annotation.
library(biomaRt)
Contrast ensembl versions available through web/MySQL interfaces and biomaRt.
kable(head(listMarts()))
| biomart | version |
|---|---|
| ensembl | ENSEMBL GENES 75 (SANGER UK) |
| snp | ENSEMBL VARIATION 75 (SANGER UK) |
| functional_genomics | ENSEMBL REGULATION 75 (SANGER UK) |
| vega | VEGA 53 (SANGER UK) |
| fungi_mart_21 | ENSEMBL FUNGI 21 (EBI UK) |
| fungi_variations_21 | ENSEMBL FUNGI VARIATION 21 (EBI UK) |
Select snp mart and see which datasets are available for the mart.
mart <- useMart("snp")
kable(head(listDatasets(mart)))
| dataset | description | version |
|---|---|---|
| pabelii_snp | Pongo abelii Short Variation (SNPs and indels) (PPYG2) | PPYG2 |
| ecaballus_snp | Equus caballus Short Variation (SNPs and indels) (EquCab2) | EquCab2 |
| hsapiens_snp | Homo sapiens Short Variation (SNPs and indels) (GRCh37.p13) | GRCh37.p13 |
| hsapiens_structvar | Homo sapiens Structural Variation (GRCh37.p13) | GRCh37.p13 |
| oanatinus_snp | Ornithorhynchus anatinus Short Variation (SNPs and indels) (OANA5) | OANA5 |
| tnigroviridis_snp | Tetraodon nigroviridis Short Variation (SNPs and indels) (TETRAODON8.0) | TETRAODON8.0 |
Select dataset with human snps hsapiens_snp.
snpmart <- useDataset("hsapiens_snp", mart = mart)
Shortcut in case you know which mart and dataset you are after:
snpmart <- useMart("snp", dataset = "hsapiens_snp")
A dataset queried on a set of fields: Attributes i.e. the desired output.
A query narrowed based on a set of fields: Filters.
Here are Attributes for the hsapiens_snp dataset of the snp mart.
kable(head(listAttributes(snpmart)))
| name | description |
|---|---|
| refsnp_id | Variation Name |
| refsnp_source | Variation source |
| refsnp_source_description | Variation source description |
| chr_name | Chromosome name |
| chrom_start | Position on Chromosome (bp) |
| chrom_strand | Strand |
Here are the Filters
kable(head(listFilters(snpmart)))
| name | description |
|---|---|
| chr_name | Chromosome name |
| chrom_start | Start |
| chrom_end | End |
| band_start | Band Start |
| band_end | Band End |
| marker_end | Marker End |
Example query:
Suppose we have a list of rs SNP ids (Filter, we want only data for those rs ids) and would like to figure out where those SNPs located, their minor alleles and MAFs (Attributes, we need only those fields returned from biomart). The getBM is the main function to query Biomart. We have already seen first four arguments: attributes, filters, value (of the filters) and mart.
?getBM
top_ids <- c("rs7775397", "rs4783244", "rs6450176")
snp_pos <- getBM(attributes = c("refsnp_id", "chr_name", "chrom_start",
"minor_allele", "minor_allele_freq"),
filters = c("snp_filter"),
value = top_ids,
mart = snpmart)
kable(snp_pos)
| refsnp_id | chr_name | chrom_start | minor_allele | minor_allele_freq |
|---|---|---|---|---|
| rs4783244 | 16 | 82662268 | T | 0.3393 |
| rs6450176 | 5 | 53298025 | A | 0.3361 |
| rs7775397 | 6 | 32261252 | G | 0.0326 |
| rs7775397 | HSCHR6_MHC_MANN | 32300691 | G | 0.0326 |
| rs7775397 | HSCHR6_MHC_COX | 32209826 | G | 0.0326 |
| rs7775397 | HSCHR6_MHC_QBL | 32219074 | G | 0.0326 |
| rs7775397 | HSCHR6_MHC_DBB | 32237102 | G | 0.0326 |
| rs7775397 | HSCHR6_MHC_SSTO | 32268189 | G | 0.0326 |
library(biomaRt)
snpmart <- useMart("snp", dataset = "hsapiens_snp")
top_ids <- c("rs7775397", "rs4783244", "rs6450176")
snp_pos <- getBM(attributes = c("refsnp_id", "chr_name", "chrom_start",
"minor_allele", "minor_allele_freq"),
filters = c("snp_filter"),
value = top_ids,
mart = snpmart)
kable(listMarts())
| biomart | version |
|---|---|
| ensembl | ENSEMBL GENES 75 (SANGER UK) |
| snp | ENSEMBL VARIATION 75 (SANGER UK) |
| functional_genomics | ENSEMBL REGULATION 75 (SANGER UK) |
| vega | VEGA 53 (SANGER UK) |
| fungi_mart_21 | ENSEMBL FUNGI 21 (EBI UK) |
| fungi_variations_21 | ENSEMBL FUNGI VARIATION 21 (EBI UK) |
| metazoa_mart_21 | ENSEMBL METAZOA 21 (EBI UK) |
| metazoa_variations_21 | ENSEMBL METAZOA VARIATION 21 (EBI UK) |
| plants_mart_21 | ENSEMBL PLANTS 21 (EBI UK) |
| plants_variations_21 | ENSEMBL PLANTS VARIATION 21 (EBI UK) |
| protists_mart_21 | ENSEMBL PROTISTS 21 (EBI UK) |
| protists_variations_21 | ENSEMBL PROTISTS VARIATION 21 (EBI UK) |
| msd | MSD (EBI UK) |
| htgt | WTSI MOUSE GENETICS PROJECT (SANGER UK) |
| REACTOME | REACTOME (CSHL US) |
| WS220 | WORMBASE 220 (CSHL US) |
| biomart | MGI (JACKSON LABORATORY US) |
| pride | PRIDE (EBI UK) |
| prod-intermart_1 | INTERPRO (EBI UK) |
| unimart | UNIPROT (EBI UK) |
| biomartDB | PARAMECIUM GENOME (CNRS FRANCE) |
| biblioDB | PARAMECIUM BIBLIOGRAPHY (CNRS FRANCE) |
| Eurexpress Biomart | EUREXPRESS (MRC EDINBURGH UK) |
| phytozome_mart | PHYTOZOME (JGI/CIG US) |
| HapMap_rel27 | HAPMAP 27 (NCBI US) |
| CosmicMart | COSMIC (SANGER UK) |
| cildb_all_v2 | CILDB INPARANOID AND FILTERED BEST HIT (CNRS FRANCE) |
| cildb_inp_v2 | CILDB INPARANOID (CNRS FRANCE) |
| experiments | INTOGEN EXPERIMENTS |
| oncomodules | INTOGEN ONCOMODULES |
| gmap_japonica | RICE-MAP JAPONICA (PEKING UNIVESITY CHINA) |
| europhenomeannotations | EUROPHENOME |
| ikmc | IKMC GENES AND PRODUCTS (IKMC) |
| EMAGE gene expression | EMAGE GENE EXPRESSION |
| EMAP anatomy ontology | EMAP ANATOMY ONTOLOGY |
| EMAGE browse repository | EMAGE BROWSE REPOSITORY |
| GermOnline | GERMONLINE |
| Sigenae_Oligo_Annotation_Ensembl_61 | SIGENAE OLIGO ANNOTATION (ENSEMBL 61) |
| Sigenae Oligo Annotation (Ensembl 59) | SIGENAE OLIGO ANNOTATION (ENSEMBL 59) |
| Sigenae Oligo Annotation (Ensembl 56) | SIGENAE OLIGO ANNOTATION (ENSEMBL 56) |
| Breast_mart_69 | BCCTB Bioinformatics Portal (UK and Ireland) |
| K562_Gm12878 | Predictive models of gene regulation from processed high-throughput epigenomics data: K562 vs. Gm12878 |
| Hsmm_Hmec | Predictive models of gene regulation from processed high-throughput epigenomics data: Hsmm vs. Hmec |
| Pancreas63 | PANCREATIC EXPRESSION DATABASE (BARTS CANCER INSTITUTE UK) |
| Public_OBIOMART | Genetic maps (markers, Qtls), Polymorphisms (snps, genes), Genetic and Phenotype resources with Genes annotations |
| Public_VITIS | Grapevine 8x, stuctural annotation with Genetic maps (genetic markers..) |
| Public_VITIS_12x | Grapevine 12x, stuctural and functional annotation with Genetic maps (genetic markers..) |
| Prod_WHEAT | Wheat, stuctural annotation with Genetic maps (genetic markers..) and Polymorphisms (snps) |
| Public_TAIRV10 | Arabidopsis Thaliana TAIRV10, genes functional annotation |
| Public_MAIZE | Zea mays ZmB73, genes functional annotation |
| Prod_POPLAR | Populus trichocarpa, genes functional annotation |
| Prod_POPLAR_V2 | Populus trichocarpa, genes functional annotation V2.0 |
| Prod_BOTRYTISEDIT | Botrytis cinerea T4, genes functional annotation |
| Prod_ | Botrytis cinerea B0510, genes functional annotation |
| Prod_SCLEROEDIT | Sclerotinia sclerotiorum, genes functional annotation |
| Prod_LMACULANSEDIT | Leptosphaeria maculans, genes functional annotation |
| vb_mart_22 | VectorBase Genes |
| vb_snp_mart_22 | VectorBase Variation |
| expression | VectorBase Expression |
| ENSEMBL_MART_PLANT | GRAMENE 40 ENSEMBL GENES (CSHL/CORNELL US) |
| ENSEMBL_MART_PLANT_SNP | GRAMENE 40 VARIATION (CSHL/CORNELL US) |
kable(listDatasets(mart))
| dataset | description | version |
|---|---|---|
| pabelii_snp | Pongo abelii Short Variation (SNPs and indels) (PPYG2) | PPYG2 |
| ecaballus_snp | Equus caballus Short Variation (SNPs and indels) (EquCab2) | EquCab2 |
| hsapiens_snp | Homo sapiens Short Variation (SNPs and indels) (GRCh37.p13) | GRCh37.p13 |
| hsapiens_structvar | Homo sapiens Structural Variation (GRCh37.p13) | GRCh37.p13 |
| oanatinus_snp | Ornithorhynchus anatinus Short Variation (SNPs and indels) (OANA5) | OANA5 |
| tnigroviridis_snp | Tetraodon nigroviridis Short Variation (SNPs and indels) (TETRAODON8.0) | TETRAODON8.0 |
| ggallus_snp | Gallus gallus Short Variation (SNPs and indels) (Galgal4) | Galgal4 |
| oaries_snp | Ovis Aries Short Variation (SNPs and indels) (Oar_v3.1) | Oar_v3.1 |
| scerevisiae_snp | Saccharomyces cerevisiae Short Variation (SNPs and indels) (R64-1-1) | R64-1-1 |
| drerio_structvar | Danio rerio Structural Variation (Zv9) | Zv9 |
| mmulatta_structvar | Macaca mulatta Structural Variation (MMUL_1) | MMUL_1 |
| mmusculus_snp | Mus musculus Short Variation (SNPs and indels) (GRCm38.p2) | GRCm38.p2 |
| mmusculus_structvar | Mus musculus Structural Variation (GRCm38.p2) | GRCm38.p2 |
| drerio_snp | Danio rerio Short Variation (SNPs and indels) (Zv9) | Zv9 |
| mdomestica_snp | Monodelphis domestica Short Variation (SNPs and indels) (monDom5) | monDom5 |
| cfamiliaris_structvar | Canis familiaris Structural Variation (CanFam3.1) | CanFam3.1 |
| btaurus_structvar | Bos taurus Structural Variation (UMD3.1) | UMD3.1 |
| ptroglodytes_snp | Pan troglodytes Short Variation (SNPs and indels) (CHIMP2.1.4) | CHIMP2.1.4 |
| btaurus_snp | Bos taurus Short Variation (SNPs and indels) (UMD3.1) | UMD3.1 |
| mmulatta_snp | Macaca mulatta Short Variation (SNPs and indels) (MMUL_1) | MMUL_1 |
| nleucogenys_snp | Nomascus leucogenys Short Variation (SNPs and indels) (Nleu1.0) | Nleu1.0 |
| mgallopavo_snp | Meleagris gallopavo Short Variation (SNPs and indels) (UMD2) | UMD2 |
| sscrofa_structvar | Sus scrofa Structural Variation (Sscrofa10.2) | Sscrofa10.2 |
| ecaballus_structvar | Equus caballus Structural Variation (EquCab2) | EquCab2 |
| hsapiens_snp_som | Homo sapiens Somatic Short Variation (SNPs and indels) (GRCh37.p13) | GRCh37.p13 |
| tguttata_snp | Taeniopygia guttata Short Variation (SNPs and indels) (taeGut3.2.4) | taeGut3.2.4 |
| fcatus_snp | Felis catus Short Variation (SNPs and indels) (Felis_catus_6.2) | Felis_catus_6.2 |
| cfamiliaris_snp | Canis familiaris Short Variation (SNPs and indels) (CanFam3.1) | CanFam3.1 |
| hsapiens_structvar_som | Homo sapiens Somatic Structural Variation (GRCh37.p13) | GRCh37.p13 |
| sscrofa_snp | Sus scrofa Short Variation (SNPs and indels) (Sscrofa10.2) | Sscrofa10.2 |
| dmelanogaster_snp | Drosophila melanogaster Short Variation (SNPs and indels) (BDGP5) | BDGP5 |
| rnorvegicus_snp | Rattus norvegicus Short Variation (SNPs and indels) (Rnor_5.0) | Rnor_5.0 |
kable(listAttributes(snpmart))
| name | description |
|---|---|
| refsnp_id | Variation Name |
| refsnp_source | Variation source |
| refsnp_source_description | Variation source description |
| chr_name | Chromosome name |
| chrom_start | Position on Chromosome (bp) |
| chrom_strand | Strand |
| allele | Variant Alleles |
| mapweight | Mapweight |
| validated | Evidence status |
| allele_1 | Ancestral allele |
| minor_allele | Minor allele (ALL) |
| minor_allele_freq | 1000 genomes global MAF (ALL) |
| minor_allele_count | 1000 genomes global MAC (ALL) |
| clinical_significance | Clinical significance |
| synonym_name | Synonym name |
| synonym_source | Synonym source |
| synonym_source_description | Synonym source description |
| variation_names | Associated variation names |
| study_type | Study type |
| study_external_ref | Study External Reference |
| study_description | Study Description |
| source_name | Source name |
| associated_gene | Associated gene with phenotype |
| phenotype_description | Phenotype description |
| phenotype_significance | Phenotype significance [0 non significant, 1 significant] |
| associated_variant_risk_allele | Associated variant risk allele |
| p_value | P value |
| set_name | Variation Set Name |
| set_description | Variation Set Description |
| title_20137 | Title |
| authors_20137 | Authors |
| year_20137 | Year |
| pmid_20137 | PubMed ID |
| pmcid_20137 | PMC reference number (PMCID) |
| ucsc_id_20137 | UCSC ID |
| doi_20137 | Digital Object Identifier |
| ensembl_gene_stable_id | Ensembl Gene ID |
| ensembl_transcript_stable_id | Ensembl Transcript ID |
| ensembl_transcript_chrom_strand | Transcript strand |
| ensembl_type | Biotype |
| consequence_type_tv | Consequence to transcript |
| consequence_allele_string | Consequence specific allele |
| ensembl_peptide_allele | Protein allele |
| cdna_start | Variation start in cDNA (bp) |
| cdna_end | Variation end in cDNA (bp) |
| translation_start | Variation start in translation (aa) |
| translation_end | Variation end in translation (aa) |
| cds_start | Variation start in CDS (bp) |
| cds_end | Variation end in CDS (bp) |
| distance_to_transcript | Distance to transcript |
| polyphen_prediction | PolyPhen prediction |
| polyphen_score | PolyPhen score |
| sift_prediction | SIFT prediction |
| sift_score | SIFT score |
| feature_stable_id_20126 | Regulatory Feature Stable ID |
| allele_string_20126 | Regulatory Feature Allele String |
| consequence_types_20126 | Regulatory Feature Consequence Type |
| feature_stable_id_20125 | Motif Feature Stable ID |
| allele_string_20125 | Motif Feature Allele String |
| consequence_types_20125 | Motif Feature Consequence Type |
| in_informative_position_20125 | High Information Position |
| motif_score_delta_20125 | Motif Score Change |
| motif_name_20125 | Motif Name |
| motif_start_20125 | Motif Position |
| snp | Variation sequence |
| upstream_flank | upstream_flank |
| downstream_flank | downstream_flank |
| chr_name | Chromosome name |
| chrom_start | Position on Chromosome (bp) |
| chrom_strand | Strand |
| refsnp_id | Variation Name |
| refsnp_source | Variation source |
| allele | Variant Alleles |
| validated | Evidence status |
| mapweight | Mapweight |
| ensembl_peptide_allele | Protein allele |
kable(listFilters(snpmart))
| name | description |
|---|---|
| chr_name | Chromosome name |
| chrom_start | Start |
| chrom_end | End |
| band_start | Band Start |
| band_end | Band End |
| marker_end | Marker End |
| marker_start | Marker Start |
| chromosomal_region | Chromosome Regions (e.g 1:100:10000:-1,1:100000:200000:1) |
| strand | Strand |
| variation_source | Variation source |
| snp_filter | Filter by Variation Name (e.g. rs123, CM000001) |
| variation_synonym_source | Variation Synonym source |
| study_type | Study type |
| phenotype_description | Phenotype description |
| phenotype_significance | Phenotype significance |
| variation_set_name | Variation Set Name |
| sift_prediction | SIFT Prediction |
| sift_score | SIFT score <= |
| polyphen_prediction | PolyPhen Prediction |
| polyphen_score | PolyPhen score >= |
| minor_allele_freq | Global minor allele frequency <= |
| minor_allele_freq_second | Global minor allele frequency >= |
| clinical_significance | Clinical_significance |
| with_validated | Variations that have been validated |
| with_variation_citation | Variations with citations |
| distance_to_transcript | Distance to transcript <= |
| ensembl_gene | Ensembl Gene ID(s) [Max 500] |
| so_parent_name | Parent term name |
| feature_stable_id | Filter by Regulatory Stable ID(s) (e.g. ENSR00001529861) [Max 500 ADVISED] |
| motif_name | Motif Name |