Annotation

Functional annotation and more

View the Project on GitHub kn3in/annotation

Intro

We consider various mapping/relationship between gene, SNP, pathway etc and other biological annotations in order to get a biological insight, whatever it may mean in your particular experiment.


Genome Browsers and alike

Easiest way to annotate a small list of genes, SNPs, microarray probes etc is to query various web based interfaces often provided within genome browsers. Also useful when you need quick sanity check of your annotation scripts.

General interfaces

SNP

Gene Expression

Pathway Analysis

Biomart, UCSC, Ensembl and NCBI integrate broad spectrum of data within a unified interface, on the other hand there are plenty of annotation tools addressing particular needs e.g. eQTL browsers, miRNA and lincRNA databases not(yet) mentioned here.


Biomart

from www.biomart.org

BioMart is a freely available, open source, federated database system that provides unified access to disparate, geographically distributed data sources. It is designed to be data agnostic and platform independent, such that existing databases can easily be incorporated into the BioMart framework.

Ensembl is one of the databases available through Biomart. There are numerous APIs to access Ensembl programmatically. We're going to use R/Bioconductor interface which unlike other APIs allows access entire Biomart.

Ensembl/Biomart APIs:

It is instructive to retrieve some data using the highest level web-interface and the lowest MySQL level to see that after all there is no magic. The rest of APIs provide nice in-between level in your preferred language which hides all particularities of db schema but allows you to deal with the annotation programmatically.

Web interface a.k.a. build a query by selecting drop-down menus.

Hopefully this is self explanatory:

  1. Go to Mart view
  2. CHOOSE DATABASE: Ensembl Variation 73
  3. CHOOSE DATASET: Homo Sapience Short Variation
  4. Filters: GENERAL VARIATION FILTERS: Filter by Variation Name:
    • rs7775397
    • rs4783244
    • rs6450176
  5. Attributes: SEQUENCE VARIATION: Variation Name, Minor allele (ALL), 1000 genomes global MAF (ALL)
  6. Results

By the same token one can build a query via Biomart: result

Direct MySQL query

Query MySQL backend (not recommended, plus no sequence retrieval; schema available here)

mysql --host=ensembldb.ensembl.org -P 3306  --user=anonymous \
-A -e "SELECT v.name, v.minor_allele, v.minor_allele_freq \
FROM variation v WHERE v.name IN ('rs7775397', 'rs4783244', 'rs6450176');" \
homo_sapiens_variation_73_37
name    minor_allele    minor_allele_freq
rs4783244   T   0.3393
rs6450176   A   0.3356
rs7775397   G   0.033

Note

Before we go into discovering lots of goodies in the Bioconductor project. Bioconductor conference and tutorial materials are available here. Each package in the project has a vignette which is worth checking. Bioconductor also have workflows section describing particular usage of the project e.g. Using bioconductor for annotation.

biomaRt

library(biomaRt)

Available marts

Contrast ensembl versions available through web/MySQL interfaces and biomaRt.

kable(head(listMarts()))
biomart version
ensembl ENSEMBL GENES 75 (SANGER UK)
snp ENSEMBL VARIATION 75 (SANGER UK)
functional_genomics ENSEMBL REGULATION 75 (SANGER UK)
vega VEGA 53 (SANGER UK)
fungi_mart_21 ENSEMBL FUNGI 21 (EBI UK)
fungi_variations_21 ENSEMBL FUNGI VARIATION 21 (EBI UK)

Available datasets

Select snp mart and see which datasets are available for the mart.

mart <- useMart("snp")
kable(head(listDatasets(mart)))
dataset description version
pabelii_snp Pongo abelii Short Variation (SNPs and indels) (PPYG2) PPYG2
ecaballus_snp Equus caballus Short Variation (SNPs and indels) (EquCab2) EquCab2
hsapiens_snp Homo sapiens Short Variation (SNPs and indels) (GRCh37.p13) GRCh37.p13
hsapiens_structvar Homo sapiens Structural Variation (GRCh37.p13) GRCh37.p13
oanatinus_snp Ornithorhynchus anatinus Short Variation (SNPs and indels) (OANA5) OANA5
tnigroviridis_snp Tetraodon nigroviridis Short Variation (SNPs and indels) (TETRAODON8.0) TETRAODON8.0

Select dataset with human snps hsapiens_snp.

snpmart <- useDataset("hsapiens_snp", mart = mart)

Shortcut in case you know which mart and dataset you are after:

snpmart <- useMart("snp", dataset = "hsapiens_snp")

Attributes and Filters

A dataset queried on a set of fields: Attributes i.e. the desired output. A query narrowed based on a set of fields: Filters.

Here are Attributes for the hsapiens_snp dataset of the snp mart.

kable(head(listAttributes(snpmart)))
name description
refsnp_id Variation Name
refsnp_source Variation source
refsnp_source_description Variation source description
chr_name Chromosome name
chrom_start Position on Chromosome (bp)
chrom_strand Strand

Here are the Filters

kable(head(listFilters(snpmart)))
name description
chr_name Chromosome name
chrom_start Start
chrom_end End
band_start Band Start
band_end Band End
marker_end Marker End

Example query: Suppose we have a list of rs SNP ids (Filter, we want only data for those rs ids) and would like to figure out where those SNPs located, their minor alleles and MAFs (Attributes, we need only those fields returned from biomart). The getBM is the main function to query Biomart. We have already seen first four arguments: attributes, filters, value (of the filters) and mart.

?getBM
top_ids <- c("rs7775397", "rs4783244", "rs6450176")
snp_pos <- getBM(attributes = c("refsnp_id", "chr_name", "chrom_start",
                                 "minor_allele", "minor_allele_freq"),
                    filters = c("snp_filter"),
                      value = top_ids,
                       mart = snpmart)
kable(snp_pos)
refsnp_id chr_name chrom_start minor_allele minor_allele_freq
rs4783244 16 82662268 T 0.3393
rs6450176 5 53298025 A 0.3361
rs7775397 6 32261252 G 0.0326
rs7775397 HSCHR6_MHC_MANN 32300691 G 0.0326
rs7775397 HSCHR6_MHC_COX 32209826 G 0.0326
rs7775397 HSCHR6_MHC_QBL 32219074 G 0.0326
rs7775397 HSCHR6_MHC_DBB 32237102 G 0.0326
rs7775397 HSCHR6_MHC_SSTO 32268189 G 0.0326

All together

library(biomaRt)
snpmart <- useMart("snp", dataset = "hsapiens_snp")
top_ids <- c("rs7775397", "rs4783244", "rs6450176")
snp_pos <- getBM(attributes = c("refsnp_id", "chr_name", "chrom_start",
                                 "minor_allele", "minor_allele_freq"),
                    filters = c("snp_filter"),
                      value = top_ids,
                       mart = snpmart)

Full tables for biomaRt

Marts

kable(listMarts())
biomart version
ensembl ENSEMBL GENES 75 (SANGER UK)
snp ENSEMBL VARIATION 75 (SANGER UK)
functional_genomics ENSEMBL REGULATION 75 (SANGER UK)
vega VEGA 53 (SANGER UK)
fungi_mart_21 ENSEMBL FUNGI 21 (EBI UK)
fungi_variations_21 ENSEMBL FUNGI VARIATION 21 (EBI UK)
metazoa_mart_21 ENSEMBL METAZOA 21 (EBI UK)
metazoa_variations_21 ENSEMBL METAZOA VARIATION 21 (EBI UK)
plants_mart_21 ENSEMBL PLANTS 21 (EBI UK)
plants_variations_21 ENSEMBL PLANTS VARIATION 21 (EBI UK)
protists_mart_21 ENSEMBL PROTISTS 21 (EBI UK)
protists_variations_21 ENSEMBL PROTISTS VARIATION 21 (EBI UK)
msd MSD (EBI UK)
htgt WTSI MOUSE GENETICS PROJECT (SANGER UK)
REACTOME REACTOME (CSHL US)
WS220 WORMBASE 220 (CSHL US)
biomart MGI (JACKSON LABORATORY US)
pride PRIDE (EBI UK)
prod-intermart_1 INTERPRO (EBI UK)
unimart UNIPROT (EBI UK)
biomartDB PARAMECIUM GENOME (CNRS FRANCE)
biblioDB PARAMECIUM BIBLIOGRAPHY (CNRS FRANCE)
Eurexpress Biomart EUREXPRESS (MRC EDINBURGH UK)
phytozome_mart PHYTOZOME (JGI/CIG US)
HapMap_rel27 HAPMAP 27 (NCBI US)
CosmicMart COSMIC (SANGER UK)
cildb_all_v2 CILDB INPARANOID AND FILTERED BEST HIT (CNRS FRANCE)
cildb_inp_v2 CILDB INPARANOID (CNRS FRANCE)
experiments INTOGEN EXPERIMENTS
oncomodules INTOGEN ONCOMODULES
gmap_japonica RICE-MAP JAPONICA (PEKING UNIVESITY CHINA)
europhenomeannotations EUROPHENOME
ikmc IKMC GENES AND PRODUCTS (IKMC)
EMAGE gene expression EMAGE GENE EXPRESSION
EMAP anatomy ontology EMAP ANATOMY ONTOLOGY
EMAGE browse repository EMAGE BROWSE REPOSITORY
GermOnline GERMONLINE
Sigenae_Oligo_Annotation_Ensembl_61 SIGENAE OLIGO ANNOTATION (ENSEMBL 61)
Sigenae Oligo Annotation (Ensembl 59) SIGENAE OLIGO ANNOTATION (ENSEMBL 59)
Sigenae Oligo Annotation (Ensembl 56) SIGENAE OLIGO ANNOTATION (ENSEMBL 56)
Breast_mart_69 BCCTB Bioinformatics Portal (UK and Ireland)
K562_Gm12878 Predictive models of gene regulation from processed high-throughput epigenomics data: K562 vs. Gm12878
Hsmm_Hmec Predictive models of gene regulation from processed high-throughput epigenomics data: Hsmm vs. Hmec
Pancreas63 PANCREATIC EXPRESSION DATABASE (BARTS CANCER INSTITUTE UK)
Public_OBIOMART Genetic maps (markers, Qtls), Polymorphisms (snps, genes), Genetic and Phenotype resources with Genes annotations
Public_VITIS Grapevine 8x, stuctural annotation with Genetic maps (genetic markers..)
Public_VITIS_12x Grapevine 12x, stuctural and functional annotation with Genetic maps (genetic markers..)
Prod_WHEAT Wheat, stuctural annotation with Genetic maps (genetic markers..) and Polymorphisms (snps)
Public_TAIRV10 Arabidopsis Thaliana TAIRV10, genes functional annotation
Public_MAIZE Zea mays ZmB73, genes functional annotation
Prod_POPLAR Populus trichocarpa, genes functional annotation
Prod_POPLAR_V2 Populus trichocarpa, genes functional annotation V2.0
Prod_BOTRYTISEDIT Botrytis cinerea T4, genes functional annotation
Prod_ Botrytis cinerea B0510, genes functional annotation
Prod_SCLEROEDIT Sclerotinia sclerotiorum, genes functional annotation
Prod_LMACULANSEDIT Leptosphaeria maculans, genes functional annotation
vb_mart_22 VectorBase Genes
vb_snp_mart_22 VectorBase Variation
expression VectorBase Expression
ENSEMBL_MART_PLANT GRAMENE 40 ENSEMBL GENES (CSHL/CORNELL US)
ENSEMBL_MART_PLANT_SNP GRAMENE 40 VARIATION (CSHL/CORNELL US)

Datasets available for the snp mart

kable(listDatasets(mart))
dataset description version
pabelii_snp Pongo abelii Short Variation (SNPs and indels) (PPYG2) PPYG2
ecaballus_snp Equus caballus Short Variation (SNPs and indels) (EquCab2) EquCab2
hsapiens_snp Homo sapiens Short Variation (SNPs and indels) (GRCh37.p13) GRCh37.p13
hsapiens_structvar Homo sapiens Structural Variation (GRCh37.p13) GRCh37.p13
oanatinus_snp Ornithorhynchus anatinus Short Variation (SNPs and indels) (OANA5) OANA5
tnigroviridis_snp Tetraodon nigroviridis Short Variation (SNPs and indels) (TETRAODON8.0) TETRAODON8.0
ggallus_snp Gallus gallus Short Variation (SNPs and indels) (Galgal4) Galgal4
oaries_snp Ovis Aries Short Variation (SNPs and indels) (Oar_v3.1) Oar_v3.1
scerevisiae_snp Saccharomyces cerevisiae Short Variation (SNPs and indels) (R64-1-1) R64-1-1
drerio_structvar Danio rerio Structural Variation (Zv9) Zv9
mmulatta_structvar Macaca mulatta Structural Variation (MMUL_1) MMUL_1
mmusculus_snp Mus musculus Short Variation (SNPs and indels) (GRCm38.p2) GRCm38.p2
mmusculus_structvar Mus musculus Structural Variation (GRCm38.p2) GRCm38.p2
drerio_snp Danio rerio Short Variation (SNPs and indels) (Zv9) Zv9
mdomestica_snp Monodelphis domestica Short Variation (SNPs and indels) (monDom5) monDom5
cfamiliaris_structvar Canis familiaris Structural Variation (CanFam3.1) CanFam3.1
btaurus_structvar Bos taurus Structural Variation (UMD3.1) UMD3.1
ptroglodytes_snp Pan troglodytes Short Variation (SNPs and indels) (CHIMP2.1.4) CHIMP2.1.4
btaurus_snp Bos taurus Short Variation (SNPs and indels) (UMD3.1) UMD3.1
mmulatta_snp Macaca mulatta Short Variation (SNPs and indels) (MMUL_1) MMUL_1
nleucogenys_snp Nomascus leucogenys Short Variation (SNPs and indels) (Nleu1.0) Nleu1.0
mgallopavo_snp Meleagris gallopavo Short Variation (SNPs and indels) (UMD2) UMD2
sscrofa_structvar Sus scrofa Structural Variation (Sscrofa10.2) Sscrofa10.2
ecaballus_structvar Equus caballus Structural Variation (EquCab2) EquCab2
hsapiens_snp_som Homo sapiens Somatic Short Variation (SNPs and indels) (GRCh37.p13) GRCh37.p13
tguttata_snp Taeniopygia guttata Short Variation (SNPs and indels) (taeGut3.2.4) taeGut3.2.4
fcatus_snp Felis catus Short Variation (SNPs and indels) (Felis_catus_6.2) Felis_catus_6.2
cfamiliaris_snp Canis familiaris Short Variation (SNPs and indels) (CanFam3.1) CanFam3.1
hsapiens_structvar_som Homo sapiens Somatic Structural Variation (GRCh37.p13) GRCh37.p13
sscrofa_snp Sus scrofa Short Variation (SNPs and indels) (Sscrofa10.2) Sscrofa10.2
dmelanogaster_snp Drosophila melanogaster Short Variation (SNPs and indels) (BDGP5) BDGP5
rnorvegicus_snp Rattus norvegicus Short Variation (SNPs and indels) (Rnor_5.0) Rnor_5.0

Attributes for the snp mart, the hsapiens_snp dataset

kable(listAttributes(snpmart))
name description
refsnp_id Variation Name
refsnp_source Variation source
refsnp_source_description Variation source description
chr_name Chromosome name
chrom_start Position on Chromosome (bp)
chrom_strand Strand
allele Variant Alleles
mapweight Mapweight
validated Evidence status
allele_1 Ancestral allele
minor_allele Minor allele (ALL)
minor_allele_freq 1000 genomes global MAF (ALL)
minor_allele_count 1000 genomes global MAC (ALL)
clinical_significance Clinical significance
synonym_name Synonym name
synonym_source Synonym source
synonym_source_description Synonym source description
variation_names Associated variation names
study_type Study type
study_external_ref Study External Reference
study_description Study Description
source_name Source name
associated_gene Associated gene with phenotype
phenotype_description Phenotype description
phenotype_significance Phenotype significance [0 non significant, 1 significant]
associated_variant_risk_allele Associated variant risk allele
p_value P value
set_name Variation Set Name
set_description Variation Set Description
title_20137 Title
authors_20137 Authors
year_20137 Year
pmid_20137 PubMed ID
pmcid_20137 PMC reference number (PMCID)
ucsc_id_20137 UCSC ID
doi_20137 Digital Object Identifier
ensembl_gene_stable_id Ensembl Gene ID
ensembl_transcript_stable_id Ensembl Transcript ID
ensembl_transcript_chrom_strand Transcript strand
ensembl_type Biotype
consequence_type_tv Consequence to transcript
consequence_allele_string Consequence specific allele
ensembl_peptide_allele Protein allele
cdna_start Variation start in cDNA (bp)
cdna_end Variation end in cDNA (bp)
translation_start Variation start in translation (aa)
translation_end Variation end in translation (aa)
cds_start Variation start in CDS (bp)
cds_end Variation end in CDS (bp)
distance_to_transcript Distance to transcript
polyphen_prediction PolyPhen prediction
polyphen_score PolyPhen score
sift_prediction SIFT prediction
sift_score SIFT score
feature_stable_id_20126 Regulatory Feature Stable ID
allele_string_20126 Regulatory Feature Allele String
consequence_types_20126 Regulatory Feature Consequence Type
feature_stable_id_20125 Motif Feature Stable ID
allele_string_20125 Motif Feature Allele String
consequence_types_20125 Motif Feature Consequence Type
in_informative_position_20125 High Information Position
motif_score_delta_20125 Motif Score Change
motif_name_20125 Motif Name
motif_start_20125 Motif Position
snp Variation sequence
upstream_flank upstream_flank
downstream_flank downstream_flank
chr_name Chromosome name
chrom_start Position on Chromosome (bp)
chrom_strand Strand
refsnp_id Variation Name
refsnp_source Variation source
allele Variant Alleles
validated Evidence status
mapweight Mapweight
ensembl_peptide_allele Protein allele

Filters for the snp mart, the hsapiens_snp dataset

kable(listFilters(snpmart))
name description
chr_name Chromosome name
chrom_start Start
chrom_end End
band_start Band Start
band_end Band End
marker_end Marker End
marker_start Marker Start
chromosomal_region Chromosome Regions (e.g 1:100:10000:-1,1:100000:200000:1)
strand Strand
variation_source Variation source
snp_filter Filter by Variation Name (e.g. rs123, CM000001)
variation_synonym_source Variation Synonym source
study_type Study type
phenotype_description Phenotype description
phenotype_significance Phenotype significance
variation_set_name Variation Set Name
sift_prediction SIFT Prediction
sift_score SIFT score <=
polyphen_prediction PolyPhen Prediction
polyphen_score PolyPhen score >=
minor_allele_freq Global minor allele frequency <=
minor_allele_freq_second Global minor allele frequency >=
clinical_significance Clinical_significance
with_validated Variations that have been validated
with_variation_citation Variations with citations
distance_to_transcript Distance to transcript <=
ensembl_gene Ensembl Gene ID(s) [Max 500]
so_parent_name Parent term name
feature_stable_id Filter by Regulatory Stable ID(s) (e.g. ENSR00001529861) [Max 500 ADVISED]
motif_name Motif Name