The biomaRt package provides an interface to databases implementing the BioMart software suite, e.g. Ensembl and Uniprot. The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas. The Ensembl database holds a lot of data on genome sequences and annotations. Take a look at http://www.ensembl.org/biomart/martview/ to get an idea of what can be downloaded from Ensembl.
Working with biomaRt typically consists of these 3 steps:
First, make sure that the biomaRt package is installed and loaded. In the code examples below, you will also need the dplyr package.
library(biomaRt)
library(dplyr)
listMarts()
# Get correct version of the database
ensembl = useMart("ENSEMBL_MART_ENSEMBL")
# List all datasets
head(listDatasets(ensembl))
listDatasets(ensembl)
# Select human dataset
ensembl_mart = useDataset("hsapiens_gene_ensembl",mart=ensembl)
getBM()
is the main function for querying datasets. These queries consist of attributes, filters and filter values.
Use listAttributes() to view all available attributes
attributes = listAttributes(ensembl_mart)
head(attributes)
Use listFilters() to view all filter options
filters = listFilters(ensembl_mart)
head(filters)
Some filters have a limited set of values. For example, boolean filters take TRUE or FALSE. For other filters the function filterOptions() can be used to get all possible values.
Find all valid options for the filter ‘biotype’.
filterOptions('biotype',ensembl_mart)
## [1] "[3prime_overlapping_ncRNA,antisense,bidirectional_promoter_lncRNA,IG_C_gene,IG_C_pseudogene,IG_D_gene,IG_J_gene,IG_J_pseudogene,IG_pseudogene,IG_V_gene,IG_V_pseudogene,lincRNA,macro_lncRNA,miRNA,misc_RNA,Mt_rRNA,Mt_tRNA,non_coding,polymorphic_pseudogene,processed_pseudogene,processed_transcript,protein_coding,pseudogene,ribozyme,rRNA,scaRNA,scRNA,sense_intronic,sense_overlapping,snoRNA,snRNA,sRNA,TEC,transcribed_processed_pseudogene,transcribed_unitary_pseudogene,transcribed_unprocessed_pseudogene,translated_processed_pseudogene,TR_C_gene,TR_D_gene,TR_J_gene,TR_J_pseudogene,TR_V_gene,TR_V_pseudogene,unitary_pseudogene,unprocessed_pseudogene,vaultRNA]"
Now we can put all this together to build queries using the function getBM(attributes = “”, filters = “”, values = “”, mart,…).
Let’s first try to find the chromosomal position of the gene SRC.
getBM(
attributes = c("chromosome_name", "start_position", "end_position"),
filters = c("hgnc_symbol"),
values = c("SRC"),
mart = ensembl_mart
)
You may also use getBM() to annotate all genes in a list with for example gene name, location and description.
q_genes <- c("ENSG00000197122", "ENSG00000182866")
gene_annot <-
getBM(
attributes = c(
"ensembl_gene_id",
"hgnc_symbol",
"chromosome_name",
"band",
"strand",
"start_position",
"end_position",
"description",
"gene_biotype"
),
filters = "ensembl_gene_id",
values = q_genes,
mart = ensembl_mart
)
gene_annot
Next, we want to find all genes with the Gene Ontology (GO) annotation TOR complex (GO:0038201). First, we search for filters related to “go”:
grep("go", filters$name, value=TRUE)
## [1] "with_go" "with_goslim_goa" "go"
## [4] "goslim_goa" "go_parent_term" "go_parent_name"
## [7] "go_evidence_code" "with_ggorilla_homolog"
Let’s try to use the filter “go”:
tor_table <-
getBM(
attributes = c("ensembl_gene_id", "hgnc_symbol", "go_id"),
filters = "go",
values = "GO:0038201",
mart = ensembl_mart
)
tor_table
How many genes were found? Note that the “go” filter will only give the genes annoatated to exactly this Gene Ontology term. If we want to find all genes annotated to the given term or any of the child terms, we instead use “go_parent_term”. This query takes a bit longer to run.
tor_table2 <-
getBM(
attributes = c("ensembl_gene_id", "hgnc_symbol", "go_id"),
filters = "go_parent_term",
values = "GO:0038201",
mart = ensembl_mart
)
head(tor_table2)
We can also try to retreive the GO terms (id + name) associated with the genes in our list. Let’s say that we are only interested in the GO domain “Biological Process” (see attribute namespace_1003).
go_table <-
getBM(
attributes = c("external_gene_name", "go_id", "name_1006", "namespace_1003"),
filters = "ensembl_gene_id",
values = q_genes,
mart = ensembl_mart
) %>% filter(namespace_1003 == "biological_process")
head(go_table)
You can combine several filters in the same query. Note that the values should be a list of vectors. Search for all protein coding genes on chromosome Y that have an ortholog in fruit fly. It can be nice to sort the genes by start position.
orth_table <-
getBM(
attributes = c(
"hgnc_symbol",
"chromosome_name",
"start_position",
"end_position"
),
filters = c("chromosome_name", "biotype", "with_dmelanogaster_homolog"),
values = list("Y", "protein_coding", TRUE),
mart = ensembl_mart
) %>% arrange(start_position)
head(orth_table)
BiomaRt can also be used to access sequence data. To find the sequence options, look at the “sequences” page of listAttributes().
# If you don't know the name of the pages:
pages = attributePages(ensembl_mart)
pages
## [1] "feature_page" "structure" "homologs" "snp"
## [5] "snp_somatic" "sequences"
head(listAttributes(ensembl_mart, page = "sequences"))
We will first use getBM() to retrieve the cDNA sequences of the genes in q_genes.
seq <-
getBM(
attributes = c("ensembl_gene_id", "cdna"),
filters = "ensembl_gene_id",
values = q_genes,
mart = ensembl_mart
)
head(seq[,c("ensembl_gene_id", "cdna")])
Note that you get several sequences per gene. These represent the different transcript isoforms. Add the transcript ids to the output to see this.
It is also possible to use a wrapper function, getSequence(), to fetch the sequences.
Try to get the same sequences using this function. Valid seqType arguments can be found in the help for getSequence. We can also order them according to gene id.
seq <-
getSequence(
id = q_genes,
type = "ensembl_gene_id",
seqType = "cdna",
mart = ensembl_mart
) %>% arrange(ensembl_gene_id)
head(seq[,c("ensembl_gene_id", "cdna")])
Next, we want the 100 bp upstream promoter sequences of the q_genes.
seq <-
getSequence(
id = q_genes,
type = "ensembl_gene_id",
seqType = "coding_gene_flank",
upstream = 100,
mart = ensembl_mart
)
head(seq[,c("ensembl_gene_id", "coding_gene_flank")])
This function can also be used with chromosome positions. As an example, get the peptide sequences of Ensembl genes in: chr1:32251239-32286165. Note that you have to use type even if you filter on position.
seq <-
getSequence(
chromosome = gene_annot$chromosome_name[1],
start = gene_annot$start_position[1],
end = gene_annot$end_position[1],
type = "ensembl_gene_id",
seqType = "peptide",
mart = ensembl_mart
)
head(seq[,c("ensembl_gene_id", "peptide")])
If there is no sequence of this type it may be listed as “Sequence unavailable”.
Sequences can be exported to file in fasta format by exportFASTA(). Export the sequences from the previous exercise. It may be wise to exclude entries with “Sequence unavailable”.
exportFASTA(seq[seq$peptide != "Sequence unavailable",],"myFastaFile.fa")
Note that if the file already exists, this command will add new sequences to the existing ones.
A typical use of the package is when you get a list of differentially expressed genes that you want to annotate. To try this out, load the file called DE_table. This file contains an example of what the output from a differential expression analysis (using EdgeR) can look like. You can also use your own data if you have any.
de_tab <- read.table("DE_table.txt", sep="\t", header=TRUE, as.is=TRUE)
de_tab
Annotate the genes in this list and merge the annotations with the original data.
de_gene_annot <-
getBM(
attributes = c(
"ensembl_gene_id",
"hgnc_symbol",
"chromosome_name",
"strand",
"start_position",
"end_position",
"description",
"gene_biotype"
),
filters = "ensembl_gene_id",
values = de_tab$ensembl_gene_id,
mart = ensembl_mart
)
merge(de_tab, de_gene_annot, by = "ensembl_gene_id")
Now we will try a different database (“ENSEMBL_MART_SNP”) containing genetic variants. Select the mart and dataset, this can be done in two steps as before, or using a single command. Then use listFilters()
and listAttributes()
to see the filters and attributes available for this dataset.
ensembl_snp = useMart("ENSEMBL_MART_SNP")
snp_mart = useDataset("hsapiens_snp",mart=ensembl_snp)
snp_mart = useMart(biomart = "ENSEMBL_MART_SNP", dataset="hsapiens_snp")
Retrieve all common (minor allele frequency >= 0.01) nonsynonymous SNPs in the genes above (q_genes) and find at least the variant names, positions and consequences.
snps <-
getBM(
attributes = c(
'refsnp_id',
'allele',
'chrom_start',
'chrom_strand',
'consequence_type_tv',
"minor_allele_freq",
"associated_gene"
),
filters = c('ensembl_gene', 'minor_allele_freq_second'),
values = list(q_genes, '0.01'),
mart = snp_mart
) %>%
filter(consequence_type_tv == "missense_variant")
snps
It is also possible to link information between different datasets, e.g. to find orthologs between species. To do this you access two datasets at once, called the primary and the linked datasets. The function uses attributes, filters and values to query each dataset.
human = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mouse = useMart("ensembl", dataset = "mmusculus_gene_ensembl")
getLDS(
attributes = c("hgnc_symbol", "chromosome_name", "start_position"),
filters = "hgnc_symbol",
values = "SRC",
mart = human,
attributesL = c("mgi_id", "chromosome_name", "start_position"),
martL = mouse
)