The focus of this exercise will be to introduce some R packages that are commonly used in analysis of RNA-sequence and Chip-sequence data. Even though the purpose and ideas behind these two types of projects are very different there is in the analysis of data many similarities as they are both using the number of short reads mapping to different parts of the genome as their main data source.
For both type of projects the first steps in the analysis eg. the mapping of short reads to a reference genome is often done outside R. For the RNA-seq part we will show how one can do this steps in R, but the majority of work on this workshop is put on later steps of the analysis as that is where R is really shining.
In this exercise we will make heavy use of different bioconductor packages. Since bioconductor were introduced yesterday you probably remember how to install packages, otherwise click to reveal the code.
# source("https://bioconductor.org/biocLite.R")
# biocLite(c("recount", "GenomicRanges", "limma", "edgeR", "DESeq2",
# "regionReport", "clusterProfiler", "org.Hs.eg.db", "gplots",
# "derfinder", "rtracklayer", "GenomicFeatures", "bumphunter",
# "derfinderPlot", "devtools", "Rsamtools", "csaw", "Glimma",
# "ShortRead"))
library(recount)
library(ShortRead)
library(limma)
library(Glimma)
library(edgeR)
library(ggplot2)
library(DESeq2)
You can go to www.bioconductor.org to learn more about the different packages, but here is an attempt at a single sentence summary of the different packages used
NB! The Rsubread package can not be easily installed on windows systems. If you are using a windows system there are some hints in the manual on how to proceed, but note that it is not heavily used in this exercise so there is no need to spend much time on it.
RNA-seq data are deposited in most cases made public via open sequence repositories such as https://www.ncbi.nlm.nih.gov/geo/ and https://www.ebi.ac.uk/arrayexpress/. Both of these can be queried via the web, but they also have bioconductor packages so that one can query and download (at least parts of the data) directly from R.
In addition to these there have been some major efforts in creating useful resources for the research community and make them publicly available. One such attempt is the https://jhubiostatistics.shinyapps.io/recount/. This project have collected more than 2000 human RNA-seq studies and re-analysed them in an effecient and consistent manner. In the examples here we will make use of data from the recount experiment, but to show a complete workflow from short reads to expression analysis we will first look at how to do all steps in a RNA-seq analysis from within R. For the first part of the exercise we will use some small example files that you can find in the rna_chip folder here.
The first step of the analysis in most projects using short read data is to have a look at the quality of the obtained data as that will determine whether we should filter the short reads prior to analysis as well as detect if there are any major issues with our data.
There are many non-R based tools that do this and often the sequence center will supply quality summaries together with the actual sequence files. The quality relating to short reads are encoded directly in the fastq file so the tools are often just summarising this and do some basic sequence content analysis.
Let us look at the information we find in a fastq file using the ShortRead package in R. Make sure you have downloaded the folder rna_chip to your computer before trying the commands. After loading the package we sample 10 reads at random from one of the supplied fastq files.
# sampler <- FastqSampler("~/data/SRR1424755_1.fastq.gz", n = 10)
# fq_ex <- yield(sampler)
download_study("SRP043368", type = "rse-gene")
## 2018-06-16 12:17:51 downloading file rse_gene.Rdata to SRP043368
load(file.path("SRP043368", "rse_gene.Rdata"))
rse_gene
## class: RangedSummarizedExperiment
## dim: 58037 26
## metadata(0):
## assays(1): counts
## rownames(58037): ENSG00000000003.14 ENSG00000000005.5 ...
## ENSG00000283698.1 ENSG00000283699.1
## rowData names(3): gene_id bp_length symbol
## colnames(26): SRR1424731 SRR1424732 ... SRR1424755 SRR1424756
## colData names(21): project sample ... title characteristics
colData(rse_gene)
## DataFrame with 26 rows and 21 columns
## project sample experiment run
## <character> <character> <character> <character>
## SRR1424731 SRP043368 SRS641338 SRX610324 SRR1424731
## SRR1424732 SRP043368 SRS641339 SRX610325 SRR1424732
## SRR1424733 SRP043368 SRS641340 SRX610326 SRR1424733
## SRR1424735 SRP043368 SRS641341 SRX610328 SRR1424735
## SRR1424734 SRP043368 SRS641342 SRX610327 SRR1424734
## ... ... ... ... ...
## SRR1424752 SRP043368 SRS641358 SRX610344 SRR1424752
## SRR1424753 SRP043368 SRS641359 SRX610345 SRR1424753
## SRR1424754 SRP043368 SRS641360 SRX610346 SRR1424754
## SRR1424755 SRP043368 SRS641361 SRX610347 SRR1424755
## SRR1424756 SRP043368 SRS641361 SRX610347 SRR1424756
## read_count_as_reported_by_sra reads_downloaded
## <integer> <integer>
## SRR1424731 29955322 29955322
## SRR1424732 31892332 31892332
## SRR1424733 50479438 50479438
## SRR1424735 59930974 59930974
## SRR1424734 38127460 38127460
## ... ... ...
## SRR1424752 56690642 56690642
## SRR1424753 53806682 53806682
## SRR1424754 62344990 62344990
## SRR1424755 17913512 17913512
## SRR1424756 13938278 13938278
## proportion_of_reads_reported_by_sra_downloaded paired_end
## <numeric> <logical>
## SRR1424731 1 TRUE
## SRR1424732 1 TRUE
## SRR1424733 1 TRUE
## SRR1424735 1 TRUE
## SRR1424734 1 TRUE
## ... ... ...
## SRR1424752 1 TRUE
## SRR1424753 1 TRUE
## SRR1424754 1 TRUE
## SRR1424755 1 TRUE
## SRR1424756 1 TRUE
## sra_misreported_paired_end mapped_read_count auc
## <logical> <integer> <numeric>
## SRR1424731 FALSE 29582886 2700597015
## SRR1424732 FALSE 31083419 2802567487
## SRR1424733 FALSE 50232620 4720081400
## SRR1424735 FALSE 59260728 5554985728
## SRR1424734 FALSE 37382762 3384497431
## ... ... ... ...
## SRR1424752 FALSE 47212788 3413173255
## SRR1424753 FALSE 52082887 4798991823
## SRR1424754 FALSE 60944552 5617825794
## SRR1424755 FALSE 17318697 1588459316
## SRR1424756 FALSE 13606853 1185130227
## sharq_beta_tissue sharq_beta_cell_type
## <character> <character>
## SRR1424731 muscle skeletal
## SRR1424732 muscle skeletal
## SRR1424733 muscle skeletal
## SRR1424735 muscle skeletal
## SRR1424734 muscle skeletal
## ... ... ...
## SRR1424752 muscle skeletal
## SRR1424753 muscle skeletal
## SRR1424754 muscle skeletal
## SRR1424755 muscle skeletal
## SRR1424756 muscle skeletal
## biosample_submission_date biosample_publication_date
## <character> <character>
## SRR1424731 2014-06-18T09:46:51.917 2014-07-16T01:09:43.077
## SRR1424732 2014-06-18T09:46:31.127 2014-07-16T01:09:48.273
## SRR1424733 2014-06-18T09:46:33.907 2014-07-16T01:09:54.377
## SRR1424735 2014-06-18T09:46:28.123 2014-07-16T01:10:05.983
## SRR1424734 2014-06-18T09:46:55.210 2014-07-16T01:10:00.340
## ... ... ...
## SRR1424752 2014-06-18T09:47:35.353 2014-07-16T01:23:32.167
## SRR1424753 2014-06-18T09:47:17.927 2014-07-16T01:23:48.860
## SRR1424754 2014-06-18T09:47:38.047 2014-07-16T01:24:02.310
## SRR1424755 2014-06-18T09:47:24.060 2014-07-16T01:08:32.747
## SRR1424756 2014-06-18T09:47:24.060 2014-07-16T01:08:32.747
## biosample_update_date avg_read_length geo_accession
## <character> <integer> <character>
## SRR1424731 2014-08-20T01:14:31.823 202 GSM1415126
## SRR1424732 2014-08-20T01:14:32.233 202 GSM1415127
## SRR1424733 2014-08-20T01:14:32.627 202 GSM1415128
## SRR1424735 2014-08-20T01:14:32.967 202 GSM1415130
## SRR1424734 2014-08-20T01:14:33.307 202 GSM1415129
## ... ... ... ...
## SRR1424752 2014-08-20T01:09:20.720 202 GSM1415146
## SRR1424753 2014-08-20T01:09:21.357 202 GSM1415147
## SRR1424754 2014-08-20T01:09:21.800 202 GSM1415148
## SRR1424755 2014-08-20T01:09:22.357 202 GSM1415149
## SRR1424756 2014-08-20T01:09:22.357 202 GSM1415149
## bigwig_file title
## <character> <character>
## SRR1424731 SRR1424731.bw 217
## SRR1424732 SRR1424732.bw 225
## SRR1424733 SRR1424733.bw 174
## SRR1424735 SRR1424735.bw 135
## SRR1424734 SRR1424734.bw 235
## ... ... ...
## SRR1424752 SRR1424752.bw 172
## SRR1424753 SRR1424753.bw 234
## SRR1424754 SRR1424754.bw 152
## SRR1424755 SRR1424755.bw 224
## SRR1424756 SRR1424756.bw 224
## characteristics
## <CharacterList>
## SRR1424731 tissue: skeletal muscle,gender: female,exercise status: untrained,...
## SRR1424732 tissue: skeletal muscle,gender: female,exercise status: untrained,...
## SRR1424733 tissue: skeletal muscle,gender: male,exercise status: untrained,...
## SRR1424735 tissue: skeletal muscle,gender: male,exercise status: untrained,...
## SRR1424734 tissue: skeletal muscle,gender: male,exercise status: untrained,...
## ... ...
## SRR1424752 tissue: skeletal muscle,gender: male,exercise status: untrained,...
## SRR1424753 tissue: skeletal muscle,gender: male,exercise status: untrained,...
## SRR1424754 tissue: skeletal muscle,gender: female,exercise status: untrained,...
## SRR1424755 tissue: skeletal muscle,gender: female,exercise status: untrained,...
## SRR1424756 tissue: skeletal muscle,gender: female,exercise status: untrained,...
Information about the experimental condition of the samples can in this experiment be found in the last column (named characteristics) of the S4Vectors DataFrame object. From this list we can extract needed (eg. information about conditions that we need to be able to compare samples for differential gene expression) sample information using the sapply function.
sampleInfo <- sapply(colData(rse_gene)$characteristics, "[", c(2,4))
Note that we get a matrix back where the first row contain the gender information and the second contains information on which leg is sampled. By some simple commands we can convert this to vectors of factors that can be used to set up the linear model in the analysis of differential gene expression.
# sampleInfo <- factor(as.vector(sampleInfo))
gender <- unlist(strsplit(sampleInfo[1,], split = ": "))[c(FALSE, TRUE)]
part <- unlist(strsplit(sampleInfo[2,], split = ": "))[c(FALSE, TRUE)]
gender <- factor(gender)
part <- factor(make.names(part))
We can now add this information to the RangedSummarizedExperiments object so that we have all necessary data associated with the experiment in the same object. We also rescale the expession data from recount object to make sure it is useful for gene expression analysis at gene level.
colData(rse_gene)$gender <- gender
colData(rse_gene)$part <- part
rse_gene_scaled <- scale_counts(rse_gene)
counts <- assays(rse_gene_scaled)$counts
filter <- rowMeans(counts) >= 10
head(counts)
## SRR1424731 SRR1424732 SRR1424733 SRR1424735 SRR1424734
## ENSG00000000003.14 107 147 121 147 107
## ENSG00000000005.5 3 3 6 1 5
## ENSG00000000419.12 569 538 669 867 539
## ENSG00000000457.13 347 233 306 310 248
## ENSG00000000460.16 127 77 107 96 87
## ENSG00000000938.12 21 45 102 40 64
## SRR1424736 SRR1424737 SRR1424738 SRR1424739 SRR1424740
## ENSG00000000003.14 117 154 169 126 77
## ENSG00000000005.5 2 2 3 2 0
## ENSG00000000419.12 787 671 434 463 508
## ENSG00000000457.13 308 263 214 252 275
## ENSG00000000460.16 91 74 65 76 95
## ENSG00000000938.12 44 43 71 25 27
## SRR1424741 SRR1424743 SRR1424742 SRR1424745 SRR1424744
## ENSG00000000003.14 140 87 98 108 106
## ENSG00000000005.5 2 0 0 0 2
## ENSG00000000419.12 623 809 558 479 707
## ENSG00000000457.13 232 280 225 191 264
## ENSG00000000460.16 57 95 62 75 69
## ENSG00000000938.12 41 23 29 28 26
## SRR1424746 SRR1424747 SRR1424748 SRR1424749 SRR1424750
## ENSG00000000003.14 124 145 95 196 140
## ENSG00000000005.5 0 12 12 20 5
## ENSG00000000419.12 656 653 520 756 314
## ENSG00000000457.13 240 192 191 329 218
## ENSG00000000460.16 74 89 63 100 58
## ENSG00000000938.12 39 26 17 49 62
## SRR1424751 SRR1424752 SRR1424753 SRR1424754 SRR1424755
## ENSG00000000003.14 74 73 132 135 158
## ENSG00000000005.5 12 2 6 2 27
## ENSG00000000419.12 469 465 661 742 580
## ENSG00000000457.13 212 215 299 331 329
## ENSG00000000460.16 65 78 87 97 73
## ENSG00000000938.12 18 194 40 33 20
## SRR1424756
## ENSG00000000003.14 120
## ENSG00000000005.5 25
## ENSG00000000419.12 315
## ENSG00000000457.13 239
## ENSG00000000460.16 84
## ENSG00000000938.12 6
head(filter)
## ENSG00000000003.14 ENSG00000000005.5 ENSG00000000419.12
## TRUE FALSE TRUE
## ENSG00000000457.13 ENSG00000000460.16 ENSG00000000938.12
## TRUE TRUE TRUE
We can now use the counts and the filter objects to create the input objects for the different differential expression packages.
## Build DGEList object
dge <- DGEList(counts = counts[filter, ])
dge$samples$part <- colData(rse_gene)$part
dge$samples$gender <- colData(rse_gene)$gender
## Calculate normalization factors
dge <- calcNormFactors(dge)
## Explore the data
mds_dge <- plotMDS(dge, labels = dge$samples$gender)
Here we can see that there is clear separation of samples corresponding to gender. We can create the same plot and look at whether there is any clustering by sampled leg.
plotMDS(dge, labels = dge$samples$part)
A more elegant solution would be to take the values from the plotMDS function and instead do the actual plotting with ggplot and add both factors in one go.
mdsDf <- data.frame(cbind(x = mds_dge$x, y = mds_dge$y))
mdsDf$part <- dge$samples$part
mdsDf$gender <- dge$samples$gender
ggplot(mdsDf, aes(x, y, color = gender, shape = part)) + geom_point() +
ggtitle("MDS plot for SRR1424755") +
xlab("Leading logFC, dim 1") + ylab("Leading logFC, dim 2")
EdgeR is a popular approach for testing differential gene expression from RNA-seq data. To run an edgeR analysis one need two things, a count matrix and a design matrix. The estimateDisp estimates common, trended and genewise dispersion for the data. The plot shows the biological coefficient of variation that tells about the expected underlying variation among replicates.
design <- model.matrix(~part + gender)
y <- estimateDisp(dge, design, robust=TRUE)
plotBCV(y)
We can now test for differential gene expression using the edgeR
fit.edgeR <- glmFit(y, design, robust=TRUE)
The limma package works with the dge object we created above, but prior to running limma we need to convert counts with the function voom, that returns log cpm valuse as well as precision weights. With this step the data is ready for linear modelling in limma in a framework assuming that values are normally distributed.
y.voom <- voom(dge, design, plot = TRUE)
fitLimma <- lmFit(y.voom, design)
fitLimma2 <- eBayes(fitLimma)
summary(decideTests(fitLimma2))
## (Intercept) partright.leg gendermale
## Down 3852 0 368
## NotSig 1649 19371 18528
## Up 13870 0 475
A third option for identifying differentially expressed genes in to use the DESeq2 package.
desD <- DESeqDataSet(rse_gene_scaled, design = ~ part + gender)
## converting counts to integer mode
keep <- rowSums(counts(desD)) >= 10
desD <- desD[keep,]
desD <- DESeq(desD)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 194 genes
## -- DESeq argument 'minReplicatesForReplace' = 7
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
resD <- results(desD)
resD_df <- as.data.frame(resD)
head(resD_df)
ggplot(resD_df, aes(log2FoldChange, -log10(padj), fill = abs(log2FoldChange) > 1))+
geom_point(pch = 21) +
labs(fill = "Log 2-fold change > 1",
y = "-Log 10 of adjusted p-value")
## Warning: Removed 10888 rows containing missing values (geom_point).