0.1 Introduction

  • Why RNASeq?
    • Identify genes in the genomes
    • Investigate patterns of gene expression
    • Study isoforms and allelic expression
    • Learn about gene functions
    • Identify co-expressed gene networks
    • Identify SNPs in genes
  • From reads to expression
    • Sequence biological material and collect fastq files
    • Select reference genome (fasta file)
    • Identify a suitable annotation file (gtf, gff or bed)
    • Map reads to reference to generate a sam/bam file
    • Count reads in the sam/bam file that map to exonic/genic regions

0.2 Workshop

The focus of this exercise will be to introduce some R packages that are commonly used in analysis of RNA-sequence and Chip-sequence data. Even though the purpose and ideas behind these two types of projects are very different there is in the analysis of data many similarities as they are both using the number of short reads mapping to different parts of the genome as their main data source.

For both type of projects the first steps in the analysis eg. the mapping of short reads to a reference genome is often done outside R. For the RNA-seq part we will show how one can do this steps in R, but the majority of work on this workshop is put on later steps of the analysis as that is where R is really shining.

0.2.1 - R Packages

In this exercise we will make heavy use of different bioconductor packages. Since bioconductor were introduced yesterday you probably remember how to install packages, otherwise click to reveal the code.

# source("https://bioconductor.org/biocLite.R")
# biocLite(c("recount", "GenomicRanges", "limma", "edgeR", "DESeq2",
#     "regionReport", "clusterProfiler", "org.Hs.eg.db", "gplots",
#     "derfinder", "rtracklayer", "GenomicFeatures", "bumphunter",
#     "derfinderPlot", "devtools", "Rsamtools", "csaw", "Glimma",
#     "ShortRead"))
library(recount)
library(ShortRead)
library(limma)
library(Glimma)
library(edgeR)
library(ggplot2)
library(DESeq2)

You can go to www.bioconductor.org to learn more about the different packages, but here is an attempt at a single sentence summary of the different packages used

  • Shortread: Work with short read data (fastq files) in R
  • recount: Adds functions to interact with the recount2 data set (link)
  • GenomicRanges: The foundations of representing genomic coordinates in bioconductor
  • limma: Contain large number of functionality related to analysis of gene expression obtained from microarray and RNA-seq data
  • edgeR: Package devoloped to analyse count data from RNA-seq and Chip-seq like approaches
  • DESeq2: Provides methods to test for differential expression by use of negative binomial generalized linear models
  • org.Hs.eg.db: Annotation data for the human genome
  • EnsDb.Hsapiens.v86/EnsDb.Hsapiens.v75: Two different versions of the ensembl human genome annotation
  • rtracklayer: Interface to genome tracks and genome browsers (UCSC)
  • GenomicFeatures: Manages transcriptrelated information (UCSC, biomart)
  • devtools: To allow for easy install from github (and gitlab?) repositories
  • csaw: Package to analyse Chip-seq data in a windows-based fashion.
  • Rsubread: Adds functionality to map short read data to reference genomes and counting mapped reads on genomic regions.
  • Rsamtools: R interface to samtools that modify/read/write sam and bam files

NB! The Rsubread package can not be easily installed on windows systems. If you are using a windows system there are some hints in the manual on how to proceed, but note that it is not heavily used in this exercise so there is no need to spend much time on it.

0.2.2 - Access public RNASeq data

RNA-seq data are deposited in most cases made public via open sequence repositories such as https://www.ncbi.nlm.nih.gov/geo/ and https://www.ebi.ac.uk/arrayexpress/. Both of these can be queried via the web, but they also have bioconductor packages so that one can query and download (at least parts of the data) directly from R.

In addition to these there have been some major efforts in creating useful resources for the research community and make them publicly available. One such attempt is the https://jhubiostatistics.shinyapps.io/recount/. This project have collected more than 2000 human RNA-seq studies and re-analysed them in an effecient and consistent manner. In the examples here we will make use of data from the recount experiment, but to show a complete workflow from short reads to expression analysis we will first look at how to do all steps in a RNA-seq analysis from within R. For the first part of the exercise we will use some small example files that you can find in the rna_chip folder here.

0.2.3 - Quality control of short read data

The first step of the analysis in most projects using short read data is to have a look at the quality of the obtained data as that will determine whether we should filter the short reads prior to analysis as well as detect if there are any major issues with our data.

There are many non-R based tools that do this and often the sequence center will supply quality summaries together with the actual sequence files. The quality relating to short reads are encoded directly in the fastq file so the tools are often just summarising this and do some basic sequence content analysis.

Let us look at the information we find in a fastq file using the ShortRead package in R. Make sure you have downloaded the folder rna_chip to your computer before trying the commands. After loading the package we sample 10 reads at random from one of the supplied fastq files.

# sampler <- FastqSampler("~/data/SRR1424755_1.fastq.gz", n = 10)
# fq_ex <- yield(sampler)

0.2.4 - Download already prepared recount data

download_study("SRP043368", type = "rse-gene")
## 2018-06-16 12:17:51 downloading file rse_gene.Rdata to SRP043368
load(file.path("SRP043368", "rse_gene.Rdata"))
rse_gene
## class: RangedSummarizedExperiment 
## dim: 58037 26 
## metadata(0):
## assays(1): counts
## rownames(58037): ENSG00000000003.14 ENSG00000000005.5 ...
##   ENSG00000283698.1 ENSG00000283699.1
## rowData names(3): gene_id bp_length symbol
## colnames(26): SRR1424731 SRR1424732 ... SRR1424755 SRR1424756
## colData names(21): project sample ... title characteristics
colData(rse_gene)
## DataFrame with 26 rows and 21 columns
##                project      sample  experiment         run
##            <character> <character> <character> <character>
## SRR1424731   SRP043368   SRS641338   SRX610324  SRR1424731
## SRR1424732   SRP043368   SRS641339   SRX610325  SRR1424732
## SRR1424733   SRP043368   SRS641340   SRX610326  SRR1424733
## SRR1424735   SRP043368   SRS641341   SRX610328  SRR1424735
## SRR1424734   SRP043368   SRS641342   SRX610327  SRR1424734
## ...                ...         ...         ...         ...
## SRR1424752   SRP043368   SRS641358   SRX610344  SRR1424752
## SRR1424753   SRP043368   SRS641359   SRX610345  SRR1424753
## SRR1424754   SRP043368   SRS641360   SRX610346  SRR1424754
## SRR1424755   SRP043368   SRS641361   SRX610347  SRR1424755
## SRR1424756   SRP043368   SRS641361   SRX610347  SRR1424756
##            read_count_as_reported_by_sra reads_downloaded
##                                <integer>        <integer>
## SRR1424731                      29955322         29955322
## SRR1424732                      31892332         31892332
## SRR1424733                      50479438         50479438
## SRR1424735                      59930974         59930974
## SRR1424734                      38127460         38127460
## ...                                  ...              ...
## SRR1424752                      56690642         56690642
## SRR1424753                      53806682         53806682
## SRR1424754                      62344990         62344990
## SRR1424755                      17913512         17913512
## SRR1424756                      13938278         13938278
##            proportion_of_reads_reported_by_sra_downloaded paired_end
##                                                 <numeric>  <logical>
## SRR1424731                                              1       TRUE
## SRR1424732                                              1       TRUE
## SRR1424733                                              1       TRUE
## SRR1424735                                              1       TRUE
## SRR1424734                                              1       TRUE
## ...                                                   ...        ...
## SRR1424752                                              1       TRUE
## SRR1424753                                              1       TRUE
## SRR1424754                                              1       TRUE
## SRR1424755                                              1       TRUE
## SRR1424756                                              1       TRUE
##            sra_misreported_paired_end mapped_read_count        auc
##                             <logical>         <integer>  <numeric>
## SRR1424731                      FALSE          29582886 2700597015
## SRR1424732                      FALSE          31083419 2802567487
## SRR1424733                      FALSE          50232620 4720081400
## SRR1424735                      FALSE          59260728 5554985728
## SRR1424734                      FALSE          37382762 3384497431
## ...                               ...               ...        ...
## SRR1424752                      FALSE          47212788 3413173255
## SRR1424753                      FALSE          52082887 4798991823
## SRR1424754                      FALSE          60944552 5617825794
## SRR1424755                      FALSE          17318697 1588459316
## SRR1424756                      FALSE          13606853 1185130227
##            sharq_beta_tissue sharq_beta_cell_type
##                  <character>          <character>
## SRR1424731            muscle             skeletal
## SRR1424732            muscle             skeletal
## SRR1424733            muscle             skeletal
## SRR1424735            muscle             skeletal
## SRR1424734            muscle             skeletal
## ...                      ...                  ...
## SRR1424752            muscle             skeletal
## SRR1424753            muscle             skeletal
## SRR1424754            muscle             skeletal
## SRR1424755            muscle             skeletal
## SRR1424756            muscle             skeletal
##            biosample_submission_date biosample_publication_date
##                          <character>                <character>
## SRR1424731   2014-06-18T09:46:51.917    2014-07-16T01:09:43.077
## SRR1424732   2014-06-18T09:46:31.127    2014-07-16T01:09:48.273
## SRR1424733   2014-06-18T09:46:33.907    2014-07-16T01:09:54.377
## SRR1424735   2014-06-18T09:46:28.123    2014-07-16T01:10:05.983
## SRR1424734   2014-06-18T09:46:55.210    2014-07-16T01:10:00.340
## ...                              ...                        ...
## SRR1424752   2014-06-18T09:47:35.353    2014-07-16T01:23:32.167
## SRR1424753   2014-06-18T09:47:17.927    2014-07-16T01:23:48.860
## SRR1424754   2014-06-18T09:47:38.047    2014-07-16T01:24:02.310
## SRR1424755   2014-06-18T09:47:24.060    2014-07-16T01:08:32.747
## SRR1424756   2014-06-18T09:47:24.060    2014-07-16T01:08:32.747
##              biosample_update_date avg_read_length geo_accession
##                        <character>       <integer>   <character>
## SRR1424731 2014-08-20T01:14:31.823             202    GSM1415126
## SRR1424732 2014-08-20T01:14:32.233             202    GSM1415127
## SRR1424733 2014-08-20T01:14:32.627             202    GSM1415128
## SRR1424735 2014-08-20T01:14:32.967             202    GSM1415130
## SRR1424734 2014-08-20T01:14:33.307             202    GSM1415129
## ...                            ...             ...           ...
## SRR1424752 2014-08-20T01:09:20.720             202    GSM1415146
## SRR1424753 2014-08-20T01:09:21.357             202    GSM1415147
## SRR1424754 2014-08-20T01:09:21.800             202    GSM1415148
## SRR1424755 2014-08-20T01:09:22.357             202    GSM1415149
## SRR1424756 2014-08-20T01:09:22.357             202    GSM1415149
##              bigwig_file       title
##              <character> <character>
## SRR1424731 SRR1424731.bw         217
## SRR1424732 SRR1424732.bw         225
## SRR1424733 SRR1424733.bw         174
## SRR1424735 SRR1424735.bw         135
## SRR1424734 SRR1424734.bw         235
## ...                  ...         ...
## SRR1424752 SRR1424752.bw         172
## SRR1424753 SRR1424753.bw         234
## SRR1424754 SRR1424754.bw         152
## SRR1424755 SRR1424755.bw         224
## SRR1424756 SRR1424756.bw         224
##                                                                  characteristics
##                                                                  <CharacterList>
## SRR1424731 tissue: skeletal muscle,gender: female,exercise status: untrained,...
## SRR1424732 tissue: skeletal muscle,gender: female,exercise status: untrained,...
## SRR1424733   tissue: skeletal muscle,gender: male,exercise status: untrained,...
## SRR1424735   tissue: skeletal muscle,gender: male,exercise status: untrained,...
## SRR1424734   tissue: skeletal muscle,gender: male,exercise status: untrained,...
## ...                                                                          ...
## SRR1424752   tissue: skeletal muscle,gender: male,exercise status: untrained,...
## SRR1424753   tissue: skeletal muscle,gender: male,exercise status: untrained,...
## SRR1424754 tissue: skeletal muscle,gender: female,exercise status: untrained,...
## SRR1424755 tissue: skeletal muscle,gender: female,exercise status: untrained,...
## SRR1424756 tissue: skeletal muscle,gender: female,exercise status: untrained,...

Information about the experimental condition of the samples can in this experiment be found in the last column (named characteristics) of the S4Vectors DataFrame object. From this list we can extract needed (eg. information about conditions that we need to be able to compare samples for differential gene expression) sample information using the sapply function.

sampleInfo <- sapply(colData(rse_gene)$characteristics, "[", c(2,4))

Note that we get a matrix back where the first row contain the gender information and the second contains information on which leg is sampled. By some simple commands we can convert this to vectors of factors that can be used to set up the linear model in the analysis of differential gene expression.

# sampleInfo <- factor(as.vector(sampleInfo))
gender <- unlist(strsplit(sampleInfo[1,], split = ": "))[c(FALSE, TRUE)]
part <- unlist(strsplit(sampleInfo[2,], split = ": "))[c(FALSE, TRUE)]
gender <- factor(gender)
part <- factor(make.names(part))

We can now add this information to the RangedSummarizedExperiments object so that we have all necessary data associated with the experiment in the same object. We also rescale the expession data from recount object to make sure it is useful for gene expression analysis at gene level.

colData(rse_gene)$gender <- gender
colData(rse_gene)$part <- part
rse_gene_scaled <- scale_counts(rse_gene)

counts <- assays(rse_gene_scaled)$counts
filter <- rowMeans(counts) >= 10

head(counts)
##                    SRR1424731 SRR1424732 SRR1424733 SRR1424735 SRR1424734
## ENSG00000000003.14        107        147        121        147        107
## ENSG00000000005.5           3          3          6          1          5
## ENSG00000000419.12        569        538        669        867        539
## ENSG00000000457.13        347        233        306        310        248
## ENSG00000000460.16        127         77        107         96         87
## ENSG00000000938.12         21         45        102         40         64
##                    SRR1424736 SRR1424737 SRR1424738 SRR1424739 SRR1424740
## ENSG00000000003.14        117        154        169        126         77
## ENSG00000000005.5           2          2          3          2          0
## ENSG00000000419.12        787        671        434        463        508
## ENSG00000000457.13        308        263        214        252        275
## ENSG00000000460.16         91         74         65         76         95
## ENSG00000000938.12         44         43         71         25         27
##                    SRR1424741 SRR1424743 SRR1424742 SRR1424745 SRR1424744
## ENSG00000000003.14        140         87         98        108        106
## ENSG00000000005.5           2          0          0          0          2
## ENSG00000000419.12        623        809        558        479        707
## ENSG00000000457.13        232        280        225        191        264
## ENSG00000000460.16         57         95         62         75         69
## ENSG00000000938.12         41         23         29         28         26
##                    SRR1424746 SRR1424747 SRR1424748 SRR1424749 SRR1424750
## ENSG00000000003.14        124        145         95        196        140
## ENSG00000000005.5           0         12         12         20          5
## ENSG00000000419.12        656        653        520        756        314
## ENSG00000000457.13        240        192        191        329        218
## ENSG00000000460.16         74         89         63        100         58
## ENSG00000000938.12         39         26         17         49         62
##                    SRR1424751 SRR1424752 SRR1424753 SRR1424754 SRR1424755
## ENSG00000000003.14         74         73        132        135        158
## ENSG00000000005.5          12          2          6          2         27
## ENSG00000000419.12        469        465        661        742        580
## ENSG00000000457.13        212        215        299        331        329
## ENSG00000000460.16         65         78         87         97         73
## ENSG00000000938.12         18        194         40         33         20
##                    SRR1424756
## ENSG00000000003.14        120
## ENSG00000000005.5          25
## ENSG00000000419.12        315
## ENSG00000000457.13        239
## ENSG00000000460.16         84
## ENSG00000000938.12          6
head(filter)
## ENSG00000000003.14  ENSG00000000005.5 ENSG00000000419.12 
##               TRUE              FALSE               TRUE 
## ENSG00000000457.13 ENSG00000000460.16 ENSG00000000938.12 
##               TRUE               TRUE               TRUE

We can now use the counts and the filter objects to create the input objects for the different differential expression packages.

0.2.5 - Using count data for detection of differential gene expression

## Build DGEList object
dge <- DGEList(counts = counts[filter, ])
dge$samples$part <- colData(rse_gene)$part
dge$samples$gender <- colData(rse_gene)$gender


## Calculate normalization factors
dge <- calcNormFactors(dge)

## Explore the data
mds_dge <- plotMDS(dge, labels = dge$samples$gender)

Here we can see that there is clear separation of samples corresponding to gender. We can create the same plot and look at whether there is any clustering by sampled leg.

plotMDS(dge, labels = dge$samples$part) 

A more elegant solution would be to take the values from the plotMDS function and instead do the actual plotting with ggplot and add both factors in one go.

mdsDf <- data.frame(cbind(x = mds_dge$x, y = mds_dge$y))
mdsDf$part <- dge$samples$part
mdsDf$gender <- dge$samples$gender
ggplot(mdsDf, aes(x, y, color = gender, shape = part)) + geom_point() + 
  ggtitle("MDS plot for SRR1424755") +
  xlab("Leading logFC, dim 1") + ylab("Leading logFC, dim 2")

0.2.6 - Detection of differential expression using edgeR

EdgeR is a popular approach for testing differential gene expression from RNA-seq data. To run an edgeR analysis one need two things, a count matrix and a design matrix. The estimateDisp estimates common, trended and genewise dispersion for the data. The plot shows the biological coefficient of variation that tells about the expected underlying variation among replicates.

design <- model.matrix(~part + gender)
y <- estimateDisp(dge, design, robust=TRUE)
plotBCV(y)

We can now test for differential gene expression using the edgeR

fit.edgeR <- glmFit(y, design, robust=TRUE)

0.2.7 - Detection of differential expression using limma

The limma package works with the dge object we created above, but prior to running limma we need to convert counts with the function voom, that returns log cpm valuse as well as precision weights. With this step the data is ready for linear modelling in limma in a framework assuming that values are normally distributed.

y.voom <- voom(dge, design, plot = TRUE)

fitLimma <- lmFit(y.voom, design)
fitLimma2 <- eBayes(fitLimma)

summary(decideTests(fitLimma2))
##        (Intercept) partright.leg gendermale
## Down          3852             0        368
## NotSig        1649         19371      18528
## Up           13870             0        475

0.2.8 - Detection of differential expression using DESeq2

A third option for identifying differentially expressed genes in to use the DESeq2 package.

desD <- DESeqDataSet(rse_gene_scaled, design = ~ part + gender)
## converting counts to integer mode
keep <- rowSums(counts(desD)) >= 10
desD <- desD[keep,]
desD <- DESeq(desD)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 194 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
resD <- results(desD)

resD_df <- as.data.frame(resD)

head(resD_df)
ggplot(resD_df, aes(log2FoldChange, -log10(padj), fill = abs(log2FoldChange) > 1))+
  geom_point(pch = 21) +
  labs(fill = "Log 2-fold change > 1",
       y = "-Log 10 of adjusted p-value")
## Warning: Removed 10888 rows containing missing values (geom_point).