Problems compared to bulk RNAseq * Transcriptional bursting * Amplification bias * Background noise
There are multiple packages dedicated to scRNAseq analysis. If you want to try out more exercises there are more tutorials at our 3-day course at:
https://nbisweden.github.io/workshop-scRNAseq/
In this workshop, you have the chance to test two commonly used pipelines:
Seurat package which has implemented all steps of scRNAseq analysis, from QC to clustering and differential expression.
Scater package + SC3. Scater defines the SingleCellExperiment class which is a good package for QC. Building on the Scater SCE class, SC3 can be used for clustering.
We suggest that you start with one of these pipelines and see how far you get.
In all of the exercises, you will be working on a dataset with Smartseq2 data from human tonsil innate lymphoid cells (ILCs) from Björklund et. al., 2016. Since this dataset has been index-sorted with FACS, we already have an idea about what the main celltypes are supposed to be, so it is a good example dataset for evaluating the clustering of our data.
All data can be accessed here.
The ILCs can be subdivided into 3 mainsubpopulations; ILC1, ILC2 and ILC3. Also natural killer cells (NK-cells) were included in the sorting. Samples were taken from 3 individual donors and sorted and prepared in 3 different batches on multiple plates.
The initial plates were gated using known surface markers for the ILCs and NK cells. As the ICL3 is the most abundant population among the ILCs, these plates mainly consists of ILC3s. To increase the representation of the other two subtypes some plates were sorted with specific gating for ILC1s and ILC2s.
Below is a representation of the abundance of the 4 different celltypes on each plate.
In this dataset there is clearly a batch effect with regards to the different donors and sort occasions. There is also a clear plate effect. But since the cell types differ on the plates, there is not much we can do to distinguish the plate batch effect from the biological variation.
library(Seurat)
library(gridExtra)
For the tutorial below, we will use human tonsil innate lymphoid cells (ILCs) example dataset from Björklund et. al., 2016. But you can also run with your own data. But keep in mind that some functions assume that the count data is UMIs, but we run it with data from Smartseq2 using raw counts.
Load data
For running Seurat we need the metadata table, the count matrix and a file with gene name translations.
C <- read.table("single_cell_ilc/ensembl_countvalues_ILC.csv",sep=",",header=T,row.names=1)
M <- read.table("single_cell_ilc/Metadata_ILC.csv",sep=",",header=T,row.names=1)
# in this case it may be wise to translate ensembl IDs to gene names
# to make plots with genes more understandable
# Read in gene name translations.
TR <- read.table("single_cell_ilc/gene_name_translation_biotype.tab",sep="\t")
# find the correct entries in TR and merge ensembl name and gene id.
m <- match(rownames(C),TR$ensembl_gene_id)
newnames <- apply(cbind(as.vector(TR$external_gene_name)[m],rownames(C)),1,paste,collapse=":")
rownames(C)<-newnames
head(C)
head(M)
head(TR)
Create seurat object
Seurat will automatically filter out genes/cells that do not meet the criteria specified to save space. The cutoffs are defined with min.cells
and min.genes
. However, in this case, the cells are already filtered, but all genes that are not expressed with >1 count in 3 cells (min.cells
) will be removed.
# in seurat we will not make use of the spike-ins, so remove them from the expression matrix before creating the Seurat object.
ercc <- grep("ERCC_",rownames(C))
# Create the Seurat object.
data <- CreateSeuratObject(raw.data = C[-ercc,],
min.cells = 3, min.genes = 200,
project = "ILC", is.expr=1, meta.data=M)
The Seurat object you have created now contains several slots with different information.
slotNames(data)
## [1] "raw.data" "data" "scale.data" "var.genes"
## [5] "is.expr" "ident" "meta.data" "project.name"
## [9] "dr" "assay" "hvg.info" "imputed"
## [13] "cell.names" "cluster.tree" "snn" "calc.params"
## [17] "kmeans" "spatial" "misc" "version"
Some of the most important ones to know are:
There are multiple functions in Seurat to visualize QC-data for your dataset. Below are some examples:
Violin plot for number of detected genes per cell.
# plot number of genes and nUMI (rpkms in this case) for each Donor
VlnPlot(object = data, features.plot = c("nGene", "nUMI"), nCol = 2, group.by = "Donor")
# same for celltype
VlnPlot(object = data, features.plot = c("nGene", "nUMI"), nCol = 2, group.by = "Celltype")
Scatterplot with detected genes vs Counts (not UMIs as it will say in the legend) since this is not UMI data.
GenePlot(object = data, gene1 = "nUMI", gene2 = "nGene")
The slot data@ident defines the grouping of cells, which is automatically set to Donor in this case. To instead color by celltype, data@ident needs to be changed.
data <- SetAllIdent(object = data, id = "Celltype")
GenePlot(object = data, gene1 = "nUMI", gene2 = "nGene")
# change ident back to Donor
data <- SetAllIdent(object = data, id = "Donor")
OBS! Each time you want to change colors in a gene plot, you need to change the identity class value in the seurat object in the slot data@ident. Perhaps there is a better way, but I did not find a solution.
In many of the other Seurat plotting functions like TSNEPlot(), PCAPlot() and VlnPlot(), you can use group.by to define which meta data variable the cells should be coloured or grouped by.
Next step is to normalize the data, detect variable genes and to scale the data.
data <- NormalizeData(object = data, normalization.method = "LogNormalize")
You should look at the plot for suitable cutoffs for your specific dataset rerun the command. You can define the lower/upper bound of mean expression with x.low.cutoff/x.high.cutoff and the limit of dispersion with y.cutoff
data <- FindVariableGenes(object = data, mean.function = ExpMean, dispersion.function = LogVMR, x.low.cutoff = 0.5, x.high.cutoff = 10, y.cutoff = 0.5)
length(x = data@var.genes)
## [1] 838
Scales and centers genes in the dataset. The scaling step can also be used to remove unwanted confounders.
It is quite common to regress out the number of detected genes (nGene), that quite often will drive the variation in your data due to library quality.
We will also run one version of scaling where we include the Donor batch information and compare.
data <- ScaleData(object = data, vars.to.regress = c("nGene"), display.progress=F)
# also with batch info + detected genes.
dataB <- ScaleData(object = data, vars.to.regress = c("nGene","Donor"), display.progress=F)
Run PCA using variable genes. The pca is stored in the slot dr.
# run PCA for both objects
data <- RunPCA(object = data, pc.genes = data@var.genes, do.print = FALSE)
dataB <- RunPCA(object = dataB, pc.genes = data@var.genes, do.print = FALSE)
VizPCA(object = data, pcs.use = 1:4)
Plot PCA for both objects. To put multiple plots in the same window, do.return=T was used to return the ggplot2 objects and that are plotted in one window with grid.arrange.
p1 <- PCAPlot(object = data, dim.1 = 1, dim.2 = 2, do.return=T, plot.title = "Uncorrected")
p2 <- PCAPlot(object = dataB, dim.1 = 1, dim.2 = 2, do.return=T, plot.title = "Batch corrected")
# and with both color by Celltype, here you can use group.by
p3 <- PCAPlot(object = data, dim.1 = 1, dim.2 = 2, do.return=T,group.by="Celltype", plot.title = "Uncorrected")
p4 <- PCAPlot(object = dataB, dim.1 = 1, dim.2 = 2, do.return=T,group.by="Celltype", plot.title = "Batch corrected")
# plot together
grid.arrange(p1,p2,p3,p4,ncol=2)
As you can see, the batch effect is not as strong in this PCA as it was in PCAs that we did in other labs, so the PCA plot with batch correction does look quite similar.
This is mainly due to the fact that we are only using top variable genes, and it seems that the batch effect is mainly seen among genes that are not highly variable.
Still, if you know you have a clear batch effect, it is a good idea to remove it with regression. So from now on, we will continue with the dataB object.
OBS! Margins are too large to display well in R-studio, save to a pdf instead.
PCHeatmap(object = dataB, pc.use = 1, do.balanced = TRUE, label.columns = FALSE)
## Warning in heatmap.2(data.use, Rowv = NA, Colv = NA, trace = "none", col =
## col.use, : Discrepancy: Rowv is FALSE, while dendrogram is `both'. Omitting
## row dendogram.
## Warning in heatmap.2(data.use, Rowv = NA, Colv = NA, trace = "none", col
## = col.use, : Discrepancy: Colv is FALSE, while dendrogram is `column'.
## Omitting column dendogram.
## Warning in plot.window(...): "dimTitle" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "dimTitle" is not a graphical parameter
## Warning in title(...): "dimTitle" is not a graphical parameter
PCHeatmap(object = dataB, pc.use = 1:5, do.balanced = TRUE, label.columns = FALSE)
Now we use the JackStraw function to check which of the principal components that are significant. If dataset is large, you can instead use PCElbowPlot().
As a default, JackStraw is only run on the first 20 PCs, if you want to include more PCs in your tSNE and clustering, run JackStraw with num.pc=30 or similar.
dataB <- JackStraw(object = dataB, num.replicate = 100,display.progress =F)
JackStrawPlot(object = dataB, PCs = 1:12)
## Warning: Removed 8096 rows containing missing values (geom_point).
## An object of class seurat in project ILC
## 24995 genes across 648 samples.
In this case, only PCs 1-7 are significant, so we will only use those in subsequent steps.
In this case, we use the PCs as suggested by the JackStrawPlot(). FindClusters() constructs a SNN-graph based on distances in PCA space using the defined principal components. This graph is split into clusters using modularity optimization techniques.
You can tweak the clustering with the resolution parameter to get more/less clusters and also with parameters k and k.scale for the construction of the graph.
OBS! Any function that depends on random start positions, like the KNN graph and tSNE will not give identical results each time you run it. So it is adviced to set the random seed with set.seed function before running the function.
# use.pcs <- 1:7
# set.seed(1)
# dataB <- FindClusters(object = dataB, reduction.type = "pca", dims.use = use.pcs, resolution = 0.6, print.output = 0, save.SNN = TRUE)
#
# PrintFindClustersParams(object = dataB)