RaukR 2018 Notes - Day 4: Mathematical Statistics

0.1 What is mathematical statistics?

Classic statistics is not the only way to analyze your data
Different types of data analysis:
- Bayesianism (A lot of features, but few samples (P >> N))
- Frequentism (Approximately equal features and samples (P ~ N))
- Deep Learning (P << N)
- Depends on the amount of observations

0.2 Frequentist statistics

Based on summary statistics, and is not recommended for genetic analysis (calculates the sum of residuals)
Parameters are fixed, data is random
Based on principle of maximum likelihood
- Maximizing the probability of observing the data
- Assumptions:
  - Large sample size
  - Gaussian distribution
  - Homoscedasticity
  - Uncorrelated errors
  - Convergence of covariance
Sensitive to outliers, may make results non-significant
Use resampling
Always use non-parametric tests
ML cannot stand Non-independence

0.3 Random effects and missing heritability

Allow individual level slopes and intercepts
This is nothing else than Bayesian priors on coefficients
R Package: LME4, contains function lmer

0.4 Bayesian Statistics

Parameters are random, data is fixed
One observation is enough to do bayesian statistics
Bayesian statistics does not have problems with statistical power
Correction for tests are done automatically
Forget about p-values in bayesian statistics
Combine data and prior knowledge

0.4.1 Bayesian linear models

R Package: brms
- Automatically makes assumptions about priors
Bayesian statistics is computationally expensive
Lasso is bayesian statistics
Dimensionality reduction:
- PCA (Principal component analysis)
- tSNE
- To improve statistics: increase N or reduce P
- Maximum likelihood cannot handle high dimensions

0.5 Tasks

library(lme4)
library(brms)
library(arm)
library(tidyverse)

Ordinary least squares linear regression

Check how the reaction of all individuals changed as a response to sleep deprivation

head(sleepstudy, 20)

str(sleepstudy)

## 'data.frame':    180 obs. of  3 variables:
##  $ Reaction: num  250 259 251 321 357 ...
##  $ Days    : num  0 1 2 3 4 5 6 7 8 9 ...
##  $ Subject : Factor w/ 18 levels "308","309","310",..: 1 1 1 1 1 1 1 1 1 1 ...

summary(lm(Reaction ~ Days, data = sleepstudy))

## 
## Call:
## lm(formula = Reaction ~ Days, data = sleepstudy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -110.848  -27.483    1.546   26.142  139.953 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  251.405      6.610  38.033  < 2e-16 ***
## Days          10.467      1.238   8.454 9.89e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.71 on 178 degrees of freedom
## Multiple R-squared:  0.2865, Adjusted R-squared:  0.2825 
## F-statistic: 71.46 on 1 and 178 DF,  p-value: 9.894e-15

ggplot(sleepstudy, aes(Days, Reaction))+
  geom_point()+
  geom_smooth(method = "lm")

We can see that it has a increasing trend but with a lot of variation between days and individuals. Looking at the summary of linear regression fit we conclude that the slope is significantly different from zero, i.e. there is a statistically significant increasing relation between Reaction and Days.
To look at how R calculates the confidence interval under the hood in ggplot:

# Error in code for now

# plot(Reaction~Days,data=sleepstudy)
# abline(lm(Reaction~Days,data=sleepstudy))
# conf_interval <- predict(lm(Reaction~Days,data=sleepstudy), newdata=data.frame(Days=seq(0,9,by=0.1)), interval="confidence", level = 0.95)
# lines(seq(0,9,by=0.1), conf_interval[,2], col="blue", lty=2)
# lines(seq(0,9,by=0.1), conf_interval[,3], col="blue", lty=2)
# head(conf_interval)

Here “fit” reflects the median value at each Days point, “lwr” and “upr” correspond to upper and lower confidence interval boundaries.
We have a severe problem with the fitting above. Ordinary Least Squares Linear Regression assumes that all the observations (data points on the plot) are independent, which will result in uncorrelated and hence Gaussian distributed residuals. However, we know that the data points on the plot belong to 18 individuals, i.e. 10 points for each individual. In principal, we can fit a linear model for each individual separately:

ggplot(sleepstudy, aes(x = Days, y = Reaction)) +
    geom_smooth(method = "lm", level = 0.95) + geom_point() + facet_wrap(~Subject, nrow = 3, ncol = 6)

We can see that most of the individuals have increasing Reaction profile while some have a neutral or even decreasing profile. What does it mean and what can we do here? Did we capture all the variation in the data with our simple Ordinary Least Squares Linear Regression model?
When the observations (data points on the plot) are not independent they should be modelled via so-called Random Effects model (in terms of classical Frequentist statistics), which is nothing else as a Prior distribution put on the coefficients of the linear model withing the Bayesian framework (we will come back to this later). Random Effects modelling is a part of so-called Mixed Models (Linear Mixed models, Linear Mixed Effects models).

Linear mixed model

summary(lmer(Reaction ~ Days + (Days | Subject), sleepstudy))

## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (Days | Subject)
##    Data: sleepstudy
## 
## REML criterion at convergence: 1743.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.9536 -0.4634  0.0231  0.4634  5.1793 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  Subject  (Intercept) 612.09   24.740       
##           Days         35.07    5.922   0.07
##  Residual             654.94   25.592       
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  251.405      6.825  36.838
## Days          10.467      1.546   6.771
## 
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.138

Let us compare residual error between fixed effects (lm) and random effects (lmer) models:

sqrt(sum(residuals(lm(Reaction~Days,data=sleepstudy))^2)/178)

## [1] 47.71472

sqrt(sum(resid(lmer(Reaction ~ Days + (Days | Subject), sleepstudy))^2)/178)

## [1] 23.56936

The resudual error decreased for the Random Effects model meaning that we captured more phenotypic variation within the Random Effects model. Let us also compare AIC:

fit1<-lm(Reaction~Days,data=sleepstudy)
fit2<-lmer(Reaction ~ Days + (Days | Subject), sleepstudy, REML=FALSE)
anova(fit2,fit1)

Again we see a significant improvement of modeling by introducing Random Effects. AIC and BIC are lower for the Random Effects Model, i.e. this model is more informative and explains more variation in the data by accounting for groupping the points between the 18 individuals.
Another strength of LMM is that it fits all individuals simultaneously but non-independently, i.e. all fits “know” about each other. In this way, slopes, intercepts and confidence intervals of fits for each individual are influenced by their common statistics, this effect is called “shrinkage toward the mean”.
We see that LMM captures more variation in the data, but can we display it and see the shrinkage effect? Let us start with the overall/average fit:

lmerfit <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
sims <- sim(lmerfit, n.sims = 10000)
fs <- fixef(sims)
newavg <- data.frame(Days = 0:9)
Xmat <- model.matrix( ~ 1 + Days, data = newavg)
fitmat <- matrix(ncol = nrow(fs), nrow = nrow(newavg))
for (i in 1:nrow(fs)) { fitmat[,i] <- Xmat %*% as.matrix(fs)[i,] }
newavg$lower <- apply(fitmat, 1, quantile, prob=0.05)
newavg$median <- apply(fitmat, 1, quantile, prob=0.5)
newavg$upper <- apply(fitmat, 1, quantile, prob=0.95)
ggplot(sleepstudy, aes(x = Days, y = Reaction)) + geom_point() +
  geom_smooth(method="lm") + 
  geom_line(data = newavg, aes(y = median), size = 1) + 
  geom_line(data = newavg, aes(y = lower), lty = 2) + 
  geom_line(data = newavg, aes(y = upper), lty = 2)

We can see that the average/mean fit for LMM/Random Effects Model (lmer, black line) is identical to Fixed Effects Model (lm, blue line), the difference is hardly noticable, they overlap pretty well. However, the confidence interval for LMM (dashed line) is wider than for the Fixed Effects fit (grey area). This difference is due to the fact that Fixed Effects Model does not account for inter-individual variation in contrast to LMM hich accounts for both population-wide and inter-individual variations.
What about slopes, intercepts and confidence intervals for each of the 18 individuals?

yhat <- fitted(sims, lmerfit)
sleepstudy$lower <- apply(yhat, 1, quantile, prob=0.025)
sleepstudy$median <- apply(yhat, 1, quantile, prob=0.5)
sleepstudy$upper <- apply(yhat, 1, quantile, prob=0.975)
ggplot(sleepstudy, aes(x = Days, y = Reaction)) + 
  geom_smooth(method = "lm", level = 0.95) + geom_point() + 
  facet_wrap(~Subject, nrow = 3, ncol = 6) + 
  geom_line(data = sleepstudy, aes(y = median), size = 1) + 
  geom_line(data = sleepstudy, aes(y = lower), lty = 2) + 
  geom_line(data = sleepstudy, aes(y = upper), lty = 2)

Black solid and dashed lines correspond to the LMM fitting while blue solid line and the grey area depict Fixed Effects Model. We can see that individual LMM fits and their confidence intervals might be very different from the Fixed Effects (lm) Model. In other words the individual fits are “shrunk” toward the common mean, all the fits help each other to stabilize variance so that the model does not get excited about extreme/outlying values. This leads to a more stable and correct fitting.

Maximum Likelihood vs. Bayesian Fitting

Before we move to the Bayesian Multilevel Models, let us briefly introduce the major differences between Frequentist and Bayesian Statistics.
Frequentist fitting used by LMM via lme4/lmer is based on Maximum Likelihood principle:

\[\ y=\alpha+\beta x\] \[\ L(y) \sim e^{- \frac{(y - \alpha - \beta x)^2}{2 \sigma^2}} \]

\[\ \max_{\alpha, \beta, \sigma} L(y) \Rightarrow \hat{\alpha},\hat{\beta},\hat{\sigma} \]

Here, we maximize the likelihood L(y) of observing the data y, which is equivalent to minimizing residuals of the model (Ordinary Least Squares approach). Now ask youself a rhetoric question: why should we maximize a probability of observing the data if we have already observed the data?
Bayesian fitting is based on Maximum Posterior Probability principle: we assume that the data is distributed with some (Normal in our case) likelihood L(y) and set Prior assimtions on the parameters of the Liner Model.

\[\ y \sim N(\mu, \sigma) \hspace{0.4cm} - Likelihood \hspace{0.2cm} L(y) \] \[\ \mu = \alpha + \beta x \] \[\ \alpha \sim N(\mu_{\alpha},\sigma_{\alpha}) \hspace{0.4cm} -Prior \hspace{0.1cm} on \hspace{0.1cm} \alpha \] \[\ \beta \sim N(\mu_{\beta},\sigma_{\beta}) \hspace{0.4cm} -Prior \hspace{0.1cm} on \hspace{0.1cm} \beta \]

\[\ P(\mu_{\alpha},\sigma_{\alpha},\mu_{\beta},\sigma_{\beta},\sigma) \sim L(y) * N(\mu_{\alpha},\sigma_{\alpha}) * N(\mu_{\beta},\sigma_{\beta}) \]

\[\ \max_{\mu_{\alpha}, \sigma_{\alpha}, \mu_{\beta}, \sigma_{\beta}, \sigma} \hspace{0.1cm} P(\mu_{\alpha}, \sigma_{\alpha}, \mu_{\beta}, \sigma_{\beta}, \sigma) \Rightarrow \hat{\mu_{\alpha}}, \hat{\sigma_{\alpha}}, \hat{\mu_{\beta}}, \hat{\sigma_{\beta}}, \hat{\sigma} \]

Here we calculate a probability distribution of parameters (and not the data) of the model which automatically gives us uncertainties (Credible Intervals) on the parameters.

Bayesian Multilevel Models

Linear Mixed Models (LMM) with Bayesian Prior distributions applied to the parameters are called Bayesian Multilevel MOdels or Bayesian Hierarcical Models. To implement Bayesian fitting in R, here we will use “brms” package which has absolutely the same syntax as lme4/lmer does. One important difference which one should remember is that fitting LMM via lme4/lmer uses Maximum Likelihood (ML) principle, i.e. it does not use prior assumptions about the parameters (or rather uses flat Priors) while Bayesian Multilevel Models in brms set reasonable priors which reflect the data. Another thing which is worth mentioning is that brms runs probabilistoc programming software/language Stan under the hood. Let us do Bayesian fitting with brms:

# options(mc.cores = parallel::detectCores())  # Run many chains simultaneously
# brmfit <- brm(Reaction ~ Days + (Days | Subject), data = sleepstudy, family = gaussian, iter = 2000, chains = 4)