In this tutorial, we will examplify some model testing approaches using R. We will use nested multiple logistic regression models on the previously used GWAS data; since this use case borders on feature selection, we will also touch upon both frequentist and Bayesian approaches for this.
library(lmtest)
library(covTest)
library(BAS)
We will re-use the data from the GWAS workshop, but for practice purpose analyze it using logistic regression.
We start by extracting the genotypes and phenotypes as data-frames from the GenAbel data format. For simplicity we will assume a linear coding of genotypes and assume an additive association model – notice that this may or may not be the optimal choice. We save them in a joint data-frame, mydata.
Several model comparison tests (lrtest, anova) in R require that the compared model are estimated on exactly the same data set; this means that some variables has missing values, which are excluded in the model fitting, will cause problems. For simplicity, we here therefore exclude all rows with missing values in any SNP.
We will make use of the top 20 best SNPs from the results in the GWAS workshop. We save the names of these in the vector top_snps.
As this procedure may take some time, I have saved the resulting mydata and top_snps in the R-file mydata.R, which can be loaded into R directly. This file should be available in a download link on the course programme page. Download it, load it into R, and look at the first entries of mydata and top_snps.
myRfile="mydata.R" # change this to fit the location where you download the R-file
if(!file.exists(myRfile)){
#extract as data.frame
genot=as.data.frame(as.numeric(gtdata(genodata.qc0)))
phenot=as.data.frame(genodata.qc0)
top_snps=an@results[,"P1df", drop=FALSE]
top_snps=data.frame(rownames(top_snps),top_snps)[order(top_snps$P1df),"rownames.top_snps."][1:20]
mydata=merge(genot[,names(genot) %in% top_snps], phenot, by="row.names")
row.names(mydata)=mydata$Row.names
mydata[,c("id","Row.names")]=NULL
# remove rows with missing values
mydata=mydata[complete.cases(mydata),]
save(mydata,top_snps, file=myRfile)
}else{
load(myRfile)
}
head(mydata)
head(top_snps)
## [1] rs12629868 rs6444131 rs4541401 rs10871764 rs11919115 rs2665749
## 727777 Levels: AFFX-SNP_10105880__rs3737108 ... SNP_A-2280133
We want to investigate how the top SNPs identified in the GWAS workshop can be combined into a multivariate model. Do we need all SNPs or just a subset? How can we evaluate which subset is best?
Logistic regression is an appropriate model to model \(Y\) that takes values that are either 0 or 1, and is therefore typically used for case-control data, as in our case.
Formally, logistic regression is a generalized linear model (GLM) with a logit link function and a binomial dispersion function. Here we don’t model the outcome directly, but rather the log odds of the outcome:
\[\log \frac{Pr[y=1]}{1-Pr[y=1]}\sim\beta_0+ \beta_1 x_1 +\beta_2 x_2 + \ldots, \beta_n x_n\]
where \(\log \frac{Pr[y=1]}{1-Pr[y=1]}\) is the logit-function of \(Pr[y=1]\). More generally, we can write \[logit(Pr[Y]) \sim X\boldsymbol{\beta}\]
Even more formally, the \(logit(Y)\) is assumed to be distributed around \(X\beta\) according to a Binomial distribution (this is the dispersion function).
R has a function glm
that is the GLM version of lm
function for linear models. To get glm
to model a logistic function, set the argument family=binomial(link=logit)
. Look up the help for glm
and try to create a list of 10 GLMs, which includes an increasing number of SNPs from the top_snps
list and with sex as a covariate. Print each GLM.
models = list()
for(n in 1:20){
models[[n]] = glm(as.formula(paste("pheno ~ sex + ", paste(top_snps[seq(1,n)], collapse=" + "))), data=mydata, family=binomial(link=logit))
print(paste0("M", n))
print(models[[n]]$formula)
}
## [1] "M1"
## pheno ~ sex + rs12629868
## [1] "M2"
## pheno ~ sex + rs12629868 + rs6444131
## [1] "M3"
## pheno ~ sex + rs12629868 + rs6444131 + rs4541401
## [1] "M4"
## pheno ~ sex + rs12629868 + rs6444131 + rs4541401 + rs10871764
## [1] "M5"
## pheno ~ sex + rs12629868 + rs6444131 + rs4541401 + rs10871764 +
## rs11919115
## [1] "M6"
## pheno ~ sex + rs12629868 + rs6444131 + rs4541401 + rs10871764 +
## rs11919115 + rs2665749
## [1] "M7"
## pheno ~ sex + rs12629868 + rs6444131 + rs4541401 + rs10871764 +
## rs11919115 + rs2665749 + rs2036834
## [1] "M8"
## pheno ~ sex + rs12629868 + rs6444131 + rs4541401 + rs10871764 +
## rs11919115 + rs2665749 + rs2036834 + rs7912228
## [1] "M9"
## pheno ~ sex + rs12629868 + rs6444131 + rs4541401 + rs10871764 +
## rs11919115 + rs2665749 + rs2036834 + rs7912228 + rs10080513
## [1] "M10"
## pheno ~ sex + rs12629868 + rs6444131 + rs4541401 + rs10871764 +
## rs11919115 + rs2665749 + rs2036834 + rs7912228 + rs10080513 +
## rs310672
## [1] "M11"
## pheno ~ sex + rs12629868 + rs6444131 + rs4541401 + rs10871764 +
## rs11919115 + rs2665749 + rs2036834 + rs7912228 + rs10080513 +
## rs310672 + rs9450686
## [1] "M12"
## pheno ~ sex + rs12629868 + rs6444131 + rs4541401 + rs10871764 +
## rs11919115 + rs2665749 + rs2036834 + rs7912228 + rs10080513 +
## rs310672 + rs9450686 + rs6777345
## [1] "M13"
## pheno ~ sex + rs12629868 + rs6444131 + rs4541401 + rs10871764 +
## rs11919115 + rs2665749 + rs2036834 + rs7912228 + rs10080513 +
## rs310672 + rs9450686 + rs6777345 + rs17768133
## [1] "M14"
## pheno ~ sex + rs12629868 + rs6444131 + rs4541401 + rs10871764 +
## rs11919115 + rs2665749 + rs2036834 + rs7912228 + rs10080513 +
## rs310672 + rs9450686 + rs6777345 + rs17768133 + rs13316955
## [1] "M15"
## pheno ~ sex + rs12629868 + rs6444131 + rs4541401 + rs10871764 +
## rs11919115 + rs2665749 + rs2036834 + rs7912228 + rs10080513 +
## rs310672 + rs9450686 + rs6777345 + rs17768133 + rs13316955 +
## rs6909585
## [1] "M16"
## pheno ~ sex + rs12629868 + rs6444131 + rs4541401 + rs10871764 +
## rs11919115 + rs2665749 + rs2036834 + rs7912228 + rs10080513 +
## rs310672 + rs9450686 + rs6777345 + rs17768133 + rs13316955 +
## rs6909585 + rs2250795
## [1] "M17"
## pheno ~ sex + rs12629868 + rs6444131 + rs4541401 + rs10871764 +
## rs11919115 + rs2665749 + rs2036834 + rs7912228 + rs10080513 +
## rs310672 + rs9450686 + rs6777345 + rs17768133 + rs13316955 +
## rs6909585 + rs2250795 + rs9450749
## [1] "M18"
## pheno ~ sex + rs12629868 + rs6444131 + rs4541401 + rs10871764 +
## rs11919115 + rs2665749 + rs2036834 + rs7912228 + rs10080513 +
## rs310672 + rs9450686 + rs6777345 + rs17768133 + rs13316955 +
## rs6909585 + rs2250795 + rs9450749 + rs8031841
## [1] "M19"
## pheno ~ sex + rs12629868 + rs6444131 + rs4541401 + rs10871764 +
## rs11919115 + rs2665749 + rs2036834 + rs7912228 + rs10080513 +
## rs310672 + rs9450686 + rs6777345 + rs17768133 + rs13316955 +
## rs6909585 + rs2250795 + rs9450749 + rs8031841 + rs1279446
## [1] "M20"
## pheno ~ sex + rs12629868 + rs6444131 + rs4541401 + rs10871764 +
## rs11919115 + rs2665749 + rs2036834 + rs7912228 + rs10080513 +
## rs310672 + rs9450686 + rs6777345 + rs17768133 + rs13316955 +
## rs6909585 + rs2250795 + rs9450749 + rs8031841 + rs1279446 +
## rs7671868
We will start to compare our models using a likelihood ratio test (LRT)
\[LR=\frac{mL[M_{i-1}|X,Y]}{mL[M_{i}|X,Y]}\]
Where \(mL[M_i|X,Y]\) is the maximum Likelihood of model \(M_i\). Recall that \(2logL∼χ2(δ)\), where \(δ\) is the difference in dimension (“number of variabels) between the models; this provides us with a significance test.
We will use the lrtest
function in the lmtest
package to perform likelihood ratio tests (LRT) sequentially between model with increasing number of variables. One way of doing this could be to loop through the different model comparisons, store the results in a data.frame and then display it in a table. Try to implement this.
lrt=data.frame(models="1 variable", added_variable=top_snps[1],llprev=NA, llnew=NA, lr=NA, P = NA, significant=NA,stringsAsFactors=FALSE) #data.frame to store results in
for(i in seq(2,20)){
fit=lrtest(models[[i-1]], models[[i]]) # perform lrtest
p=fit$`Pr(>Chisq)`[2] # alias for p-value (for convenience of typing)
lr=fit$LogLik[1]-fit$LogLik[2] # compute LR
signif = ifelse(p > 0.05,"-",ifelse(p > 0.01,"*",ifelse(p > 0.005, "**","***"))) # significance level
# store result
lrt[i,] = list(paste0(i," variables"), top_snps[i], signif(fit$LogLik[1], 5), signif(fit$LogLik[2], 5), format(lr, digits=4), format(fit$`Pr(>Chisq)`[2],digits=3, scientific=-1), signif)
}
lrt[order(as.numeric(lrt$P), na.last=T),]
However, we find that lrtest, in the spirit of R, can take more than two models as argument, e.g., lrtest(models[[1]], models[[2]], models[[3]])
. We can therefore perform exactly the sequential test we had in mind in one function call. How would you do that? (tip: look up the do.call function.
# Since we have fitted models in a list, we make use of the do.call function, which
# extracts the elements in the list and sends them as arguments to the lrtest call.
lrt2=do.call("lrtest", models)
lrt2
Moreover, there are other functions that can perform LRTs. The anova function, when presented with more than one model, peforms an analysis of deviance test; since the deviance is equivalent to the LR, this test is equivalent to a LRT. Try to perform an analysis of deviance using the anova function (tip: to get p-values, you need to pass the argument test=Chisq
to anova)
# If we want p-values, we need to pass the named argument ``test`` to anova; we therefore
# add this to our list of arguments (i.e., the list of models)
do.call("anova", c(models, list("test"="Chisq")))
LRT only makes use of the relative likelilhood when comparing models. Often the aim is to also to try to find as small a model as possible, i.e., to balance the likelihood against having as few variables as possible. The Akakike Information Content (AIC) provides one such balance. The AIC for a model m is
\[AIC_m = 2\#(\boldsymbol{\beta}\neq 0)-2\log mL[\boldsymbol{\beta}|X,Y]\]
As we saw in the lecture, this is actually a penalized LRT. The aim is to select the model with minimum AIC.
We can compare the Akaike information index (AIC) between out models using the AIC function in the stats package. Try to look up help for AIC and perform such analysis (tip: For some unknown reason, do.call on AIC returns garbled-up AIC return value, but for some other unknown reason, it suffices to set the row.names of this return value to `NULL!?).
my.aic <- function(x) {
x <- do.call(AIC, x)
rownames(x) <- NULL
return(x)
}
aic=my.aic(models)
aic
If we look at the output from the glm function from each model, we see that glm actually computes AIC, so we could just list these:
for(i in seq(1,20)){
print(paste0("M",i,": aic = ",models[[i]]$aic))
}
## [1] "M1: aic = 191.413826142946"
## [1] "M2: aic = 191.347340504348"
## [1] "M3: aic = 193.134436330627"
## [1] "M4: aic = 176.762372070646"
## [1] "M5: aic = 178.663456001373"
## [1] "M6: aic = 170.538222942918"
## [1] "M7: aic = 169.411055539535"
## [1] "M8: aic = 157.4572711016"
## [1] "M9: aic = 151.317844935567"
## [1] "M10: aic = 145.029287082904"
## [1] "M11: aic = 143.961602761146"
## [1] "M12: aic = 145.956326173936"
## [1] "M13: aic = 144.684882688072"
## [1] "M14: aic = 146.682762084756"
## [1] "M15: aic = 148.142792468267"
## [1] "M16: aic = 149.23617930652"
## [1] "M17: aic = 151.230782385327"
## [1] "M18: aic = 130.240122525795"
## [1] "M19: aic = 121.55817408892"
## [1] "M20: aic = 119.609696365098"
If we plot the log likelihoods (significant changes marked by *) and AICs, we see that the “best” model overall seems to be the one including all 20 SNPs, in that the addition of variable 20 improves likelihood significantly and this model has the overall minimal AIC.
However, we also notice that:
par(mfrow=c(1,2))
plot(lrt2$LogLik, pch=ifelse(lrt2$`Pr(>Chisq)`<0.05, 8,1))
legend("bottomright",legend=c('signif', 'not'), pch=c(8,1))
plot(aic$AIC)
Lasso search the model space and uses a regularization approach to select the best (nested) model. The optmization criterion can be written, e.g.:
\[min_{\boldsymbol{\beta}}\left\{\frac{1}{N} ||Y-X\boldsymbol{\beta}||_2^2\right\} \textrm{subject to} ||\boldsymbol{\beta}||_1\leq \lambda\]
i.e., Least squares with a limit on the L1 norm of β as a auxiliary criterion.
Try to implement a LASSO analysis of our data. There are (at least) two packages that implement LASSO, lars+covTest and glmnet. However, the lars and covTest implementation for GLMs appears to be unstable, so use glmnet instead. glmnet uses a cross-validation approach to select the optimal λ to use (λmin). Look up the help page for glmnet and try to identify λmin and the effect sizes βi for the different variables i at λmin. Also plot the coefficient profile for β as a function the lasso-path of λ and the cross-validation curve for estimating λmin. (tip: preferably use the deviance as the type-measure when estimating λmin as some of the alternatives – class– gives unstable results).
# glmnet wants input with predictors x and outcome y separated
# so let's extract them from my data.
x=as.matrix(mydata[,-which(names(mydata) == "pheno")], rownames=1)
y=mydata$pheno
# run the lasso analysis
glasso=glmnet(x,y,family="binomial")
# plot trace of effect sizes as variables are added with decreasing lambda
plot(glasso, xvar="norm",label=T)
# use cross validation to estimate optimal lambda
glasso_fit <- cv.glmnet(x, y, family="binomial", nfolds=50, type.measure="deviance")
plot(glasso_fit)
# extract the estimated coefficient under optimal lambda
coef=coef(glasso_fit, s = "lambda.min")
paste0("Optimal lambda = ", signif(glasso_fit$lambda.min, 5), " which selects a model with ",length(coef@i)," variables")
## [1] "Optimal lambda = 0.012919 which selects a model with 16 variables"
# tabulate the coefficients in decreasing order -- tricky extraction
# from the cv.glmnet class object!
effects=data.frame(snp=coef@Dimnames[[1]][coef@i+1], beta=coef@x)
effects[order(-abs(effects$beta)),]
Let’s also try a Bayesian approach to feature selection. The bas.glm function in the BASpackage uses a Bayesian Adaptive Sampling algorithm (or MCMC) for variable selection in GLMs. It explores the landscape of possible models and assigns posterior probabilities for each variable to be included in the model, given the data.
The posterior probability distribution over models M and their associated variable effect sizes β is
\[Pr[M, \boldsymbol{\beta}|Y, X] = \frac{Pr[Y|M, \boldsymbol{\beta}, X]Pr[\boldsymbol{\beta}|M]Pr[M]}{Pr[Y|X]}\]
where Pr[β|M] and Pr[M] are prior distributions over effect sizes and models, respectively.
The posterior inclusion probability for variable i is the probabiity that its effect size βi≠0 (remember that our nested model can be described in terms of a more general model by constraining some parameters, e.g., βi, to 0). This probability can be computed by averaging the event βi≠0 over the posterior distribution,
\[Pr[\beta_i \neq 0|Y,X] = \sum_{M}\int_{\boldsymbol{\beta}} I(\beta_i\neq 0)Pr[M, \boldsymbol{\beta}|Y, X] d\boldsymbol{\beta} dM\]
where the indicator function I(βi≠0)=1 if βi≠0 and 0 otherwise.
bas.glm provides a few choices of Pr[β|M] (including AIC and BIC) and Pr[M]. Read the help page for bas.glmand try to perform a Bayesian variable selection using bas.glm (tip: as bas.glm takes some time searching through the model/variable space, it might be a good idea to reduce the number of possible variables in the GLMs model by, say, only using the top 5 SNPs in top_snps while trying out your code – once your code is stable, you can apply it to the full model).
# nvars = 20 # change to lower values when testing
# f = as.formula(paste("pheno ~ sex + ", paste(top_snps[seq(1,nvars)], collapse=" + ")))
#
# #
# bayes=bas.glm(f, data=mydata, family=binomial(link=logit),betaprior=aic.prior(), modelprior=uniform())
# bayes
# bayes$n.vars
The following exercises highlight parts of the wide field of experimental design. Specifically, it will target power estimation and evaluation of statistical methods. The two parts overlap but can be completed independently of each other. There is also an open-ended challenge at the end for those who want to work more freely.
Ideally, there would be software packages that accurately estimate the required number of samples for your planned experiment. However, such methods are hard to generalize and for a specific experiment one always has to take the characteristics of that specific data into account and look at what is the praxis within the field. If you do have examples of packages that have helped you in your work feel free to share them with the group.
set.seed(42)
Power estimation is a key concept in experimental design. In an ideal world, researchers would have a good grasp of the variability in their data, expected sample sizes and a clear understanding of their planned statistical analysis. Then estimating the number of samples required to capture the effect would be straightforward. For many classical statistical methods this can easily be done through R (or by hand), both via native methods and packages e.g. pwr. For more complex methods that e.g. use empirical Bayesian estimators that depend on all the data in a high-dimensional dataset, these calculations are not as simple and can primarily be done via simulation, more on that in the second task. For this first tutorial we will stick with the basics. For each task, try to figure out your own solution but if needed click the code button for a solution.
First a short reminder on the statistical terminology. For null-hypotheses testing one typically considers two kinds of errors, Type-I and Type-II, formulated as,
Type-I: Rejecting the null-hypothesis when the null-hypothesis is true
Type-II: Failing to reject the null-hypothesis when the null-hypothesis is false
These are also commonly referred to as false positives and false negatives. Statistical power is defined as the probability to reject the null-hypotheses when the null-hypotheses is false. That is, the probability to avoid a type-II error given that an effect exist. Estimating the power of a test requires making assumptions on what the data looks like (primarily the effect size and variance) and setting a rejection level for the test (often called α or significance cutoff). Thus, while we would like to “know” the power of our planned experiment, it will always be conditioned on these assumptions.
First consider the power for a standard two-sample t-test. Assume you are going to run an RNAseq experiment where one group of mice have received a candidate drug while the second group received placebo. Also assume that you have decided to use a t-test to test the difference of a specific gene (a single gene to avoid multiple testing considerations for now). You believe that the difference should be at least 2 units and you want the power to detect this difference to be 0.9. Calculate the number of samples needed assuming a standard deviation of 1.3 and an alpha of 0.05 (hint: power.t.test()
)
power.t.test(delta = 2, power = 0.9, sd = 1.3)
##
## Two-sample t test power calculation
##
## n = 9.945326
## delta = 2
## sd = 1.3
## sig.level = 0.05
## power = 0.9
## alternative = two.sided
##
## NOTE: n is number in *each* group
At least 10 samples per group are needed to achieve the desired power
Unfortunately while you sequenced the required number of samples other experiments have shown that the effect size could be as small as 0.25 units. What power would your study result in?
power.t.test(delta=0.25, n=10, sd=1.3,type="two.sample",strict = TRUE)
##
## Two-sample t test power calculation
##
## n = 10
## delta = 0.25
## sd = 1.3
## sig.level = 0.05
## power = 0.06923484
## alternative = two.sided
##
## NOTE: n is number in *each* group
Power is approximately 0.07
Now, to proceed to a deeper understanding of power we will do a small simulation study. Begin with setting up the simulation necessary to simulate the data with the same parameters as in task 2. Note, you will need to draw two vectors of random numbers from the normal distribution, one without an effect and one with.
Apply the two-sample t-test to repeated draws of your simulated data, does the number of significant outcomes match the estimated power?
nIterations <- 10000
result <- rep(FALSE,nIterations)
for (i in 1:nIterations) {
g1 <- rnorm(10,0,sd=1.3)
g2 <- rnorm(10,0.25,sd=1.3)
result[i] <- t.test(g1,g2)[3] < 0.05 # run the test, get the pvalue and
# check if it is smaller than 0.05
}
mean(result)
## [1] 0.0651
In classical statistics one generally speaks of Type I and Type II errors. However, these are not the only forms of errors possible. Now we will consider Type-M and Type-S errors. M stands for magnitude and the error captures the overestimation of the effect size. S stands for sign and captures the error of getting an effect of the wrong sign. Note that these definitions are not as widespread but gives useful insight into the errors that are likely to occur during testing.
Now, extend the simulation in task 3 to estimate Type-M and Type-S error. That is, for each set of data that results in a significant outcome save the estimated difference between the groups. Then calculate how many times the differences turned out to be negative instead of positive and how many time larger the estimated effect was than the real effect.
nIterations <- 10000
result <- c() # As we have an unkown number of significant results we cannot
counter <- 0 # pre-allocate the whole vector and need a counter
temp <- c()
for (i in 1:nIterations) {
g1 <- rnorm(10,0,sd=1.3)
g2 <- rnorm(10,0.25,sd=1.3)
outcome <- t.test(g1,g2)
temp[i] <- mean(outcome$conf.int)
if (outcome[3] < 0.05) { # run the test, get the pvalue and check that it is
# smaller than 0.05
counter <- counter + 1
result[counter] <- mean(outcome$conf.int) # quick way of getting the estimated
# effect size, could also be done by
# simply taking the difference in
# means between the groups.
}
}
# Note that type-M and type-S errors are not strictly defined, but rather
# useful concepts to keep in mind.
typeM <- mean(result[result <0])/-0.25 # The average effect for the significant
# results is ca 5 times as large as the
# true effect.
typeS <- sum(result >0)/length(result) # ca 13% of the significant tests have
# effects with the wrong sign.
paste0("Type M error = ", typeM)
## [1] "Type M error = 5.38553682958715"
paste0("Type S error = ", typeS)
## [1] "Type S error = 0.125588697017268"
Do these results seem surprising to you? The setup for this exercise is a bit extreme, a power of 0.06 is really quite low. However, with noisy data (as is most omics) combined with small sample sizes it is still a result to be concerned about.
This final exercise was inspired by the work of Andrew Gelman, specifically this blog post (Bonus point to anyone who figures out why his theoretical results do not match my simulation example). All of his blog is highly recommended for an insight into recent discussions on sound statistics.