RaukR 2018 Workshop - Spatial statistics

0.1 Spatial data

Types of spatial data:

Areal data (percentages over an area)
Geostatistical data (usually over a larger area)
Point patterns (locations)

Disease risk predictions are based on the observed disease cases, number of individuals at risk, and risk factor information such as demographic and environmental factors

0.1.1 Areal data

Disease risk is often estimated by the standardized mortality ratio:

\[\ SMR = \frac{E}{Y}\]

Y = number of observed cases
E = number of expected cases if the study population had the same disease rate as the standard population
If SMR is > 1: more cases observed than expected
SMR may be misleading in cases with small populations. A model is therefore used
Expected cases is calculated by using indirect standardization:

\[ E = \sum_{j = 1}^m r_j^{(s)}n_j \] $r_j^{(s)}$ = number of events / number of individuals at risk $n_j$ = population in stratum j of observed population

Model to estimate disease risk $ _i$ in areas $i = 1, ..., n$

\[\ Y_i | P(x_i) \sim Po(E_i \times \theta_i)\] \[log(\theta_i) = z^{'}_i\beta + u_i + v_i\]

0.2 Geostatistical data

\[\ Y_i | P(x_i) \sim Binom(N_i, P(x_i))\] \[log(P(x_i)) = z^{'}_i\beta + S(x_i) + v_i\] Coordinate reference system

Unprojected/geographic (Lat/Lon)
Projected: Easting/Northing for referencing location on 2-dim representation of earth (UTM projection)

0.3 Workshop

Packages:

#install.packages(c("leaflet", "geoR", "rgdal", "raster", "sp", "spdep", "SpatialEpi", "SpatialEpiApp"))

#install.packages("INLA", repos = "https://inla.r-inla-download.org/R/stable", dep = TRUE)

library(tidyverse)
library(SpatialEpi)
library(sp)
library(leaflet)
library(spdep)
library(INLA)
library(geoR)
library(raster)

data(gambia)
data(pennLC)

0.3.1 Areal data

0.3.1.1 Data preparation

class(pennLC)

## [1] "list"

names(pennLC)

## [1] "geo"             "data"            "smoking"         "spatial.polygon"

We see that pennLC is a list object with the following elements: * geo: a table of county ids, and the longitude and latitude of the centroid of each county, * data: a table of county ids, number of cases, population and strata information, * smoking: a table of county ids and proportion of smokers, * spatial.polygon: a SpatialPolygons object.

We inspect the first rows of pennLC$data and pennLC$smoking. pennLC$data contains the number of lung cancer cases and the population at county level, stratified on race (white and non-white), gender (female and male) and age (under 40, 40-59, 60-69 and 70+).

head(pennLC$data)

head(pennLC$smoking)

The map of Pennsylvania counties is given by the SpatialPolygons object called pennLC$spatial.polygon. We can plot the map as follows:

map <- pennLC$spatial.polygon
plot(map)

We now create a data frame called d with columns containing the counties ids, the observed and expected number of cases, the smoking proportions and the SMRs. Specifically, d will contain the following columns: * county: id of each county, * Y: observed number of cases in each county, * E: expected number of cases in each county, * smoking: smoking proportion in each county, * SMR: SMR of each county.

d <- group_by(pennLC$data, county) %>% summarize(Y = sum(cases))
head(d)

Now we calculate the expected number of cases in each county using indirect standardization. The expected counts in each county represent the total number of disease cases one would expect if the population in the county behaved the way the Pennsylvania population behaves. We can do this by using the expected() function of the SpatialEpi package. This function has three arguments, namely, * population: a vector of population counts for each strata in each area, * cases: a vector with the number of cases for each strata in each area, * n.strata: number of strata considered. Vectors population and cases have to be sorted by area first and then, within each area, the counts for all strata need to be listed in the same order. All strata need to be included in the vectors, including strata with 0 cases.

In order to obtain the expected counts we first sort the data using the order() function where we specify the order as county, race, gender and finally age.

pennLC$data <- pennLC$data[order(pennLC$data$county, pennLC$data$race, pennLC$data$gender, pennLC$data$age), ]

Then, we obtain the expected counts E in each county by calling the expected() function where we set population equal to pennLC$data$population and cases equal to pennLC$data$cases. There are 2 races, 2 genders and 4 age groups for each county, so number of strata is set to 2 x 2 x 4 = 16.

E <- expected(population = pennLC$data$population, cases = pennLC$data$cases, n.strata = 16)

Now we add the vector E to the data frame d which contains the counties ids (county) and the observed counts (Y) making sure the E elements correspond to the counties in d$county in the same order. To do that, we use match() to calculate the vector of the positions that match d$county in unique(pennLC$data$county) which are the corresponding counties of E. Then we rearrange E using that vector.

d$E <- E[match(d$county, unique(pennLC$data$county))]
head(d)

We also add to d the variable smoking which represents the proportion of smokers in each county. We add this variable using the merge() function where we specify county as the column for merging.

d <- merge(d, pennLC$smoking, by = "county")

Finally, we compute the vector of SMRs as the ratio of the observed to the expected counts, and add it to the data frame d.

d$SMR <- d$Y/d$E

The final dataset is this:

head(d)

We have constructed a data frame d containing the observed and expected disease counts, the smokers proportions and the SMR for each of the counties. The map of Pennsylvania counties is given by the SpatialPolygons object called map. Using this object and the data frame d we can create a SpatialPolygonsDataFrame that will allow us to make maps of the variables in d. In order to do that, we first set the row names of the data frame d equal to d$county. Then we merge the SpatialPolygons map and the data frame d matching the SpatialPolygons member Polygons ID slot values with the data frame row names (match.ID = TRUE).

rownames(d) <- d$county
map <- SpatialPolygonsDataFrame(map, d, match.ID = TRUE)
head(map@data)

0.3.1.2 Mapping SMR

We can visualize the observed and expected disease counts, the SMRs, as well as the smokers proportions in an interactive chropleth map. We construct this map using the leaflet package.

We create the map by first calling leaflet() and adding the default OpenStreetMap map tiles to the map with addTiles(). Then we add the Pennsylvania counties with addPolygons() where we specify the areas boundaries color (color) and the stroke width (weight). We fill the areas with the colours given by the color palette function generated with colorNumeric(), and set fillOpacity to a value less than 1 to be able to see the background map. We use colorNumeric() to create a color palette function that maps data values to colors according to a given palette. We create the function using the two following parameters: * palette: color function that values will be mapped to, and * domain: possible values that can be mapped.

Finally, we add the legend by specifying the color palette function (pal) and the values used to generate colors from the palette function (values). We set opacity to the same value as the opacity in the areas, and specify a title and a position for the legend.

l <- leaflet(map) %>% addTiles()

pal <- colorNumeric(palette = "YlOrRd", domain = map$SMR)

l %>% addPolygons(color = "grey", weight = 1, fillColor = ~pal(SMR), fillOpacity = 0.5) %>%
  addLegend(pal = pal, values = ~SMR, opacity = 0.5, title = "SMR", position = "bottomright")

We can improve the map by highlighting the counties when the mouse hovers over them, and showing information about the observed and expected counts, SMRs, and smoking proportions. We do this by adding the arguments highlightOptions, label and labelOptions to addPolygons(). We choose to highlight the areas using a bigger stroke width (highlightOptions(weight = 4)). We create the labels using HTML syntax. First, we create the text to be shown using the function sprintf() which returns a character vector containing a formatted combination of text and variable values and then applying htmltools::HTML() which marks the text as HTML. In labelOptions we specify the labels style, textsize, and direction. Possible values for direction are left, right and auto and this specifies the direction the label displays in relation to the marker. We choose auto so the optimal direction will be chosen depending on the position of the marker.

labels <- sprintf("<strong>%s</strong><br/>Observed: %s <br/>Expected: %s <br/>Smokers proportion: %s <br/>SMR: %s",
  map$county, map$Y,  round(map$E, 2), map$smoking, round(map$SMR, 2)) %>%
  lapply(htmltools::HTML)

l %>% addPolygons(color = "grey", weight = 1, fillColor = ~pal(SMR), fillOpacity = 0.5,
    highlightOptions = highlightOptions(weight = 4),
    label = labels,
    labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"),
    textsize = "15px", direction = "auto")) %>%
    addLegend(pal = pal, values = ~SMR, opacity = 0.5, title = "SMR", position = "bottomright")

Now we can examine the map and see which areas have * SMR = 1 which indicates observed cases are the same as expected, * SMR > 1 which indicates observed cases are higher than expected, * SMR < 1 which indicates observed cases are lower than expected.

This map gives a sense of the disease risk across Pennsylvania. However, SMRs may be misleading and insufficiently reliable in counties with small populations. In contrast, model-based approaches enable to incorporate covariates and borrow information from neighboring counties to improve local estimates, resulting in the smoothing of extreme values based on small sample sizes. In the next section we will show how to obtain disease risk estimates using a model using INLA.

0.3.1.3 Modeling

In this Section we specify the model for the observed data, and detail the required steps to fit the model and obtain the disease risk estimates using INLA.

Let Yi and Ei be the observed and expected number of disesase cases, respectively, and let θi be the relative risk for county i=1,…,n. The model is specified as follows:

Yi|θi∼Poisson(Ei×θi), i=1,…,n,

log(θi)=β0+β1×smokingi+ui+vi.

β0 is the intercept, β1 is the coefficient of the smokers proportion covariate, ui is a structured spatial effect, ui|u−i∼N(u¯δi,1τunδi) (Conditionally autoregressive model (CAR)), vi is an unstructured spatial effect, vi∼N(0,1/τv).

We create the neighbourhood matrix needed to define the spatial random effect using the poly2nb() and the nb2INLA() functions of the spdep package. First, we use poly2nb() to create a neighbours list based on areas with contiguous boundaries. Each element of the list nb represents one area and contains the indices of its neighbours. For example, nb[[2]] contains the neighbours of area 2.

Then, we use nb2INLA() to convert this list into a file with the representation of the neighbourhood matrix as required by INLA. Then we read the file using the inla.read.graph() function of INLA, and store it in the object g which we will later use for specifying the spatial disease model with INLA.

nb <- poly2nb(map)
head(nb)

## [[1]]
## [1] 21 28 67
## 
## [[2]]
## [1]  3  4 10 63 65
## 
## [[3]]
## [1]  2 10 16 32 33 65
## 
## [[4]]
## [1]  2 10 37 63
## 
## [[5]]
## [1]  7 11 29 31 56
## 
## [[6]]
## [1] 15 36 38 39 46 54

nb2INLA("map.adj", nb)
g <- inla.read.graph(filename = "map.adj")

The model includes two random effects, namely, $\ u_i$ for modelling spatial residual variation, and $\ v_i$ for modelling unstructured noise. We need to include two vectors in the data that denote the indices of these random effects. We call re_u the indices vector for $\ u_i$, and re_v the indices vector for $\ v_i$. We set both re_u and re_v equal to $\ 1,…,n$, where $\ n$ is the number of counties. In our example, $\ n=67$ and this can be obtained with the number of rows in the data (nrow(map@data)).

map$re_u <- 1:nrow(map@data)
map$re_v <- 1:nrow(map@data)

We specify the model formula by including the response in the left-hand side, and the fixed and random effects in the right-hand side. Random effects are set using f() with parameters equal to the name of the variable and the chosen model. For $\ u_i$, we use model = "besag" with neighbourhood matrix given by g. For $\ v_i$ we choose model = "iid".

formula <- Y ~ smoking + f(re_u, model = "besag", graph = g) + f(re_v, model = "iid")

We fit the model by calling the inla() function and using the default priors in INLA. We specify the formula, family, data, and the expected counts, and set control.predictor equal to list(compute = TRUE) to compute the posterior means of the predictors.

res <- inla(formula, family = "poisson", data = map@data, E = E, control.predictor = list(compute = TRUE))

We can inspect the res object res using summary().

summary(res)

## 
## Call:
## c("inla(formula = formula, family = \"poisson\", data = map@data, ",  "    E = E, control.predictor = list(compute = TRUE))")
## 
## Time used:
##  Pre-processing    Running inla Post-processing           Total 
##          3.8173          1.4350          0.3287          5.5810 
## 
## Fixed effects:
##                mean     sd 0.025quant 0.5quant 0.975quant    mode kld
## (Intercept) -0.3236 0.1503    -0.6212  -0.3234    -0.0279 -0.3231   0
## smoking      1.1567 0.6247    -0.0809   1.1582     2.3852  1.1619   0
## 
## Random effects:
## Name   Model
##  re_u   Besags ICAR model 
## re_v   IID model 
## 
## Model hyperparameters:
##                        mean       sd 0.025quant 0.5quant 0.975quant
## Precision for re_u    93.05    49.95      30.95    81.84     220.42
## Precision for re_v 17956.92 18118.49    1155.86 12549.06   65967.91
##                       mode
## Precision for re_u   63.63
## Precision for re_v 3103.11
## 
## Expected number of effective parameters(std dev): 18.61(4.365)
## Number of equivalent replicates : 3.60 
## 
## Marginal log-Likelihood:  -320.53 
## Posterior marginals for linear predictor and fitted values computed

We see the intercept β^0= -0.3236 with a 95% credible interval equal to (-0.6212, -0.0279), and the coefficient of smoking is β^1= 1.1567 with a 95% credible interval equal to (-0.0810, 2.3853). We can plot the posterior distribution of the smoking coefficient. We do this by calculating a smoothing of the marginal distribution of the coefficient with inla.smarginal() and then plot it with ggplot() of the ggplot2 package.

marginal <- inla.smarginal(res$marginals.fixed$smoking)
marginal <- data.frame(marginal)
ggplot(marginal, aes(x = x, y = y)) + geom_line() + labs(x = expression(beta[1]), y = "Density") +
  geom_vline(xintercept = 0, col = "blue") + theme_bw()

The disease risk estimates and uncertainty for each of the counties are given by the mean posterior and the 95% credible intervals of θi, i=1,…,n which are in res$summary.fitted.values. Column mean is the mean posterior and 0.025quant and 0.975quant are the 2.5 and 97.5 percentiles, respectively. We add these data to map to be able to make maps of these variables. We assign mean to the estimate of the relative risk, and 0.025quant and 0.975quant to the lower and upper limits of 95% credible intervals of the risks.

head(res$summary.fitted.values)

map$RR <- res$summary.fitted.values[, "mean"]
map$LL <- res$summary.fitted.values[, "0.025quant"]
map$UL <- res$summary.fitted.values[, "0.975quant"]

We show the estimated disease risk in an interactive map using leaflet. In the map, we add labels that appear when mouse hovers over the counties showing information about observed and expected counts, SMRs, smokers proportions, RRs, and lower and upper limits of 95% credible intervals.

We observe counties with greater disease risk are located in the west and south east of Pennsylvania, and counties with lower risk are located in the center. The 95% credible intervals indicate the uncertainty in the risk estimates.

pal <- colorNumeric(palette = "YlOrRd", domain = map$RR)

labels <- sprintf("<strong> %s </strong> <br/> Observed: %s <br/> Expected: %s <br/>
                  Smokers proportion: %s <br/>SMR: %s <br/>RR: %s (%s, %s)",
                  map$county, map$Y,  round(map$E, 2),  map$smoking, round(map$SMR, 2),
                  round(map$RR, 2), round(map$LL, 2), round(map$UL, 2)) %>%
  lapply(htmltools::HTML)

leaflet(map) %>% addTiles() %>%
    addPolygons(color = "grey", weight = 1, fillColor = ~pal(RR),  fillOpacity = 0.5,
    highlightOptions = highlightOptions(weight = 4),
    label = labels,
    labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"),
    textsize = "15px", direction = "auto")) %>%
    addLegend(pal = pal, values = ~RR, opacity = 0.5, title = "RR", position = "bottomright")

0.3.2 Geospatial data

In this tutorial we predict malaria prevalence in The Gambia, Africa. We use data of malaria prevalence in children obtained at 65 villages in The Gambia which are contained in the geoR package, and environmental covariates downloaded with the raster package. We show how to obtain malaria prevalence predicitons using the INLA and SPDE approaches, and how to make interactive maps of the prevalence and risk factors using leaflet.

0.3.2.1 Data preparation

dim(gambia)

## [1] 2035    8

head(gambia)

we inspect the data and see it is a data frame with 2035 observations and the following 8 variables:

x: x-coordinate of the village (UTM),
y: y-coordinate of the village (UTM),
pos: presence (1) or absence (0) of malaria in a blood sample taken from the child,
age: age of the child in days,
netuse: indicator variable denoting whether (1) or not (0) the child regularly sleeps under a bed-net,
treated: indicator variable denoting whether (1) or not (0) the bed-net is treated (coded 0 if netuse = 0),
green: satellite-derived measure of the greenness of vegetation in the immediate vicinity of the village (arbitrary units),
phc: indicator variable denoting the presence (1) or absence (0) of a health center in the village.

Data in gambia are given at individual level. Here, we do the analysis at the village level by aggregating the malaria tests by village. We create a data frame called d with columns containing, for each village, the longitude and latitude, the number of malaria tests, the number of positive tests, the prevalence, and the altitude.

We aggregate the number of tests in each village. We look at the dimension of the unique coordinates and see that the 2035 tests are conducted in 65 unique locations.

dim(unique(gambia[, c("x", "y")]))

## [1] 65  2

We create a data frame called d containing the village coordinates (x, y), the total number of tests performed (total), the number of positive tests (positive), and the malaria prevalence (prev) in each village. Positive tests results have pos equal to 1 and negative test results have pos equal to 0. Therefore, we can calculate the number of positive tests in each village adding up the elements in gambia$pos. We calculate the prevalence in each village by calculating the proportion of positive tests (number of positive results divided by the total number of tests in each village).

d <- group_by(gambia, x, y) %>% 
  summarize(total = n(),
            positive = sum(pos),
            prev = positive/total)
head(d)

Now, we can plot the malaria prevalence in a leaflet map. leaflet expects data to be specified in geographic coordinates (latitude/longitude). However, the coordinates in the data are in UTM format (Easting/Northing). We transform the UTM coordinates in the data to geographic coordinates using the spTransform() function of the sp package.

First, we create a SpatialPoints object called sps using the sp package where we specify the projection of The Gambia, that is, UTM zone 28. Then, we transform the UTM coordinates in sps to geographic coordinates using spTransform() where we set CRS to CRS(“+proj=longlat +datum=WGS84”).

# Transform UT to lon/lat
sps  <- SpatialPoints(d[, c("x", "y")], proj4string = CRS("+proj=utm +zone=28")) # Timezone of Gambia is 28
spst <- spTransform(sps, CRS("+proj=longlat +datum=WGS84"))

Finally, we add the longitude and latitude variables to the data frame d.

d[, c("long", "lat")] <- coordinates(spst)
head(d)

Now we use leaflet() to construct a map with the locations of the villages and the malaria prevalence. We use addCircles() to put circles on the map, and colour circles according to the value of the prevalences. We choose a palette function with colours from viridis and four equal intervals from values 0 to 1. We use the base map given by providers$CartoDB.Positron so point colours can be distinguised from the background map. We also add a scale bar using addScaleBar().

pal <- colorBin("viridis", bins = c(0, 0.25, 0.5, 0.75, 1))
leaflet(d) %>%  addProviderTiles(providers$CartoDB.Positron) %>%
  addCircles(lng = ~long, lat = ~lat, color = ~pal(prev)) %>%
  addLegend("bottomright", pal = pal, values = ~prev, title = "Prevalence") %>%
  addScaleBar(position = c("bottomleft"))

To model malaria prevalence we will use a covariate that indicates the elevation in The Gambia. This covariate can be obtained with the getData() function of the raster package which can be used to obtain geographic data from anywhere in the world. In order to get the elevation values in The Gambia, we need to call getData() with the three following arguments:

name of the data set to 'alt',
country set to the 3 letters of the International Organization for Standardization (ISO) code of The Gambia (GMB)
mask set to TRUE so the neighbouring countries are set to NA.

r <- getData(name = 'alt', country = 'GMB', mask = TRUE)

We make a map with the elevation raster using the addRasterImage() function of the leaflet package. First we create a palette function pal using the values of the raster (values(r)) and specifying that the NA values are transparent.

pal <- colorNumeric("viridis", values(r), na.color = "transparent")

leaflet() %>% addProviderTiles(providers$CartoDB.Positron) %>%
  addRasterImage(r, colors = pal, opacity = 0.5) %>%
  addLegend("bottomright", pal = pal, values = values(r), title = "Altitude") %>%
  addScaleBar(position = c("bottomleft"))

We also add the elevation variable to the data frame d because we will use it as a covariate in the model. To do that we get the altitude values at the villages locations using the extract() function of the raster package. The first argument of this function is the elevation raster (r). The second argument is a two column matrix with the coordinates where we want to know the values, that is, the coordinates of the villages given by d[ ,c(“long”, “lat”)]. We assign the altitude vector to the column alt of the data frame d.

d$alt <- extract(r, d[, c("long", "lat")])

The final dataframe:

head(d)

0.3.2.2 Modelling

In this Section we specify the model to predict malaria prevalence in The Gambia, and detail the steps to fit the model using the INLA and SPDE approaches.

Conditional on the true prevalence $P(x_i)$ at location $x_i, i=1,…,n$ the number of positive results $Y_i$ out of $N_i$ people sampled at $x_i$ follows a binomial distribution,

\[Y_i|P(x_i) \sim Binomial(N_i, P(x_i))\]

\[logit(P(x_i)) = \beta_0 + \beta_1 \times \mbox{altitude} + S(x_i)\] Here $\beta_0$ denotes the intercept, $\beta_1$ is the coefficient of altitude, and $S(x_i)$ is a spatially structured random effect which follows a zero-mean Gaussian process with Matern covariance function:

\[\mbox{Cov}(S(x_i),S(x_j))=\frac{\sigma^2}{2^{\nu-1}\Gamma(\nu)}(\kappa ||x_i - x_j ||)^\nu K_\nu(\kappa ||x_i - x_j ||)\] Here, $K_ν$ is the modified Bessel function of second kind and order $ν>0$. $ν$ is the smoothness parameter, $\sigma_2$ denotes the variance, and $κ>0$ is related to the practical range $\rho = \sqrt{8 \nu}/\kappa$, the distance at which the spatial correlation is close to 0.1.

Then, we need to build a triangulation mesh that covers The Gambia over which to make the random field discretization. For building the mesh, we use the inla.mesh.2d() function passing the following parameters:

loc: locations coordinates that will be used as initial mesh vertices
max.edge: values denoting the maximum allowed triangle edge lengths in the region and in the extension
cutoff: minimum allowed distance between points.

Here, we call inla.mesh.2d() setting loc equal to the matrix with the coordinates coo. We set max.edge = c(0.1, 5) to use small triangles within the region, and larger triangles in the extension. cutoff = 0.01 to avoid building many small triangles where we have some very close points.

coo <- cbind(d$long, d$lat)
mesh <- inla.mesh.2d(loc = coo, max.edge = c(0.1, 5), cutoff = 0.01)

The number of vertices is given by mesh$n and we can plot the mesh with plot(mesh).

mesh$n

## [1] 669

plot(mesh)

points(coo, col = "red")

Then, we use the inla.spde2.matern() function to build the SPDE model.

spde <- inla.spde2.matern(mesh = mesh, alpha = 2)

Here, alpha is a parameter related to the smoothness parameter of the process: $\alpha=\nu+d/2$. Here, we set the smoothness parameter $ν$ equal to 1, and in the spatial case $d=2$. Therefore alpha $=1+2/2=2$.

Now we generate the index set for the SPDE model. We do this with the function inla.spde.make.index() where we specify the name of the effect (s) and the number of vertices in the SPDE model (spde$n.spde). This creates a list with vector s equal to 1:spde$n.spde, and vectors s.group and s.repl that have all elements equal to 1s and size given by the number of mesh vertices.

indexs <- inla.spde.make.index("s", spde$n.spde)
lengths(indexs)

##       s s.group  s.repl 
##     669     669     669

We need to build a projector matrix A that projects the spatially continuous Gaussian random field at the mesh nodes. The projector matrix A can be built with the inla.spde.make.A() function passing the mesh and the coordinates.

A <- inla.spde.make.A(mesh = mesh, loc = coo)

Here we specify the locations where we want to predict the prevalence. We set the prediction locations to the locations of the the covariate raster data. We can get the coordinates of the raster r with the function rasterToPoints() of the raster package. This function returns a matrix with the coordinates and values of the raster that do not have NA values. We see there are 12964 points.

dp <- rasterToPoints(r)
dim(dp)

## [1] 12964     3

In this example, we want to use fewer prediction points. We can lower the resolution of the raster by using aggregate() of the raster package. The arguments of the function are

x: the raster object,
fact: aggregation factor expressed as number of cells in each direction (horizontally and vertically),
fun: function used to aggregate values.

We specify fact = 5 aggregating 5 cells in each direction, and fun = mean to compute the mean of the cell values. We call coop to the matrix of coordinates with the prediction locations.

ra <- aggregate(r, fact = 5, fun = mean)

dp <- rasterToPoints(ra)
dim(dp)

## [1] 627   3

coop <- dp[, c("x", "y")]

We also construct the matrix that projects the locations where we will do the predictions.

Ap <- inla.spde.make.A(mesh = mesh, loc = coop)

Now we use the inla.stack() function to organize data, effects, and projector matrices. We use the following arguments:

tag: string for identifying the data,
data: list of data vectors,
A: list of projector matrices,
effects: list with fixed and random effects.

We construct a stack called stk.e with data for estimation and we tag it with the string “est”. The fixed effects are the intercept (b0) and a covariate (cov). The random effects is the spatial Gaussian Random Field (s). Therefore, in effects we pass a list with a data.frame with the fixed effects, and a list s containing the indices of the SPDE object (indexs). A is set to a list where the second element is A, the projector matrix for the random effects, and the first element is 1 to indicate the fixed effects are mapped one-to-one to the response. In data we specify the response vector and the number of trials. We also construct a stack for prediction that called stk.p. This stack has tag “pred”, the response vector is set to NA, and the data is specified at the prediction locations. Finally, we put stk.e and stk.p together in a full stack stk.full.

#stack for estimation stk.e
stk.e <- inla.stack(tag = "est",
data = list(y = d$positive, numtrials = d$total),
A = list(1, A),
effects = list(data.frame(b0 = 1, cov = d$alt), s = indexs))

## Warning: 'cBind' is deprecated.
##  Since R version 3.2.0, base's cbind() should work fine with S4 objects

#stack for prediction stk.p
stk.p <- inla.stack(tag = "pred",
data = list(y = NA, numtrials = NA),
A = list(1, Ap),
effects = list(data.frame(b0 = 1, cov = dp[, 3]), s = indexs))

#stk.full has stk.e and stk.p
stk.full <- inla.stack(stk.e, stk.p)

We specify the model formula by including the response in the left-hand side, and the fixed and random effects in the right-hand side. In the formula we remove the intercept (adding 0) and add it as a covariate term (adding b0), so all the covariate terms can be captured in the projector matrix.

formula <- y ~ 0 + b0 + cov + f(s, model = spde)

We fit the model by calling inla() and using the default priors in INLA. We specify the formula, family, data, and options. In control.predictor we set compute = TRUE to compute the posterior means of the predictors. We set link=1 to compute the fitted values (res$summary.fitted.values and res$marginals.fitted.values) with the same link function as the the family specified in the model.

res <- inla(formula, family = "binomial", Ntrials = numtrials,
control.family = list(link = "logit"),
data = inla.stack.data(stk.full),
control.predictor = list(compute = TRUE, link = 1, A = inla.stack.A(stk.full)))

## Warning: 'rBind' is deprecated.
##  Since R version 3.2.0, base's rbind() should work fine with S4 objects

summary(res)

## 
## Call:
## c("inla(formula = formula, family = \"binomial\", data = inla.stack.data(stk.full), ",  "    Ntrials = numtrials, control.predictor = list(compute = TRUE, ",  "        link = 1, A = inla.stack.A(stk.full)), control.family = list(link = \"logit\"))" )
## 
## Time used:
##  Pre-processing    Running inla Post-processing           Total 
##         13.0683         16.8821          1.0611         31.0115 
## 
## Fixed effects:
##        mean     sd 0.025quant 0.5quant 0.975quant    mode kld
## b0  -0.1728 0.5311    -1.0699  -0.2220     1.0350 -0.2881   0
## cov -0.0117 0.0106    -0.0322  -0.0118     0.0095 -0.0120   0
## 
## Random effects:
## Name   Model
##  s   SPDE2 model 
## 
## Model hyperparameters:
##                mean     sd 0.025quant 0.5quant 0.975quant   mode
## Theta1 for s -4.008 0.2925     -4.578   -4.010     -3.427 -4.018
## Theta2 for s  2.648 0.3734      1.905    2.652      3.374  2.666
## 
## Expected number of effective parameters(std dev): 43.01(3.945)
## Number of equivalent replicates : 1.511 
## 
## Marginal log-Likelihood:  -216.26 
## Posterior marginals for linear predictor and fitted values computed

Now we map the malaria prevalence predictions in a leaflet map. We can obtain the mean prevalence and lower and upper limits of the 95% credible intervals from res$summary.fitted.values where we need to specify the rows corresponding to the indices of the predictions and the columns mean, 0.025quant and 0.975quant. The indices of the stack stk.full that correspond to the predictions are the ones tagged with tag = "pred". We can obtain them by using inla.stack.index() and specifying tag = "pred".

index <- inla.stack.index(stack = stk.full, tag = "pred")$data

prev_mean <- res$summary.fitted.values[index, "mean"]
prev_ll <- res$summary.fitted.values[index, "0.025quant"]
prev_ul <- res$summary.fitted.values[index, "0.975quant"]

The predicted values correspond to a set of points. We can create a raster with the predicted values using rasterize() of the raster package. This function is useful for transferring the values of a vector at some locations to raster cells. We transfer the predicted values prev_mean from the locations coop to the raster ra that we used to get the locations for prediction. We use the following arguments:

x = coop, the coordinates where we made the predictions,
y = ra, the raster where we transfer the values,
field = prev_mean, the values to be transferred (prevalence estimates in locations coop),
fun = mean, this will assign the mean of the values to cells that have more than one point.

r_prev_mean <- rasterize(x = coop, y = ra, field = prev_mean, fun = mean)

Now we plot the raster with the prevalence values in a leaflet map. We use a palette function created with colorNumeric(). We could set the domain to the values we are going to plot (values(r_prev_mean)), but we decide to use the interval [0, 1] because we will use the same palette to plot the mean prevalence and the lower and upper limits of the 95% credible interval.

pal <- colorNumeric("viridis", c(0, 1), na.color = "transparent")

leaflet() %>% addProviderTiles(providers$CartoDB.Positron) %>%
  addRasterImage(r_prev_mean, colors = pal, opacity = 0.5) %>%
  addLegend("bottomright", pal = pal, values = values(r_prev_mean), title = "Prevalence") %>%
  addScaleBar(position = c("bottomleft"))

We can follow the same approach to create maps with the lower and upper limits of the prevalece estimates. First we create rasters with the lower and upper limits of the prevalences. Then we make the map with the same palette function we used to plot the mean prevalence.

r_prev_ll <- rasterize(x = coop, y = ra, field = prev_ll, fun = mean)

leaflet() %>% addProviderTiles(providers$CartoDB.Positron) %>%
  addRasterImage(r_prev_ll, colors = pal, opacity = 0.5) %>%
  addLegend("bottomright", pal = pal, values = values(r_prev_ll), title = "LL") %>%
  addScaleBar(position = c("bottomleft"))

r_prev_ul <- rasterize(x = coop, y = ra, field = prev_ul, fun = mean)

leaflet() %>% addProviderTiles(providers$CartoDB.Positron) %>%
  addRasterImage(r_prev_ul, colors = pal, opacity = 0.5) %>%
  addLegend("bottomright", pal = pal, values = values(r_prev_ul), title = "UL") %>%
  addScaleBar(position = c("bottomleft"))