Chapter 13 Zero-Inflation
13.1 Exercise 1
For this exercise, you will consider a data set containing publication records associated with 915 biochemistry graduate students. The data can be accessed using:
Consider the following response variable:
art: count of articles produced during last 3 years of Ph.D.
And, the available predictor variables:
- fem: factor indicating gender of student, with levels Men and Women7.
- mar: factor indicating marital status of student, with levels Single and Married
- kid5: number of children aged 5 or younger
- phd: prestige of Ph.D. department
- ment: count of articles produced by Ph.D. mentor during last 3 years
Formulate and fit a model for
art
that includes one or more predictors for both the zero-inflation and count parts of the model using thezeroinfl
function in thepscl
library.Describe the model you fit using text and equations and match the parameters from the R output to those in your model description. Interpret the parameters in the model in the context of the problem.
13.2 Exercise 2
For this exercise, you will consider the nuts
data set in the COUNT
package (Hilbe, 2016). These data were originally reported by Flaherty et al (2012). Researchers recorded information about squirrel behavior and
forest attributes across various plots in Scotland’s Abernathy Forest. The study focused on the following
variables:
- cones = number of cones stripped by red squirrels per plot
- sntrees = standardized number of trees per plot
- sheight = standardized mean tree height per plot
- scover = standardized percentage of canopy cover per plot
You can access the data using:
Fit a Poisson model and a negative binomial model relating
cones
to the other 3 predictor variables. In addition, fit zero-inflated versions of these models using thezeroinfl
function in thepscl
package (Zeileis, Kleiber, & Jackman, 2008; Jackman, 2020). Compare the four models using AIC. For the zero-inflation models, assume a constant level of zero-inflation usingzeroinfl(y ~ x | 1, data=)
.Conduct a goodness-of-fit test the poisson regression model without zero-inflation. Use the number of zeros as the test statistic. You may conduct the test using predictive simulation or via a Bayesian goodness-of-fit test. Interpret the result of this test.
Describe the “AIC-best” model using text and equations and match the parameters from the R output to those in your model description.
Optional: fit the AIC-best model using JAGS.
13.3 Exercise 3
For this exercise, you will use simulations to gain insights into the performance of negative binomial models with zero-inflation. We will make use of simulation code from Kéry (2010), Section 14.2 (you will need to adapt it to simulate multiple data sets). Assume we survey 40 sites for snowshoe hare (Lepus americanus), 20 classified as grassland
and 20 classified as arable
land. Some sites will be suitable for hare and others not. We will model suitability using a Bernoulli distribution and the number of hare, given the site is suitable, using a negative binomial distribution.
Let \(W_i = 1\) if the site is suitable for hare and 0 otherwise. Further, let \(C_i\) be the count of hare at a site. We can write the model as:
\[W_i \sim Bernoulli(\Psi), \text{ with } \Psi = 0.8\]
\[C_i | W_i \sim \text{Negative Binomial}(\lambda_iW_i, \theta), \text{ with } \theta = 0.5\]
\[\lambda_i = \exp(\beta_0 + \beta_1X_i) \text{ with } \beta_0 =0.69 \text{ and } \beta_1 =0.92\]
One data set can be generated using the following code:
# This code can be included *outside* of a data generation loop
psi <- 0.8 # probability a site is suitable
n.site <- 20 # number of sites
x <- gl(n = 2, k = n.site, labels = c("grassland", "arable")) # site indicator
lambda <- exp(0.69 +(0.92*(as.numeric(x)-1))) # Mean count | suitable
theta<-0.5 # overdispersion parameter of negative binomial
# This code should be included *inside* a data generation loop
w <- rbinom(n = 2*n.site, size = 1, prob = psi) # Suitability indicator
C <- MASS::rnegbin(n = 2*n.site, mu = w *lambda, theta=theta) # Count of hare
simdat<-data.frame(x=x, w=w, C=C)
m.negb <- zeroinfl(C ~ x | 1, dist = "negbin", data=simdat) #znb model
coef(m.negb) # coefficients of the znb model
m.negb$theta # overdisperion parameter znb model
Make sure to load any necessary packages (e.g.,
pscl
for fitting zero-inflated models,MASS
for thernegbin
function). Simulate 1000 data sets, and for each data set, fit a zero-inflated negative binomial model using thezeroinfl
function - i.e., usingzeroinfl(C∼x| 1, dist="negbin")
. Save the fitted coefficients (you can access them using thecoef
function). Also, save the values of \(\hat{\theta}\). To access the estimates of \(\theta\), you can usemod$theta
wheremod
is the name of the fitted model.Plot the sampling distributions of (\(\Psi, \beta_0, \beta_1, \log(\theta)\)). Overlay the true values of the parameters that were used to generate the data. Note:
zeroinfl
estimates 1 - logit(\(\Psi\)). You can get estimates of \(\Psi\) using1 - plogis(coef(m.negb)[3])
.Plot estimates of the zero-inflation parameter, \(\Psi\), versus \(\log(\theta)\).
Comment on your findings. How well are we able to estimate the zero-inflation parameter, \(\Psi\)? What do you think is going on?
13.4 Exercise 4
In Exercise 12.1, we fit a logistic regression model to presence-absence (really detection, non-detection) data of willow tits (Poecile montanus) during the Swiss Survey of Common Breeding Birds. In this exercise, we will fit an occupancy model to these same data. This model will be similar in flavor to the zero-inflated models considered in this section.
The data can be accessed using:
The data set contains counts c.i
on each of 3 survey occasions (\(i\) = 1, 2, or 3), which have also been reduced to 0’s and 1’s (y.i
; i=1, 2, or 3), with y.i
= 1 if at least 1 bird was seen on survey \(i\) (and 0 otherwise). Some sites were not surveyed on all 3 occasions and will have NA
for y.i
when not sampled. In exercise 12.1, we will fit logistic regression models to a composite indicator variable, y
, equal to 1 if at least 1 bird was seen on any survey and 0 otherwise. In this exercise, we will consider the observations from each survey occasion, y.i
.
- Begin by fitting a logistic regression model using
y
as the response variable and include forest cover and elevation as predictor variables. Allow the effect of elevation to be non-linear by including scaled and centered version ofelev
created using:
library(dplyr)
willowtits <- willowtits %>% mutate(selev = as.numeric(scale(elev)))
willowtits <- willowtits %>% mutate(selev2 = selev^2)
- Next, we will fit the following occupancy model the survey data:
\[\begin{gather} Z_j \sim Bernoulli(\Psi_j) \\ \text{logit}(\Psi_j) = \beta_0 + \beta_1 Forest_j + \beta_2 selev_j + \beta_3selev^2_j \\ Y_{i,j} | Z_i \sim Bernoulli(p_{i,j}Z_j) \end{gather}\]
We assume that sites are either occupied or unoccupied during the full survey period. The probability that a site is occupied is given by \(\Psi_j\) and is modeled as a function of forest cover and elevation. Further, \(p_{i,j}\) represents the probability that a willow tit is detected during survey \(i\) at site \(j\), given that site \(j\) is occupied. We assume there are no false positives and thus, the probability of detecting willow tits at an unoccupied site is 0. Lastly, we assume that this detection probability is constant, but we could also model detection probabilities as a function of survey duration or Julian date.
Some skeleton code is provided below for fitting the model in JAGS. You will need to:
- formulate appropriate prior distributions
- pass all of the data JAGS needs to fit the model, determine parameters to save, number of iterations, etc.
occupancymodel <- function(){
# Likelihood
for(j in 1:nobs) {
# probabilty site is occupied
logit(psi[j]) <- beta0 + beta1*forest[j] + beta2*selev[j] + beta3*scelev2[j]
Z[j] ~ dbern(psi[j]) # Occupancy status
y.1[j] ~dbern(Z[j]*p) # Detection period 1
y.2[j] ~dbern(Z[j]*p) # Detection period 2
y.3[j] ~dbern(Z[j]*p) # Detection period 3
}
# Priors
}
You will need to specify appropriate prior distributions for each of the parameters. In addition, you may run into errors if you do not also pass JAGS initial guesses for \(Z_j\) or treat \(Z_j\) as “partially observed” (i.e., when we have at least 1 detection, we know \(Z_j\) = 1 but otherwise we do not know if \(Z_j\) should be 0 or 1). So, when passing data, you should include \(Z\) below:
- Compare estimates of the model parameters between the occupancy and logistic regression models. Do the results surprise you?
References
Importantly, a binary classification is not inclusive and leaves out a significant proportion of the population. Although we cannot change past data collection efforts, we can use more inclusive language and additional options when designing our own surveys!↩︎