Chapter 13 Zero-Inflation

13.1 Exercise 1

For this exercise, you will consider a data set containing publication records associated with 915 biochemistry graduate students. The data can be accessed using:

library(pscl)
data(bioChemists)

Consider the following response variable:

art: count of articles produced during last 3 years of Ph.D.

And, the available predictor variables:

  • fem: factor indicating gender of student, with levels Men and Women7.
  • mar: factor indicating marital status of student, with levels Single and Married
  • kid5: number of children aged 5 or younger
  • phd: prestige of Ph.D. department
  • ment: count of articles produced by Ph.D. mentor during last 3 years
  1. Formulate and fit a model for art that includes one or more predictors for both the zero-inflation and count parts of the model using the zeroinfl function in the pscl library.

  2. Describe the model you fit using text and equations and match the parameters from the R output to those in your model description. Interpret the parameters in the model in the context of the problem.

13.2 Exercise 2

For this exercise, you will consider the nuts data set in the COUNT package (Hilbe, 2016). These data were originally reported by Flaherty et al (2012). Researchers recorded information about squirrel behavior and forest attributes across various plots in Scotland’s Abernathy Forest. The study focused on the following variables:

  • cones = number of cones stripped by red squirrels per plot
  • sntrees = standardized number of trees per plot
  • sheight = standardized mean tree height per plot
  • scover = standardized percentage of canopy cover per plot

You can access the data using:

library(COUNT)
data(nuts)
  1. Fit a Poisson model and a negative binomial model relating cones to the other 3 predictor variables. In addition, fit zero-inflated versions of these models using the zeroinfl function in the pscl package (Zeileis, Kleiber, & Jackman, 2008; Jackman, 2020). Compare the four models using AIC. For the zero-inflation models, assume a constant level of zero-inflation using zeroinfl(y ~ x | 1, data=).

  2. Conduct a goodness-of-fit test the poisson regression model without zero-inflation. Use the number of zeros as the test statistic. You may conduct the test using predictive simulation or via a Bayesian goodness-of-fit test. Interpret the result of this test.

  3. Describe the “AIC-best” model using text and equations and match the parameters from the R output to those in your model description.

  4. Optional: fit the AIC-best model using JAGS.

13.3 Exercise 3

For this exercise, you will use simulations to gain insights into the performance of negative binomial models with zero-inflation. We will make use of simulation code from Kéry (2010), Section 14.2 (you will need to adapt it to simulate multiple data sets). Assume we survey 40 sites for snowshoe hare (Lepus americanus), 20 classified as grassland and 20 classified as arable land. Some sites will be suitable for hare and others not. We will model suitability using a Bernoulli distribution and the number of hare, given the site is suitable, using a negative binomial distribution.

Let \(W_i = 1\) if the site is suitable for hare and 0 otherwise. Further, let \(C_i\) be the count of hare at a site. We can write the model as:

\[W_i \sim Bernoulli(\Psi), \text{ with } \Psi = 0.8\]

\[C_i | W_i \sim \text{Negative Binomial}(\lambda_iW_i, \theta), \text{ with } \theta = 0.5\]

\[\lambda_i = \exp(\beta_0 + \beta_1X_i) \text{ with } \beta_0 =0.69 \text{ and } \beta_1 =0.92\]

One data set can be generated using the following code:

# This code can be included *outside* of a data generation loop
psi <- 0.8 # probability a site is suitable
n.site <- 20 # number of sites
x <- gl(n = 2, k = n.site, labels = c("grassland", "arable")) # site indicator
lambda <- exp(0.69 +(0.92*(as.numeric(x)-1))) # Mean count | suitable
theta<-0.5 # overdispersion parameter of negative binomial

# This code should be included *inside* a data generation loop
w <- rbinom(n = 2*n.site, size = 1, prob = psi) # Suitability indicator
C <- MASS::rnegbin(n = 2*n.site, mu = w *lambda, theta=theta) # Count of hare
simdat<-data.frame(x=x, w=w, C=C)
m.negb <- zeroinfl(C ~ x | 1, dist = "negbin", data=simdat) #znb model
coef(m.negb) # coefficients of the znb model
m.negb$theta # overdisperion parameter znb model
  1. Make sure to load any necessary packages (e.g., pscl for fitting zero-inflated models, MASS for the rnegbin function). Simulate 1000 data sets, and for each data set, fit a zero-inflated negative binomial model using the zeroinfl function - i.e., using zeroinfl(C∼x| 1, dist="negbin"). Save the fitted coefficients (you can access them using the coef function). Also, save the values of \(\hat{\theta}\). To access the estimates of \(\theta\), you can use mod$theta where mod is the name of the fitted model.

  2. Plot the sampling distributions of (\(\Psi, \beta_0, \beta_1, \log(\theta)\)). Overlay the true values of the parameters that were used to generate the data. Note: zeroinfl estimates 1 - logit(\(\Psi\)). You can get estimates of \(\Psi\) using 1 - plogis(coef(m.negb)[3]).

  3. Plot estimates of the zero-inflation parameter, \(\Psi\), versus \(\log(\theta)\).

  4. Comment on your findings. How well are we able to estimate the zero-inflation parameter, \(\Psi\)? What do you think is going on?

13.4 Exercise 4

In Exercise 12.1, we fit a logistic regression model to presence-absence (really detection, non-detection) data of willow tits (Poecile montanus) during the Swiss Survey of Common Breeding Birds. In this exercise, we will fit an occupancy model to these same data. This model will be similar in flavor to the zero-inflated models considered in this section.

The data can be accessed using:

library(Data4Ecologists)
data(willowtits)

The data set contains counts c.i on each of 3 survey occasions (\(i\) = 1, 2, or 3), which have also been reduced to 0’s and 1’s (y.i; i=1, 2, or 3), with y.i = 1 if at least 1 bird was seen on survey \(i\) (and 0 otherwise). Some sites were not surveyed on all 3 occasions and will have NA for y.i when not sampled. In exercise 12.1, we will fit logistic regression models to a composite indicator variable, y, equal to 1 if at least 1 bird was seen on any survey and 0 otherwise. In this exercise, we will consider the observations from each survey occasion, y.i.

  1. Begin by fitting a logistic regression model using y as the response variable and include forest cover and elevation as predictor variables. Allow the effect of elevation to be non-linear by including scaled and centered version ofelev created using:
library(dplyr)
willowtits <- willowtits %>% mutate(selev = as.numeric(scale(elev)))
willowtits <- willowtits %>% mutate(selev2 = selev^2)
  1. Next, we will fit the following occupancy model the survey data:

\[\begin{gather} Z_j \sim Bernoulli(\Psi_j) \\ \text{logit}(\Psi_j) = \beta_0 + \beta_1 Forest_j + \beta_2 selev_j + \beta_3selev^2_j \\ Y_{i,j} | Z_i \sim Bernoulli(p_{i,j}Z_j) \end{gather}\]

We assume that sites are either occupied or unoccupied during the full survey period. The probability that a site is occupied is given by \(\Psi_j\) and is modeled as a function of forest cover and elevation. Further, \(p_{i,j}\) represents the probability that a willow tit is detected during survey \(i\) at site \(j\), given that site \(j\) is occupied. We assume there are no false positives and thus, the probability of detecting willow tits at an unoccupied site is 0. Lastly, we assume that this detection probability is constant, but we could also model detection probabilities as a function of survey duration or Julian date.

Some skeleton code is provided below for fitting the model in JAGS. You will need to:

  • formulate appropriate prior distributions
  • pass all of the data JAGS needs to fit the model, determine parameters to save, number of iterations, etc.
 occupancymodel <- function(){
  # Likelihood
  for(j in 1:nobs) {
    # probabilty site is occupied
    logit(psi[j]) <- beta0 + beta1*forest[j] + beta2*selev[j] + beta3*scelev2[j]
    Z[j] ~ dbern(psi[j]) # Occupancy status
    y.1[j] ~dbern(Z[j]*p) # Detection period 1
    y.2[j] ~dbern(Z[j]*p) # Detection period 2
    y.3[j] ~dbern(Z[j]*p) # Detection period 3
  }
  # Priors
  
}

You will need to specify appropriate prior distributions for each of the parameters. In addition, you may run into errors if you do not also pass JAGS initial guesses for \(Z_j\) or treat \(Z_j\) as “partially observed” (i.e., when we have at least 1 detection, we know \(Z_j\) = 1 but otherwise we do not know if \(Z_j\) should be 0 or 1). So, when passing data, you should include \(Z\) below:

Z <- NA
Z[willowtits$y.1 == 1] <- 1
Z[willowtits$y.2 == 1] <- 1
Z[willowtits$y.3 == 1] <- 1
  1. Compare estimates of the model parameters between the occupancy and logistic regression models. Do the results surprise you?

References

Hilbe, J. M. (2016). COUNT: Functions, data and code for count data. Retrieved from https://CRAN.R-project.org/package=COUNT
Jackman, S. (2020). pscl: Classes and methods for R developed in the political science computational laboratory. Sydney, New South Wales, Australia: United States Studies Centre, University of Sydney. Retrieved from https://github.com/atahk/pscl/
Kéry, M. (2010). Introduction to WinBUGS for ecologists: Bayesian approach to regression, ANOVA, mixed models and related analyses. Academic Press.
Zeileis, A., Kleiber, C., & Jackman, S. (2008). Regression models for count data in R. Journal of Statistical Software, 27(8). Retrieved from http://www.jstatsoft.org/v27/i08/

  1. Importantly, a binary classification is not inclusive and leaves out a significant proportion of the population. Although we cannot change past data collection efforts, we can use more inclusive language and additional options when designing our own surveys!↩︎