Chapter 3 Multiple Regression and Design Matrices

3.1 Exercise 1

For this exercise, we will consider the penguins data set from the palmerpenguins package (Horst, Hill, & Gorman, 2020). The data set contains various measures of penguin size, species, and sex, along with the year and island when/where the observations were made. Begin by loading the data and dropping observations that have missing values for some of the predictors:

library(dplyr)
library(palmerpenguins)
data(penguins)
penguins <- penguins %>% filter(complete.cases(.))

First, consider a linear model in which you attempt to predict a penguin’s body mass (body_mass_g) from its flipper length (flipper_length_mm) and the sex (sex) of the penguin.

  1. Fit the model using effects coding. Write down the entries in the design matrix, \(X\) for the first 3 observations in the data set. Verify your answer using the model.matrix function. The first 3 observations are given by:
library(palmerpenguins)
data(penguins)
penguins[1:3,]
## # A tibble: 3 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## # ℹ 2 more variables: sex <fct>, year <int>
  1. Fit the same model using means coding. Again, write down the entries in the design matrix, \(X\) for the first 3 observations in the data set. Verify your answer using the model.matrix function.

  2. Now, consider adding species to the model in addition to flipper length (flipper_length-mm) and the sex (sex) of the penguin. Fit the model using effects coding. Write down the entries in the design matrix for the following observations:

penguins[c(1, 2, 200, 201, 300, 301),]
## # A tibble: 6 × 8
##   species   island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>     <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie    Torgersen           39.1          18.7               181        3750
## 2 Adelie    Torgersen           39.5          17.4               186        3800
## 3 Gentoo    Biscoe              50.5          15.9               225        5400
## 4 Gentoo    Biscoe              44.9          13.3               213        5100
## 5 Chinstrap Dream               50.6          19.4               193        3800
## 6 Chinstrap Dream               46.7          17.9               195        3300
## # ℹ 2 more variables: sex <fct>, year <int>
  1. Lastly, let’s allow the effect of flipper length to be sex-specific. This can be accomplished by adding an interaction between sex and flipper_length_mm, sex:flipper_length_mm to the model from [3] above. Again, write down the entries in the design matrix for 6 observations selected just above.

3.2 Exercise 2

For this exercise, we will use the leaftemp data set in the DAAG package (Maindonald & Braun, 2020). The data set contains measurements of vapor pressure (vapPress) and differences between leaf and air temperatures (tempDiff) in an experiment conducted at three different levels of carbon dioxide (CO2level).

library(DAAG)
data(leaftemp)
  1. Fit the following three models to these data:
  • simple linear regression: lm(tempDiff ~ vapPress, data = leaftemp)
  • Analysis of covariance: lm(tempDiff ~ vapPress + CO2level, data= leaftemp)
  • Interaction model: lm(tempDiff ~ vapPress*CO2level, data= leaftemp)
  1. For the Analysis of covariance model, write down the equation for predicting the mean temperature difference as a function of vapor pressure for each of the 3 CO\(_2\) levels.

  2. For the Interaction model, write down the equation for predicting the mean temperature difference as a function of vapor pressure for each of the 3 CO\(_2\) levels.

  3. Plot the predicted mean temperature difference as a function of vapor pressure (and when appropriate, CO\(_2\) level) for each of the 3 models.

  4. Construct and fit a model in which each CO\(_2\) level has a different intercept, but the medium and high CO\(_2\) level treatments have the same slope for vapPress (which differs from the slope for the Low CO\(_2\) level treatment).

3.3 Exercise 3

This example uses data from the Maryland Biological Stream Survey to explore added-variable and component+residual plots. To access the data, type:

library(Data4Ecologists)
data("longnosedace")

Predictors in the data set include:

  • stream: name of each sampled stream (i.e., the name associated with each case).
  • longnosedace: number of longnose dace (Rhinichthys cataractae) per 75-meter section of stream.
  • acreage: area (in acres) drained by the stream.
  • do2: the dissolved oxygen (in mg/liter).
  • maxdepth: the maximum depth (in cm) of the 75-meter segment of stream.
  • no3: nitrate concentration (mg/liter).
  • so4: sulfate concentration (mg/liter).
  • temp: the water temperature on the sampling date (in degrees C).
  1. Begin by fitting a model containing all of the predictor variables in the data set (other than stream).

  2. Create residual diagnostic plots to evaluate the assumptions of the linear regression model. Comment on whether or not you think each assumption of linear regression is met?

  3. Explore the effect of each predictor (after accounting for all other predictors in the model) using added-variable plots constructed using the avPlots function in the car package. Just type avPlots(model-name) where model-name is the name you assign to the fitted model. When creating your report, you may want to adjust the size of the figure in the output using chunk options: fig.height = 12, fig.width=6, out.width = “80%”. You can also change the layout of the panels. For example, if you want a 2 x 3 set of panels (rather than the default 3 x 2), you can use avPlots(mod, layout = c(2, 3)).

  4. Construct partial residual or component + residual plots using using the crPlots function in the car package via crPlots(model-name). Again, you may want to adjust the size of the plot and the layout of the panels.

  5. Consider the added-variable and component + residual plots. Do these tell you anything more about potential assumption violations?

  6. Component+residual plots can also be constructed using the termplot function (in base stats package) with the argument partial=T. The termplot function allows one to generalize the component+residual plot to more complicated models that allow for non-linear relationships2. For example, we can construct a component+residual plot for a model with a quadratic term by replacing \(\hat{\beta}_iX_i\) with \(\hat{\beta}_{i1}X_i+\hat{\beta}_{i2}X_i^2\). To see how the component + residual plot might look when non-linear terms are included in the model, consider a model with a quadratic effect of acreage and a linear effect of no3. I.e., mod2 <- lm(longnosedace ~ poly(acreage, 2) + no3, data=dace). Then produce a component + residual plot using: termplot(mod2, se=T, partial=T).

3.4 Exercise 4

Consider the data set birdmalariaLFS in the Data4Ecologists library. These data are associated with the following paper: Asghar, M., Hasselquist, D., Hansson, B., Zehtindjiev, P., Westerdahl, H., & Bensch, S. (2015). Hidden costs of infection: chronic malaria accelerates telomere degradation and senescence in wild birds. Science, 347(6220), 436-438.

The authors were interested in whether chronic infection might impact the fitness (e.g., number of offspring) of wild great reed warblers. The data set contains the following variables:

  • LFS = lifetime reproductive success (number of offspring produced)
  • Sex of bird (0 if male and 1 if female)
  • year = year of birth
  • Tlifespan = total lifespan (in years)
  • infected = infected status (0 = uninfected thoughout life, 1 = infected at 1 year of life and remained infected throughout life)
  • BTL = telomere length shortly after birth (telomere length is sometimes used to predict lifespan, but is not without controversy; see e.g., (Whittemore, Vera, Martı́nez-Nevado, Sanpera, & Blasco, 2019)).

The data set can be accessed using:

library(Data4Ecologists)
data("birdmalariaLFS")

The authors fit the following model relating Tlifespan to the other predictors:

mod1 <- lm(Tlifespan ~ Sex + year + BTL + infected + BTL:infected, data=birdmalariaLFS)
summary(mod1)

Call:
lm(formula = Tlifespan ~ Sex + year + BTL + infected + BTL:infected, 
    data = birdmalariaLFS)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2416 -0.8060 -0.2603  0.5255  5.2395 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  104.66665   87.76436   1.193 0.236791    
Sex           -0.36932    0.32120  -1.150 0.253869    
year          -0.05247    0.04411  -1.190 0.237980    
BTL            3.00839    0.78981   3.809 0.000283 ***
infected       2.60804    1.15466   2.259 0.026809 *  
BTL:infected  -3.86458    1.27743  -3.025 0.003401 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.404 on 75 degrees of freedom
  (19 observations deleted due to missingness)
Multiple R-squared:  0.2408,    Adjusted R-squared:  0.1902 
F-statistic: 4.758 on 5 and 75 DF,  p-value: 0.0007873
  1. Write down a set of equations for the model and interpret the model parameters in the context of the problem.

  2. Write out the design matrix for the following 3 observations:

  • Observation 1: Sex = male, year = 1991, BTL = 0.80, infected = 1
  • Observation 2: Sex = female, year = 1991, BTL = 0.80, infected = 1
  • Observation 3: Sex = female, year = 1991, BTL = 0.80, infected = 0
  1. Evaluate whether or not you think the assumptions of the linear regression model are met.

3.5 Exercise 5

Consider the data set birdmalariaLFS in the Data4Ecologists library. These data are associated with the following paper: Asghar, M., Hasselquist, D., Hansson, B., Zehtindjiev, P., Westerdahl, H., & Bensch, S. (2015). Hidden costs of infection: chronic malaria accelerates telomere degradation and senescence in wild birds. Science, 347(6220), 436-438.

The authors were interested in whether chronic infection might the impact fitness of wild great reed warblers. The data set contains the following variables:

  • LFS = lifetime reproductive success (number of offspring produced)
  • Sex of bird (0 if male? 1female)
  • year = year of birth
  • Tlifespan = total lifespan (in years)
  • infected = infected status (0 = uninfected thoughout life, 1 = infected at 1 year of life and remained infected throughout life)

The data set can be accessed using:

library(Data4Ecologists)
data("birdmalariaLFS")
  1. Fit a linear model relating lifetime reproductive success to the sex of the bird, its birth year, total lifespan, and infected status. Allow the effect of total lifespan to differ for males and females. Also, assume the effect of birth year can be modeled using a continuous variable representing a trend over time.

  2. Describe the model using a set of equations and interpret the model parameters in the context of the problem. Do not forget about \(\sigma^2\)!

  3. Write out the design matrix for the following 3 observations:

  • Observation 1: Sex = male, year = 1991, Tlifespan = 2, infected = 1
  • Observation 2: Sex = female, year = 1991, Tlifespan = 2, infected = 1
  • Observation 3: Sex = female, year = 1991, Tlifespan = 2, infected = 0

References

Horst, A. M., Hill, A. P., & Gorman, K. B. (2020). Palmerpenguins: Palmer archipelago (antarctica) penguin data. Retrieved from https://allisonhorst.github.io/palmerpenguins/
Maindonald, J. H., & Braun, W. J. (2020). DAAG: Data analysis and graphics data and functions. Retrieved from https://CRAN.R-project.org/package=DAAG
Whittemore, K., Vera, E., Martı́nez-Nevado, E., Sanpera, C., & Blasco, M. A. (2019). Telomere shortening rate predicts species life span. Proceedings of the National Academy of Sciences, 116(30), 15122–15127.

  1. Models cannot, however, include interaction terms.↩︎