Chapter 3 Multiple Regression and Design Matrices
3.1 Exercise 1
For this exercise, we will consider the penguins
data set from the palmerpenguins
package (Horst, Hill, & Gorman, 2020). The data set contains various measures of penguin size, species, and sex, along with the year and island when/where the observations were made. Begin by loading the data and dropping observations that have missing values for some of the predictors:
library(dplyr)
library(palmerpenguins)
data(penguins)
penguins <- penguins %>% filter(complete.cases(.))
First, consider a linear model in which you attempt to predict a penguin’s body mass (body_mass_g
) from its flipper length (flipper_length_mm
) and the sex (sex
) of the penguin.
- Fit the model using effects coding. Write down the entries in the design matrix, \(X\) for the first 3 observations in the data set. Verify your answer using the
model.matrix
function. The first 3 observations are given by:
## # A tibble: 3 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## # ℹ 2 more variables: sex <fct>, year <int>
Fit the same model using means coding. Again, write down the entries in the design matrix, \(X\) for the first 3 observations in the data set. Verify your answer using the
model.matrix
function.Now, consider adding
species
to the model in addition to flipper length (flipper_length-mm
) and the sex (sex
) of the penguin. Fit the model using effects coding. Write down the entries in the design matrix for the following observations:
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Gentoo Biscoe 50.5 15.9 225 5400
## 4 Gentoo Biscoe 44.9 13.3 213 5100
## 5 Chinstrap Dream 50.6 19.4 193 3800
## 6 Chinstrap Dream 46.7 17.9 195 3300
## # ℹ 2 more variables: sex <fct>, year <int>
- Lastly, let’s allow the effect of flipper length to be sex-specific. This can be accomplished by adding an interaction between
sex
andflipper_length_mm
,sex:flipper_length_mm
to the model from [3] above. Again, write down the entries in the design matrix for 6 observations selected just above.
3.2 Exercise 2
For this exercise, we will use the leaftemp
data set in the DAAG
package (Maindonald & Braun, 2020). The data set contains measurements of vapor pressure (vapPress
) and differences between leaf and air temperatures (tempDiff
) in an experiment conducted at three different levels of carbon dioxide (CO2level
).
- Fit the following three models to these data:
- simple linear regression:
lm(tempDiff ~ vapPress, data = leaftemp)
- Analysis of covariance:
lm(tempDiff ~ vapPress + CO2level, data= leaftemp)
- Interaction model:
lm(tempDiff ~ vapPress*CO2level, data= leaftemp)
For the Analysis of covariance model, write down the equation for predicting the mean temperature difference as a function of vapor pressure for each of the 3 CO\(_2\) levels.
For the Interaction model, write down the equation for predicting the mean temperature difference as a function of vapor pressure for each of the 3 CO\(_2\) levels.
Plot the predicted mean temperature difference as a function of vapor pressure (and when appropriate, CO\(_2\) level) for each of the 3 models.
Construct and fit a model in which each CO\(_2\) level has a different intercept, but the medium and high CO\(_2\) level treatments have the same slope for
vapPress
(which differs from the slope for the Low CO\(_2\) level treatment).
3.3 Exercise 3
This example uses data from the Maryland Biological Stream Survey to explore added-variable and component+residual plots. To access the data, type:
Predictors in the data set include:
- stream: name of each sampled stream (i.e., the name associated with each case).
- longnosedace: number of longnose dace (Rhinichthys cataractae) per 75-meter section of stream.
- acreage: area (in acres) drained by the stream.
- do2: the dissolved oxygen (in mg/liter).
- maxdepth: the maximum depth (in cm) of the 75-meter segment of stream.
- no3: nitrate concentration (mg/liter).
- so4: sulfate concentration (mg/liter).
- temp: the water temperature on the sampling date (in degrees C).
Begin by fitting a model containing all of the predictor variables in the data set (other than
stream
).Create residual diagnostic plots to evaluate the assumptions of the linear regression model. Comment on whether or not you think each assumption of linear regression is met?
Explore the effect of each predictor (after accounting for all other predictors in the model) using added-variable plots constructed using the
avPlots
function in thecar
package. Just typeavPlots(model-name)
wheremodel-name
is the name you assign to the fitted model. When creating your report, you may want to adjust the size of the figure in the output using chunk options: fig.height = 12, fig.width=6, out.width = “80%”. You can also change the layout of the panels. For example, if you want a 2 x 3 set of panels (rather than the default 3 x 2), you can useavPlots(mod, layout = c(2, 3))
.Construct partial residual or component + residual plots using using the
crPlots
function in thecar
package viacrPlots(model-name)
. Again, you may want to adjust the size of the plot and the layout of the panels.Consider the added-variable and component + residual plots. Do these tell you anything more about potential assumption violations?
Component+residual plots can also be constructed using the
termplot
function (in basestats
package) with the argumentpartial=T
. Thetermplot
function allows one to generalize the component+residual plot to more complicated models that allow for non-linear relationships2. For example, we can construct a component+residual plot for a model with a quadratic term by replacing \(\hat{\beta}_iX_i\) with \(\hat{\beta}_{i1}X_i+\hat{\beta}_{i2}X_i^2\). To see how the component + residual plot might look when non-linear terms are included in the model, consider a model with a quadratic effect ofacreage
and a linear effect ofno3
. I.e.,mod2 <- lm(longnosedace ~ poly(acreage, 2) + no3, data=dace)
. Then produce a component + residual plot using:termplot(mod2, se=T, partial=T)
.
3.4 Exercise 4
Consider the data set birdmalariaLFS
in the Data4Ecologists
library. These data are associated with the following paper:
Asghar, M., Hasselquist, D., Hansson, B., Zehtindjiev, P., Westerdahl, H., & Bensch, S. (2015). Hidden costs of infection: chronic malaria accelerates telomere degradation and senescence in wild birds. Science, 347(6220), 436-438.
The authors were interested in whether chronic infection might impact the fitness (e.g., number of offspring) of wild great reed warblers. The data set contains the following variables:
LFS
= lifetime reproductive success (number of offspring produced)Sex
of bird (0 if male and 1 if female)year
= year of birthTlifespan
= total lifespan (in years)infected
= infected status (0 = uninfected thoughout life, 1 = infected at 1 year of life and remained infected throughout life)BTL
= telomere length shortly after birth (telomere length is sometimes used to predict lifespan, but is not without controversy; see e.g., (Whittemore, Vera, Martı́nez-Nevado, Sanpera, & Blasco, 2019)).
The data set can be accessed using:
The authors fit the following model relating Tlifespan
to the other predictors:
mod1 <- lm(Tlifespan ~ Sex + year + BTL + infected + BTL:infected, data=birdmalariaLFS)
summary(mod1)
Call:
lm(formula = Tlifespan ~ Sex + year + BTL + infected + BTL:infected,
data = birdmalariaLFS)
Residuals:
Min 1Q Median 3Q Max
-2.2416 -0.8060 -0.2603 0.5255 5.2395
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 104.66665 87.76436 1.193 0.236791
Sex -0.36932 0.32120 -1.150 0.253869
year -0.05247 0.04411 -1.190 0.237980
BTL 3.00839 0.78981 3.809 0.000283 ***
infected 2.60804 1.15466 2.259 0.026809 *
BTL:infected -3.86458 1.27743 -3.025 0.003401 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.404 on 75 degrees of freedom
(19 observations deleted due to missingness)
Multiple R-squared: 0.2408, Adjusted R-squared: 0.1902
F-statistic: 4.758 on 5 and 75 DF, p-value: 0.0007873
Write down a set of equations for the model and interpret the model parameters in the context of the problem.
Write out the design matrix for the following 3 observations:
- Observation 1: Sex = male, year = 1991, BTL = 0.80, infected = 1
- Observation 2: Sex = female, year = 1991, BTL = 0.80, infected = 1
- Observation 3: Sex = female, year = 1991, BTL = 0.80, infected = 0
- Evaluate whether or not you think the assumptions of the linear regression model are met.
3.5 Exercise 5
Consider the data set birdmalariaLFS
in the Data4Ecologists
library. These data are associated with the following paper:
Asghar, M., Hasselquist, D., Hansson, B., Zehtindjiev, P., Westerdahl, H., & Bensch, S. (2015). Hidden costs of infection: chronic malaria accelerates telomere degradation and senescence in wild birds. Science, 347(6220), 436-438.
The authors were interested in whether chronic infection might the impact fitness of wild great reed warblers. The data set contains the following variables:
LFS
= lifetime reproductive success (number of offspring produced)Sex
of bird (0 if male? 1female)year
= year of birthTlifespan
= total lifespan (in years)infected
= infected status (0 = uninfected thoughout life, 1 = infected at 1 year of life and remained infected throughout life)
The data set can be accessed using:
Fit a linear model relating lifetime reproductive success to the sex of the bird, its birth year, total lifespan, and infected status. Allow the effect of total lifespan to differ for males and females. Also, assume the effect of birth year can be modeled using a continuous variable representing a trend over time.
Describe the model using a set of equations and interpret the model parameters in the context of the problem. Do not forget about \(\sigma^2\)!
Write out the design matrix for the following 3 observations:
- Observation 1: Sex = male, year = 1991, Tlifespan = 2, infected = 1
- Observation 2: Sex = female, year = 1991, Tlifespan = 2, infected = 1
- Observation 3: Sex = female, year = 1991, Tlifespan = 2, infected = 0
References
Models cannot, however, include interaction terms.↩︎