Chapter 4 Non-linear regression
4.1 Exercise 1
The data for this exercise are contained in the Data4Ecologists
package and can be accessed by loading the packaging and then typing data(clutch)
.
Read in the data set and look at the first 6 records using the
head
function.Note that clutch size is contained in a variable named
CLUTCH
. You can access the variable names with thenames
function. Change the variable names so that they are all lower case using:
- Create a scatterplot to explore how clutch size (
clutch
) varies byyear
and nest initiation date (date
). Note:date
is measured as the number of days since Jan 1, so it really reflects Julian day.
It will help to first tell R that year
should be treated as a categorical variable. To do this, use:
One option for creating the plot is to use ggplot
in the ggplot2
library:
or
There are several large outliers representing nests that were parasitized. Lets drop all observations with clutch sizes > 15 (e.g., using the
subset
function ordplyr::filter
function [indplyr
package]). If you are unfamiliar with these functions, explore their help pages by typing?subset
or?filter
into the R console.Ignore the fact that most of the nest structures were observed for multiple years (i.e., the observations are not necessarily independent). Fit a linear model relating
clutch
toyear
anddate
. Evaluate whether the assumptions of linear regression are reasonably met.Fit a model using
lm
, allowing for a quadratic relationship betweendate
andclutch
. Evaluate whether the assumptions of linear regression are reasonably met. Note: theresid_panel
function does not seem to work with models that include terms created usingpoly
orns
. You can use the defaultplot
function or theplot_model
function in thesjPlot
package withtype = "diag"
. Or, try thecheck_model
function in theperformance
package. Comment on whether you think the assumptions of linear regression are reasonably met.Now, fit a linear spline model, with a single knot at
date
= 150. You will need to create a new predictor that is 0 ifdate
\(<=150\) and =date
- 150 otherwise. Again, evaluate whether the assumptions of linear regression are reasonably met.Create a plot showing the fit of the model, overlaid on the original data. You can get fitted values from the model for a reasonable range of predictor values using:
newdat<-data.frame(expand.grid(date=seq(90, 180, by = 1), year=c("1997", "1998", "1999")))
newdat$fitted<-predict(modelname, newdata = newdat)
One option for creating the plot is to use the plot
function. Then, add the fitted lines using the lines
function (once call to lines for each year).
Alternatively, you can use ggplot
:
ggplot(newdat, aes(date, fitted, color = year)) + geom_line() +
geom_point(data = clutch, aes(date, clutch)) + xlab("Nest Initiation date") +
ylab("Fitted values")
- Fit a model with natural cubic regression splines and 3 degrees of freedom allocated to date. You will need to load the
splines
library and then usens(date,3)
as a predictor. Look at a residuals versus fitted values plot for these data. Create a plot showing the fit of the model, overlaid on the original data.
4.2 Exercise 2
Consider the ElephantsMF
data set in the Stat2Data
library (Cannon et al., 2019), which contains data collected by Lee, Bussière, Webber, Poole, & Moss (2013). The data set has 3 variables collected from 288 African elephants that lived through droughts in the first two years of their life.:
- Age (in years)
- Height (shoulder height in cm)
- Sex F = female, M = Male
Plot the data, showing the relationship between
Height
andAge
for each sex.Fit a linear model to data:
lm(Height ~ Age*Sex, data = ElephantsMF)
. Evaluate the assumptions of the model graphically. Be sure to comment on what you are looking for in each plot (e.g., the assumption you are looking to evaluate and what would constitute an assumption violation).Fit a model that allows the relationship between Height and Age to be non-linear and to also differ for males and females. Again, evaluate the assumptions of the model graphically.
Plot the fit of your model from step 3 with the data overlaid.
4.3 Exercise 3
This exercise will explore how polynomials, simple linear splines, and restricted cubic regression splines can be used to model non-linear relationships between a predictor, \(X\), and the mean response at a given value of \(X\), \(E[Y|X]\).
This exercise is largely constructed from a tutorial written by Patrick Heagerty, a professor in the Biostatistics Department at University of Washington.
Linear model: We will purposely simulate data where a linear model will not be appropriate.
- Simulate a predictor variable that takes on values 1:24
- Simulate a response variable that is predicted by the variable, \(X\), but in a highly non-linear way:
mu <- 10 + 5 * sin( x * pi / 24 ) - 2 * cos( (x-6)*4/24 )
set.seed(2010)
eps <- rnorm( length(mu) )
y <- mu + eps
- Plot the data showing the relationship between \(X\) and \(Y\) along with the least-squares regression line.
Linear Spline model: We can fit a model where the slope of the line relating \(Y\) to \(X\) changes at different locations, called knots.
- Create a set of basis vectors to allow fitting a linear spline model with changes in the slope at \(X = 6, 12,\) and \(18\). The first basis vector, \((X-6)_+\) can be created using:
This creates a predictor that can be included as a column in the design matrix. This predictor will be 0 when \(X \le 6\) and \(X-6\) when \(X > 6\). Create similar predictors to capture changes in slope for \(X > 12\) and \(X > 18\).
- Fit a model that includes these the 3 basis functions: \(Y = \beta_0 + \beta_1X_i + \beta_2(X_i-6)_+ + \beta_3(X_i-12)_+ + \beta_4(X_i-18)_+ + \epsilon_i\). Overlay the fitted mean on a scatterplot of the data. You can use the
predict
function with your fitted model object to calculate the fitted mean:
\(\hat{Y}_i = \hat{E}[Y_i|X_i] = \hat{\beta}_0 + \hat{\beta}_1X_i + \hat{\beta}_2(X_i-6)_+ + \hat{\beta}_3(X_i-12)_+ + \hat{\beta}_4(X_i-18)_+\)
- Plot the columns of the design matrix from the linear spline model versus \(X\), i.e., plot:
- \(1\) versus \(X\)
- \(X\) versus \(X\)
- \((X-6)_+\) versus \(X\)
- \((X-12)_+\) versus \(X\)
- \((X-18)_+\) versus \(X\)
- Plot the weighted columns of the design matrix from the linear spline model versus \(X\), i.e., plot:
- \(\hat{\beta}_01\) versus \(X\)
- \(\hat{\beta}_1X\) versus \(X\)
- \(\hat{\beta}_2(X-6)_+\) versus \(X\)
- \(\hat{\beta}_3(X-12)_+\) versus \(X\)
- \(\hat{\beta}_4(X-18)_+\) versus \(X\)
Note, if we add these contributions (i.e., from a-e) for any given value of \(X\), we get \(\hat{Y}_i = \hat{E}[Y_i|X_i] = \hat{\beta}_0 + \hat{\beta}_1X_i + \hat{\beta}_2(X_i-6)_+ + \hat{\beta}_3(X_i-12)_+ + \hat{\beta}_4(X_i-18)_+\).
- What is the slope of the fitted curve when \(X = 3\)? How about when \(X = 9, 14,\) and \(19\)?
Polynomial model: One disadvantage of the linear spline model is that it is not as “smooth” as we might desire (e.g., it is not continuous at the knot locations, \(X = 6, 12, 18\)). One way to create a model that is “smooth” is to use polynomial terms. This will result in a model that is continuous and its first derivative will also be continuous (so, there will be no quick changes in the direction of \(E[Y|X]\) as we increase or decrease \(X\)).
Create a set of basis vectors, \(X\), \(X^2\), and \(X^3\), to fit a cubic polynomial model: \(Y = \beta_0 + \beta_1X_i + \beta_2X_i^2 + \beta_3X_i^3 + \epsilon_i\). Fit this model and overlay the fitted mean on a scatterplot of the data. You can get the values of the fitted mean using the
predict
function along with your fitted model object.Plot the columns of the design matrix versus \(X\), i.e., plot:
- \(1\) versus \(X\)
- \(X_i\) versus \(X\)
- \(X_i^2\) versus \(X\)
- \(X_i^3\) versus \(X\)
- Plot the weighted columns of the design matrix versus \(X\), i.e., plot:
- \(\hat{\beta}_01\) versus \(X\)
- \(\hat{\beta}_1X_i\) versus \(X\)
- \(\hat{\beta}_2X_i^2\) versus \(X\)
- \(\hat{\beta}_3X_i^4\) versus \(X\)
Again, if we add these 4 contributions up for any given value of \(X\), we get \(\hat{Y}_i = \hat{E}[Y_i|X_i] = \hat{\beta}_0 + \hat{\beta}_1X_i + \hat{\beta}_2X_i^2+ \hat{\beta}_3X_i^3\).
Cubic regression splines with truncated basis functions: Next, we will consider a model with a set of cubic sline basis functions that we create ourselves.
- Create a set of cubic regression spline basis vectors, \((X-6)^3_+\), \((X-12)^3_+\), and \((X-18)^3_+\) to go with the polynomial basis vectors from steps 9-11 (these can be formed by simply raising the basis vectors from step 4 to the third power). For example, we can create \((X-6)^3_+\) using:
These additional basis vectors will allow our fitted cubic function to change after each knot location. In essence, we will end up with a different cubic fit within each segment: \(X < 6, 6 \le X < 12, 12 \le X <18, X \ge 18\).
Fit the model with these additional terms:
\(Y = \beta_0 + \beta_1X_i + \beta_2X_i^2 + \beta_3X_i^3 + \beta_4(X-6)^3_+ + \beta_5(X-12)^3_+ + \beta_6(X-18)^3_+ +\epsilon_i\).
Overlay the fitted mean on a scatterplot of the data.
Natural cubic regression splines: Next, we fit a model with natural cubic regression spline basis functions. This will allow again for a piecewise cubic model within each segment formed by our interior knots but linear fits outside of the boundary knots. We can use the splines
package to form our basis vectors. We will use the same set of interior knots (6, 12, 18) and set the boundary knots to the minimum and maximum values in the data set (\(X = 1\) and \(X=24\), respectively).
Load the
splines
package and fit the cubic regression spline model using:fit <- lm( y ~ ns(x, knots=c(6, 12, 18)))
.Use the predict function to overlay the fitted model on the data.
Plot the basis vectors versus \(X\) or the weighted basis vectors versus \(X\). You can access the basis vectors directly by using the
model.matrix
function to pull off the different columns of the design matrix.Comment on the relative strengths and weaknesses of the different approaches considered (i.e., linear models, linear splines, polynomials, cubic regression splines).
4.4 Exercise 4
The code, below, reads in data containing estimates of the number of Moose in Minnesota between 2005 and 2020:
nmoose<-data.frame(year=2005:2020,
estimate = c(8160, 8840, 6860, 7890, 7840, 5700, 4900, 4230, 2760, 4350, 3450, 4020, 3710, 3030, 4180, 3150))
Fit a linear regression model to the data and evaluate whether the assumptions are reasonable. Make sure to comment on what you would expect to see in each plot if the assumptions of linear regression were met.
Fit a model that allows for a non-linear trend in moose numbers over time, using linear or cubic regression splines. Plot the data with the fitted curve, similar to the picture below from the MN DNR report in which they analyze the trend in moose numbers over time.