Chapter 4 Non-linear regression

4.1 Exercise 1

The data for this exercise are contained in the Data4Ecologists package and can be accessed by loading the packaging and then typing data(clutch).

  1. Read in the data set and look at the first 6 records using the head function.

  2. Note that clutch size is contained in a variable named CLUTCH. You can access the variable names with the names function. Change the variable names so that they are all lower case using:

names(clutch)<-tolower(names(clutch))
  1. Create a scatterplot to explore how clutch size (clutch) varies by year and nest initiation date (date). Note: date is measured as the number of days since Jan 1, so it really reflects Julian day.

It will help to first tell R that year should be treated as a categorical variable. To do this, use:

clutch$year<-as.factor(clutch$year)

One option for creating the plot is to use ggplot in the ggplot2 library:

ggplot(clutch, aes(x = date, y = clutch, color = year)) + geom_point()

or

ggplot(clutch, aes(x = date, y = clutch)) + geom_point()+facet_wrap(~ year)
  1. There are several large outliers representing nests that were parasitized. Lets drop all observations with clutch sizes > 15 (e.g., using the subset function or dplyr::filter function [in dplyr package]). If you are unfamiliar with these functions, explore their help pages by typing ?subset or ?filter into the R console.

  2. Ignore the fact that most of the nest structures were observed for multiple years (i.e., the observations are not necessarily independent). Fit a linear model relating clutch to year and date. Evaluate whether the assumptions of linear regression are reasonably met.

  3. Fit a model using lm, allowing for a quadratic relationship between date and clutch. Evaluate whether the assumptions of linear regression are reasonably met. Note: the resid_panel function does not seem to work with models that include terms created using poly or ns. You can use the default plot function or the plot_model function in the sjPlot package with type = "diag". Or, try the check_model function in the performance package. Comment on whether you think the assumptions of linear regression are reasonably met.

  4. Now, fit a linear spline model, with a single knot at date = 150. You will need to create a new predictor that is 0 if date \(<=150\) and = date - 150 otherwise. Again, evaluate whether the assumptions of linear regression are reasonably met.

  5. Create a plot showing the fit of the model, overlaid on the original data. You can get fitted values from the model for a reasonable range of predictor values using:

newdat<-data.frame(expand.grid(date=seq(90, 180, by = 1), year=c("1997", "1998", "1999")))
newdat$fitted<-predict(modelname, newdata = newdat)

One option for creating the plot is to use the plot function. Then, add the fitted lines using the lines function (once call to lines for each year).

Alternatively, you can use ggplot:

ggplot(newdat, aes(date, fitted, color = year)) + geom_line() + 
    geom_point(data = clutch, aes(date, clutch)) + xlab("Nest Initiation date") + 
     ylab("Fitted values")
  1. Fit a model with natural cubic regression splines and 3 degrees of freedom allocated to date. You will need to load the splines library and then use ns(date,3) as a predictor. Look at a residuals versus fitted values plot for these data. Create a plot showing the fit of the model, overlaid on the original data.

4.2 Exercise 2

Consider the ElephantsMF data set in the Stat2Data library (Cannon et al., 2019), which contains data collected by Lee, Bussière, Webber, Poole, & Moss (2013). The data set has 3 variables collected from 288 African elephants that lived through droughts in the first two years of their life.:

  • Age (in years)
  • Height (shoulder height in cm)
  • Sex F = female, M = Male
  1. Plot the data, showing the relationship between Height and Age for each sex.

  2. Fit a linear model to data: lm(Height ~ Age*Sex, data = ElephantsMF). Evaluate the assumptions of the model graphically. Be sure to comment on what you are looking for in each plot (e.g., the assumption you are looking to evaluate and what would constitute an assumption violation).

  3. Fit a model that allows the relationship between Height and Age to be non-linear and to also differ for males and females. Again, evaluate the assumptions of the model graphically.

  4. Plot the fit of your model from step 3 with the data overlaid.

4.3 Exercise 3

This exercise will explore how polynomials, simple linear splines, and restricted cubic regression splines can be used to model non-linear relationships between a predictor, \(X\), and the mean response at a given value of \(X\), \(E[Y|X]\).

This exercise is largely constructed from a tutorial written by Patrick Heagerty, a professor in the Biostatistics Department at University of Washington.

Linear model: We will purposely simulate data where a linear model will not be appropriate.

  1. Simulate a predictor variable that takes on values 1:24
x <- c(1:24)
  1. Simulate a response variable that is predicted by the variable, \(X\), but in a highly non-linear way:
mu <- 10 + 5 * sin( x * pi / 24 )  - 2 * cos( (x-6)*4/24 )
set.seed(2010)
eps <- rnorm( length(mu) )
y <- mu + eps
  1. Plot the data showing the relationship between \(X\) and \(Y\) along with the least-squares regression line.

Linear Spline model: We can fit a model where the slope of the line relating \(Y\) to \(X\) changes at different locations, called knots.

  1. Create a set of basis vectors to allow fitting a linear spline model with changes in the slope at \(X = 6, 12,\) and \(18\). The first basis vector, \((X-6)_+\) can be created using:
x6 <-ifelse(x <6, 0, x-6)

This creates a predictor that can be included as a column in the design matrix. This predictor will be 0 when \(X \le 6\) and \(X-6\) when \(X > 6\). Create similar predictors to capture changes in slope for \(X > 12\) and \(X > 18\).

  1. Fit a model that includes these the 3 basis functions: \(Y = \beta_0 + \beta_1X_i + \beta_2(X_i-6)_+ + \beta_3(X_i-12)_+ + \beta_4(X_i-18)_+ + \epsilon_i\). Overlay the fitted mean on a scatterplot of the data. You can use the predict function with your fitted model object to calculate the fitted mean:

\(\hat{Y}_i = \hat{E}[Y_i|X_i] = \hat{\beta}_0 + \hat{\beta}_1X_i + \hat{\beta}_2(X_i-6)_+ + \hat{\beta}_3(X_i-12)_+ + \hat{\beta}_4(X_i-18)_+\)

  1. Plot the columns of the design matrix from the linear spline model versus \(X\), i.e., plot:
  1. \(1\) versus \(X\)
  2. \(X\) versus \(X\)
  3. \((X-6)_+\) versus \(X\)
  4. \((X-12)_+\) versus \(X\)
  5. \((X-18)_+\) versus \(X\)
  1. Plot the weighted columns of the design matrix from the linear spline model versus \(X\), i.e., plot:
  1. \(\hat{\beta}_01\) versus \(X\)
  2. \(\hat{\beta}_1X\) versus \(X\)
  3. \(\hat{\beta}_2(X-6)_+\) versus \(X\)
  4. \(\hat{\beta}_3(X-12)_+\) versus \(X\)
  5. \(\hat{\beta}_4(X-18)_+\) versus \(X\)

Note, if we add these contributions (i.e., from a-e) for any given value of \(X\), we get \(\hat{Y}_i = \hat{E}[Y_i|X_i] = \hat{\beta}_0 + \hat{\beta}_1X_i + \hat{\beta}_2(X_i-6)_+ + \hat{\beta}_3(X_i-12)_+ + \hat{\beta}_4(X_i-18)_+\).

  1. What is the slope of the fitted curve when \(X = 3\)? How about when \(X = 9, 14,\) and \(19\)?

Polynomial model: One disadvantage of the linear spline model is that it is not as “smooth” as we might desire (e.g., it is not continuous at the knot locations, \(X = 6, 12, 18\)). One way to create a model that is “smooth” is to use polynomial terms. This will result in a model that is continuous and its first derivative will also be continuous (so, there will be no quick changes in the direction of \(E[Y|X]\) as we increase or decrease \(X\)).

  1. Create a set of basis vectors, \(X\), \(X^2\), and \(X^3\), to fit a cubic polynomial model: \(Y = \beta_0 + \beta_1X_i + \beta_2X_i^2 + \beta_3X_i^3 + \epsilon_i\). Fit this model and overlay the fitted mean on a scatterplot of the data. You can get the values of the fitted mean using the predict function along with your fitted model object.

  2. Plot the columns of the design matrix versus \(X\), i.e., plot:

  1. \(1\) versus \(X\)
  2. \(X_i\) versus \(X\)
  3. \(X_i^2\) versus \(X\)
  4. \(X_i^3\) versus \(X\)
  1. Plot the weighted columns of the design matrix versus \(X\), i.e., plot:
  1. \(\hat{\beta}_01\) versus \(X\)
  2. \(\hat{\beta}_1X_i\) versus \(X\)
  3. \(\hat{\beta}_2X_i^2\) versus \(X\)
  4. \(\hat{\beta}_3X_i^4\) versus \(X\)

Again, if we add these 4 contributions up for any given value of \(X\), we get \(\hat{Y}_i = \hat{E}[Y_i|X_i] = \hat{\beta}_0 + \hat{\beta}_1X_i + \hat{\beta}_2X_i^2+ \hat{\beta}_3X_i^3\).

Cubic regression splines with truncated basis functions: Next, we will consider a model with a set of cubic sline basis functions that we create ourselves.

  1. Create a set of cubic regression spline basis vectors, \((X-6)^3_+\), \((X-12)^3_+\), and \((X-18)^3_+\) to go with the polynomial basis vectors from steps 9-11 (these can be formed by simply raising the basis vectors from step 4 to the third power). For example, we can create \((X-6)^3_+\) using:
x6 <-ifelse(x <6, 0, x-6)
x6cubed <- x6^3

These additional basis vectors will allow our fitted cubic function to change after each knot location. In essence, we will end up with a different cubic fit within each segment: \(X < 6, 6 \le X < 12, 12 \le X <18, X \ge 18\).

Fit the model with these additional terms:

\(Y = \beta_0 + \beta_1X_i + \beta_2X_i^2 + \beta_3X_i^3 + \beta_4(X-6)^3_+ + \beta_5(X-12)^3_+ + \beta_6(X-18)^3_+ +\epsilon_i\).

Overlay the fitted mean on a scatterplot of the data.

Natural cubic regression splines: Next, we fit a model with natural cubic regression spline basis functions. This will allow again for a piecewise cubic model within each segment formed by our interior knots but linear fits outside of the boundary knots. We can use the splines package to form our basis vectors. We will use the same set of interior knots (6, 12, 18) and set the boundary knots to the minimum and maximum values in the data set (\(X = 1\) and \(X=24\), respectively).

  1. Load the splines package and fit the cubic regression spline model using: fit <- lm( y ~ ns(x, knots=c(6, 12, 18))).

  2. Use the predict function to overlay the fitted model on the data.

  3. Plot the basis vectors versus \(X\) or the weighted basis vectors versus \(X\). You can access the basis vectors directly by using the model.matrix function to pull off the different columns of the design matrix.

  4. Comment on the relative strengths and weaknesses of the different approaches considered (i.e., linear models, linear splines, polynomials, cubic regression splines).

4.4 Exercise 4

The code, below, reads in data containing estimates of the number of Moose in Minnesota between 2005 and 2020:

nmoose<-data.frame(year=2005:2020, 
                   estimate = c(8160, 8840, 6860, 7890, 7840, 5700, 4900, 4230, 2760, 4350, 3450, 4020, 3710, 3030, 4180, 3150))
  1. Fit a linear regression model to the data and evaluate whether the assumptions are reasonable. Make sure to comment on what you would expect to see in each plot if the assumptions of linear regression were met.

  2. Fit a model that allows for a non-linear trend in moose numbers over time, using linear or cubic regression splines. Plot the data with the fitted curve, similar to the picture below from the MN DNR report in which they analyze the trend in moose numbers over time.

Trend in the estimated number of moose in MN, from: https://files.dnr.state.mn.us/wildlife/moose/moosesurvey.pdf

Figure 4.1: Trend in the estimated number of moose in MN, from: https://files.dnr.state.mn.us/wildlife/moose/moosesurvey.pdf

References

Cannon, A., Cobb, G., Hartlaub, B., Legler, J., Lock, R., Moore, T., … Witmer, J. (2019). Stat2Data: Datasets for Stat2. Retrieved from https://CRAN.R-project.org/package=Stat2Data
Lee, P. C., Bussière, L. F., Webber, C. E., Poole, J. H., & Moss, C. J. (2013). Enduring consequences of early experiences: 40 year effects on survival and success among african elephants (loxodonta africana). Biology Letters, 9(2), 20130011.