Chapter 4 Non-linear regression

4.1 Exercise 1

The data for this exercise are contained in the Data4Ecologists package and can be accessed by loading the packaging and then typing data(clutch).

  1. Read in the data set and look at the first 6 records using the head function.

  2. Note that clutch size is contained in a variable named CLUTCH. You can access the variable names with the names function. Change the variable names so that they are all lower case using:

names(clutch)<-tolower(names(clutch))
  1. Create a scatterplot to explore how clutch size (clutch) varies by year and nest initiation date (date). Note: date is measured as the number of days since Jan 1, so it really reflects Julian day.

It will help to first tell R that year should be treated as a categorical variable. To do this, use:

clutch$year<-as.factor(clutch$year)

One option for creating the plot is to use ggplot in the ggplot2 library:

ggplot(clutch, aes(x = date, y = clutch, color = year)) + geom_point()

or

ggplot(clutch, aes(x = date, y = clutch)) + geom_point()+facet_wrap(~ year)
  1. There are several large outliers representing nests that were parasitized. Lets drop all observations with clutch sizes > 15 (e.g., using the subset function or dplyr::filter function [in dplyr package]). If you are unfamiliar with these functions, explore their help pages by typing ?subset or ?filter into the R console.

  2. Ignore the fact that most of the nest structures were observed for multiple years (i.e., the observations are not necessarily independent). Fit a linear model relating clutch to year and date. Evaluate whether the assumptions of linear regression are reasonably met.

  3. Fit a model using lm, allowing for a quadratic relationship between date and clutch. Evaluate whether the assumptions of linear regression are reasonably met. Note: the resid_panel function does not seem to work with models that include terms created using poly or ns. You can use the default plot function or the plot_model function in the sjPlot package with type = "diag". Or, try the check_model function in the performance package. Comment on whether you think the assumptions of linear regression are reasonably met.

  4. Now, fit a linear spline model, with a single knot at date = 150. You will need to create a new predictor that is 0 if date <=150 and = date - 150 otherwise. Again, evaluate whether the assumptions of linear regression are reasonably met.

  5. Create a plot showing the fit of the model, overlaid on the original data. You can get fitted values from the model for a reasonable range of predictor values using:

newdat<-data.frame(expand.grid(date=seq(90, 180, by = 1), year=c("1997", "1998", "1999")))
newdat$fitted<-predict(modelname, newdata = newdat)

One option for creating the plot is to use the plot function. Then, add the fitted lines using the lines function (once call to lines for each year).

Alternatively, you can use ggplot:

ggplot(newdat, aes(date, fitted, color = year)) + geom_line() + 
    geom_point(data = clutch, aes(date, clutch)) + xlab("Nest Initiation date") + 
     ylab("Fitted values")
  1. Fit a model with natural cubic regression splines and 3 degrees of freedom allocated to date. You will need to load the splines library and then use ns(date,3) as a predictor. Look at a residuals versus fitted values plot for these data. Create a plot showing the fit of the model, overlaid on the original data.

4.2 Exercise 2

Consider the ElephantsMF data set in the Stat2Data library (Cannon et al., 2019), which contains data collected by Lee, Bussière, Webber, Poole, & Moss (2013). The data set has 3 variables collected from 288 African elephants that lived through droughts in the first two years of their life.:

  • Age (in years)
  • Height (shoulder height in cm)
  • Sex F = female, M = Male
  1. Plot the data, showing the relationship between Height and Age for each sex.

  2. Fit a linear model to data: lm(Height ~ Age*Sex, data = ElephantsMF). Evaluate the assumptions of the model graphically. Be sure to comment on what you are looking for in each plot (e.g., the assumption you are looking to evaluate and what would constitute an assumption violation).

  3. Fit a model that allows the relationship between Height and Age to be non-linear and to also differ for males and females. Again, evaluate the assumptions of the model graphically.

  4. Plot the fit of your model from step 3 with the data overlaid.

4.3 Exercise 3

This exercise will explore how polynomials, simple linear splines, and restricted cubic regression splines can be used to model non-linear relationships between a predictor, X, and the mean response at a given value of X, E[Y|X].

This exercise is largely constructed from a tutorial written by Patrick Heagerty, a professor in the Biostatistics Department at University of Washington.

Linear model: We will purposely simulate data where a linear model will not be appropriate.

  1. Simulate a predictor variable that takes on values 1:24
x <- c(1:24)
  1. Simulate a response variable that is predicted by the variable, X, but in a highly non-linear way:
mu <- 10 + 5 * sin( x * pi / 24 )  - 2 * cos( (x-6)*4/24 )
set.seed(2010)
eps <- rnorm( length(mu) )
y <- mu + eps
  1. Plot the data showing the relationship between X and Y along with the least-squares regression line.

Linear Spline model: We can fit a model where the slope of the line relating Y to X changes at different locations, called knots.

  1. Create a set of basis vectors to allow fitting a linear spline model with changes in the slope at X=6,12, and 18. The first basis vector, (X6)+ can be created using:
x6 <-ifelse(x <6, 0, x-6)

This creates a predictor that can be included as a column in the design matrix. This predictor will be 0 when X6 and X6 when X>6. Create similar predictors to capture changes in slope for X>12 and X>18.

  1. Fit a model that includes these the 3 basis functions: Y=β0+β1Xi+β2(Xi6)++β3(Xi12)++β4(Xi18)++ϵi. Overlay the fitted mean on a scatterplot of the data. You can use the predict function with your fitted model object to calculate the fitted mean:

Y^i=E^[Yi|Xi]=β^0+β^1Xi+β^2(Xi6)++β^3(Xi12)++β^4(Xi18)+

  1. Plot the columns of the design matrix from the linear spline model versus X, i.e., plot:
  1. 1 versus X
  2. X versus X
  3. (X6)+ versus X
  4. (X12)+ versus X
  5. (X18)+ versus X
  1. Plot the weighted columns of the design matrix from the linear spline model versus X, i.e., plot:
  1. β^01 versus X
  2. β^1X versus X
  3. β^2(X6)+ versus X
  4. β^3(X12)+ versus X
  5. β^4(X18)+ versus X

Note, if we add these contributions (i.e., from a-e) for any given value of X, we get Y^i=E^[Yi|Xi]=β^0+β^1Xi+β^2(Xi6)++β^3(Xi12)++β^4(Xi18)+.

  1. What is the slope of the fitted curve when X=3? How about when X=9,14, and 19?

Polynomial model: One disadvantage of the linear spline model is that it is not as “smooth” as we might desire (e.g., it is not continuous at the knot locations, X=6,12,18). One way to create a model that is “smooth” is to use polynomial terms. This will result in a model that is continuous and its first derivative will also be continuous (so, there will be no quick changes in the direction of E[Y|X] as we increase or decrease X).

  1. Create a set of basis vectors, X, X2, and X3, to fit a cubic polynomial model: Y=β0+β1Xi+β2Xi2+β3Xi3+ϵi. Fit this model and overlay the fitted mean on a scatterplot of the data. You can get the values of the fitted mean using the predict function along with your fitted model object.

  2. Plot the columns of the design matrix versus X, i.e., plot:

  1. 1 versus X
  2. Xi versus X
  3. Xi2 versus X
  4. Xi3 versus X
  1. Plot the weighted columns of the design matrix versus X, i.e., plot:
  1. β^01 versus X
  2. β^1Xi versus X
  3. β^2Xi2 versus X
  4. β^3Xi4 versus X

Again, if we add these 4 contributions up for any given value of X, we get Y^i=E^[Yi|Xi]=β^0+β^1Xi+β^2Xi2+β^3Xi3.

Cubic regression splines with truncated basis functions: Next, we will consider a model with a set of cubic sline basis functions that we create ourselves.

  1. Create a set of cubic regression spline basis vectors, (X6)+3, (X12)+3, and (X18)+3 to go with the polynomial basis vectors from steps 9-11 (these can be formed by simply raising the basis vectors from step 4 to the third power). For example, we can create (X6)+3 using:
x6 <-ifelse(x <6, 0, x-6)
x6cubed <- x6^3

These additional basis vectors will allow our fitted cubic function to change after each knot location. In essence, we will end up with a different cubic fit within each segment: X<6,6X<12,12X<18,X18.

Fit the model with these additional terms:

Y=β0+β1Xi+β2Xi2+β3Xi3+β4(X6)+3+β5(X12)+3+β6(X18)+3+ϵi.

Overlay the fitted mean on a scatterplot of the data.

Natural cubic regression splines: Next, we fit a model with natural cubic regression spline basis functions. This will allow again for a piecewise cubic model within each segment formed by our interior knots but linear fits outside of the boundary knots. We can use the splines package to form our basis vectors. We will use the same set of interior knots (6, 12, 18) and set the boundary knots to the minimum and maximum values in the data set (X=1 and X=24, respectively).

  1. Load the splines package and fit the cubic regression spline model using: fit <- lm( y ~ ns(x, knots=c(6, 12, 18))).

  2. Use the predict function to overlay the fitted model on the data.

  3. Plot the basis vectors versus X or the weighted basis vectors versus X. You can access the basis vectors directly by using the model.matrix function to pull off the different columns of the design matrix.

  4. Comment on the relative strengths and weaknesses of the different approaches considered (i.e., linear models, linear splines, polynomials, cubic regression splines).

4.4 Exercise 4

The code, below, reads in data containing estimates of the number of Moose in Minnesota between 2005 and 2020:

nmoose<-data.frame(year=2005:2020, 
                   estimate = c(8160, 8840, 6860, 7890, 7840, 5700, 4900, 4230, 2760, 4350, 3450, 4020, 3710, 3030, 4180, 3150))
  1. Fit a linear regression model to the data and evaluate whether the assumptions are reasonable. Make sure to comment on what you would expect to see in each plot if the assumptions of linear regression were met.

  2. Fit a model that allows for a non-linear trend in moose numbers over time, using linear or cubic regression splines. Plot the data with the fitted curve, similar to the picture below from the MN DNR report in which they analyze the trend in moose numbers over time.

Trend in the estimated number of moose in MN, from: https://files.dnr.state.mn.us/wildlife/moose/moosesurvey.pdf

Figure 4.1: Trend in the estimated number of moose in MN, from: https://files.dnr.state.mn.us/wildlife/moose/moosesurvey.pdf

4.5 Exercise 5

This exercise largely comes from Catherine Polik who took FW8051 in the Spring of 2023, and will use data from one of her published papers (Polik, Elling, & Pearson, 2018). I am going to do my best to set the stage, but I would encourage you to read her paper to learn more!

For this exercise, we are going to model the relationship between sea-surface temperature (SST) anomalies and a nitrogen isotropic signature (δ15N) measured in two different types of sediment:

  1. Sapropels, which are dark-colored sediments, rich in organic matter.
  2. Marl sediment, which is carbon-poor and next to the Sapropels.

The temperature anomalies (our response variable) represent discrepancies between two different methods for reconstructing sea-surface temperature from molecular signatures in marine sediment cores (TEX86 and U37K). The δ15N measurements are lower in Sapropel sediments and likely capture N-cycling dynamics during periods of high productivity. The accelerated growth of ammonia-oxidizing organisms (in the Archaea domain) is thought to impact SST measurements reconstructed using the TEX86 method, and thus, likely impact the temperature anomalies.

The data set can be accessed from the Data4Ecologists package as follows:

library(Data4Ecologists)
library(ggplot2)
data(PolikSST)

The data set contains the following variables:

  • Site = Location where the core was taken during Leg 160 of the Ocean Drilling Program
  • Sapropel = Three sapropels sampled: S3 (81 ka), S5 (124 ka), and i-282 (2.943 Ma)
  • mbsf = Meters below seafloor, i.e. the depth of the sample
  • d15N.TN = The nitrogen isotopic value of the bulk sediment, in permil
  • TEX86.SST = SST reconsctructed using the TEX86 method, in C
  • Uk37.SST = SST reconsctructed using the Uk37 method, in C
  • Temp.Anomaly The difference between TEX86 and Uk37 SST reconstructions
  • Sediment = Whether the sample comes from inside the sapropel or from the surrounding, carbon-poor marl sediment

Before starting the exercise, filter out any observations that are missing values for either Temp.Anomaly or d15N.TN using the following code:

library(dplyr)
PolikSST <- PolikSST %>% 
  filter(!is.na(Temp.Anomaly)) %>%
  filter(!is.na(d15N.TN))
  1. Use ggplot to create a scatterplot of Temp.Anomaly (y-axis) versus d15N.TN (x-axis), mapping Sediment to different colors.

  2. Fit an appropriate linear model allowing the relationship between d15N.TN and Temp.Anomaly to depend on the Sediment type.

  3. If we were to look at residual plots or consider the independence assumption, we might be reluctant to use this model. For now, however, we will proceed as though the assumptions are reasonably met. Use the fitted model to answer the following questions:

    • By how much do we expect Temp.Anomaly to increase, on average, for every 1 unit increase in d15N.TN when Sediment = Sapropel?
    • By how much do we expect Temp.Anomaly to increase, on average, for every 1 unit increase in d15N.TN when Sediment = Marl?
  4. Instead of fitting models with separate slopes for the two sediment types, fit a model where Temp.Anomaly depends only on d15N.TN, but use a quadratic polynomial to model this relationship. Make sure to use raw polynomials (poly(x, 2, raw = TRUE)) (this will be important for part 5 of the problem). Plot the fitted model on top of a scatterplot of the data.

  5. We can use derivatives and calculus to estimate the value of d15N.TN associated with the maximum temperature anomaly. Our fitted model implies:

E[Y|X]=β0+β1X+β2X2

Let’s take a derivative of this experession with respect to X, set it equal to 0, and solve for X:

dE[Y|X]dX=β1+2β2X=0.

This implies that the maximum occurs at X=β12β2.

Plug in the estimated coefficients to estimate the value of d15N.TN associated with the maximum expected temperature anomaly.

  1. Instead of a quadratic model, fit a linear spline model with a single knot located at the value of X given in your answer to part 5. Plot the fitted model over a scatterplot of the data.

References

Cannon, A., Cobb, G., Hartlaub, B., Legler, J., Lock, R., Moore, T., … Witmer, J. (2019). Stat2Data: Datasets for Stat2. Retrieved from https://CRAN.R-project.org/package=Stat2Data
Lee, P. C., Bussière, L. F., Webber, C. E., Poole, J. H., & Moss, C. J. (2013). Enduring consequences of early experiences: 40 year effects on survival and success among african elephants (loxodonta africana). Biology Letters, 9(2), 20130011.
Polik, C. A., Elling, F. J., & Pearson, A. (2018). Impacts of paleoecology on the TEX86 sea surface temperature proxy in the pliocene-pleistocene mediterranean sea. Paleoceanography and Paleoclimatology, 33(12), 1472–1489.