Chapter 8 Modeling Strategies
8.1 Exercise 1
In this exercise, you will explore modeling strategies using a publicly available data set with at least 100 observations and 5 or more predictor variables.
Determine an appropriate response variable in the data set. Also, select a variable for which you would be interested in estimating its causal effect on this response variable. Describe how you would approach estimating this causal effect.
Describe how you might approach constructing a model for predictive purposes.
Demonstrate these approaches using your chosen data set.
Describe your methods and resulting models as would be appropriate for a journal submission or project report.
To find an appropriate data set, you could begin by exploring data sets available in different R packages here:
https://vincentarelbundock.github.io/Rdatasets/datasets.html
To quickly filter though available data sets, note that the Rows should be at least 100 and the Cols should be at least 6 (1 response, 5 predictor variables).
Instructor Notes: In the past, I have had all students work independently with the same data set as part of a mid-term project. After submitting their individual project reports, I form small groups of 4 to 5 students. The group-members discuss and critique the individual reports and then put forth a final group-based approach to the analysis.
8.2 Exercise 2
This example uses data from the Maryland Biological Stream Survey to explore different modeling strategies. To access the data, type:
Predictors in the data set include:
- stream: name of each sampled stream (i.e., the name associated with each case).
- longnosedace: number of longnose dace (Rhinichthys cataractae) per 75-meter section of stream.
- acreage: area (in acres) drained by the stream.
- do2: the dissolved oxygen (in mg/liter).
- maxdepth: the maximum depth (in cm) of the 75-meter segment of stream.
- no3: nitrate concentration (mg/liter).
- so4: sulfate concentration (mg/liter).
- temp: the water temperature on the sampling date (in degrees C).
Suppose we are interested in exploring the relationship between abundance of longnose dace and chemical and biological characteristics of the streams in which they were counted.
Inference using a full model: Harrell Jr (2015) suggests that to avoid overfitting, the number of regression coefficients, \(p\), considered in any given data analysis should be constrained such that \(p \le n/10\) or \(p \le n/20\). In this case, we have 68 observations, which would suggest including no more than 3 to 6 predictors. In our case, we have 6 possible variables to consider. We might choose to fit a single model that includes all of them and then judge variable importance using the estimated coefficients and their SEs. Avoiding model selection altogether will preserve the interpretation of p-values and confidence intervals, provided the assumptions of the regression model are met4.
- Begin by fitting a model containing all of the predictor variables in the data set (other than
stream
). Using the estimated coefficients and their SEs, comment on whether the variables appear to be: a) statistically significant (i.e., do we have evidence to suggest their coefficients are non-zero?); and b) their potential biological significance (are the estimated effects large or potentially large?). Assume for now that the assumptions of the model are met (some of the assumptions are likely problematic). One thing that may help with judging biological significance is to center and scale all of the variables first so that the regression coefficients reflect a change in abundance for every change of 1 sd of the predictor variable.
Model selection: We might decide that it would be beneficial to simplify the model by dropping predictors that do not appear to explain much of the variability in the abundance data. Having a simpler model could make it easier to update the model in the future (by reducing the needs to collect some of the predictor data) and may also improve our predictions by reducing some of the noise associated with variables of little importance.
Try to find the “best” model using
stepAIC
in theMASS
library (Venables & Ripley, 2002). This function will perform what is called backwards stepwise selection using AIC (remember, smaller values of AIC indicate “better” fit). At each “step”, the current “best” model is compared (using AIC) to all models that have 1 fewer variable included (for the first step, the full model is considered “best”). If all of the simpler models have larger values of AIC than the current “best” model, then the algorithm stops. If not, then we update the current “best” model (i.e., pick the one with the smallest AIC). We then repeat the process (considering models with an additional variable deleted). We continue until we can no longer improve model fit (as judged by AIC) or we we run out of explanatory variables to drop. Interpret the output from running thestepAIC
function. What model was selected as “best”?Fit the model chosen using
stepAIC
and inspect the coefficients and their presumed statistical significance using thesummary
function.
The reported t-tests and confidence intervals assume that our model was pre-specified - i.e., chosen before collecting and analyzing the data. When we fit several possible models, select the best according to some criteria (e.g., AIC), and then act as though our model was pre-specified, we are ignoring the uncertainty associated with choosing the best model. This leads to what Breiman (1992) labeled the “quiet scandal” in the statistical community: data-driven model selection leading to overly optimistic results, namely, confidence intervals that are too narrow, p-values that are too small, and models unlikely to fare well when applied to new data.
Model averaging: We can, in some cases, ameliorate the problems associated with model selection by considering more than 1 model. Furthermore, averaging among many “good” models can often result in smaller prediction errors than using a single, “best” model. One way to average among models is to use weights based on their relative AIC values. This method gained a lot of popularity in the wildlife literature following the publication of Burnham and Anderson’s book on model-selection and multi-model inference (Anderson & Burnham, 2004). There are at least two R packages that make model averaging “easy” (though, I would caution that it is not always appropriate - as we discussed in the Section on Causal Inference). We will explore methods for averaging among models using functions in the MuMIn
package (Barton, 2020); similar functionality is provided by the AICcmodavg
package (Mazerolle, 2020).
Use the
dredge
function to fit all possible models formed by considering the different combinations of predictor variables (a total of 64 models in this case). Save the output to an object namedmodset
and inspect the results by typingmodset
in your R script. Is there a clear cut “best” model (as judged by AIC)?Use
model.avg(modset, subset = delta < 4)
to average coefficients across all models that are within 4 AIC units from the best model.Use
confset.95p <- get.models(modset, cumsum(weight) <= .95)
to find the set of models such that the combined weight of the models is \(\le\) 0.95.Use
avgmod.95p <- model.avg(confset.95p)
to average the coefficients across models in the “confidence set”, andsummary(avgmod.95p)
to explore model-averaged parameter estimates. The coefficient table “(with shrinkage)” averages a 0 whenever a predictor is not included in a model. Rough model-averaged confidence intervals can be obtained by taking \(\hat{\beta} \pm\) \(z^*\)SE (although the development and testing of methods for quantifying uncertainty associated with model-averaged quantities is still a very active area of statistical research).
Bootstrap Stability Analysis: if we were to go out and collect another data set with all the same variables and use stepAIC
to choose the best model, we might end up with a very different model. One way to think about the problem is that we have uncertainty regarding the most appropriate model for inference and also uncertainty associated with the coefficients in our fitted models.
For the next few steps, we will use Frank Harrell’s rms
package (Harrell Jr, 2021) to explore how the bootstrap can be used to explore model-selection uncertainty and to calculate more honest measures of model fit.
Refit the full model using the
ols
function in therms
package.Use the
validate
function, both with and without the argumentfw=T
, to estimate the degree of optimism resulting from evaluating the fit of the model using the same set of data as was used to fit the model. This function, when applied withbw = TRUE
, will also allow you to evaluate how the choice of variables in a final model might vary from data set to data set (i.e., model stability). Comment on the results of this exploration. Is there a clear cut “best” model? How much do our estimates of model fit change when using the bootstrap versus using the same data to estimate parameters and evaluate fit?Comment on the different approaches explored in this exercise (using a full model for inference, stepwise model selection, model averaging, and the bootstrap stability analysis). What are the advantages and disadvantages associated with each approach when it comes to exploring relationships between abundance of longnose dace and chemical and biological characteristics of the streams in which they were counted.
References
The family-wise error rate will still be \(> \alpha\) due to considering multiple predictors in the same model↩︎