Chapter 2 Bootstrapping
2.1 Exercise 1
In Chapter 2, we considered a linear regression model relating species richness to exposure of the sampling site in the RIKZdat
data set:
lmfit <- lm(Richness ~ exposure, data=RIKZdat)
Although the model is easy to fit in R, we discussed how nearly all assumptions of linear regression model are violated with these data. Yet, we could still potentially be interested in describing the relationship between the two variables using a least-squares regression line. In this exercise, we will consider using a cluster-level bootstrap to calculate a confidence interval for the intercept and slope of the regression line and compare it to the confidence intervals created using the confint
function.
- Fit the least-squares regression line using
lm
. Calculate 95% confidence intervals for the intercept and slope parameters using theconfint
function. - Use a cluster-level bootstrap to fit the linear model to several bootstrap data sets (at least 5000).
- Plot the bootstrap distributions for the intercept and slope coefficients (e.g., using the
hist
ordensity
functions orggplot
equivalents). - Use the
quantile
function to determine 95% bootstrap confidence intervals for both of the regression coefficients (intercept and slope). Compare the bootstrap intervals to the intervals you get from usingconfint
combined withlm
. - Comment on the results separately for the intercept and slope. Which approach gives a wider interval (or, are the intervals similar)? Do the results match what you expected?
Make sure to annotate your code and turn in a reproducible report that contains your code and the output from running your code.
Hint: You can use a loop, making use of the code at the end of Chapter 2 to generate bootstrap data sets.
2.2 Exercise 2
Pick a data set and variable, \(x\), that is of interest to you. One option is consider the mean salary of “youtubers” using the youtubewages
data set in the Data4Ecologists
package.
Estimate the population mean and a 95% confidence interval for the population mean via \(\bar{x} \pm 1.96 \frac{\sigma}{\sqrt{n}}\)1.
Generate 5000 bootstrap sample means.
Estimate the standard error of the bootstrap distribution using the
sd
function. Compare this estimate to \(s/\sqrt{n}\).Plot the distribution of bootstrap means. Comment on its shape (e.g., is it symmetric)?
Estimate a 95% confidence interval for \(\mu\) using quantiles of the bootstrap distribution.
Comment on any differences between the Normal (or t-based) and bootstrap confidence intervals.
2.3 Exercise 3
The data for this exercise comes from a Minnesota DNR research project I worked on with Dave Rave and Mike Zicus (a UMN alum) (see Rave, Zicus, Fieberg, Savoy, & Regan, 2014). Mike Zicus conducted a study in 1981, where he collected common goldeneye and hooded merganser eggs throughout the state and measured eggshell thickness and egg-contaminant levels (e.g., DDT, PCBs, mercury). DDT was used (quite effectively) to control insect pests after World War I, but was banned in the early 1970s due to adverse effects on wildlife populations. In particular, DDT has been tied to declines of bald eagles, brown pelicans, and peregrine falcons in North America (if interested, you can learn more http://www.fws.gov/contaminants/info/ddt.html).
In 2003-2004, we decided to collect more eggs from these two species to evaluate whether eggshell thickness had improved since 1981 and also to evaluate whether mercury (Hg) concentrations had increased since 1981. When analyzing Hg concentrations, it is common to work with log-transformed data since the original measurements tend to be highly skewed and therefore, not Normally distributed. One statistic that could be used to measure skewness is the ratio \(\frac{\text{sample mean}}{\text{sample median}}\). I created a function in R to calculate this statistic, below (you will need to include this code at the top of your .Rmd or .R file):
skewstat <- function(data, variable_name) {
# Check if the variable exists in the dataset
if (!(variable_name %in% names(data))) {
stop("Variable not found in the dataset.")
}
# Extract the specified variable
variable <- data[[variable_name]]
# Calculate the mean and median
mean_value <- mean(variable)
median_value <- median(variable)
# Avoid division by zero
if (median_value == 0) {
stop("Median is zero. Cannot calculate the ratio.")
}
# Calculate the ratio of mean to median
ratio <- mean_value / median_value
return(ratio)
}
Let’s load the mercury data from the Data4Ecologists
package using data(hgdat)
and then use this function to calculate the skewness of the mercury measurements.
## [1] 1.379158
What values of this statistic (small, large, close to 1) might indicate that the population is skewed to the right?
Create a bootstrap distribution for this ratio by sampling the original data with replacement. Plot the bootstrap distribution.
Using the bootstrap distribution, calculate a 95% confidence interval for this ratio (assuming the cases we have are representative of the population).
What value of this statistic would you expect to see for a Normal distribution? Use your confidence interval in step 4 to make a conclusion for the following hypothesis test (using an \(\alpha = 0.05\)): \(H_0:\) the data are Normally distributed vs. \(H_A:\) the data are not normally distributed.
2.4 Exercise 4
For this exercise, we will examine data from an article published in Science:
Darimont, C. T., Fox, C. H., Bryan, H. M., & Reimchen, T. E. (2015). The unique ecology of human predators. Science, 349(6250), 858-860.
Darimont et al. (2015) compared patterns of predation by humans (i.e., hunters and fishers) with those of other predators that compete over shared prey. The predPom
data set in the Data4Ecologists
package contains estimates of human-caused and predator-caused mortality from 99 different studies and study systems. The data set has many variables (for a list, type ?predPom
); we will focus on the following ones:
Species
= species nameHumanPoM
= estimate of the mortality rate attributable to human causesCumPOM
= estimate of the mortality rate attributable to other predators
The authors report that, “the median proportion of mortality (an independent metric) caused by hunters (0.35) was 1.9 times that (0.19) caused by all other predators combined.
I wrote a function in R to calculate the ratio of median(\(y\)) to median(\(x\)), \(\hat{m}_r = \frac{\hat{m}_y}{\hat{m}_x}\), below.
Don’t worry about the code in the function, but note that we can replicate the ratio of medians = 1.89 as follows:
## [1] 1.886792
The authors do not provide a confidence interval to go with this estimate, and as far as I know, there is no formula that we can use to estimate the SE of \(\hat{m}_r\). However, we can use a bootstrap to get a confidence interval.
Use the
medratio
function that I provided to you to calculate a 95% confidence interval for the ratio of medians.Plot the bootstrap distribution and report a 95% confidence interval. Interpret the interval in the context of the problem
If mortality rates from humans and other predators were equal, we would expect \(m_r\) to be 1. Using an \(\alpha = 0.05\) and the confidence interval from question 2, consider the following hypothesis test:
\(H_0: m_r = \frac{m_y}{m_x} = 1\) vs. \(H_A: m_r = \frac{m_y}{m_x} \ne 1\)
State a conclusion regarding the test in the context of the problem.
References
Alternatively, you might consider a t-based inteval if your sample size is small.↩︎