Chapter 7 Causal Inference

7.1 Exercise 1

During the surge in Covid-19 cases associated with the Omicron variant, it was observed that many, if not most, of the patients hospitalized for COVID-19 were vaccinated. Vaccination rates were also significantly higher among older individuals. Draw a Directed Acyclical Graph (DAG) representing causal relationships between age, vaccination status, and hospitalization rates. Explain why we need to account for age when quantifying the protective effect of vaccines on risk of disease. You may also choose to demonstrate this issue using a simple simulation.

7.2 Exercise 2

Create a causal diagram (i.e., a DAG) to describe your study system or to explore a question of interest to you3. Make sure to include at least 3 or 4 state variables. Simulate data from your causal diagram and explore how estimates of regression coefficients change with the inclusion or exclusion of different variables. Explain the behavior of the regression models using concepts from this chapter (e.g., mediator variables, colliders, confounders).

7.3 Exercise 3

Create DAGs that may help explain the following results from observational data:

  1. An observational study found that students that were tutored performed worse than students that were not tutored. This study was highlighted in the NY Times as suggesting that parental investment in student outcomes is “overrated.”

  2. Admission data from six U.C. Berkeley majors, from 1973, showed that more men were being admitted than women: 44% men were admitted compared to 30% women. PJ Bickel, EA Hammel, and JW O’Connell. Science (1975). Yet, these differences in admission rates were negligible after considering the departments that males and females applied to.

  3. This question is motivated by a Skew the script lesson on Linear regression published under a CC BY NC SA license. Low-income students tend to have lower attendance rates and lower math test scores than their middle/upper income peers causing many to consider whether increasing attendance might help close the achievement gap. In the past several years, superintendents have piloted large-scale (and sometimes quite expensive) initiatives to improve student attendance. These included:

  • Call programs for chronically absent students
  • Hiring attendance case managers and coordinators
  • Using Uber/Lyft for students with transportation issues

Yet, the results have not been as impressive as hoped. Create and use a DAG to explain why this may be the case.

7.4 Exercise 4

For this exercise, we will consider a data set from an randomized experiment that evaluated the effect of server posture (standing vs. squatting) on the size of tip left by restaurant customers. The study was conducted by a server, who flipped a coin to randomly determine whether they would stand or squat when they first visited a table and introduced themselves to their customers. All subsequent interactions were performed from a standing position. The server also recorded additional information that might influence the size of the the tip.

The data set is contained in the experimentr package. The code below will load the data and format some of the variables so that they are easier to understand:

library(experimentr)
library(dplyr)
data(lynn)
tipping <- lynn %>% mutate(Shift = ifelse(daytime==0, "Day", "Evening"),
                        MaleFemale = ifelse(female== 1, "Female", "Male"),
                        Posture = ifelse(crouch==1, "Crouch", "Standing"),
                        Payment = ifelse(paid_by_credit_card==1, "Credit Card", "Cash")) %>%
  rename(Groupsize = groupsize, Tip = tip, Bill = bill)%>%
  select(Groupsize, Bill, Tip, Shift, MaleFemale, Posture, Payment)

Now, the tipping data set contains the following variables:

  • Groupsize = number of customers dining at table
  • Bill = total bill amount in dollars
  • Tip = tip amount in dollars
  • Shift = whether the customer or group of customers dined during a Day or Evening shift
  • MaleFemale = Whether the bill was paid for by a Male or Female customer (note, other gender identifications were unfortunately not included in the experiment)
  • Posture = whether the server was Crouched or Standing
  • Payment = whether the bill was paid for using a Credit card or Cash.
  1. Create a DAG that represents possible connections between of the different variables in the data set. Remember, these data come from a randomized experiment. That should influence the connections between Posture and the other explanatory variables in the data set.

  2. Use linear regression to estimate the direct effect of posture on the amount that customers tip.

  3. Use linear regression to estimate the total effect of posture on the amount that customers tip.


  1. For examples, you might look to Scott Cunningham’s Causal Inference book, which states In a messy world, causal inference is what helps establish the causes and effects of the actions being studied—for example, the impact (or lack thereof) of increases in the minimum wage on employment, the effects of early childhood education on incarceration later in life, or the influence on economic growth of introducing malaria nets in developing regions.↩︎