Problem Set 4

Author

YOUR NAME HERE

Published

October 14, 2022

# load packages, they are all installed for you already (in the cloud only)
library(tidyverse) # your friend and mine
library(broom) # for tidy regression
library(modelsummary) # for nice regression tables
library(wooldridge) # for wooldridge data set
library(car) # for vif command
library(dagitty) # for working with DAGs
library(ggdag) # for drawing DAGs

Theory and Concepts

Question 1

In your own words, explain the fundamental problem of causal inference.

The fundamental problem of causal inference is that we never know the counterfactuals for the observations that we actually see in the data. A causal effect of a treatment \((\delta_i)\) for individual \(i\) would be the difference between their outcomes if they were treated \((Y^1_i)\) and if they were not-treated \((Y^0_i)\), i.e. \(\delta_i=Y^1_i-Y^0_i\). Across many individuals, we could take the average for each group (treated and untreated group). However, in reality, for each individual in our data, we only ever see \(Y^1_i\) or \(Y^0_i\), never both. We only see a person’s outcome with treatment, or without treatment, so we cannot simply take the difference.

Question 2

In your own words, explain how properly conducting a randomized controlled trial helps to identify the causal effect of one variable on another.

Randomized controlled experiments are where a pool of subjects representative of a population are randomly assigned into a treatment group (or into one of a number of groups given different levels of a treatment) or into a control group. The treatment group(s) is(are) given the treatment(s), the control group is given nothing (perhaps a placebo, though this is not always necessary), and then the average results of the two groups are compared to measure the true average effect of treatment.

The key is that the assignment must be random, which controls for all factors that potentially determine the outcome (e.g. when measuring individual outcomes, their height, family background, income, race, age, etc). If subjects are randomly assigned, then knowing anything about the individual (e.g. age, height, etc) tells us nothing about whether or not they got the treatment(s). The only thing that separates a member of the treatment group(s) from the control group is whether or not they were assigned to treatment. This ensures that the average person in the treatment group(s) looks like the average person in the control group, and that we are truly comparing apples to apples, rather than apples to oranges.

Question 3

In your own words, describe what omitted variable bias means. What are the two conditions for a variable to bias OLS estimates if omitted?

All variables that might influence the dependent variable (\(Y\)) that we do not measure and include in our regression are a part of the error term \((\epsilon\)). If we omit a variable \((Z)\), it will cause a bias if and only if it meets both of the following conditions:

  1. The variable must be a determinant of our dependent variable, \(corr(Z,Y)\neq 0\), and thus would appear in the error term, \(u\).

  2. The variable must be correlated with one of our independent variables of interest in our regression, \(corr(Z,X) \neq 0\).

If both conditions are met, then if we did not include the omitted variable $Z$, our estimate of the causal effect of \(X\) on \(Y\) would be biased, because our estimate \((\hat{\beta_1})\) would pick up some of the effect of \(Z\). If we include \(Z\) as another independent variable, then the \(\hat{\beta_1}\) on $X$ will tell us the precise effect of only \(X\rightarrow Y\), holding \(Z\) constant.

dagify(Y ~ Z + X,
       X ~ Z,
       exposure = "X",
       outcome = "Y") %>% 
  ggdag_status()+
  theme_dag_blank()+
  theme(legend.position = "none")

Question 4

In your own words, describe what multicollinearity means. What is the cause, and what are the consequences of multicollinearity? How can we measure multicollinearity and its effects? What happens if multicollinearity is perfect?

Multicollinearity just means that two regressors (e.g .\(X_1\) and \(X_2\)) are correlated with each other. This fact does *not* bias the OLS estimates of these regressors (e.g. \(\hat{\beta_1}\) and \(\hat{\beta_2}\)). In fact, the reason \(X_2\) is included in the regression is because omitting it would cause omitted variable bias, since \(corr(X_1,X_2)\neq 0\) and \(corr(Y, X_2)\neq 0\). However, the variance of these OLS estimators is increased because it is hard to get a precise measure of \(X_1\rightarrow Y\) because \(X_2\rightarrow Y\) also, and \(X_1\) may tend to be certain values (large or small) when \(X_2\) is certain values (large or small) so we don’t know counterfactuals (e.g. what if \(X_1\) were the opposite of what it tends to be (large or small) when \(X_2\) is large or small).

The strength of multicollinearity is simply given by the value of the correlation coefficient between \(X_1\) and \(X_2\), \(r_{X_1,X_2}\). We can measure the effect of multicollinearity on the variance of a regressor (\(X_j\))’s coefficient (\(\hat{\beta_j}\)) with the Variance Inflation Factor:

\[VIF=\frac{1}{1-R^2_j}\]

where \(R^2_j\) is the \(R^2\) from an auxiliary regression of \(X_j\) on all of the other regressors.

Multicollinearity is perfect when the correlation between \(X_1\) and \(X_2\) is 1 or -1. This happens when one regressor (e.g. \(X_1\)) is an exact linear function of another regressor(s) (e.g. \(X_1=\frac{X_2}{100}\)). A regression cannot be run including both variables, as it creates a logical contradiction. In this example, \(\hat{\beta_1}\) would be the marginal effect on \(Y\) of changing \(X_1\) holding \(X_2\) constant -- but \(X_2\) would naturally change as it is a function of \(X_1\)!

Question 5

Explain how we use Directed Acyclic Graphs (DAGs) to depict a causal model: what are the two criteria that must hold for identifying a causal effect of \(X\) on \(Y\)? When should we control a variable, and when should we not control for a variable?

A Directed Acyclic Graph (DAG) describes a causal model based on making our assumptions about relationships between variables explicit, and in many cases, testable.

Variables are represented as nodes, and causal effects represented as arrows from one node to another (in the direction of the causal effect). We think about the causal effect of \(X \rightarrow Y\) in counterfactual terms: if \(X\) had been different, \(Y\) would have been different as a response.

When considering the causal effect of \(X \rightarrow Y\), we must consider all pathways from \(X\) to \(Y\) (that do not loop, or go through a variable twice), regardless of the direction of the arrows. The paths will be of two types:

  • Causal (front-door) pathways where arrows go from \(X\) into \(Y\) (including through other mediator variables)

  • Non-causal (back-door) pathways where an arrow leads into \(X\) (implying \(X\) is partially caused by that variable)

Adjusting or controlling for (in a multivariate regression, this means including the variable in the regression) a variable along a pathway closes that pathway.

Variables should be adjusted (controlled for) such that:

1. Back-door criterion: no backdoor pathway between \(X\) and \(Y\) remains open

2. Front-door criterion: no frontdoor pathway is closed

The one exception is a collider variable, where a variable along a pathway has arrows pointing into it from both directions. This automatically blocks a path (whether front door or back door). Controlling for a collider variable opens the pathway it is on.

See R Practice on DAGs for examples.

Theory Problems

For the following questions, please show all work and explain answers as necessary. You may lose points if you only write the correct answer. You may use R to verify your answers, but you are expected to reach the answers in this section “manually.”

Question 6

A pharmaceutical company is interested in estimating the impact of a new drug on cholesterol levels. They enroll 200 people in a clinical trial. People are randomly assigned the treatment group or into the control group. Half of the people are given the new drug and half the people are given a sugar pill with no active ingredient. To examine the impact of dosage on reductions in cholesterol levels, the authors of the study regress the following model:

\[\text{cholesterol level}_i = \beta_0+\beta_1 \text{dosage level}_i + u_i\]

For people in the control group, dosage level\(_i=0\) and for people in the treatment group, dosage level\(_i\) measures milligrams of the active ingredient. In this case, the authors find a large, negative, statistically significant estimate of \(\hat{\beta_1}\). Is this an unbiased estimate of the impact of dosage on change in cholesterol level? Why or why not? Do you expect the estimate to overstate or understate the true relationship between dosage and cholesterol level?

Consider the 4th assumption about the error term, \(u_i\). Does knowing whether (or how much) a person was treated convey any information about other characteristics that affect cholesterol level (in \(u_i\))? Again, we are asking if \(\mathbb{E}[u|X]=0\) or \(cor(X, u)=0\).

In this case, the answer is clearly no; knowing whether or not someone received treatment tells us nothing else about the person that might affect their cholesterol levels (i.e. age, height, diet, weight, family history, etc., all in \(u_i)\) because treatment is randomly assigned.

In this case, because treatment is exogenous, \(\mathbb{E}[\hat{\beta_1}]=\beta_1\), \(\hat{\beta_1}\) is unbiased.

dagify(chol ~ treat + u,
       exposure = "treat",
       outcome = "chol") %>%
  ggdag_status()+
  theme_dag()+
  theme(legend.position="none")

Question 7

Data were collected from a random sample of 220 home sales from a community in 2017.

\[\widehat{Price}=119.2+0.485 \, BDR+23.4 \, Bath+0.156 \, Hsize+0.002 \, Lsize+0.090 \, Age\]

Variable Description
\(Price\) selling price (in $1,000s)
\(BDR\) number of bedrooms
\(Bath\) number of bathrooms
\(Hsize\) size of the house (in ft\(^2)\)
\(Lsize\) lot size (in ft\(^2)\)
\(Age\) age of the house (in years)

Part A

Suppose that a homeowner converts part of an existing living space in her house to a new bathroom. What is the expected increase in the value of the house?

From \(\hat{\beta_2}\), $23,400.

Part B

Suppose a homeowner adds a new bathroom to her house, which also increases the size of the house by 100 square feet. What is the expected increase in the value of the house?

In this case, \(\Delta BDR=1\) and \(\Delta Hsize=100\). The resulting expected increase in price is \(23.4(1)+0.156(100)=39.0\), or $39,000.

Part C

Suppose the \(R^2\) of this regression is 0.727. Calculate the adjusted \(\bar{R}^2\).

There are $n=220$ observations and $k=6$ variables, so:

\[ \begin{align*} \bar{R}^2&=1-\frac{n-1}{n-k-1}(1-R^2)\\ &=1-\frac{220-1}{220-6-1}(1-0.727)\\ &=1-\frac{219}{213}(0.273)\\ &=0.719\\ \end{align*} \]

Part D

Suppose the following auxiliary regression for \(BDR\) has an \(R^2\) of 0.841.

\[\widehat{BDR}=\delta_0+\delta_1Bath+\delta_2Hsize+\delta_3Lsize+\delta_4Age\]

Calculate the Variance Inflation Factor for \(BDR\) and explain what it means.

\[ \begin{align*} VIF&=\frac{1}{1-R^2_j}\\ &=\frac{1}{1-0.841}\\ &=\frac{1}{0.159}\\ &=6.29\\ \end{align*} \]

The variance on \(\hat{\beta_2}\) (on Bath) increases by 6.29 times (629%) due to multicollinearity between Bath and other \(X\)-variables.

Question 8

A researcher wants to investigate the effect of education on average hourly wages. Wage, education, and experience in the dataset have the following correlations:

Wage Education Experience
Wage 1.0000
Education 0.4059 1.0000
Experience 0.1129 -0.2995 1.0000

She runs a simple regression first, and gets the results:

\[\widehat{\text{Wage}} = -0.9049 + 0.5414 \, Education\]

She runs another regression, and gets the results:

\[\widehat{\text{Experience}} = 35.4615 - 1.4681 \, Education\]

Part A

If the true marginal effect of experience on wages (holding education constant) is 0.0701, calculate the omitted variable bias in the first regression caused by omitting experience. Does the estimate of \(\hat{\beta_1}\) in the first regression overstate or understate the effect of education on wages?

We know that the estimate in the biased regression (first one above) is a function of:

\[\hat{\alpha_1}=\hat{\beta_1}+\hat{\beta_2}\hat{\delta_1}\]

Where:

  • \(\hat{\alpha_1}\): the coefficient on education in the biased regression (0.5414)

  • \(\hat{\beta_1}\): the true effect of education on wages (??)

  • \(\hat{\beta_2}\): the true effect of experience on wages (0.0701)

  • \(\hat{\delta_1}\): the effect of education on experience (from an auxiliary regression) (-1.4681)

\[ \begin{align*} OMV&=\hat{\beta_2} \hat{\delta_1}\\ &=(0.0701)(-1.4681)\\ &=-0.1029\\ \end{align*} \]

Since the bias is negative, it understates the effect of education (due to experience). Note there are other factors that can bias the effect of education, but at least through experience, the bias is negative since the two are negatively related in the data (see the second regression).

Part B

Knowing this, what would be the true effect of education on wages, holding experience constant?

We know the biased estimate for education, 0.0611. Plugging in both this and the bias, we get the “true” effect:

\[ \begin{align*} \alpha_1&=\beta_1+\beta_2\delta_1\\ 0.5414&=\beta_1-0.1029\\ 0.6443&=\beta_1\\ \end{align*} \]

Part C

The \(R^2\) for the second regression is 0.0897. If she were to run a better regression including both education and experience, how much would the variance of the coefficients on education and experience increase? Why?

Here we need to calculate the Variance Inflation Factor (VIF) by using the \(R^2\) from the auxiliary regression.

\[ \begin{align*} VIF &=\frac{1}{1-R^2_j}\\ &=\frac{1}{1-(0.0897)}\\ &=\frac{1}{0.9103}\\ &=1.0985\\ \end{align*} \]

The variance increases only by 1.0985 times (9.85%) due to fairly weak multicollinearity between education and experience.

R Questions

Answer the following questions using R. When necessary, please write answers in the same document (rendered to html or pdf, typed .doc(x), or handwritten) as your answers to the above questions. Be sure to include (email or print an .R file, or show in your rendered quarto document) your code and the outputs of your code with the rest of your answers.

Question 9

Download and read in heightwages.csv dataset. If you don’t want to download/upload it, you can read it in directly from the url by running this chunk:

# run or edit this chunk
heights <- read_csv("http://metricsf22.classes.ryansafner.com/files/data/heightwages.csv")
New names:
Rows: 12686 Columns: 22
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," dbl
(22): ...1, male, white, black, hispanic, mompro2, poppro2, siblings, no...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

This data is a part of a larger dataset from the National Longitudinal Survey of Youth (NLSY) 1979 cohort: a nationally representative sample of 12,686 men and women aged 14-22 years old when they were first surveyed in 1979. They were subsequently interviewed every year through 1994 and then every other year afterwards. There are many included variables, but for now we will just focus on:

Variable Description
wage96 Adult hourly wages ($/hr) reported in 1996
height85 Adult height (inches) reported in 1985
height81 Adolescent height (inches) reported in 1981

We want to figure out what is the effect of height on wages (e.g. do taller people earn more on average than shorter people?)

Part A

Create a quick scatterplot between height85 (as \(X)\) amd wage96 (as \(Y)\).

ggplot(data = heights)+
  aes(x = height85, y = wage96)+
  geom_jitter(color = "blue")+
  geom_smooth(method = "lm", color = "red")+
  labs(x = "Adult Height in 1985 (inches)",
       y = "Hourly Wage in 1996 ($)")+
  theme_classic(base_family = "Fira Sans Condensed",
           base_size = 20)
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 5973 rows containing non-finite values (stat_smooth).
Warning: Removed 5973 rows containing missing values (geom_point).

Part B

Regress wages on adult height. Write the equation of the estimated OLS regression. Interpret the coefficient on height85.

# write your code here! 
reg1 <- lm(wage96 ~ height85, data = heights)
summary(reg1)

Call:
lm(formula = wage96 ~ height85, data = heights)

Residuals:
    Min      1Q  Median      3Q     Max 
 -17.26   -7.21   -3.38    1.95 1520.48 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -6.97678    5.55895  -1.255 0.209503    
height85     0.31475    0.08239   3.820 0.000135 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 27.68 on 6711 degrees of freedom
  (5973 observations deleted due to missingness)
Multiple R-squared:  0.00217,   Adjusted R-squared:  0.002021 
F-statistic: 14.59 on 1 and 6711 DF,  p-value: 0.0001346

\[\widehat{\text{Wage}_i} = -6.98+0.31 \, \text{Height}_i\]

For every additional inch in height as an adult, we can expect someone’s wage to be $0.31 higher, on average.

Part C

How much would someone who is 5’10” (70 in) be predicted to earn per hour, according to the model?

For a 5’10’’ person (70 inches), they would earn:

\[ \begin{align*} \widehat{Wages}&=-6.98+0.31Height\\ &=-6.98+0.31(70)\\ &=-6.98+21.70\\ &=\$14.72\\ \end{align*} \]

# If you want to calculate it with R
reg1_tidy <- tidy(reg1)

# extract beta 0
beta_0 <- reg1_tidy %>%
  filter(term == "(Intercept)") %>%
  pull(estimate)

# extract beta 1
beta_1 <- reg1_tidy %>%
  filter(term == "height85") %>%
  pull(estimate)

beta_0 + beta_1*70 # some rounding error in my calculation above
[1] 15.05581

Part D

Would adolescent height cause an omitted variable bias if it were left out? Explain using both your intuition, and some statistical evidence with R.

heights %>%
  select(wage96, height81, height85) %>%
  cor(use = "pairwise.complete.obs")
             wage96   height81   height85
wage96   1.00000000 0.05387358 0.04658124
height81 0.05387358 1.00000000 0.93429066
height85 0.04658124 0.93429066 1.00000000

We see that Adolescent height height81 is weakly correlated with wage96 (to be fair, so is adult height), but it is strongly correlated with adult height height85.

Part E

Now add adolescent height to the regression, and write the new regression equation below, as before. Interpret the coefficient on height85.

reg2 <- lm(wage96 ~ height85 + height81, data = heights)
summary(reg2)

Call:
lm(formula = wage96 ~ height85 + height81, data = heights)

Residuals:
    Min      1Q  Median      3Q     Max 
 -17.86   -7.23   -3.40    1.99 1520.49 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  -9.2483     5.7749  -1.601   0.1093  
height85     -0.1068     0.2425  -0.440   0.6598  
height81      0.4575     0.2459   1.860   0.0629 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 27.89 on 6591 degrees of freedom
  (6092 observations deleted due to missingness)
Multiple R-squared:  0.002685,  Adjusted R-squared:  0.002383 
F-statistic: 8.873 on 2 and 6591 DF,  p-value: 0.0001417

\[\widehat{\text{Wage}_i}=-9.25-0.11\, \text{Height85}_i+0.46\, \text{Height81}_i\]

For every additional inch in height as an adult, we can expect someone’s wages to be $0.11 lower, on average, holding their adolescent height constant.

Part F

How much would someone who is 5’10” in 1985 and 4’8” in 1981 be predicted to earn, according to the model?

For that 5’10” person (70 inches) in 1985 and 4’10” (58”) in 1981, they would earn:

\[ \begin{align*} \widehat{Wages}&=-9.25-0.11Height85+0.46Height81\\ &=-9.25-0.11(70)+0.46(58)\\ &=-9.25-7.70+25.76\\ &=\$9.73 \\ \end{align*} \]

# If you want to calculate it with R
reg2_tidy <- tidy(reg2)

# extract beta 0
multi_beta_0 <- reg2_tidy %>%
  filter(term == "(Intercept)") %>%
  pull(estimate)

# extract beta 1
multi_beta_1 <- reg2_tidy %>%
  filter(term == "height85") %>%
  pull(estimate)

# extract beta 2
multi_beta_2 <- reg2_tidy %>%
  filter(term == "height81") %>%
  pull(estimate)

multi_beta_0 + multi_beta_1*70 + multi_beta_2*58 # some rounding error in my calculation above
[1] 9.810844

Part G

What happened to the estimate on height85 and its standard error?

The effect fell by more than a half, and turned negative. The standard error also doubled (likely due to multicollinearity).

Part H

Is there multicollinearity between height85 and height81? Explore with a scatterplot. Hint: to avoid overplotting, use geom_jitter() instead of geom_point() to get a better view of the data.

ggplot(data = heights)+
  aes(x = height81, y = height85)+
  geom_jitter(color = "blue")+
  geom_smooth(method = "lm", color = "red")+
  labs(x = "Adult Height in 1985 (inches)",
       y = "Adolescent Height in 1981 (inches)")+
  theme_classic(base_family = "Fira Sans Condensed",
           base_size = 20)
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 2105 rows containing non-finite values (stat_smooth).
Warning: Removed 2105 rows containing missing values (geom_point).

Part I

Quantify how much multicollinearity affects the variance of the OLS estimates on both heights. Hint: You’ll need the vif command from the car package.

vif(reg2) # run vif 
height85 height81 
8.383416 8.383416 

The variance of each OLS estimate is inflated by 8.38 times (838%) due to multicollinearity.

Part J

Reach the same number as in part I by running an auxiliary regression.

Hint: There’s some missing wage96 data that may give you a different answer, so take the data and filter(!is.na(wage96)) before running this regression — this will include only observations for wage96 that are not NA’s.

aux_reg <- heights %>%
  filter(!is.na(wage96)) %>% # use only for which we have wages
  lm(data = ., height85 ~ height81) # run regression

summary(aux_reg) # look for R-squared

Call:
lm(formula = height85 ~ height81, data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.3510  -0.3994  -0.1574   0.6006  14.0198 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.448524   0.290169   11.88   <2e-16 ***
height81    0.951601   0.004313  220.62   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.416 on 6592 degrees of freedom
  (336 observations deleted due to missingness)
Multiple R-squared:  0.8807,    Adjusted R-squared:  0.8807 
F-statistic: 4.867e+04 on 1 and 6592 DF,  p-value: < 2.2e-16

We find that adult height is positively affected by adolescent height (\(\hat{\beta_1}=0.95)\), meaning for every inch that someone is when they are an adolescent, their adult height can be predicted to be about the same). The \(R^2\) on the auxiliary regression is 0.88.

\[ \begin{align*} VIF&=\frac{1}{1-R^2_1}\\ &=\frac{1}{1-(0.8807)}\\ &=\frac{1}{0.1193}\\ &=8.38\\ \end{align*} \]

# In R

# extract r.squared
aux_r_sq <- glance(aux_reg) %>%
  pull(r.squared)

# vif formula
1/(1-aux_r_sq)
[1] 8.383416

Note if you failed to tell R to not drop the data with NA’s you will get a different \(R^2\) and hence a different VIF than before. This would use more observations that have data on height81 and height85 but not wage96 which would not be used in the regression…hence, a different \(n\) and \(R^2\).

Part K

Make a regression table from part B and D using modelsummary.

modelsummary(models = list("Wage (1996)" = reg1,
                           "Wage (1996)" = reg2),
             fmt = 2, # round to 2 decimals
             output = "html",
             coef_rename = c("(Intercept)" = "Constant",
                             "height85" = "Adult Height (1985)",
                             "height81" = "Adolescent Height (1981)"),
             gof_map = list(
               list("raw" = "nobs", "clean" = "n", "fmt" = 0),
               #list("raw" = "r.squared", "clean" = "R<sup>2</sup>", "fmt" = 2),
               list("raw" = "adj.r.squared", "clean" = "Adj. R<sup>2</sup>", "fmt" = 2),
               list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
             ),
             escape = FALSE,
             stars = c('*' = .1, '**' = .05, '***' = 0.01)
)
Wage (1996) Wage (1996)
Constant −6.98 −9.25
(5.56) (5.77)
Adult Height (1985) 0.31*** −0.11
(0.08) (0.24)
Adolescent Height (1981) 0.46*
(0.25)
n 6713 6594
Adj. R2 0.002 0.002
SER 27.67 27.89
* p < 0.1, ** p < 0.05, *** p < 0.01

Question 10

Install and load the wooldridge package. This package contains datasets used in Jeffrey Wooldridge’s Introductory Econometrics: A Modern Approach (the textbook that I used in my econometrics classes years ago!).

We will use the bwght data from wooldridge, which comes from The 1988 National Health Interview Survey., used in J. Mullahy (1997), “Instrumental-Variable Estimation of Count Data Models: Applications to Models of Cigarette Smoking Behavior,” Review of Economics and Statistics 79: 596-593.

Let’s just look at the following variables:

Variable Description
bwght Birth Weight (ounces)
cigs Cigarettes smoked per day while pregnant (1988)
motheduc Mother’s education (number of years)
cigprice Price of cigarette pack (1988)
faminc Family’s income in $1,000s (1988)

We want to explore how a mother smoking during pregnancy affects the baby’s birthweight (which may have strong effects on outcomes over the child’s life).

Just to be explicit about it, assign this as some dataframe (feel free to change the name), i.e.:

# run or edit this chunk
births <- bwght # feel free to rename whatever you want for the dataframe

Part A

Make a correlation table for our variables listed above.

Hint: select() these variables and then pipe this into cor(., use = "pairwise.complete.obs") to use only observations for which there are data on each variable (to avoid NA’s).

bwght %>%
  select(bwght, cigs, motheduc, cigprice, faminc) %>%
  cor(use = "pairwise.complete.obs")
               bwght        cigs    motheduc   cigprice      faminc
bwght     1.00000000 -0.15076180  0.06912704 0.04918790  0.10893684
cigs     -0.15076180  1.00000000 -0.21386510 0.00970419 -0.17304493
motheduc  0.06912704 -0.21386510  1.00000000 0.07086587  0.45592970
cigprice  0.04918790  0.00970419  0.07086587 1.00000000  0.09545581
faminc    0.10893684 -0.17304493  0.45592970 0.09545581  1.00000000

Part B

Consider the following causal model:

dagify(bwght ~ cigs + inc,
       cigs ~ price + educ + inc,
       inc ~ educ,
       exposure = "cigs",
       outcome = "bwght") %>%
  tidy_dagitty(seed = 256) %>%
  ggdag_status()+
  theme_dag_blank()+
  theme(legend.position = "none")

Note what we are hypothesizing:

  1. bwght is caused by cigs and inc
  2. cigs are caused by price, educ, and inc
  3. inc is caused by educ

See also how this is written into the notation in R to draw (plot) the DAG.

Create this model on dagitty.net. What does dagitty tell us the testable implications of this causal model?

You can answer this using dagitty.net, and/or R.

See the middle box on the right on dagitty:

  1. \(bwght \perp price \, | \, cigs, inc\): birthweight is independent of price, controlling for cigarettes and income

  2. \(bwght \perp educ \, | \, cigs, inc\): birthweight is independent of education, controlling for cigarettes and income

  3. \(inc \perp price\): income is independent of price

  4. \(price \perp educ\): cigarette price is independent of education

# note you can do it in R with the dagitty package's command,
# impliedConditionalIndependencies()

# save the dag (before plotted, so the part inside dagify() command
# that I created in the question
dag <- dagify(bwght ~ cigs + inc,
       cigs ~ price + educ + inc,
       inc ~ educ,
       exposure = "cigs",
       outcome = "bwght")

# now pipe it into command
dag %>% impliedConditionalIndependencies()
bwgh _||_ educ | cigs, inc
bwgh _||_ pric | cigs, inc
educ _||_ pric
inc _||_ pric

Part C

Test each implication given to you by dagitty.

  • For independencies, e.g. \((x \perp y)\): run a regression of \(y\) on \(x\).
  • For conditional independencies, e.g. \((x \perp y | z, a)\): run a regression of \(y\) on \(x, z, a\).

For each, test against the null hypothesis that the relevant coefficient \((\beta_1) =0\) (i.e. \(x\) and \(y\) are indeed independent).

Does this causal model hold up well?

Implication 1

If we run a regression of bwght on cigprice, including cigs and faminc as controls, there should not not be a statistically significant coefficient on cigprice (i.e. there is no relationship between cigprice and bwght holding cigs andfaminc constant):

lm(bwght ~ cigprice +  cigs + faminc, data = bwght) %>% summary()

Call:
lm(formula = bwght ~ cigprice + cigs + faminc, data = bwght)

Residuals:
    Min      1Q  Median      3Q     Max 
-95.746 -11.516   0.784  13.175 149.956 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 106.02164    6.88671  15.395  < 2e-16 ***
cigprice      0.08499    0.05282   1.609   0.1078    
cigs         -0.46735    0.09156  -5.104 3.78e-07 ***
faminc        0.08811    0.02931   3.006   0.0027 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.05 on 1384 degrees of freedom
Multiple R-squared:  0.03162,   Adjusted R-squared:  0.02952 
F-statistic: 15.06 on 3 and 1384 DF,  p-value: 1.193e-09

The coefficient on cigprice is small and not statistically significant. This implication holds up well.

Implication 2

If we run a regression of bwght on motheduc, including cigs and faminc as controls, there should not not be a statistically significant coefficient on cigprice (i.e. there is no relationship between motheduc and bwght holding cigs andfaminc constant):

lm(bwght ~ motheduc + cigs + faminc, data = bwght) %>% summary()

Call:
lm(formula = bwght ~ motheduc + cigs + faminc, data = bwght)

Residuals:
    Min      1Q  Median      3Q     Max 
-96.064 -11.585   0.668  13.154 150.078 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 116.83485    3.13778  37.235  < 2e-16 ***
motheduc      0.01426    0.25799   0.055   0.9559    
cigs         -0.46335    0.09275  -4.996 6.61e-07 ***
faminc        0.09147    0.03246   2.818   0.0049 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.08 on 1383 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.02977,   Adjusted R-squared:  0.02767 
F-statistic: 14.15 on 3 and 1383 DF,  p-value: 4.385e-09

Implication 3:

The model implies simply that there is no significant correlation between faminc and cigprice

bwght %>%
  select(faminc, cigprice) %>%
  cor()
             faminc   cigprice
faminc   1.00000000 0.09545581
cigprice 0.09545581 1.00000000

There is a fairly weak correlation. This implication mostly holds up.

Implication 4:

The model implies simply that there is no significant correlation between cigprice and motheduc

bwght %>%
  select(cigprice, motheduc) %>%
  cor(use = "pairwise.complete.obs")
           cigprice   motheduc
cigprice 1.00000000 0.07086587
motheduc 0.07086587 1.00000000

This is an even weaker correlation. This implication seems to holds up.

Part D

List all of the possible pathways from cigs to bwght. Which are “front-doors” and which are “back-doors?” Are any blocked by colliders?

You can answer this using dagitty.net, and/or R.

  1. \(cigs \rightarrow bwght\) (causal, front door)
  2. \(cigs \leftarrow faminc \rightarrow bwght\) (non-causal, back door)
  3. \(cibs \leftarrow motheduc \rightarrow faminc \rightarrow bwght\) (non-causal, back door)

There are no colliders on any path.

# using ggdag_paths()
dag %>%
  ggdag_paths()+
  theme_dag()

Part E

What is the minimal sufficient set of variables we need to control for in order to causally identify the effect of cigs on bwght?

You can answer this using dagitty.net, and/or R.

We simply need to control for faminc. This blocks the back door for both path 2 and path 3.

# we can also get this from R in ggdag
# using ggdag_adjustment_set()

dag %>%
  ggdag_adjustment_set(shadow = T)+
  theme_dag()

Part F

Estimate the causal effect by running the appropriate regression in R.

FYI, on dagitty.net, you can change a variable on the diagram to be “adjusted” (controlled for) by clicking it and then hitting the A key.

We need to control only for faminc, so we put it into the regression to estimate:

\[bwght_i=\beta_0+\beta_1 \, cigs_i+\beta_2 \,faminc_i\]

lm(bwght ~ cigs + faminc, data = bwght) %>% summary()

Call:
lm(formula = bwght ~ cigs + faminc, data = bwght)

Residuals:
    Min      1Q  Median      3Q     Max 
-96.061 -11.543   0.638  13.126 150.083 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 116.97413    1.04898 111.512  < 2e-16 ***
cigs         -0.46341    0.09158  -5.060 4.75e-07 ***
faminc        0.09276    0.02919   3.178  0.00151 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.06 on 1385 degrees of freedom
Multiple R-squared:  0.0298,    Adjusted R-squared:  0.0284 
F-statistic: 21.27 on 2 and 1385 DF,  p-value: 7.942e-10

Controlling for income, each cigarette smoked while pregant will cause the birthweight to decrease by 0.46 ounces.

Part G

We saw some effect between faminc and cigprice. Perhaps there are unobserved factors (such as the economy’s performance) that affect both. Add an unobserved factor u1 to your dagitty model.

FYI, on dagitty.net, you can make a variable be “unobserved” by double-clicking it and then hitting the U key.

Part H

Perhaps our model is poorly specified. Maybe motheduc actually has a causal effect on bwght? Tweak your model on dagitty.net to add this potential relationship. What testable implications does this new model imply?

See the middle box on the right on dagitty:

  1. \(bwght \perp price \, | \, cigs, inc, educ\): birthweight is independent of price, controlling for cigarettes, income, and education

  2. \(price \perp educ\): cigarette price is independent of education

  3. \(inc \perp price\): income is independent of cigarette price

# we can also do this in R

# update the DAG

dag2 <- dagify(bwght ~ cigs + inc + educ, #<< add educ
       cigs ~ price + educ + inc,
       inc ~ educ,
       exposure = "cigs",
       outcome = "bwght")

# now pipe it into command
dag2 %>% impliedConditionalIndependencies()
bwgh _||_ pric | cigs, educ, inc
educ _||_ pric
inc _||_ pric

Part I

Test these implications appropriately in R, like you did in Part C. Does this model hold up well?

Implication 1:

If we run a regression of bwght on cigprice, including cigs, faminc, and motheduc as controls, there should not not be a statistically significant coefficient on cigprice (i.e. there is no relationship between cigprice and bwght holding cigs,faminc, and motheduc constant):

lm(bwght ~ cigprice + cigs + faminc + motheduc, data = bwght) %>% summary()

Call:
lm(formula = bwght ~ cigprice + cigs + faminc + motheduc, data = bwght)

Residuals:
    Min      1Q  Median      3Q     Max 
-95.760 -11.519   0.803  13.175 149.954 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.061e+02  7.407e+00  14.321  < 2e-16 ***
cigprice     8.478e-02  5.288e-02   1.603  0.10912    
cigs        -4.681e-01  9.274e-02  -5.047 5.08e-07 ***
faminc       8.765e-02  3.253e-02   2.694  0.00714 ** 
motheduc    -4.448e-04  2.580e-01  -0.002  0.99862    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.06 on 1382 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.03158,   Adjusted R-squared:  0.02877 
F-statistic: 11.26 on 4 and 1382 DF,  p-value: 5.368e-09

The coefficient on cigprice is small and not statistically significant. This implication holds up well.

Implication 2:

This is the same as implication 3 from Part C. Again, this holds up reasonably well.

Implication 3:

This is the same as implication 4 from Part C. Again, this holds up reasonably well.

Part J

Under this new causal model, list all of the possible pathways from cigs to bwght. Which are “front-doors” and which are “back-doors?” Are any blocked by colliders?

You can answer this using dagitty.net, and/or R.

  1. \(cigs \rightarrow bwght\) (causal, front door)
  2. \(cigs \leftarrow faminc \rightarrow bwght\) (non-causal, back door)
  3. \(cigs \leftarrow cigprice \leftarrow u1 \rightarrow faminc \rightarrow bwght\) (non-causal, back door)
  4. \(cigs \leftarrow motheduc \rightarrow bwght\) (non-causal, back door)
  5. \(cigs \leftarrow motheduc \rightarrow faminc \rightarrow bwght\) (non-causal, back door)
# we can also get this from R in ggdag
# using ggdag_paths()

dag2 %>%
  ggdag_paths()+
  theme_dag()

Part K

Under this new causal model, what is the minimal sufficient set of variables we need to control in order to causally identify the effect of cigs on bwght?

You can answer this using dagitty.net, and/or R.

We need to control for faminc and motheduc. This blocks the back door for paths 2, 3, 4, and 5.

# we can also get this from R in ggdag
# using ggdag_adjustment_set()

dag2 %>%
  ggdag_adjustment_set(shadow = T)+
  theme_dag()

Part L

Estimate the causal effect in this new model by running the appropriate regression in R. Compare your answers to those in part F.

We need to control for faminc and motheduc, so we put them into the regression to estimate:

\[bwght_i=\beta_0+\beta_1 \, cigs_i+\beta_2 \,faminc_i+ \beta_3 \, motheduc_i\]

lm(bwght ~ cigs + faminc + motheduc, data = bwght) %>% summary()

Call:
lm(formula = bwght ~ cigs + faminc + motheduc, data = bwght)

Residuals:
    Min      1Q  Median      3Q     Max 
-96.064 -11.585   0.668  13.154 150.078 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 116.83485    3.13778  37.235  < 2e-16 ***
cigs         -0.46335    0.09275  -4.996 6.61e-07 ***
faminc        0.09147    0.03246   2.818   0.0049 ** 
motheduc      0.01426    0.25799   0.055   0.9559    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.08 on 1383 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.02977,   Adjusted R-squared:  0.02767 
F-statistic: 14.15 on 3 and 1383 DF,  p-value: 4.385e-09

Controlling for income and education, each cigarette smoked while pregant will cause the birthweight to decrease by 0.46 ounces. It turns out there was no noticeable difference when we included education!

Part M

Try out drawing this model using the ggdag package in R. See my DAG in question 3 for an example.

dagify(bwght ~ cigs + inc + educ,
       cigs ~ price + educ + inc,
       inc ~ educ + u1,
       price ~ u1,
       exposure = "cigs",
       outcome = "bwght") %>% 
  ggdag_status()+
  theme_dag_blank()+
  theme(legend.position = "none")