```
# load packages, they are all installed for you already (in the cloud only)
library(tidyverse) # your friend and mine
library(broom) # for tidy regression
library(modelsummary) # for nice regression tables
library(wooldridge) # for wooldridge data set
library(car) # for vif command
library(dagitty) # for working with DAGs
library(ggdag) # for drawing DAGs
```

# Problem Set 4

# Theory and Concepts

## Question 1

In your own words, explain the fundamental problem of causal inference.

The fundamental problem of causal inference is that we never know the counterfactuals for the observations that we actually see in the data. A causal effect of a treatment \((\delta_i)\) for individual \(i\) would be the difference between their outcomes if they were treated \((Y^1_i)\) and if they were not-treated \((Y^0_i)\), i.e. \(\delta_i=Y^1_i-Y^0_i\). Across many individuals, we could take the average for each group (treated and untreated group). However, in reality, for each individual in our data, we only ever see \(Y^1_i\) *or* \(Y^0_i\), *never both*. We only see a person’s outcome with treatment, or without treatment, so we cannot simply take the difference.

## Question 2

In your own words, explain how properly conducting a randomized controlled trial helps to identify the causal effect of one variable on another.

Randomized controlled experiments are where a pool of subjects representative of a population are randomly assigned into a treatment group (or into one of a number of groups given different levels of a treatment) or into a control group. The treatment group(s) is(are) given the treatment(s), the control group is given nothing (perhaps a placebo, though this is not always necessary), and then the average results of the two groups are compared to measure the true average effect of treatment.

The key is that the assignment must be random, which controls for all factors that potentially determine the outcome (e.g. when measuring individual outcomes, their height, family background, income, race, age, etc). If subjects are randomly assigned, then knowing anything about the individual (e.g. age, height, etc) tells us nothing about whether or not they got the treatment(s). The only thing that separates a member of the treatment group(s) from the control group is whether or not they were assigned to treatment. This ensures that the average person in the treatment group(s) looks like the average person in the control group, and that we are truly comparing apples to apples, rather than apples to oranges.

## Question 3

In your own words, describe what omitted variable bias means. What are the two conditions for a variable to bias OLS estimates if omitted?

All variables that might influence the dependent variable (\(Y\)) that we do not measure and include in our regression are a part of the error term \((\epsilon\)). If we omit a variable \((Z)\), it will cause a bias if and only if it meets both of the following conditions:

The variable must be a determinant of our dependent variable, \(corr(Z,Y)\neq 0\), and thus would appear in the error term, \(u\).

The variable must be correlated with one of our independent variables of interest in our regression, \(corr(Z,X) \neq 0\).

If both conditions are met, then if we did not include the omitted variable $Z$, our estimate of the causal effect of \(X\) on \(Y\) would be biased, because our estimate \((\hat{\beta_1})\) would pick up some of the effect of \(Z\). If we include \(Z\) as another independent variable, then the \(\hat{\beta_1}\) on $X$ will tell us the precise effect of *only* \(X\rightarrow Y\), holding \(Z\) constant.

```
dagify(Y ~ Z + X,
~ Z,
X exposure = "X",
outcome = "Y") %>%
ggdag_status()+
theme_dag_blank()+
theme(legend.position = "none")
```

## Question 4

In your own words, describe what multicollinearity means. What is the cause, and what are the consequences of multicollinearity? How can we measure multicollinearity and its effects? What happens if multicollinearity is *perfect*?

Multicollinearity just means that two regressors (e.g .\(X_1\) and \(X_2\)) are correlated with each other. This fact does *not* bias the OLS estimates of these regressors (e.g. \(\hat{\beta_1}\) and \(\hat{\beta_2}\)). In fact, the reason \(X_2\) is included in the regression is because omitting it would cause omitted variable bias, since \(corr(X_1,X_2)\neq 0\) and \(corr(Y, X_2)\neq 0\). However, the variance of these OLS estimators is increased because it is hard to get a precise measure of \(X_1\rightarrow Y\) because \(X_2\rightarrow Y\) also, and \(X_1\) may tend to be certain values (large or small) when \(X_2\) is certain values (large or small) so we don’t know counterfactuals (e.g. what if \(X_1\) were the *opposite* of what it tends to be (large or small) when \(X_2\) is large or small).

The strength of multicollinearity is simply given by the value of the correlation coefficient between \(X_1\) and \(X_2\), \(r_{X_1,X_2}\). We can measure the *effect* of multicollinearity on the variance of a regressor (\(X_j\))’s coefficient (\(\hat{\beta_j}\)) with the **Variance Inflation Factor**:

\[VIF=\frac{1}{1-R^2_j}\]

where \(R^2_j\) is the \(R^2\) from an auxiliary regression of \(X_j\) on all of the other regressors.

Multicollinearity is *perfect* when the correlation between \(X_1\) and \(X_2\) is 1 or -1. This happens when one regressor (e.g. \(X_1\)) is an exact linear function of another regressor(s) (e.g. \(X_1=\frac{X_2}{100}\)). A regression cannot be run including both variables, as it creates a logical contradiction. In this example, \(\hat{\beta_1}\) would be the marginal effect on \(Y\) of changing \(X_1\) holding \(X_2\) constant -- but \(X_2\) would naturally change as it is a function of \(X_1\)!

## Question 5

Explain how we use Directed Acyclic Graphs (DAGs) to depict a causal model: what are the two criteria that must hold for identifying a causal effect of \(X\) on \(Y\)? When should we control a variable, and when should we *not* control for a variable?

A Directed Acyclic Graph (DAG) describes a causal model based on making our assumptions about relationships between variables explicit, and in many cases, testable.

Variables are represented as nodes, and causal effects represented as arrows from one node to another (in the direction of the causal effect). We think about the causal effect of \(X \rightarrow Y\) in *counterfactual* terms: if \(X\) had been different, \(Y\) would have been different as a response.

When considering the causal effect of \(X \rightarrow Y\), we must consider all pathways from \(X\) to \(Y\) (that do not loop, or go through a variable twice), regardless of the direction of the arrows. The paths will be of two types:

**Causal (front-door) pathways**where arrows go from \(X\) into \(Y\) (including through other**mediator**variables)**Non-causal (back-door) pathways**where an arrow leads into \(X\) (implying \(X\) is partially caused by that variable)

Adjusting or controlling for (in a multivariate regression, this means including the variable in the regression) a variable along a pathway closes that pathway.

Variables should be adjusted (controlled for) such that:

1. **Back-door criterion**: no backdoor pathway between \(X\) and \(Y\) remains open

2. **Front-door criterion**: no frontdoor pathway is closed

The one exception is a **collider** variable, where a variable along a pathway has arrows pointing into it from both directions. This *automatically blocks a path* (whether front door or back door). Controlling for a collider variable *opens* the pathway it is on.

See R Practice on DAGs for examples.

# Theory Problems

For the following questions, please *show all work* and explain answers as necessary. You may lose points if you only write the correct answer. You may use `R`

to *verify* your answers, but you are expected to reach the answers in this section “manually.”

## Question 6

A pharmaceutical company is interested in estimating the impact of a new drug on cholesterol levels. They enroll 200 people in a clinical trial. People are randomly assigned the treatment group or into the control group. Half of the people are given the new drug and half the people are given a sugar pill with no active ingredient. To examine the impact of dosage on reductions in cholesterol levels, the authors of the study regress the following model:

\[\text{cholesterol level}_i = \beta_0+\beta_1 \text{dosage level}_i + u_i\]

For people in the control group, dosage level\(_i=0\) and for people in the treatment group, dosage level\(_i\) measures milligrams of the active ingredient. In this case, the authors find a large, negative, statistically significant estimate of \(\hat{\beta_1}\). Is this an unbiased estimate of the impact of dosage on change in cholesterol level? Why or why not? Do you expect the estimate to overstate or understate the true relationship between dosage and cholesterol level?

Consider the 4th assumption about the error term, \(u_i\). Does knowing whether (or how much) a person was treated convey any information about other characteristics that affect cholesterol level (in \(u_i\))? Again, we are asking if \(\mathbb{E}[u|X]=0\) or \(cor(X, u)=0\).

In this case, the answer is clearly no; knowing whether or not someone received treatment tells us *nothing* else about the person that might affect their cholesterol levels (i.e. age, height, diet, weight, family history, etc., all in \(u_i)\) because treatment is *randomly* assigned.

In this case, because treatment is exogenous, \(\mathbb{E}[\hat{\beta_1}]=\beta_1\), \(\hat{\beta_1}\) is unbiased.

```
dagify(chol ~ treat + u,
exposure = "treat",
outcome = "chol") %>%
ggdag_status()+
theme_dag()+
theme(legend.position="none")
```

## Question 7

Data were collected from a random sample of 220 home sales from a community in 2017.

\[\widehat{Price}=119.2+0.485 \, BDR+23.4 \, Bath+0.156 \, Hsize+0.002 \, Lsize+0.090 \, Age\]

Variable | Description |
---|---|

\(Price\) | selling price (in $1,000s) |

\(BDR\) | number of bedrooms |

\(Bath\) | number of bathrooms |

\(Hsize\) | size of the house (in ft\(^2)\) |

\(Lsize\) | lot size (in ft\(^2)\) |

\(Age\) | age of the house (in years) |

### Part A

Suppose that a homeowner converts part of an existing living space in her house to a new bathroom. What is the expected increase in the value of the house?

From \(\hat{\beta_2}\), $23,400.

### Part B

Suppose a homeowner adds a new bathroom to her house, which also increases the size of the house by 100 square feet. What is the expected increase in the value of the house?

In this case, \(\Delta BDR=1\) and \(\Delta Hsize=100\). The resulting expected increase in price is \(23.4(1)+0.156(100)=39.0\), or $39,000.

### Part C

Suppose the \(R^2\) of this regression is 0.727. Calculate the adjusted \(\bar{R}^2\).

There are $n=220$ observations and $k=6$ variables, so:

\[ \begin{align*} \bar{R}^2&=1-\frac{n-1}{n-k-1}(1-R^2)\\ &=1-\frac{220-1}{220-6-1}(1-0.727)\\ &=1-\frac{219}{213}(0.273)\\ &=0.719\\ \end{align*} \]

### Part D

Suppose the following auxiliary regression for \(BDR\) has an \(R^2\) of 0.841.

\[\widehat{BDR}=\delta_0+\delta_1Bath+\delta_2Hsize+\delta_3Lsize+\delta_4Age\]

Calculate the Variance Inflation Factor for \(BDR\) and explain what it means.

\[ \begin{align*} VIF&=\frac{1}{1-R^2_j}\\ &=\frac{1}{1-0.841}\\ &=\frac{1}{0.159}\\ &=6.29\\ \end{align*} \]

The variance on \(\hat{\beta_2}\) (on Bath) increases by 6.29 times (629%) due to multicollinearity between Bath and other \(X\)-variables.

## Question 8

A researcher wants to investigate the effect of education on average hourly wages. Wage, education, and experience in the dataset have the following correlations:

Wage | Education | Experience | |
---|---|---|---|

Wage | 1.0000 | ||

Education | 0.4059 | 1.0000 | |

Experience | 0.1129 | -0.2995 | 1.0000 |

She runs a simple regression first, and gets the results:

\[\widehat{\text{Wage}} = -0.9049 + 0.5414 \, Education\]

She runs another regression, and gets the results:

\[\widehat{\text{Experience}} = 35.4615 - 1.4681 \, Education\]

### Part A

If the true marginal effect of experience on wages (holding education constant) is 0.0701, calculate the omitted variable bias in the first regression caused by omitting experience. Does the estimate of \(\hat{\beta_1}\) in the first regression overstate or understate the effect of education on wages?

We know that the estimate in the biased regression (first one above) is a function of:

\[\hat{\alpha_1}=\hat{\beta_1}+\hat{\beta_2}\hat{\delta_1}\]

Where:

\(\hat{\alpha_1}\): the coefficient on education in the biased regression (0.5414)

\(\hat{\beta_1}\): the true effect of education on wages (??)

\(\hat{\beta_2}\): the true effect of experience on wages (0.0701)

\(\hat{\delta_1}\): the effect of education on experience (from an auxiliary regression) (-1.4681)

\[ \begin{align*} OMV&=\hat{\beta_2} \hat{\delta_1}\\ &=(0.0701)(-1.4681)\\ &=-0.1029\\ \end{align*} \]

Since the bias is negative, it understates the effect of education (due to experience). Note there are other factors that can bias the effect of education, but at least through experience, the bias is negative since the two are negatively related in the data (see the second regression).

### Part B

Knowing this, what would be the *true effect* of education on wages, holding experience constant?

We know the biased estimate for education, 0.0611. Plugging in both this and the bias, we get the “true” effect:

\[ \begin{align*} \alpha_1&=\beta_1+\beta_2\delta_1\\ 0.5414&=\beta_1-0.1029\\ 0.6443&=\beta_1\\ \end{align*} \]

### Part C

The \(R^2\) for the second regression is 0.0897. If she were to run a better regression including both education and experience, how much would the variance of the coefficients on education and experience increase? Why?

Here we need to calculate the Variance Inflation Factor (VIF) by using the \(R^2\) from the auxiliary regression.

\[ \begin{align*} VIF &=\frac{1}{1-R^2_j}\\ &=\frac{1}{1-(0.0897)}\\ &=\frac{1}{0.9103}\\ &=1.0985\\ \end{align*} \]

The variance increases only by 1.0985 times (9.85%) due to fairly weak multicollinearity between education and experience.

# R Questions

Answer the following questions using `R`

. When necessary, please write answers in the same document (rendered to `html`

or `pdf`

, typed `.doc(x)`

, or handwritten) as your answers to the above questions. Be sure to include (email or print an `.R`

file, or show in your rendered quarto document) your code and the outputs of your code with the rest of your answers.

## Question 9

Download and read in `heightwages.csv`

dataset. If you don’t want to download/upload it, you can read it in directly from the url by running this chunk:

```
# run or edit this chunk
<- read_csv("http://metricsf22.classes.ryansafner.com/files/data/heightwages.csv") heights
```

```
New names:
Rows: 12686 Columns: 22
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," dbl
(22): ...1, male, white, black, hispanic, mompro2, poppro2, siblings, no...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
```

This data is a part of a larger dataset from the National Longitudinal Survey of Youth (NLSY) 1979 cohort: a nationally representative sample of 12,686 men and women aged 14-22 years old when they were first surveyed in 1979. They were subsequently interviewed every year through 1994 and then every other year afterwards. There are many included variables, but for now we will just focus on:

Variable | Description |
---|---|

`wage96` |
Adult hourly wages ($/hr) reported in 1996 |

`height85` |
Adult height (inches) reported in 1985 |

`height81` |
Adolescent height (inches) reported in 1981 |

We want to figure out what is the effect of height on wages (e.g. do taller people earn more on average than shorter people?)

### Part A

Create a quick scatterplot between `height85`

(as \(X)\) amd `wage96`

(as \(Y)\).

```
ggplot(data = heights)+
aes(x = height85, y = wage96)+
geom_jitter(color = "blue")+
geom_smooth(method = "lm", color = "red")+
labs(x = "Adult Height in 1985 (inches)",
y = "Hourly Wage in 1996 ($)")+
theme_classic(base_family = "Fira Sans Condensed",
base_size = 20)
```

``geom_smooth()` using formula 'y ~ x'`

`Warning: Removed 5973 rows containing non-finite values (stat_smooth).`

`Warning: Removed 5973 rows containing missing values (geom_point).`