# Problem Set 6

Author

Published

Invalid Date

# load packages, they are all installed for you already (in the cloud only)
library(tidyverse) # your friend and mine
library(broom) # for tidy regression
library(modelsummary) # for nice regression tables
library(fixest) # for fixed effects and 2SLS/IV regression
library(car) # for f-test
library(dagitty) # for dags
library(ggdag) # for plotting dags
library(ggraph) # for plotting dags

# Theory and Concepts

## Question 1

In your own words, describe what fixed effects are, when we can use them, and how they remove endogeneity.

Fixed effects can be used for panel data, where each observation$$_{it}$$ belongs to a group $$i$$ and a time period $$t$$. Running a simple OLS model as usual, known as a pooled model would be biased due to systematic relationships within and between groups and time periods that cause $$X_{it}$$ to be endogenous:

$\hat{Y}_{it}=\beta_0+\beta_1 X_{it}+u_{it}$

Group-fixed effects $$(\alpha_i$$) pull out of the error term all of the factors that explain $$Y_{it}$$ that are stable over time within each individual group ($$i$$), but vary across groups. For example, if groups are U.S. states, state-fixed effects pull out all of the error term all of the idiosyncrasies of each state that do not change over time, such as cultural differences, geographical differences, historical differences, population differences, etc. Group fixed effects do not pull out factors that change over time, such as federal laws passed, recessions affecting all states, etc.

Time-fixed effects $$(\theta_i)$$ pull out of the error term all of the factors that explain $$Y_{it}$$ that change over time but do not vary across groups. For example, if groups are U.S. states, and time is in years, year-fixed effects pull out all of the error term all of the changes over the years that affect all U.S. states, such as recessions, inflation, national population growth, national immigration trends, or federal laws passed that affect all states. Time-fixed effects do not pull out factors that do not change over time.

$\hat{Y}_{it}=\beta_0+\beta_1 X_{it}+ \alpha_{i} + \theta_{t} + \epsilon_{it}$

Mechanically, OLS estimates a separate constant for each group (and/or each time period), giving the expected value of $$Y_{it}$$ for that group or time-period. This can be done by either de-meaning the data and calculating OLS coefficients by exploiting the variation within each group and/or time period (which is why fixed effects estimators are called “within” estimators), or by creating a dummy variable for each group and/or time period (and omitting one of each, to avoid the dummy variable trap).

## Question 2

In your own words, describe the logic of a difference-in-difference model: what is it comparing against what, and how does it estimate the effect of treatment? What assumption must be made about the treatment and control group for the model to be valid?

$\hat{Y}_{it}=\beta_0+\beta_1 \, \text{Treated}_i +\beta_2 \, \text{After}_{t}+\beta_3 \, (\text{Treated}_i \times \text{After}_{t})+u_{it}$

A difference-in-difference model compares the difference before and after a treatment between a group that receives the treatment and a group that does not. By doing so, we can isolate the effect of the treatment. It is easiest to see in the following equation:

$\Delta \Delta Y = (\text{Treated}_{\text{after}}-\text{Treated}_{\text{before}})-(\text{Control}_{\text{after}}-\text{Control}_{\text{before}})$

In OLS regression, we can simply make two dummy variables for observations$$_it$$ depending on the group and the time period each observation is:

$\text{Treated}_i = \begin{cases} 1 & \text{if } i \text{ receives treatment} \\ 0 & \text{if } i \text{ does not receive treatment }\\ \end{cases}$

$\text{After}_t = \begin{cases} 1 & \text{if } t \text{is after treatment period} \\ 0 & \text{if } t \text{is before treatment period}\\ \end{cases}$

Lastly, we make an interaction term $$\text{Treated}_i \times \text{After}_t$$ which isolates the treatment effect, captured by the coefficient on this term, $$\beta_3$$.

Diff-in-diff models assume a counterfactual that if the treated group did not receive treatment, it would have experience the same change as the control group over the pre- to post-treatment period.

A classic example is if a state(s) passes a law and others do not. We want to compare the difference in the difference before and after the law was passed between states that passed the law and states that did not:

$\Delta_i \Delta_t Outcome = (\text{Passed}_{after}-\text{Passed}_{before})-(\text{Didn't}_{after}-\text{Didn't}_{before})$

“Treatment$$_i$$” and time-fixed effects for “After$$_t$$” to estimate the treatment effect (still $$\beta_3$$): $\hat{Y}_{it}=\alpha_i +\theta_{t}+\beta_3 \, (\text{Treated}_i \times \text{After}_{t})+\epsilon_{it}$

control<-tribble(
~x, ~y,
1, 1,
2, 3
)
treatment<-tribble(
~x, ~y,
1, 4,
2, 8
)

ggplot(data = tibble(x=1,3))+
aes(x = x)+
# control
geom_point(data = control, aes(x=x,y=y), size = 3, color = "red")+
geom_text(x=1,y=0.5, color = "red", label=expression(C))+
geom_text(x=2,y=3.5, color = "red", label=expression(C))+

# line
geom_segment(x=1, y=1, xend=2, yend=3, color="red", size = 2)+

# beta 0
geom_label(x=0.9,y=1, color = "black", label=expression(hat(beta)))+

# line slope
geom_segment(x=1, y=1, xend=2, yend=1, color="red", linetype = "dotted", size = 1)+
geom_segment(x=2, y=1, xend=2, yend=3, color="red", linetype = "dotted", size = 1)+
# beta 1+2
geom_label(x=2.2,y=3, color = "black", label=expression(hat(beta)+hat(beta)))+

# treatment
geom_point(data = treatment, aes(x=x,y=y), size = 3, color = "green")+
geom_text(x=1,y=4.5, color = "green", label=expression(T))+
geom_text(x=2,y=8.5, color = "green", label=expression(T))+

# line
geom_segment(x=1, y=4, xend=2, yend=8, color="green", size = 2)+

# beta 0 plus beta 1
geom_label(x=0.8,y=4, color = "black", label=expression(hat(beta)+hat(beta)))+

# line slope
geom_segment(x=1, y=4, xend=2, yend=4, color="green", linetype = "dotted", size = 1)+
geom_segment(x=2, y=4, xend=2, yend=6, color="red", linetype = "dotted", size = 1)+

# diag
geom_segment(x=1, y=4, xend=2, yend=6, color="red", linetype = "dashed", size = 1)+

geom_segment(x=2, y=6, xend=2, yend=8, color="blue", linetype = "dashed", size = 1)+

# beta 3
geom_label(x=2.25,y=8, color = "black",
label=expression(hat(beta)+hat(beta)+hat(beta)+hat(beta)))+

# beta 1 dif
annotate("segment", x = 1, xend = 1, y = 1, yend = 4, colour = "black", size=1, alpha=1, arrow=arrow(length=unit(0.5,"cm"), ends="both", type="closed"))+
geom_label(x=1,y=2.5, size = 4, color = "black", label=expression(hat(beta)))+

# beta 2 dif
annotate("segment", x = 2, xend = 2, y = 1, yend = 3, colour = "black", size=1, alpha=1, arrow=arrow(length=unit(0.5,"cm"), ends="both", type="closed"))+
geom_label(x=2,y=2, size = 4, color = "black", label=expression(hat(beta)))+

# beta 2 dif
annotate("segment", x = 2, xend = 2, y = 4, yend = 6, colour = "black", size=1, alpha=1, arrow=arrow(length=unit(0.5,"cm"), ends="both", type="closed"))+
geom_label(x=2,y=5, size = 4, color = "black", label=expression(hat(beta)))+

# beta 2 dif
annotate("segment", x = 2, xend = 2, y = 6, yend = 8, colour = "blue", size=1, alpha=1, arrow=arrow(length=unit(0.5,"cm"), ends="both", type="closed"))+
geom_label(x=2,y=7, size = 4, color = "blue", label=expression(hat(beta)))+

scale_x_continuous(breaks=c(1,2),
labels=c(expression(t[before]),expression(t[after])),
limits=c(0.5,2.5))+
scale_y_continuous(breaks=NULL,
limits=c(0,9))+
labs(x = "Time",
y = "Y")+
theme_classic(base_family = "Fira Sans Condensed",
base_size=14)
Warning: Using size aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use linewidth instead.
Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'
Control Treatment Group Diff $$(\Delta Y_i)$$
Before $$\beta_0$$ $$\beta_0+\beta_1$$ $$\beta_1$$
After $$\beta_0+\beta_2$$ $$\beta_0+\beta_1+\beta_2+\beta_3$$ $$\beta_1+\beta_3$$
Time Diff $$(\Delta Y_t)$$ $$\beta_2$$ $$\beta_2+\beta_3$$ $$\beta_3$$ Diff-in-diff $$(\Delta_i \Delta_t)$$

## Question 3

In your own words, describe the logic of an instrumental variables model: how does this approach identify the causal effect of $$X$$ on $$Y$$? What makes for a valid instrument? When using a Two-Stage Least Squares approach, what is estimated at each stage?

Suppose we have a simple linear regression model of

$Y_i = \beta_0 + \beta_1 \, X_i + u_i$

Instrumental variables techniques attempt to isolate and use only the variation in an endogenous $$X$$ variable (correlated with the error term, other factors that cause $$Y$$) that is exogenous, that is, not correlated with the error term. This is done by using an instrumental variable, $$I$$ that is correlated with $$X$$, but not a direct cause of $$Y$$; only indirectly, through $$X$$, does it cause $$Y$$. Thus, there are two conditions for a valid instrument:

1. Relevance (the “inclusion condition”): instrument $$I$$ must be correlated with $$X$$.
2. Exogeneity (the “exclusion condition”): instrument $$I$$ must not be correlated with the error term, $$u$$.

The first condition is testable with a “first stage” regression of $$X$$ on the instrument(s) $$I$$ (and any covariates, $$X_2, \cdots X_k$$):

$X_i = \gamma_0 + \gamma_1 I_i + w_i$ If $$\gamma_1$$ is statistically significant (i.e. we can reject $$H_0$$: $$\gamma_1 = 0$$), we have good evidence that the instrument is relevant.

The second condition is not testable statistically, and requires a plausible argument that $$I$$ does not cause $$Y$$ except through $$X$$.

We also run the “reduced form” regression of the instrument $$I$$ on outcome $$Y$$ (and any covariates):

$Y_i = \pi_0 + \pi_1 I_i + v_i$

The instrumental variable estimator for the original $$\beta_1$$, $$\beta_1^{IV} = \frac{pi_1}{\gamma_1}$$, that is, we scale the “direct” effect of $$I \rightarrow Y$$ ($$\pi_1$$) by the effect of $$I \rightarrow X$$ ($$\gamma_1$$), since it is only $$X$$ that directly affects $$Y$$, not $$I$$.

iv_colors <- tribble(
~name, ~Variable, ~Type,
"X", "X", "Endogenous",
"Y", "Y", "Outcome",
"I", "Z", "Exogenous",
"U", "Z", "Exogenous",
)

iv_dag <- dagify(Y ~ X + U,
X ~ Z + U,
outcome = "Y",
exposure = "X") %>%
tidy_dagitty(seed = 256)

# Add node types/colors to DAG data
iv_dag$data <- iv_dag$data %>%
left_join(iv_colors, by = "name")

iv_plot<-ggplot(data = iv_dag)+
aes(x = x,
y = y,
xend = xend,
yend = yend)+
geom_dag_point(size = 10,
aes(color = Type)) +
geom_dag_edges(start_cap = circle(4, "mm"),
end_cap = circle(4, "mm")) +
geom_dag_text(size = 3, family = "Fira Sans") +
scale_dag()+
theme_void()+
scale_shape_manual(values = c(16,17))+
theme(legend.position = "none")
Warning: 'scale_dag' is deprecated.
See help("Deprecated")
iv_plot We can generalize this, if we want to use multiple endogenous variables and/or multiple instruments, with Two Stage Least Squares (2SLS). We run the first stage regression of the endogenous variable $$X$$ on the instrument(s) $$I$$:

$X_i = \gamma_0 + \gamma_1 I_i + w_i$

Then we take the predicted (or fitted) values of $$\hat{X_i}$$ from the first stage, and substitute them in for our exogenous variable $$X$$ in the second stage regression:

$Y_i = \beta_0 + \beta_1 \, \hat{X}_i + u_i$ This gives us an unbiased estimator of $$\beta_1$$, so long as the instrument(s) is valid.

# R Questions

Answer the following questions using R. When necessary, please write answers in the same document (rendered to html or pdf, typed .doc(x), or handwritten) as your answers to the above questions. Be sure to include (email or print an .R file, or show in your rendered quarto document) your code and the outputs of your code with the rest of your answers.

## Question 4 (Fixed Effects)

Download and read in PeaceCorps.csv dataset. If you don’t want to download/upload it, you can read it in directly from the url by running this chunk:

# run or edit this chunk
pc <- read_csv("http://metricsf22.classes.ryansafner.com/files/data/PeaceCorps.csv")
Rows: 306 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): state, stateshort
dbl (4): year, unemployrate, appspc, stcode

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.

How do people respond to changes in economic conditions? Are they more likely to pursue public service when private sector jobs are scarce? This dataset contains variables at the U.S. State (& D.C.) level:

Variable Description
state U.S. State
year Year
appspc Applications to the Peace Corps (per capita) in State
unemployrate State unemployment rate

Do more people apply to the Peace Corps when unemployment increases (and reduces other opportunities)?

### Part A

Before looking at the data, what does your economic intuition tell you? Explain your hypothesis.

If joining the Peace Corps is a substitute for private sector jobs, then as unemployment rises, so too should applications. Although, it could also be possible that people also are more willing to opt out of the workforce when the economy is strong, so we should examine this empirically to be sure.

### Part B

To get the hang of the data we’re working with, count (separately) the number of states, and the number of years. Get the number of n_distinct() states and years (Do these inside the summarize() command), as well as the distinct() values of each (don’t use the summarize() command for this part).

# count number of states
pc %>%
count(state)
# count number of years
pc %>%
count(year)
# get number of distinct states and years
pc %>%
summarize(Num_states = n_distinct(state),
Num_years = n_distinct(year))
# get distinct values of states
pc %>%
distinct(state)
# get distinct values of years
pc %>%
distinct(year)

### Part C

Create a scatterplot of appspc (Y) on unemployrate (X). Which State is an outlier? How would this affect the pooled regression estimates? Create a second scatterplot that does not include this State.

ggplot(data = pc)+
aes(x = unemployrate,
y = appspc)+
geom_point(aes(color = as.factor(state)))+
geom_smooth(method = "lm")+
scale_x_continuous(breaks=seq(0,20,2),
labels = function(x){paste(x,"%", sep="")})+
labs(x = "Unemployment Rate",
y = "Peace Corps Applications (per capita)")+
theme_bw(base_family = "Fira Sans Condensed",
base_size=16)+
theme(legend.position = "none") # remove legend for State colors
geom_smooth() using formula = 'y ~ x'