Problem Set 6

Author

Answer Key

Published

Invalid Date

# load packages, they are all installed for you already (in the cloud only)
library(tidyverse) # your friend and mine
library(broom) # for tidy regression
library(modelsummary) # for nice regression tables
library(fixest) # for fixed effects and 2SLS/IV regression
library(car) # for f-test
library(dagitty) # for dags
library(ggdag) # for plotting dags
library(ggraph) # for plotting dags

Theory and Concepts

Question 1

In your own words, describe what fixed effects are, when we can use them, and how they remove endogeneity.

Fixed effects can be used for panel data, where each observation\(_{it}\) belongs to a group \(i\) and a time period \(t\). Running a simple OLS model as usual, known as a pooled model would be biased due to systematic relationships within and between groups and time periods that cause \(X_{it}\) to be endogenous:

\[\hat{Y}_{it}=\beta_0+\beta_1 X_{it}+u_{it}\]

Group-fixed effects \((\alpha_i\)) pull out of the error term all of the factors that explain \(Y_{it}\) that are stable over time within each individual group (\(i\)), but vary across groups. For example, if groups are U.S. states, state-fixed effects pull out all of the error term all of the idiosyncrasies of each state that do not change over time, such as cultural differences, geographical differences, historical differences, population differences, etc. Group fixed effects do not pull out factors that change over time, such as federal laws passed, recessions affecting all states, etc.

Time-fixed effects \((\theta_i)\) pull out of the error term all of the factors that explain \(Y_{it}\) that change over time but do not vary across groups. For example, if groups are U.S. states, and time is in years, year-fixed effects pull out all of the error term all of the changes over the years that affect all U.S. states, such as recessions, inflation, national population growth, national immigration trends, or federal laws passed that affect all states. Time-fixed effects do not pull out factors that do not change over time.

\[\hat{Y}_{it}=\beta_0+\beta_1 X_{it}+ \alpha_{i} + \theta_{t} + \epsilon_{it}\]

Mechanically, OLS estimates a separate constant for each group (and/or each time period), giving the expected value of \(Y_{it}\) for that group or time-period. This can be done by either de-meaning the data and calculating OLS coefficients by exploiting the variation within each group and/or time period (which is why fixed effects estimators are called “within” estimators), or by creating a dummy variable for each group and/or time period (and omitting one of each, to avoid the dummy variable trap).

Question 2

In your own words, describe the logic of a difference-in-difference model: what is it comparing against what, and how does it estimate the effect of treatment? What assumption must be made about the treatment and control group for the model to be valid?

\[\hat{Y}_{it}=\beta_0+\beta_1 \, \text{Treated}_i +\beta_2 \, \text{After}_{t}+\beta_3 \, (\text{Treated}_i \times \text{After}_{t})+u_{it}\]

A difference-in-difference model compares the difference before and after a treatment between a group that receives the treatment and a group that does not. By doing so, we can isolate the effect of the treatment. It is easiest to see in the following equation:

\[\Delta \Delta Y = (\text{Treated}_{\text{after}}-\text{Treated}_{\text{before}})-(\text{Control}_{\text{after}}-\text{Control}_{\text{before}})\]

In OLS regression, we can simply make two dummy variables for observations\(_it\) depending on the group and the time period each observation is:

\[\text{Treated}_i = \begin{cases} 1 & \text{if } i \text{ receives treatment} \\ 0 & \text{if } i \text{ does not receive treatment }\\ \end{cases}\]

\[\text{After}_t = \begin{cases} 1 & \text{if } t \text{is after treatment period} \\ 0 & \text{if } t \text{is before treatment period}\\ \end{cases}\]

Lastly, we make an interaction term \(\text{Treated}_i \times \text{After}_t\) which isolates the treatment effect, captured by the coefficient on this term, \(\beta_3\).

Diff-in-diff models assume a counterfactual that if the treated group did not receive treatment, it would have experience the same change as the control group over the pre- to post-treatment period.

A classic example is if a state(s) passes a law and others do not. We want to compare the difference in the difference before and after the law was passed between states that passed the law and states that did not:

\[\Delta_i \Delta_t Outcome = (\text{Passed}_{after}-\text{Passed}_{before})-(\text{Didn't}_{after}-\text{Didn't}_{before})\]

“Treatment\(_i\)” and time-fixed effects for “After\(_t\)” to estimate the treatment effect (still \(\beta_3\)): \[\hat{Y}_{it}=\alpha_i +\theta_{t}+\beta_3 \, (\text{Treated}_i \times \text{After}_{t})+\epsilon_{it}\]

control<-tribble(
  ~x, ~y,
  1, 1,
  2, 3
)
treatment<-tribble(
  ~x, ~y,
  1, 4,
  2, 8
)

ggplot(data = tibble(x=1,3))+
  aes(x = x)+
  # control
  geom_point(data = control, aes(x=x,y=y), size = 3, color = "red")+
  geom_text(x=1,y=0.5, color = "red", label=expression(C[1]))+
  geom_text(x=2,y=3.5, color = "red", label=expression(C[2]))+
  
  # line
  geom_segment(x=1, y=1, xend=2, yend=3, color="red", size = 2)+
  
  # beta 0 
  geom_label(x=0.9,y=1, color = "black", label=expression(hat(beta)[0]))+
  
  # line slope
  geom_segment(x=1, y=1, xend=2, yend=1, color="red", linetype = "dotted", size = 1)+
  geom_segment(x=2, y=1, xend=2, yend=3, color="red", linetype = "dotted", size = 1)+
  # beta 1+2
  geom_label(x=2.2,y=3, color = "black", label=expression(hat(beta)[0]+hat(beta)[2]))+

  # treatment
  geom_point(data = treatment, aes(x=x,y=y), size = 3, color = "green")+
  geom_text(x=1,y=4.5, color = "green", label=expression(T[1]))+
  geom_text(x=2,y=8.5, color = "green", label=expression(T[2]))+
  
    # line
  geom_segment(x=1, y=4, xend=2, yend=8, color="green", size = 2)+

  # beta 0 plus beta 1 
  geom_label(x=0.8,y=4, color = "black", label=expression(hat(beta)[0]+hat(beta)[1]))+
  

  # line slope
  geom_segment(x=1, y=4, xend=2, yend=4, color="green", linetype = "dotted", size = 1)+
  geom_segment(x=2, y=4, xend=2, yend=6, color="red", linetype = "dotted", size = 1)+
  
  # diag
  geom_segment(x=1, y=4, xend=2, yend=6, color="red", linetype = "dashed", size = 1)+

  geom_segment(x=2, y=6, xend=2, yend=8, color="blue", linetype = "dashed", size = 1)+
  
  # beta 3
  geom_label(x=2.25,y=8, color = "black",
             label=expression(hat(beta)[0]+hat(beta)[1]+hat(beta)[2]+hat(beta)[3]))+

  # beta 1 dif
  annotate("segment", x = 1, xend = 1, y = 1, yend = 4, colour = "black", size=1, alpha=1, arrow=arrow(length=unit(0.5,"cm"), ends="both", type="closed"))+
  geom_label(x=1,y=2.5, size = 4, color = "black", label=expression(hat(beta)[1]))+
  
  # beta 2 dif
  annotate("segment", x = 2, xend = 2, y = 1, yend = 3, colour = "black", size=1, alpha=1, arrow=arrow(length=unit(0.5,"cm"), ends="both", type="closed"))+
  geom_label(x=2,y=2, size = 4, color = "black", label=expression(hat(beta)[2]))+
  
  # beta 2 dif
  annotate("segment", x = 2, xend = 2, y = 4, yend = 6, colour = "black", size=1, alpha=1, arrow=arrow(length=unit(0.5,"cm"), ends="both", type="closed"))+
  geom_label(x=2,y=5, size = 4, color = "black", label=expression(hat(beta)[2]))+
  
  # beta 2 dif
  annotate("segment", x = 2, xend = 2, y = 6, yend = 8, colour = "blue", size=1, alpha=1, arrow=arrow(length=unit(0.5,"cm"), ends="both", type="closed"))+
  geom_label(x=2,y=7, size = 4, color = "blue", label=expression(hat(beta)[3]))+

    scale_x_continuous(breaks=c(1,2),
                  labels=c(expression(t[before]),expression(t[after])),
                  limits=c(0.5,2.5))+
  scale_y_continuous(breaks=NULL,
                     limits=c(0,9))+
  labs(x = "Time",
       y = "Y")+
  theme_classic(base_family = "Fira Sans Condensed",
           base_size=14)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Control Treatment Group Diff \((\Delta Y_i)\)
Before \(\beta_0\) \(\beta_0+\beta_1\) \(\beta_1\)
After \(\beta_0+\beta_2\) \(\beta_0+\beta_1+\beta_2+\beta_3\) \(\beta_1+\beta_3\)
Time Diff \((\Delta Y_t)\) \(\beta_2\) \(\beta_2+\beta_3\) \(\beta_3\) Diff-in-diff \((\Delta_i \Delta_t)\)

Question 3

In your own words, describe the logic of an instrumental variables model: how does this approach identify the causal effect of \(X\) on \(Y\)? What makes for a valid instrument? When using a Two-Stage Least Squares approach, what is estimated at each stage?

Suppose we have a simple linear regression model of

\[Y_i = \beta_0 + \beta_1 \, X_i + u_i\]

Instrumental variables techniques attempt to isolate and use only the variation in an endogenous \(X\) variable (correlated with the error term, other factors that cause \(Y\)) that is exogenous, that is, not correlated with the error term. This is done by using an instrumental variable, \(I\) that is correlated with \(X\), but not a direct cause of \(Y\); only indirectly, through \(X\), does it cause \(Y\). Thus, there are two conditions for a valid instrument:

  1. Relevance (the “inclusion condition”): instrument \(I\) must be correlated with \(X\).
  2. Exogeneity (the “exclusion condition”): instrument \(I\) must not be correlated with the error term, \(u\).

The first condition is testable with a “first stage” regression of \(X\) on the instrument(s) \(I\) (and any covariates, \(X_2, \cdots X_k\)):

\[X_i = \gamma_0 + \gamma_1 I_i + w_i\] If \(\gamma_1\) is statistically significant (i.e. we can reject \(H_0\): \(\gamma_1 = 0\)), we have good evidence that the instrument is relevant.

The second condition is not testable statistically, and requires a plausible argument that \(I\) does not cause \(Y\) except through \(X\).

We also run the “reduced form” regression of the instrument \(I\) on outcome \(Y\) (and any covariates):

\[Y_i = \pi_0 + \pi_1 I_i + v_i\]

The instrumental variable estimator for the original \(\beta_1\), \(\beta_1^{IV} = \frac{pi_1}{\gamma_1}\), that is, we scale the “direct” effect of \(I \rightarrow Y\) (\(\pi_1\)) by the effect of \(I \rightarrow X\) (\(\gamma_1\)), since it is only \(X\) that directly affects \(Y\), not \(I\).

iv_colors <- tribble(
  ~name, ~Variable, ~Type,
  "X", "X", "Endogenous",
  "Y", "Y", "Outcome",
  "I", "Z", "Exogenous",
  "U", "Z", "Exogenous",
)

iv_dag <- dagify(Y ~ X + U,
              X ~ Z + U,
              outcome = "Y",
              exposure = "X") %>% 
  tidy_dagitty(seed = 256) 

# Add node types/colors to DAG data
iv_dag$data <- iv_dag$data %>% 
  left_join(iv_colors, by = "name")

iv_plot<-ggplot(data = iv_dag)+
  aes(x = x,
      y = y,
      xend = xend,
      yend = yend)+
  geom_dag_point(size = 10,
                 aes(color = Type)) +
  geom_dag_edges(start_cap = circle(4, "mm"),
                 end_cap = circle(4, "mm")) +
  geom_dag_text(size = 3, family = "Fira Sans") +
  scale_dag()+
  theme_void()+
  scale_shape_manual(values = c(16,17))+
  theme(legend.position = "none")
Warning: 'scale_dag' is deprecated.
Use 'scale_adjusted' instead.
See help("Deprecated")
iv_plot

We can generalize this, if we want to use multiple endogenous variables and/or multiple instruments, with Two Stage Least Squares (2SLS). We run the first stage regression of the endogenous variable \(X\) on the instrument(s) \(I\):

\[X_i = \gamma_0 + \gamma_1 I_i + w_i\]

Then we take the predicted (or fitted) values of \(\hat{X_i}\) from the first stage, and substitute them in for our exogenous variable \(X\) in the second stage regression:

\[Y_i = \beta_0 + \beta_1 \, \hat{X}_i + u_i\] This gives us an unbiased estimator of \(\beta_1\), so long as the instrument(s) is valid.

R Questions

Answer the following questions using R. When necessary, please write answers in the same document (rendered to html or pdf, typed .doc(x), or handwritten) as your answers to the above questions. Be sure to include (email or print an .R file, or show in your rendered quarto document) your code and the outputs of your code with the rest of your answers.

Question 4 (Fixed Effects)

Download and read in PeaceCorps.csv dataset. If you don’t want to download/upload it, you can read it in directly from the url by running this chunk:

# run or edit this chunk
pc <- read_csv("http://metricsf22.classes.ryansafner.com/files/data/PeaceCorps.csv")
Rows: 306 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): state, stateshort
dbl (4): year, unemployrate, appspc, stcode

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

How do people respond to changes in economic conditions? Are they more likely to pursue public service when private sector jobs are scarce? This dataset contains variables at the U.S. State (& D.C.) level:

Variable Description
state U.S. State
year Year
appspc Applications to the Peace Corps (per capita) in State
unemployrate State unemployment rate

Do more people apply to the Peace Corps when unemployment increases (and reduces other opportunities)?

Part A

Before looking at the data, what does your economic intuition tell you? Explain your hypothesis.

If joining the Peace Corps is a substitute for private sector jobs, then as unemployment rises, so too should applications. Although, it could also be possible that people also are more willing to opt out of the workforce when the economy is strong, so we should examine this empirically to be sure.

Part B

To get the hang of the data we’re working with, count (separately) the number of states, and the number of years. Get the number of n_distinct() states and years (Do these inside the summarize() command), as well as the distinct() values of each (don’t use the summarize() command for this part).

# count number of states
pc %>%
  count(state)
# count number of years
pc %>%
  count(year)
# get number of distinct states and years
pc %>%
  summarize(Num_states = n_distinct(state),
            Num_years = n_distinct(year))
# get distinct values of states
pc %>%
  distinct(state)
# get distinct values of years
pc %>%
  distinct(year)

Part C

Create a scatterplot of appspc (Y) on unemployrate (X). Which State is an outlier? How would this affect the pooled regression estimates? Create a second scatterplot that does not include this State.

ggplot(data = pc)+
    aes(x = unemployrate,
        y = appspc)+
    geom_point(aes(color = as.factor(state)))+
    geom_smooth(method = "lm")+
  scale_x_continuous(breaks=seq(0,20,2),
                     labels = function(x){paste(x,"%", sep="")})+
  labs(x = "Unemployment Rate",
       y = "Peace Corps Applications (per capita)")+
  theme_bw(base_family = "Fira Sans Condensed",
           base_size=16)+
  theme(legend.position = "none") # remove legend for State colors
`geom_smooth()` using formula = 'y ~ x'

We can see there are very clear outliers at the top. Let’s plot text instead of points, using the stateshort to see which observations are which states:

ggplot(data = pc)+
    aes(x = unemployrate,
        y = appspc)+
    geom_text(aes(color = as.factor(state), label = stateshort))+ #<<
    geom_smooth(method = "lm")+
  scale_x_continuous(breaks=seq(0,20,2),
                     labels = function(x){paste(x,"%", sep="")})+
  labs(x = "Unemployment Rate",
       y = "Peace Corps Applications (per capita)")+
  theme_bw(base_family = "Fira Sans Condensed",
           base_size=16)+
  theme(legend.position = "none") # remove legend for State colors
`geom_smooth()` using formula = 'y ~ x'

Clearly DIS, which is District of Columbia, are the outliers. Let’s make a second scatterplot without them:

pc %>%
  filter(state != "DISTRICT OF COLUMBIA") %>%
ggplot(data = .)+
    aes(x = unemployrate,
        y = appspc)+
    geom_point(aes(color = as.factor(state)))+
    geom_smooth(method = "lm")+
  scale_x_continuous(breaks=seq(0,20,2),
                     labels = function(x){paste(x,"%", sep="")})+
  labs(x = "Unemployment Rate",
       y = "Peace Corps Applications (per capita)")+
  theme_bw(base_family = "Fira Sans Condensed",
           base_size=16)+
  theme(legend.position = "none") # remove legend for State colors
`geom_smooth()` using formula = 'y ~ x'

Part D

Run two pooled regressions, one with the outliers, and one without them. Write out the estimated regression equation for each. Interpret the coefficient, and comment on how it changes between the two regressions.

pooled <- lm(appspc ~ unemployrate, data = pc)
summary(pooled)

Call:
lm(formula = appspc ~ unemployrate, data = pc)

Residuals:
    Min      1Q  Median      3Q     Max 
-43.616 -20.641  -9.719   5.171 310.328 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   46.2255     7.2857   6.345 8.07e-10 ***
unemployrate   0.8353     1.0362   0.806    0.421    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 45.2 on 304 degrees of freedom
Multiple R-squared:  0.002133,  Adjusted R-squared:  -0.00115 
F-statistic: 0.6498 on 1 and 304 DF,  p-value: 0.4208

\[\widehat{\text{Apps per capita}_{it}}=46.23+0.84 \, \text{Unemployment Rate}_{it}\] For every 1 percentage point increase in unemployment, there are 0.46 more applications to the Peace Corps per capita.

pc_no_dc <- pc %>%
  filter(state != "DISTRICT OF COLUMBIA")

pooled_no_dc <- lm(appspc ~ unemployrate, data = pc_no_dc)
summary(pooled_no_dc)

Call:
lm(formula = appspc ~ unemployrate, data = pc_no_dc)

Residuals:
    Min      1Q  Median      3Q     Max 
-38.380 -14.544  -4.190   9.278  97.997 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   48.5891     3.5955  13.514   <2e-16 ***
unemployrate  -0.3710     0.5133  -0.723     0.47    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22.17 on 298 degrees of freedom
Multiple R-squared:  0.001751,  Adjusted R-squared:  -0.001599 
F-statistic: 0.5226 on 1 and 298 DF,  p-value: 0.4703

\[\widehat{\text{Apps per capita}_{it}}=48.59-0.37 \, \text{Unemployment Rate}_{it}\]

For every 1 percentage point increase in unemployment, there are 0.37 fewer applications to the Peace Corps per capita.

The coefficient changes signs and becomes a smaller magnitude by dropping the outliers (DC)!

Part E

Now run a regression with State fixed effects using the dummy variable method. (Ensure that state is a factor variable, and insert in the regression. You can either mutate() it into a factor beforehand, or just do factor(state) in the lm command.) Interpret the marginal effect of unemployrate on appspc. How did it change?

fe_reg <- lm(appspc ~ unemployrate + factor(state), data = pc)
summary(fe_reg)

Call:
lm(formula = appspc ~ unemployrate + factor(state), data = pc)

Residuals:
    Min      1Q  Median      3Q     Max 
-31.652  -3.661  -0.393   3.492  33.262 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        12.6070     3.7663   3.347 0.000939 ***
unemployrate                        0.6565     0.2330   2.817 0.005225 ** 
factor(state)ALASKA                52.5875     4.8452  10.854  < 2e-16 ***
factor(state)ARIZONA               18.0456     4.8463   3.724 0.000242 ***
factor(state)ARKANSAS               4.9117     4.8447   1.014 0.311630    
factor(state)CALIFORNIA            22.1455     4.8692   4.548 8.39e-06 ***
factor(state)COLORADO              69.9201     4.8452  14.431  < 2e-16 ***
factor(state)CONNECTICUT           28.9349     4.8446   5.973 7.84e-09 ***
factor(state)DELAWARE              21.8317     4.8489   4.502 1.02e-05 ***
factor(state)DISTRICT OF COLUMBIA 311.7036     4.8533  64.225  < 2e-16 ***
factor(state)FLORIDA               13.0805     4.8492   2.697 0.007456 ** 
factor(state)GEORGIA               18.7073     4.8486   3.858 0.000145 ***
factor(state)HAWAII                35.7883     4.8617   7.361 2.53e-12 ***
factor(state)IDAHO                 34.4465     4.8480   7.105 1.21e-11 ***
factor(state)ILLINOIS              27.8559     4.8503   5.743 2.65e-08 ***
factor(state)INDIANA               21.5183     4.8478   4.439 1.35e-05 ***
factor(state)IOWA                  28.9038     4.8614   5.946 9.06e-09 ***
factor(state)KANSAS                28.9728     4.8507   5.973 7.83e-09 ***
factor(state)KENTUCKY               9.6535     4.8540   1.989 0.047800 *  
factor(state)LOUISIANA              2.5206     4.8517   0.520 0.603844    
factor(state)MAINE                 53.3394     4.8450  11.009  < 2e-16 ***
factor(state)MARYLAND              45.1903     4.8513   9.315  < 2e-16 ***
factor(state)MASSACHUSETTS         44.9452     4.8450   9.277  < 2e-16 ***
factor(state)MICHIGAN              26.8719     4.8970   5.487 9.86e-08 ***
factor(state)MINNESOTA             45.6153     4.8476   9.410  < 2e-16 ***
factor(state)MISSISSIPPI           -5.6364     4.8607  -1.160 0.247307    
factor(state)MISSOURI              19.9421     4.8458   4.115 5.23e-05 ***
factor(state)MONTANA               66.2285     4.8583  13.632  < 2e-16 ***
factor(state)NEBRASKA              26.6913     4.8904   5.458 1.14e-07 ***
factor(state)NEVADA                12.7169     4.8767   2.608 0.009656 ** 
factor(state)NEW HAMPSHIRE         54.0272     4.8658  11.103  < 2e-16 ***
factor(state)NEW JERSEY            13.4382     4.8452   2.774 0.005956 ** 
factor(state)NEW MEXICO            28.0890     4.8503   5.791 2.06e-08 ***
factor(state)NEW YORK              21.5156     4.8446   4.441 1.34e-05 ***
factor(state)NORTH CAROLINA        23.2097     4.8533   4.782 2.94e-06 ***
factor(state)NORTH DAKOTA          19.6067     4.9034   3.999 8.36e-05 ***
factor(state)OHIO                  25.2887     4.8500   5.214 3.83e-07 ***
factor(state)OKLAHOMA              10.4850     4.8560   2.159 0.031774 *  
factor(state)OREGON                72.7034     4.8545  14.976  < 2e-16 ***
factor(state)PENNSYLVANIA          21.7160     4.8449   4.482 1.12e-05 ***
factor(state)RHODE ISLAND          34.0396     4.8654   6.996 2.33e-11 ***
factor(state)SOUTH CAROLINA        12.0780     4.8651   2.483 0.013690 *  
factor(state)SOUTH DAKOTA          22.9175     4.8862   4.690 4.46e-06 ***
factor(state)TENNESSEE              9.1572     4.8498   1.888 0.060142 .  
factor(state)TEXAS                  9.2251     4.8455   1.904 0.058062 .  
factor(state)UTAH                  18.8667     4.8571   3.884 0.000131 ***
factor(state)VERMONT              104.7363     4.8580  21.560  < 2e-16 ***
factor(state)VIRGINIA              46.9318     4.8607   9.655  < 2e-16 ***
factor(state)WASHINGTON            63.8660     4.8460  13.179  < 2e-16 ***
factor(state)WEST VIRGINIA          3.9514     4.8461   0.815 0.415620    
factor(state)WISCONSIN             35.8930     4.8448   7.409 1.89e-12 ***
factor(state)WYOMING               35.3036     4.8665   7.254 4.88e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.391 on 254 degrees of freedom
Multiple R-squared:  0.9713,    Adjusted R-squared:  0.9655 
F-statistic: 168.4 on 51 and 254 DF,  p-value: < 2.2e-16

Part F

Find the coefficient for Maryland and interpret it. How many applications per capita does Maryland have?

This is just like any model with dummy variables representing different categories. The constant (12.61) represents the reference category not put in the regression (to avoid the dummy variable trap), i.e. ALABAMA.

The coefficient on Maryland is 45.19. This means that, on average, there are 45.19 more applications per capita in Maryland than Alabama over the time period in the data. To get the average for Maryland, we add \(12.61+45.19=57.80\).

# if we want to extract it using R
fe_reg_tidy <- tidy(fe_reg)

AL <- fe_reg_tidy %>%
  filter(term == "(Intercept)") %>%
  pull(estimate)

MD_AL_diff <- fe_reg_tidy %>%
  filter(term == "factor(state)MARYLAND") %>%
  pull(estimate)

AL + MD_AL_diff
[1] 57.79731

Part G

Now try using the feols() command (from the fixest package), which de-means the data, and make sure you get the same results as Part E. Do you get the same marginal effect of unemployrate on appspc?

fe_reg_alt <- feols(appspc ~ unemployrate | factor(state),
                    data = pc)
summary(fe_reg_alt)
OLS estimation, Dep. Var.: appspc
Observations: 306 
Fixed-effects: factor(state): 51
Standard-errors: Clustered (factor(state)) 
             Estimate Std. Error t value   Pr(>|t|)    
unemployrate 0.656509   0.177444 3.69982 0.00053739 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 7.64486     Adj. R2: 0.965505
                Within R2: 0.0303  

Part H

Now also include year fixed effects in your regression, using the dummy variable method. Interpret the marginal effect of unemployrate on appspc. How did it change?

TWFE_reg <- lm(appspc ~ unemployrate + factor(state) + factor(year), data = pc)
summary(TWFE_reg)

Call:
lm(formula = appspc ~ unemployrate + factor(state) + factor(year), 
    data = pc)

Residuals:
     Min       1Q   Median       3Q      Max 
-26.7457  -3.1669   0.1384   2.7666  31.1644 

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        14.48800    3.91879   3.697 0.000268 ***
unemployrate                        0.70525    0.54733   1.289 0.198760    
factor(state)ALASKA                52.57126    4.08265  12.877  < 2e-16 ***
factor(state)ARIZONA               18.01801    4.09035   4.405 1.57e-05 ***
factor(state)ARKANSAS               4.91897    4.07940   1.206 0.229036    
factor(state)CALIFORNIA            22.04317    4.23743   5.202 4.12e-07 ***
factor(state)COLORADO              69.93711    4.08307  17.129  < 2e-16 ***
factor(state)CONNECTICUT           28.93000    4.07894   7.093 1.36e-11 ***
factor(state)DELAWARE              21.87470    4.10713   5.326 2.24e-07 ***
factor(state)DISTRICT OF COLUMBIA 311.64268    4.13555  75.357  < 2e-16 ***
factor(state)FLORIDA               13.03582    4.10931   3.172 0.001702 ** 
factor(state)GEORGIA               18.66589    4.10502   4.547 8.50e-06 ***
factor(state)HAWAII                35.87361    4.18953   8.563 1.16e-15 ***
factor(state)IDAHO                 34.48472    4.10104   8.409 3.24e-15 ***
factor(state)ILLINOIS              27.80635    4.11635   6.755 1.00e-10 ***
factor(state)INDIANA               21.48091    4.10010   5.239 3.44e-07 ***
factor(state)IOWA                  28.98828    4.18745   6.923 3.75e-11 ***
factor(state)KANSAS                29.02399    4.11886   7.047 1.79e-11 ***
factor(state)KENTUCKY               9.59010    4.14017   2.316 0.021351 *  
factor(state)LOUISIANA              2.57585    4.12547   0.624 0.532951    
factor(state)MAINE                 53.35320    4.08152  13.072  < 2e-16 ***
factor(state)MARYLAND              45.24393    4.12277  10.974  < 2e-16 ***
factor(state)MASSACHUSETTS         44.95986    4.08187  11.015  < 2e-16 ***
factor(state)MICHIGAN              26.72242    4.41044   6.059 5.03e-09 ***
factor(state)MINNESOTA             45.65107    4.09827  11.139  < 2e-16 ***
factor(state)MISSISSIPPI           -5.71925    4.18336  -1.367 0.172813    
factor(state)MISSOURI              19.91937    4.08656   4.874 1.95e-06 ***
factor(state)MONTANA               66.30488    4.16773  15.909  < 2e-16 ***
factor(state)NEBRASKA              26.83099    4.36996   6.140 3.24e-09 ***
factor(state)NEVADA                12.59989    4.28489   2.941 0.003585 ** 
factor(state)NEW HAMPSHIRE         54.12221    4.21590  12.838  < 2e-16 ***
factor(state)NEW JERSEY            13.42198    4.08265   3.288 0.001156 ** 
factor(state)NEW MEXICO            28.13855    4.11635   6.836 6.25e-11 ***
factor(state)NEW YORK              21.52042    4.07894   5.276 2.87e-07 ***
factor(state)NORTH CAROLINA        23.14879    4.13555   5.598 5.73e-08 ***
factor(state)NORTH DAKOTA          19.76510    4.44960   4.442 1.34e-05 ***
factor(state)OHIO                  25.24078    4.11393   6.135 3.32e-09 ***
factor(state)OKLAHOMA              10.55484    4.15333   2.541 0.011652 *  
factor(state)OREGON                72.63838    4.14334  17.531  < 2e-16 ***
factor(state)PENNSYLVANIA          21.72903    4.08118   5.324 2.26e-07 ***
factor(state)RHODE ISLAND          33.94537    4.21360   8.056 3.28e-14 ***
factor(state)SOUTH CAROLINA        11.98462    4.21132   2.846 0.004799 ** 
factor(state)SOUTH DAKOTA          23.05074    4.34429   5.306 2.48e-07 ***
factor(state)TENNESSEE              9.11011    4.11274   2.215 0.027659 *  
factor(state)TEXAS                  9.24536    4.08494   2.263 0.024480 *  
factor(state)UTAH                  18.93985    4.16038   4.552 8.30e-06 ***
factor(state)VERMONT              104.81188    4.16587  25.160  < 2e-16 ***
factor(state)VIRGINIA              47.01469    4.18336  11.239  < 2e-16 ***
factor(state)WASHINGTON            63.84082    4.08836  15.615  < 2e-16 ***
factor(state)WEST VIRGINIA          3.97742    4.08900   0.973 0.331641    
factor(state)WISCONSIN             35.90359    4.08029   8.799 2.35e-16 ***
factor(state)WYOMING               35.40021    4.22056   8.388 3.73e-15 ***
factor(year)2007                   -5.36842    1.39953  -3.836 0.000159 ***
factor(year)2008                    0.02018    1.48030   0.014 0.989135    
factor(year)2009                    4.23877    2.61984   1.618 0.106939    
factor(year)2010                   -3.25521    2.76284  -1.178 0.239836    
factor(year)2011                   -8.88542    2.47990  -3.583 0.000409 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.064 on 249 degrees of freedom
Multiple R-squared:   0.98, Adjusted R-squared:  0.9756 
F-statistic: 218.3 on 56 and 249 DF,  p-value: < 2.2e-16

For every 1 percentage point increase in the unemployment rate, there are 0.71 applications per capita to the Peace Corps.

Part I

What would be the predicted number of applications in Maryland in 2011 at an unemployment rate of 5%?

\[\begin{align*} \widehat{\text{Apps per capita}_{it}}&=14.49 +0.71 \, \text{Unemployment Rate}_{it} + 45.24 \, \text{Maryland_i} -8.89 \text{2007}_t \\ \widehat{\text{Apps per capita}_{MD, 2011}}&=14.49 +0.71 \, (5) + 45.24 \, (1) -8.89 (1) \\ \widehat{\text{Apps per capita}_{MD, 2011}}& \approx 54.09 \\ \end{align*}\]

# if we want to extract it using R
TWFE_reg_tidy <- tidy(TWFE_reg)

constant <- TWFE_reg_tidy %>%
  filter(term == "(Intercept)") %>%
  pull(estimate)

unemploy_me <- TWFE_reg_tidy %>%
  filter(term == "unemployrate") %>%
  pull(estimate)

MD_dif <- TWFE_reg_tidy %>%
  filter(term == "factor(state)MARYLAND") %>%
  pull(estimate)

dif_2007 <- TWFE_reg_tidy %>%
  filter(term == "factor(year)2007") %>%
  pull(estimate)

constant + unemploy_me * 5 + MD_dif + dif_2007
[1] 57.88974
# some rounding error

Part J

Now try using the feols() command, which de-means the data, and make sure you get the same results as Part H. Do you get the same marginal effect of unemployrate on appspc?

TWFE_reg_alt <- feols(appspc ~ unemployrate | factor(state) + factor(year),
                    data = pc)
summary(TWFE_reg_alt)
OLS estimation, Dep. Var.: appspc
Observations: 306 
Fixed-effects: factor(state): 51,  factor(year): 6
Standard-errors: Clustered (factor(state)) 
             Estimate Std. Error t value Pr(>|t|) 
unemployrate 0.705245   0.521514  1.3523  0.18236 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 6.37247     Adj. R2: 0.975551
                Within R2: 0.006624

Part K

Can there still be endogeneity in this model? Give some examples.

Yes. There may be other time-varying variables that are omitted and correlated with unemployment, but not picked up in the time-fixed effect since they are not the same in each state. Examples could be a local market bubble some years in Texas or Nevada, but not in Maine, or bad weather in Florida one year, but not in Wyoming, etc.1

Part L

Create a nice regression table (using modelsummary) for comparison of the regressions in Parts D, G, and J.

modelsummary(models = list("OLS" = pooled,
                           "OLS (no DC)" = pooled_no_dc,
                           "State FE" = fe_reg_alt,
                           "TWFE" = TWFE_reg_alt),
             fmt = 2, # round to 2 decimals
             output = "html",
             coef_rename = c("(Intercept)" = "Constant",
                             "unemployrate" = "Unemployment Rate"),
             gof_map = list(
               list("raw" = "nobs", "clean" = "n", "fmt" = 0),
               #list("raw" = "r.squared", "clean" = "R<sup>2</sup>", "fmt" = 2),
               list("raw" = "adj.r.squared", "clean" = "Adj. R<sup>2</sup>", "fmt" = 2),
               list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
             ),
             escape = FALSE,
             stars = c('*' = .1, '**' = .05, '***' = 0.01)
)
OLS OLS (no DC) State FE TWFE
Constant 46.23*** 48.59***
(7.29) (3.60)
Unemployment Rate 0.84 −0.37 0.66*** 0.71
(1.04) (0.51) (0.18) (0.52)
n 306 300 306 306
Adj. R2 −0.001 −0.002 0.97 0.98
SER 45.06 22.10 7.64 6.37
* p < 0.1, ** p < 0.05, *** p < 0.01

Question 5 (Difference-in-Differences)

Download and read in TexasSchools.csv dataset. If you don’t want to download/upload it, you can read it in directly from the url by running this chunk:

# run or edit this chunk
schools <- read_csv("http://metricsf22.classes.ryansafner.com/files/data/TexasSchools.csv")
Rows: 7203 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (5): DistNumber, Year, OnCycle, LnAvgSalary, CycleSwitch

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Are teachers paid more when school board members are elected “off cycle” when there are not major national political elections (e.g. odd years) than “on cycle?” The argument is that during “off” years, without attention on state or national elections, voters will pay less attention to the election, and teachers can more effectively mobilize for higher pay, versus “on” years where voters are paying more attention. This data comes from Anzia, Sarah (2012), “The Election Timing Effect: Evidence from a Policy Intervention in Texas.” Quarterly Journal of Political Science 7(3): 277-297, and follows 1,020 Texas school board districts from 2003–2009.

From 2003–2006, all districts elected their school board members off-cycle. A change in Texas policy in 2006 led some, but not all, districts to elect their school board members on-cycle from 2007 onwards.

Variable Description
LnAvgSalary logged average salary of teachers in district
Year Year
OnCycle =1 if school boards elected on-cycle (e.g. same year as national and state elections), =0 if elected off-cycle
pol_freedom Political freedom index score (2018) from 1 (least) top 10 (most free)
CycleSwitch =1 if district switched from off- to on-cycle elections
AfterSwitch =1 if year is after 2006

Part A

Run a pooled regression model of LnAvgSalary on OnCycle. Write the estimated regression equation, and interpret the coefficient on OnCycle. Are there any sources of bias (consider in particular the argument in the question prompt)?

pooled_reg <- lm(LnAvgSalary ~ OnCycle, data = schools)
summary(pooled_reg)

Call:
lm(formula = LnAvgSalary ~ OnCycle, data = schools)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.28910 -0.05527 -0.00646  0.05196  0.65215 

Coefficients:
             Estimate Std. Error   t value Pr(>|t|)    
(Intercept) 10.671363   0.001018 10478.815   <2e-16 ***
OnCycle     -0.030621   0.003766    -8.131    5e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.08284 on 7137 degrees of freedom
  (64 observations deleted due to missingness)
Multiple R-squared:  0.009178,  Adjusted R-squared:  0.009039 
F-statistic: 66.11 on 1 and 7137 DF,  p-value: 4.995e-16

\[\widehat{\text{log(Average Salary})_it} = 10.67-0.03 \, \text{OnCycle}_{it}\]

Recognize this is a linear-log model and \(X\) is a dummy variable. School board elections that are held OnCycle \((=1)\) lead to \(\hat{\beta_1 \times 100\% = −0.03 \times 100\% = −3\%\) lower teacher salaries. This is highly statistically significant.

This is almost certainly biased. There are probably things correlated at the district level between both whether or not districts vote (or switch to voting) on cycle/off cycle and with teacher salaries. Perhaps some districts have strong/weak teachers unions and that determines whether they choose to vote on or off cycle (off cycle would be preferable, and perhaps districts with strong teachers unions kept the elections off cycle).

Also, there could be changes in the entire State over time-perhaps all Texas teacher salaries were increasing or decreasing over the years.

Part B

Some schools decided to switch to an on-cycle election after 2006. Consider this, CycleSwitch the “treatment.” Create a variable to indicate post-treatment years (i.e. years after 2006). Call it After. Create a second, interaction variable to capture the interaction effect between those districts that switched, and after the treatment.

# create a variable, that is a factor, using ifelse() function
#   if the year is greater than 2006, code "After" as 1,
#   if the year is NOT greater than 2006, code "After" as 0
schools <- schools %>%
  mutate(After = factor(ifelse(test = Year > 2006,
                        yes = 1,
                        no = 0)))
head(schools)

Part C

Now estimate a difference-in-difference model with your variables in Part B: CycleSwitch is the treatment variable, After is your post-treatment indicator, and add an interaction variable to capture the interaction effect between those districts that switched, and after the treatment. Write down the estimated regression equation (to four decimal places).

dnd <- lm(LnAvgSalary ~ CycleSwitch + After + CycleSwitch:After, data = schools)
summary(dnd)

Call:
lm(formula = LnAvgSalary ~ CycleSwitch + After + CycleSwitch:After, 
    data = schools)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.29811 -0.05595 -0.00734  0.05284  0.65245 

Coefficients:
                    Estimate Std. Error  t value Pr(>|t|)    
(Intercept)        10.671068   0.001433 7447.598  < 2e-16 ***
CycleSwitch        -0.023982   0.003397   -7.059 1.83e-12 ***
After1              0.009303   0.002188    4.251 2.16e-05 ***
CycleSwitch:After1 -0.008590   0.005189   -1.655   0.0979 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.08334 on 7198 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.0183,    Adjusted R-squared:  0.01789 
F-statistic: 44.72 on 3 and 7198 DF,  p-value: < 2.2e-16

\[\widehat{\text{Log(Average Salary)}_{it}}=10.6711-0.0240 \, \text{Cycle Switch}_{i} + 0.0093 \, \text{After}_t - 0.0086 \, (\text{Cycle Switch}_i \times \text{After}_t)\]

Part D

Interpret what each coefficient means from Part C.

  • \(\hat{\beta_0}\): districts before 2007 that did not switch had ln(AvgSalary) of 10.6711
  • \(\hat{\beta_1}\): districts before 2007 that did switch had ln(AvgSalary) 0.0240 lower than districts that did not switch
  • \(\hat{\beta_2}\): all districts after the change had ln(AvgSalary) 0.0093 higher than before the change
  • \(\hat{\beta_3}\): districts that made a switch had a 0.0086 lower change in ln(AvgSalary) over time than districts that did not switch

Part E

Using your regression equation in Part C, calculate the expected logged average salary \((Y)\) for districts in Texas:

  1. Before the switch that did not switch
  2. After the switch that did not switch
  3. Before the switch that did switch
  4. After the switch that did switch
  1. Before change and did not switch: \(\hat{\beta_0} = 10.6711\)
  2. After change and did not switch: \(\hat{\beta_0}+\hat{\beta_2} = 10.6711 + 0.0093 = 10.6804\)
  3. Before change and did switch: \(\hat{\beta_0}+\hat{\beta_1} = 10.6711 − 0.0240 = 10.6471\)
  4. After change and did switch: \(\hat{\beta_0}+\hat{\beta_1}+\hat{\beta_2}+\hat{\beta_3} = 10.6711−0.0240+ 0.0093 − 0.0086 = 10.6478\)
Before After Time Diff
Did not switch \(\hat{\beta}_0\) \(\hat{\beta}_0+\hat{\beta}_2\) \(\hat{\beta}_2\)
Switched \(\hat{\beta}_0 + \hat{\beta}_1\) \(\hat{\beta}_0 + \hat{\beta}_1 + \hat{\beta}_2 + \hat{\beta}_3\) \(\hat{\beta}_2 + \hat{\beta}_3\)
Group Diff \(\hat{\beta}_1\) \(\hat{\beta}_1+\hat{\beta}_3\) \(\hat{\beta}_3\)
Before After Time Diff
Did not switch 10.6711 10.6804 0.00093
Switched 10.6471 10.6478 0.0007
Group Diff 0.0240 0.0226 -0.0086

Part F

Confirm your estimates in Part E by finding the mean logged average salary for each of those four groups in the data. (Hint, filter() properly then summarize().)

# avg log salary for districts BEFORE that did NOT switch
schools %>%
  filter(CycleSwitch == 0,
         After == 0) %>%
  summarize(average = mean(LnAvgSalary, na.rm=T)) # need to remove NA's
# avg log salary for districts AFTER that did NOT switch
schools %>%
  filter(CycleSwitch == 0,
         After == 1) %>%
  summarize(average = mean(LnAvgSalary, na.rm=T)) # need to remove NA's
# avg log salary for districts BEFORE that DID switch
schools %>%
  filter(CycleSwitch == 1,
         After == 0) %>%
  summarize(average = mean(LnAvgSalary, na.rm=T)) # need to remove NA's
# avg log salary for districts AFTER that DID switch
schools %>%
  filter(CycleSwitch == 1,
         After == 1) %>%
  summarize(average = mean(LnAvgSalary, na.rm=T)) # need to remove NA's

Part G

Write out the difference-in-difference equation, and calculate the difference-in-difference. Make sure it matches your estimate from the regression.

\[\begin{align*} \Delta \Delta ln(AvgSalary)&=(Switched_{after}-Switched_{before})-(Didn't_{after}-Didn't_{before})\\ &=(10.6478-10.6471)-(10.6804-10.6711)\\ &=(0.0007)-(0.0093)\\ &=-0.0086\\ \end{align*}\]

This is \(\beta_3\).

Part H

Can we say anything about the types of districts that switched? Can we say anything about all salaries in the districts in the years after the switch?

We can see that the districts that switched paid 2.3% less to their teachers to begin with. \(\beta_2\) shows the difference, before treatment, between the treated (districts that switched) and control groups (districts that did not switch).

We can see that all districts saw a 0.93% increase in teacher salaries after the years after the switch. \(\beta_3\) shows the difference, for both the treatment and control, of salaries over time (before and after the switch).

Part I

Now let’s generalize the diff-in-diff model. Instead of the treatment and post-treatment dummies, use district-and year-fixed effects and the interaction term. Interpret the coefficient on the interaction term.

This is doable with the dummy variable method, but there will be a lot of dummies! I suggest using feols() from the fixest package.

dnd_twfe_reg <- feols(LnAvgSalary ~ CycleSwitch*After | factor(DistNumber) + factor(Year),
                    data = schools)
NOTE: 1 observation removed because of NA values (LHS: 1).
The variables 'CycleSwitch' and 'After1' have been removed because of collinearity (see $collin.var).
summary(dnd_twfe_reg)
OLS estimation, Dep. Var.: LnAvgSalary
Observations: 7,202 
Fixed-effects: factor(DistNumber): 1,029,  factor(Year): 7
Standard-errors: Clustered (factor(DistNumber)) 
                    Estimate Std. Error t value Pr(>|t|)    
CycleSwitch:After1 -0.008594   0.003936 -2.1835 0.029224 *  
... 2 variables were removed because of collinearity (CycleSwitch and After1)
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.029514     Adj. R2: 0.856129
                 Within R2: 0.003027

Part J

Create a nice regression table (using modelsummary) for comparison of the regressions in Parts A, C, and I.

modelsummary(models = list("OLS" = pooled_reg,
                           "DND" = dnd,
                           "TWFE" = dnd_twfe_reg),
             fmt = 2, # round to 2 decimals
             output = "html",
             coef_rename = c("(Intercept)" = "Constant",
                             "OnCycle" = "On Cycle",
                             "CycleSwitch" = "Switched",
                             "After1" = "After Switch",
                             "CycleSwitch:After1" = "Switched x After"),
             gof_map = list(
               list("raw" = "nobs", "clean" = "n", "fmt" = 0),
               #list("raw" = "r.squared", "clean" = "R<sup>2</sup>", "fmt" = 2),
               list("raw" = "adj.r.squared", "clean" = "Adj. R<sup>2</sup>", "fmt" = 2),
               list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
             ),
             escape = FALSE,
             stars = c('*' = .1, '**' = .05, '***' = 0.01)
)
OLS DND TWFE
Constant 10.67*** 10.67***
(0.001) (0.001)
On Cycle −0.03***
(0.004)
Switched −0.02***
(0.003)
After Switch 0.009***
(0.002)
Switched x After −0.009* −0.009**
(0.005) (0.004)
n 7139 7202 7202
Adj. R2 0.009 0.02 0.86
SER 0.08 0.08 0.03
* p < 0.1, ** p < 0.05, *** p < 0.01

Question 6 (Instrumental Variables)

Download and read in RainIV.csv dataset. If you don’t want to download/upload it, you can read it in directly from the url by running this chunk:

# run or edit this chunk
ivrain <- read_csv("http://metricsf22.classes.ryansafner.com/files/data/RainIV.csv")
Rows: 743 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): country_name, country_code
dbl (19): year, GPCP, RainfallGrowth, LaggedRainfallGrowth, war_prio, Intern...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Does economic growth reduce the odds of civil conflict? Consider data on 41 African countries between 1981-1999. Below are key variables (among others in the data):

Variable Description
Internalconflict =1 if civil war (<25 deaths), else =0
LaggedGDPGrowth GDP Growth rate from previous year
LagggedRainfallGrowth Change (in mm) of rain from previous year
InitialGDP GDP per capita (1979)
Democracy Polity II score
country_name Country name
country_code Three letter ISO country code
year Year
pop Population
Mountains Percentage of territory that is mountainous terrain
EthnicFrac Ethnic fractionalization index (probability two random individuals are NOT from the same ethnic group)
ReligiousFrac Religious fractionalization index (probability two random individuals are NOT from the same religion)

Part A

Make a scatterplot between Internalconflict (as y) and LaggedGDPGrowth (as x), and add a regression line.

ggplot(data = ivrain)+
    aes(x = LaggedGDPGrowth,
        y = InternalConflict)+
    geom_jitter(aes(color = as.factor(country_name)), height=0.05)+
    geom_smooth(method = "lm")+
  scale_x_continuous(labels = function(x){paste(x,"%", sep="")})+
  scale_y_continuous(breaks = c(0,1))+
  labs(x = "Lagged GDP Growth Rate",
       y = "Internal Conflict")+
  theme_bw(base_family = "Fira Sans Condensed",
           base_size=16)+
  theme(legend.position = "none") # remove legend for State colors
`geom_smooth()` using formula = 'y ~ x'

Part B

Run a regression of InternalconflictonLaggedGDPGrowth.` Note that since \(Y\) is a dummy variable, \(\beta_1\) is the probability of internal conflict in country \(i\) at time \(t\). Write out the estimated regression equation and interpret the marginal effect.

conflict_reg <- lm(InternalConflict ~ LaggedGDPGrowth, data = ivrain)
summary(conflict_reg)

Call:
lm(formula = InternalConflict ~ LaggedGDPGrowth, data = ivrain)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.2999 -0.2689 -0.2660  0.7228  0.7876 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.26738    0.01631  16.389   <2e-16 ***
LaggedGDPGrowth -0.08206    0.22485  -0.365    0.715    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4434 on 741 degrees of freedom
Multiple R-squared:  0.0001797, Adjusted R-squared:  -0.00117 
F-statistic: 0.1332 on 1 and 741 DF,  p-value: 0.7152

\[\widehat{\text{Internal Conflict}_{it}} = 0.2674 - 0.0821 \, \text{Lagged GDP Growth}_{it}\]

For every additional 1% of GDP growth per capita from the previous year, the probability of internal conflict decreases by 0.08 (8%).

Part C

Run another regression, adding in LaggedGDPGrowth, InitialGDP, Democracy, Mountains EthnicFrac, and ReligiousFrac as controls. Write out the estimated regression equation. What happens to the marginal effect of LaggedGDPGrowth?

conflict_reg_controls <- lm(InternalConflict ~ LaggedGDPGrowth + InitialGDP + Democracy + Mountains + EthnicFrac + ReligiousFrac, data = ivrain)
summary(conflict_reg_controls)

Call:
lm(formula = InternalConflict ~ LaggedGDPGrowth + InitialGDP + 
    Democracy + Mountains + EthnicFrac + ReligiousFrac, data = ivrain)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5654 -0.2811 -0.2221  0.4570  0.9459 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.0703555  0.0731012   0.962  0.33614    
LaggedGDPGrowth -0.1087977  0.2200999  -0.494  0.62123    
InitialGDP      -0.0569091  0.0182258  -3.122  0.00186 ** 
Democracy        0.0012242  0.0028894   0.424  0.67193    
Mountains        0.0038654  0.0009527   4.057 5.49e-05 ***
EthnicFrac       0.3247931  0.0918181   3.537  0.00043 ***
ReligiousFrac    0.0105162  0.0958907   0.110  0.91270    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4334 on 736 degrees of freedom
Multiple R-squared:  0.05106,   Adjusted R-squared:  0.04332 
F-statistic:   6.6 on 6 and 736 DF,  p-value: 8.276e-07

\[\begin{align*} \widehat{\text{Internal Conflict}_{it}} = & \, 0.0704 - 0.1088 \, \text{Lagged GDP Growth}_{it} - 0.0569 \, \text{Initial GDP}_it + 0.0012 \, \text{Democracy}_{it} +\\ & 0.0039 \, \text{Mountains}_{it} + 0.3248 \, \text{Ethnic Fractionalization}_{it} + 0.0105 \, \text{Religious Fractionalization}_{it} \\ \end{align*}\]

The marginal effect of LaggedGDPGrowth increases from -0.0821 to -0.1088.

Part D

Now let’s consider LaggedRainfallGrowth as a potential instrument for LaggedGDPGrowth. First, check the correlations between InternalConflict, LaggedGDPGrowth and LaggedRainfallGrowth.

ivrain %>%
select(InternalConflict, LaggedGDPGrowth, LaggedRainfallGrowth) %>%
cor()
                     InternalConflict LaggedGDPGrowth LaggedRainfallGrowth
InternalConflict            1.0000000      -0.0134064           -0.0506118
LaggedGDPGrowth            -0.0134064       1.0000000            0.1263399
LaggedRainfallGrowth       -0.0506118       0.1263399            1.0000000

Part E

Now let’s consider if LaggedRainfallGrowth is a relevant instrument (the “inclusion condition”) by running the first stage regression of LaggedGDPGrowth on LaggedRainfallGrowth and the rest of the controls from part C. Is the OLS estimate on LaggedRainfallGrowth statistically significant?

first_stage <- lm(LaggedGDPGrowth ~ LaggedRainfallGrowth + InitialGDP + Democracy + Mountains + EthnicFrac + ReligiousFrac,
                 data = ivrain)
summary(first_stage)

Call:
lm(formula = LaggedGDPGrowth ~ LaggedRainfallGrowth + InitialGDP + 
    Democracy + Mountains + EthnicFrac + ReligiousFrac, data = ivrain)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.46563 -0.03051  0.00459  0.03159  0.66704 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -0.0058041  0.0121558  -0.477 0.633168    
LaggedRainfallGrowth  0.0439770  0.0128127   3.432 0.000632 ***
InitialGDP           -0.0008056  0.0030287  -0.266 0.790327    
Democracy             0.0005373  0.0004798   1.120 0.263080    
Mountains             0.0001087  0.0001582   0.687 0.492435    
EthnicFrac            0.0031603  0.0152584   0.207 0.835976    
ReligiousFrac        -0.0017176  0.0159357  -0.108 0.914197    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.07201 on 736 degrees of freedom
Multiple R-squared:  0.01844,   Adjusted R-squared:  0.01044 
F-statistic: 2.305 on 6 and 736 DF,  p-value: 0.0327

Yes, LaggedRainfallGrowth is statistically significant (since the p-value is much smaller than 0.05), implying the instrument is relevant. Each additional 1mm increase in rainfall from the previous year increases GDP growth rates by 0.04 percentage points.

Part F

There is no statistical test for whether LaggedRainfallGrowth is exogenous (the “exclusion condition”). Do you think it is plausible that LaggedRainfallGrowth is uncorrelated with the error term; that is, with any other factors that cause conflict; and that it only affects conflict through economic growth?

Rainfall and weather certainly affects crop yields, which may be a large part of the economy in these countries, so this would certainly be a channel by which it affects conflict. Can we concieve of rainfall affecting other things that affect whether a country is experiencing internal conflict?

Part G

Extract and save the fitted values of your first stage regression from part E. Now run the second stage regression by using the fitted values of LaggedGDPGrowth and the control variables from parts C and E. What is the marginal effect of LaggedGDPGrowth on InternalConflict?

# add fitted values from firststage regression to the ivrain data as a column called "laggdpgrowth_hat"
ivrain$laggdpgrowth_hat <- first_stage$fitted.values

second_stage <- lm(InternalConflict ~ laggdpgrowth_hat + InitialGDP + Democracy + Mountains + EthnicFrac + ReligiousFrac, data = ivrain)
summary(second_stage)

Call:
lm(formula = InternalConflict ~ laggdpgrowth_hat + InitialGDP + 
    Democracy + Mountains + EthnicFrac + ReligiousFrac, data = ivrain)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5653 -0.2830 -0.2237  0.4748  0.9952 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.0625061  0.0733775   0.852 0.394579    
laggdpgrowth_hat -2.0631528  1.7522115  -1.177 0.239394    
InitialGDP       -0.0580798  0.0182415  -3.184 0.001514 ** 
Democracy         0.0023613  0.0030592   0.772 0.440438    
Mountains         0.0040694  0.0009691   4.199 3.01e-05 ***
EthnicFrac        0.3288508  0.0918180   3.582 0.000364 ***
ReligiousFrac     0.0047242  0.0959548   0.049 0.960746    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4331 on 736 degrees of freedom
Multiple R-squared:  0.05253,   Adjusted R-squared:  0.04481 
F-statistic: 6.801 on 6 and 736 DF,  p-value: 4.937e-07

The effect is now a much larger -2.0631 (for every additional 1% of GDP growth per capita from the previous year, the probability of internal conflict decreases by 2.06 (206%). ), but it is no longer statistically significant.

Part H

Now run the same regression using feols() from the fixest package.

tsls <- feols(InternalConflict ~ InitialGDP + Democracy + Mountains + EthnicFrac + ReligiousFrac | LaggedGDPGrowth ~ LaggedRainfallGrowth,
           data = ivrain)
summary(tsls)
TSLS estimation, Dep. Var.: InternalConflict, Endo.: LaggedGDPGrowth, Instr.: LaggedRainfallGrowth
Second stage: Dep. Var.: InternalConflict
Observations: 743 
Standard-errors: IID 
                     Estimate Std. Error   t value   Pr(>|t|)    
(Intercept)          0.062506   0.077268  0.808955 0.41880232    
fit_LaggedGDPGrowth -2.063153   1.845106 -1.118176 0.26385684    
InitialGDP          -0.058080   0.019209 -3.023646 0.00258426 ** 
Democracy            0.002361   0.003221  0.733011 0.46378490    
Mountains            0.004069   0.001020  3.987785 0.00007335 ***
EthnicFrac           0.328851   0.096686  3.401233 0.00070705 ***
ReligiousFrac        0.004724   0.101042  0.046755 0.96272101    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.453894   Adj. R2: -0.059159
F-test (1st stage), LaggedGDPGrowth: stat = 11.8    , p = 6.32e-4 , on 1 and 736 DoF.
                         Wu-Hausman: stat =  1.26244, p = 0.261556, on 1 and 735 DoF.

Part I

Make a nice regression table using modelsummary() using your results from Parts B, G, and H.

modelsummary(models = list("OLS" = conflict_reg,
                           "2SLS" = second_stage,
                           "2SLS (fixest)" = tsls),
             fmt = 2, # round to 2 decimals
             output = "html",
             coef_rename = c("(Intercept)" = "Constant",
                             "LaggedGDPGrowth" = "Lag GDP Growth Rate",
                             "laggdpgrowth_hat" = "Lag GDP Growth Rate",
                             "fit_LaggedGDPGrowth" = "Lag GDP Growth Rate"),
             gof_map = list(
               list("raw" = "nobs", "clean" = "n", "fmt" = 0),
               #list("raw" = "r.squared", "clean" = "R<sup>2</sup>", "fmt" = 2),
               list("raw" = "adj.r.squared", "clean" = "Adj. R<sup>2</sup>", "fmt" = 2),
               list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
             ),
             escape = FALSE,
             stars = c('*' = .1, '**' = .05, '***' = 0.01)
)
OLS 2SLS 2SLS (fixest)
Constant 0.27*** 0.06 0.06
(0.02) (0.07) (0.08)
Lag GDP Growth Rate −0.08 −2.06 −2.06
(0.22) (1.75) (1.85)
InitialGDP −0.06*** −0.06***
(0.02) (0.02)
Democracy 0.002 0.002
(0.003) (0.003)
Mountains 0.004*** 0.004***
(0.001) (0.001)
EthnicFrac 0.33*** 0.33***
(0.09) (0.10)
ReligiousFrac 0.005 0.005
(0.10) (0.10)
n 743 743 743
Adj. R2 −0.001 0.04 −0.06
SER 0.44 0.43 0.45
* p < 0.1, ** p < 0.05, *** p < 0.01

Footnotes

  1. It’s better to use the plm()-generated regressions so as to avoid the multitude of coefficients for the state and year dummies! You certainly could use the dummy variable ones and manually list all of the variables to suppress in the table inside omit_coefs()↩︎