Problem Set 6
Please read the instructions for completing and submitting homeworks.
Answers
Theory and Concepts
Question 1
In your own words, describe what fixed effects are, when we can use them, and how they remove endogeneity.
Question 2
In your own words, describe the logic of a difference-in-difference model: what is it comparing against what, and how does it estimate the effect of treatment? What assumption must be made about the treatment and control group for the model to be valid?
Question 3
In your own words, describe the logic of an instrumental variables model: how does this approach identify the causal effect of \(X\) on \(Y\)? What makes for a valid instrument? When using a Two-Stage Least Squares approach, what is estimated at each stage?
R Questions
Answer the following questions using R
. When necessary, please write answers in the same document (rendered to html
or pdf
, typed .doc(x)
, or handwritten) as your answers to the above questions. Be sure to include (email or print an .R
file, or show in your rendered quarto document) your code and the outputs of your code with the rest of your answers.
Question 4 (Fixed Effects)
Download and read in PeaceCorps.csv
dataset.
How do people respond to changes in economic conditions? Are they more likely to pursue public service when private sector jobs are scarce? This dataset contains variables at the U.S. State (& D.C.) level:
Variable | Description |
---|---|
state |
U.S. State |
year |
Year |
appspc |
Applications to the Peace Corps (per capita) in State |
unemployrate |
State unemployment rate |
Do more people apply to the Peace Corps when unemployment increases (and reduces other opportunities)?
Part A
Before looking at the data, what does your economic intuition tell you? Explain your hypothesis.
Part B
To get the hang of the data we’re working with, count
(separately) the number of state
s, and the number of year
s. Get the number of n_distinct()
state
s and year
s (Do these inside the summarize()
command), as well as the distinct()
values of each (don’t use the summarize()
command for this part).
Part C
Create a scatterplot of appspc
(Y) on unemployrate
(X). Which State is an outlier? How would this affect the pooled regression estimates? Create a second scatterplot that does not include this State.
Part D
Run two pooled regressions, one with the outliers, and one without them. Write out the estimated regression equation for each. Interpret the coefficient, and comment on how it changes between the two regressions.
Part E
Now run a regression with State fixed effects using the dummy variable method. (Ensure that state
is a factor variable, and insert in the regression. You can either mutate()
it into a factor
beforehand, or just do factor(state)
in the lm
command.) Interpret the marginal effect of unemployrate
on appspc
. How did it change?
Part F
Find the coefficient for Maryland and interpret it. How many applications per capita does Maryland have?
Part G
Now try using the feols()
command (from the fixest
package), which de-means the data, and make sure you get the same results as Part E. Do you get the same marginal effect of unemployrate
on appspc
?
Part H
Now also include year fixed effects in your regression, using the dummy variable method. Interpret the marginal effect of unemployrate
on appspc
. How did it change?
Part I
What would be the predicted number of applications in Maryland in 2011 at an unemployment rate of 5%?
Part J
Now try using the feols()
command, which de-means the data, and make sure you get the same results as Part H. Do you get the same marginal effect of unemployrate
on appspc
?
Part K
Can there still be endogeneity in this model? Give some examples.
Part L
Create a nice regression table (using modelsummary
) for comparison of the regressions in Parts D, G, and J.
Question 5 (Difference-in-Differences)
Download and read in TexasSchools.csv
dataset.
Are teachers paid more when school board members are elected “off cycle” when there are not major national political elections (e.g. odd years) than “on cycle?” The argument is that during “off” years, without attention on state or national elections, voters will pay less attention to the election, and teachers can more effectively mobilize for higher pay, versus “on” years where voters are paying more attention. This data comes from Anzia, Sarah (2012), “The Election Timing Effect: Evidence from a Policy Intervention in Texas.” Quarterly Journal of Political Science 7(3): 277-297, and follows 1,020 Texas school board districts from 2003–2009.
From 2003–2006, all districts elected their school board members off-cycle. A change in Texas policy in 2006 led some, but not all, districts to elect their school board members on-cycle from 2007 onwards.
Variable | Description |
---|---|
LnAvgSalary |
logged average salary of teachers in district |
Year |
Year |
OnCycle |
=1 if school boards elected on-cycle (e.g. same year as national and state elections), =0 if elected off-cycle |
pol_freedom |
Political freedom index score (2018) from 1 (least) top 10 (most free) |
CycleSwitch |
=1 if district switched from off- to on-cycle elections |
AfterSwitch |
=1 if year is after 2006 |
Part A
Run a pooled regression model of LnAvgSalary
on OnCycle
. Write the estimated regression equation, and interpret the coefficient on OnCycle.
Are there any sources of bias (consider in particular the argument in the question prompt)?
Part B
Some schools decided to switch to an on-cycle election after 2006. Consider this, CycleSwitch
the “treatment.” Create a variable to indicate post-treatment years (i.e. years after 2006). Call it After
. Create a second, interaction variable to capture the interaction effect between those districts that switched, and after the treatment.
Part C
Now estimate a difference-in-difference model with your variables in Part B: CycleSwitch
is the treatment variable, After
is your post-treatment indicator, and add an interaction variable to capture the interaction effect between those districts that switched, and after the treatment. Write down the estimated regression equation (to four decimal places).
Part D
Interpret what each coefficient means from Part C.
Part E
Using your regression equation in Part C, calculate the expected logged average salary \((Y)\) for districts in Texas:
- Before the switch that did not switch
- After the switch that did not switch
- Before the switch that did switch
- After the switch that did switch
Part F
Confirm your estimates in Part E by finding the mean logged average salary for each of those four groups in the data. (Hint, filter()
properly then summarize()
.)
Part G
Write out the difference-in-difference equation, and calculate the difference-in-difference. Make sure it matches your estimate from the regression.
Part H
Can we say anything about the types of districts that switched? Can we say anything about all salaries in the districts in the years after the switch?
Part I
Now let’s generalize the diff-in-diff model. Instead of the treatment and post-treatment dummies, use district-and year-fixed effects and the interaction term. Interpret the coefficient on the interaction term.
This is doable with the dummy variable method, but there will be a lot of dummies! I suggest using feols()
from the fixest
package.
Part J
Create a nice regression table (using modelsummary
) for comparison of the regressions in Parts A, C, and I.
Question 6 (Instrumental Variables)
Part A
Download and read in RainIV.csv
dataset.
Does economic growth reduce the odds of civil conflict? Consider data on 41 African countries between 1981-1999. Below are key variables (among others in the data):
Variable | Description |
---|---|
Internalconflict |
=1 if civil war (<25 deaths), else =0 |
LaggedGDPGrowth |
GDP Growth rate from previous year |
LagggedRainfallGrowth |
Change (in mm) of rain from previous year |
InitialGDP |
GDP per capita (1979) |
Democracy |
Polity II score |
country_name |
Country name |
country_code |
Three letter ISO country code |
year |
Year |
pop |
Population |
Mountains |
Percentage of territory that is mountainous terrain |
EthnicFrac |
Ethnic fractionalization index (probability two random individuals are NOT from the same ethnic group) |
ReligiousFrac |
Religious fractionalization index (probability two random individuals are NOT from the same religion) |
Part A
Make a scatterplot between Internalconflict
(as y
) and LaggedGDPGrowth
(as x
), and add a regression line.
Part B
Run a regression of Internalconflicton
LaggedGDPGrowth.` Note that since \(Y\) is a dummy variable, \(\beta_1\) is the probability of internal conflict in country \(i\) at time \(t\).
Part C
Run another regression, adding in LaggedGDPGrowth
, InitialGDP
, Democracy
, Mountains
EthnicFrac
, and ReligiousFrac
as controls. What happens to the marginal effect of LaggedGDPGrowth
?
Part D
Now let’s consider LaggedRainfallGrowth
as a potential instrument for LaggedGDPGrowth
. First, check the correlations between InternalConflict
, LaggedGDPGrowth
and LaggedRainfallGrowth
.
Part E
Now let’s consider if LaggedRainfallGrowth
is a relevant instrument (the “inclusion condition”) by running the first stage regression of LaggedGDPGrowth
on LaggedRainfallGrowth
and the rest of the controls from part C. Is the OLS estimate on LaggedRainfallGrowth
statistically significant?
Part F
There is no statistical test for whether LaggedRainfallGrowth
is exogenous (the “exclusion condition”). Do you think it is plausible that LaggedRainfallGrowth
is uncorrelated with the error term; that is, with any other factors that cause conflict; and that it only affects conflict through economic growth?
Part G
Extract and save the fitted values of your first stage regression from part E. Now run the second stage regression by using the fitted values of LaggedGDPGrowth
and the control variables from parts C and E. What is the marginal effect of LaggedGDPGrowth
on InternalConflict
?
Part H
Now run the same regression using feols()
from the fixest
package.
Part I
Make a nice regression table using modelsummary()
using your results from Parts B, G, and H.