Problem Set 6

Warning

This assignment is not graded, and is only for practice for the final exam.

Please read the instructions for completing and submitting homeworks.

R Project R Studio Cloud

Answers

html R Project R Studio Cloud

Theory and Concepts

Question 1

In your own words, describe what fixed effects are, when we can use them, and how they remove endogeneity.

Question 2

In your own words, describe the logic of a difference-in-difference model: what is it comparing against what, and how does it estimate the effect of treatment? What assumption must be made about the treatment and control group for the model to be valid?

Question 3

In your own words, describe the logic of an instrumental variables model: how does this approach identify the causal effect of $X$ on $Y$ ? What makes for a valid instrument? When using a Two-Stage Least Squares approach, what is estimated at each stage?

R Questions

Answer the following questions using R. When necessary, please write answers in the same document (rendered to html or pdf, typed .doc(x), or handwritten) as your answers to the above questions. Be sure to include (email or print an .R file, or show in your rendered quarto document) your code and the outputs of your code with the rest of your answers.

Question 4 (Fixed Effects)

PeaceCorps.csv

Download and read in PeaceCorps.csv dataset.

How do people respond to changes in economic conditions? Are they more likely to pursue public service when private sector jobs are scarce? This dataset contains variables at the U.S. State (& D.C.) level:

Variable	Description
`state`	U.S. State
`year`	Year
`appspc`	Applications to the Peace Corps (per capita) in State
`unemployrate`	State unemployment rate

Do more people apply to the Peace Corps when unemployment increases (and reduces other opportunities)?

Part A

Before looking at the data, what does your economic intuition tell you? Explain your hypothesis.

Part B

To get the hang of the data we’re working with, count (separately) the number of states, and the number of years. Get the number of n_distinct() states and years (Do these inside the summarize() command), as well as the distinct() values of each (don’t use the summarize() command for this part).

Part C

Create a scatterplot of appspc (Y) on unemployrate (X). Which State is an outlier? How would this affect the pooled regression estimates? Create a second scatterplot that does not include this State.

Part D

Run two pooled regressions, one with the outliers, and one without them. Write out the estimated regression equation for each. Interpret the coefficient, and comment on how it changes between the two regressions.

Part E

Now run a regression with State fixed effects using the dummy variable method. (Ensure that state is a factor variable, and insert in the regression. You can either mutate() it into a factor beforehand, or just do factor(state) in the lm command.) Interpret the marginal effect of unemployrate on appspc. How did it change?

Part F

Find the coefficient for Maryland and interpret it. How many applications per capita does Maryland have?

Part G

Now try using the feols() command (from the fixest package), which de-means the data, and make sure you get the same results as Part E. Do you get the same marginal effect of unemployrate on appspc?

Part H

Now also include year fixed effects in your regression, using the dummy variable method. Interpret the marginal effect of unemployrate on appspc. How did it change?

Part I

What would be the predicted number of applications in Maryland in 2011 at an unemployment rate of 5%?

Part J

Now try using the feols() command, which de-means the data, and make sure you get the same results as Part H. Do you get the same marginal effect of unemployrate on appspc?

Part K

Can there still be endogeneity in this model? Give some examples.

Part L

Create a nice regression table (using modelsummary) for comparison of the regressions in Parts D, G, and J.

Question 5 (Difference-in-Differences)

TexasSchools.csv

Download and read in TexasSchools.csv dataset.

Are teachers paid more when school board members are elected “off cycle” when there are not major national political elections (e.g. odd years) than “on cycle?” The argument is that during “off” years, without attention on state or national elections, voters will pay less attention to the election, and teachers can more effectively mobilize for higher pay, versus “on” years where voters are paying more attention. This data comes from Anzia, Sarah (2012), “The Election Timing Effect: Evidence from a Policy Intervention in Texas.” Quarterly Journal of Political Science 7(3): 277-297, and follows 1,020 Texas school board districts from 2003–2009.

From 2003–2006, all districts elected their school board members off-cycle. A change in Texas policy in 2006 led some, but not all, districts to elect their school board members on-cycle from 2007 onwards.

Variable	Description
`LnAvgSalary`	logged average salary of teachers in district
`Year`	Year
`OnCycle`	`=1` if school boards elected on-cycle (e.g. same year as national and state elections), `=0` if elected off-cycle
`pol_freedom`	Political freedom index score (2018) from 1 (least) top 10 (most free)
`CycleSwitch`	`=1` if district switched from off- to on-cycle elections
`AfterSwitch`	`=1` if year is after 2006

Part A

Run a pooled regression model of LnAvgSalary on OnCycle. Write the estimated regression equation, and interpret the coefficient on OnCycle. Are there any sources of bias (consider in particular the argument in the question prompt)?

Part B

Some schools decided to switch to an on-cycle election after 2006. Consider this, CycleSwitch the “treatment.” Create a variable to indicate post-treatment years (i.e. years after 2006). Call it After. Create a second, interaction variable to capture the interaction effect between those districts that switched, and after the treatment.

Part C

Now estimate a difference-in-difference model with your variables in Part B: CycleSwitch is the treatment variable, After is your post-treatment indicator, and add an interaction variable to capture the interaction effect between those districts that switched, and after the treatment. Write down the estimated regression equation (to four decimal places).

Part D

Interpret what each coefficient means from Part C.

Part E

Using your regression equation in Part C, calculate the expected logged average salary $(Y)$ for districts in Texas:

Before the switch that did not switch
After the switch that did not switch
Before the switch that did switch
After the switch that did switch

Part F

Confirm your estimates in Part E by finding the mean logged average salary for each of those four groups in the data. (Hint, filter() properly then summarize().)

Part G

Write out the difference-in-difference equation, and calculate the difference-in-difference. Make sure it matches your estimate from the regression.

Part H

Can we say anything about the types of districts that switched? Can we say anything about all salaries in the districts in the years after the switch?

Part I

Now let’s generalize the diff-in-diff model. Instead of the treatment and post-treatment dummies, use district-and year-fixed effects and the interaction term. Interpret the coefficient on the interaction term.

This is doable with the dummy variable method, but there will be a lot of dummies! I suggest using feols() from the fixest package.

Part J

Create a nice regression table (using modelsummary) for comparison of the regressions in Parts A, C, and I.

Question 6 (Instrumental Variables)

Part A

RainIV.csv

Download and read in RainIV.csv dataset.

Does economic growth reduce the odds of civil conflict? Consider data on 41 African countries between 1981-1999. Below are key variables (among others in the data):

Variable	Description
`Internalconflict`	`=1` if civil war (<25 deaths), else `=0`
`LaggedGDPGrowth`	GDP Growth rate from previous year
`LagggedRainfallGrowth`	Change (in mm) of rain from previous year
`InitialGDP`	GDP per capita (1979)
`Democracy`	Polity II score
`country_name`	Country name
`country_code`	Three letter ISO country code
`year`	Year
`pop`	Population
`Mountains`	Percentage of territory that is mountainous terrain
`EthnicFrac`	Ethnic fractionalization index (probability two random individuals are NOT from the same ethnic group)
`ReligiousFrac`	Religious fractionalization index (probability two random individuals are NOT from the same religion)

Part A

Make a scatterplot between Internalconflict (as y) and LaggedGDPGrowth (as x), and add a regression line.

Part B

Run a regression of InternalconflictonLaggedGDPGrowth.` Note that since $Y$ is a dummy variable, $β_{1}$ is the probability of internal conflict in country $i$ at time $t$ .

Part C

Run another regression, adding in LaggedGDPGrowth, InitialGDP, Democracy, Mountains EthnicFrac, and ReligiousFrac as controls. What happens to the marginal effect of LaggedGDPGrowth?

Part D

Now let’s consider LaggedRainfallGrowth as a potential instrument for LaggedGDPGrowth. First, check the correlations between InternalConflict, LaggedGDPGrowth and LaggedRainfallGrowth.

Part E

Now let’s consider if LaggedRainfallGrowth is a relevant instrument (the “inclusion condition”) by running the first stage regression of LaggedGDPGrowth on LaggedRainfallGrowth and the rest of the controls from part C. Is the OLS estimate on LaggedRainfallGrowth statistically significant?

Part F

There is no statistical test for whether LaggedRainfallGrowth is exogenous (the “exclusion condition”). Do you think it is plausible that LaggedRainfallGrowth is uncorrelated with the error term; that is, with any other factors that cause conflict; and that it only affects conflict through economic growth?

Part G

Extract and save the fitted values of your first stage regression from part E. Now run the second stage regression by using the fitted values of LaggedGDPGrowth and the control variables from parts C and E. What is the marginal effect of LaggedGDPGrowth on InternalConflict?

Part H

Now run the same regression using feols() from the fixest package.

Part I

Make a nice regression table using modelsummary() using your results from Parts B, G, and H.