<-lm_robust(y ~ x, data = the_data, # change y, x, and data names to yours
reg_robust se_type = "stata") # we'll use this method to calculate
Problem Set 3
This assignment is due by Friday October 14 on Blackboard Assignments.
Please read the instructions for completing and submitting homeworks.
The PDF is useful if you want to print out the problem set and write on it. The R Project is a zipped .zip
file which contains a .qmd
file to write answers in, and the data, all in a logical working directory. (See this resource for help unzipping files). You can also just write an .R
file in the project if you don’t want to use markdown. If you use the cloud project, I have already installed tidyverse
and tinytex
(to produce pdfs).
Answers
Theory and Concepts
Question 1
In your own words, describe what exogeneity and endogeneity mean, and how they are related to bias in our regression. What things can we learn about the bias if we know
Question 2
In your own words, describe what
Question 3
In your own words, describe what the standard error of the regression (
Question 4
In your own words, describe what homoskedasticity and heteroskedasticity mean: both in ordinary English, and in terms of the graph of the OLS regression line.
Question 5
In your own words, describe what the variation in
Question 6
In your own words, describe what a p-value means, and how it is used to establish statistical significance.
Question 7
A researcher is interested in examining the impact of illegal music downloads on commercial music sales. The author collects data on commercial sales of the top 500 singles from 2017 (Y) and the number of downloads from a web site that allows ‘file sharing’ (X). The author estimates the following model:
The author finds a large, positive, and statistically significant estimate of
Theory Problems
For the following questions, please show all work and explain answers as necessary. You may lose points if you only write the correct answer. You may use R to verify your answers, but you are expected to reach the answers in this section “manually.”
Question 8
A researcher wants to estimate the relationship between average weekly earnings (
Part A
Interpret what
Part B
Interpret what
Part C
The
Part D
The
Part E
Suppose Maria is 20 years old. What is her predicted
Part F
Suppose the data shows her actual
Part G
What does the error term,
Part H
Do you think that
Question 9
Suppose a researcher is interested in estimating a simple linear regression model:
In a sample of 48 observations, she generates the following descriptive statistics:
Part A
What is the OLS estimate of
Part B
What is the OLS estimate of
Part C
Suppose the OLS estimate of
Part D
Calculate the
R Questions
Answer the following questions using R. When necessary, please write answers in the same document (rendered to html or pdf, typed .doc(x), or handwritten) as your answers to the above questions. Be sure to include (email or print an .R file, or show in your rendered quarto document) your code and the outputs of your code with the rest of your answers.
Question 10
Download the MLBattend dataset. This data contains data on attendance at major league baseball games for all 32 MLB teams from the 1970s-2000. We want to answer the following question:
“How big is home-field advantage in baseball? Does a team with higher attendance at home games over their season have score more runs over their season?”
Part A
Clean up the data a bit by mutate()
-ing a variable to measure home attendance in millions. This will make it easier to interpret your regression later on.
Part B
Get the correlation between Runs Scored and Home Attendance.
Part C
Plot a scatterplot of Runs Scored (y
) on Home Attendance (x
). Add a regression line.
Part D
We want to estimate a regression of Runs Scored on Home Attendance:
Run this regression in R.
What are
Hint: make sure to save your regression model as an object, and get a summary() of it. This object will be needed later.
Part E
Write out the estimated regression equation.
Part F
Make a regression table of the output using modelsummary()
.
Part G
Check the goodness of fit statistics. What is the
Part H
Now let’s start running some diagnostics of the regression. Make a histogram of the residuals. Do they look roughly normal?
Hint: you will need to use the broom
package’s augment()
command on your saved regression object to add containing the residuals (.resid
), and save this as a new object - to be your data source for the plot in this part and the next part.
Part I
Make a residual plot.
Part J
Test the regression for heteroskedasticity. Are the errors homoskedastic or heteroskedastic?
Hint: use the lmtest
package’s bptest()
command on your saved regression object.
Run another regression using robust standard errors. Hint: use the estimatr
package’s lm_robust()
command and save the output like the following:
Now make another regression output table with modelsummary
, with one column using regular standard errors (just use your original saved regression object) and another using robust standard errors (use this new saved object)
Part K
Test the data for outliers. If there are any, identify which team(s) and season(s) are outliers. Hint: use the car
package’s outlierTest()
command on your saved regression object.
Part L
Look back at your regression results. What is the marginal effect of home attendance on runs scored? Is this statistically significant? Why or why not?
Part M
Now we’ll try out the infer package to understand the
First, save the (value of) our sample
<- 123 # replace "123" with whatever number you found for the slope in part D our_slope
Then, using the infer package run the following simulation:
# save our simulations as an object (I called it "sims")
<- data %>% # "data" here is whatever you named your dataframe!
sims specify(y ~ x) %>% # replacing y and x with your variable names
hypothesize(null = "independence") %>% # H_0 is that slope is 0, x and y are independent
generate(reps = 1000,
type = "permute") %>% # make 1000 samples assuming H_0 is true
calculate(stat = "slope") # estimate slope in each sample
# look at it
sims
# calculate p value
%>%
sims get_p_value(obs_stat = our_slope,
direction = "both") # a two-sided H_a: slope =/= 0
Compare to the p-value in your original regression output in previous parts of this question.
Part N
Make a histogram of the simulated slopes, and plot our sample slope on that histogram, shading the p-value.
You can pipe sims
into visualize(obs_stat = our_slope)
, or use ggplot2
to plot a histogram in the normal way, using sims
as the data source and add a geom_vline(xintercept = our_slope)
to show our finding on the distribution.