Problem Set 3

Author

Answer Key

Published

October 14, 2022

# load packages, they are all installed for you already (in the cloud only)
library(tidyverse) # your friend and mine
library(broom) # for tidy regression
library(lmtest) # for heteroskedasticity test
library(estimatr) # for robust SEs
library(modelsummary) # for nice regression tables
library(car) # for outlier test
library(infer) # for simulating inference

Theory and Concepts

Question 1

In your own words, describe what exogeneity and endogeneity mean, and how they are related to bias in our regression. What things can we learn about the bias if we know \(X\) is endogenous?

The OLS estimators \(\hat{\beta_0}\) and \(\hat{\beta_1}\) are unbiased estimates of the true population parameters \(\beta_0\) and \(\beta_1\) if and only if \(X\) is exogenous. That is to say, if \(cor(X,u)=0\) (i.e. there is no correlation between \(X\) and any unobserved variable that affects \(Y\)), then \(E[\hat{\beta_1}]=\beta_1\).

If \(X\) is correlated with the error term, then \(X\) is endogenous. The true expected value of the OLS estimator is \[E[\hat{\beta_1}]=\beta_1+cor(X,u)\frac{\sigma_u}{\sigma_X}\]

The bias is \(\big(E[\hat{\beta_1}]-\beta_1\big)\), i.e. the difference between average estimated sample slope and the `true’ population slope, so we can determine first the size of the bias based on how large \(cor(X,u)\) is. The stronger the correlation, the larger the bias.

Second, we can determine the direction of the bias depending on the sign of \(cor(X,u)\).

  • If \(X\) and \(u\) are positively correlated (move in the same direction), we know that we have overstated the true effect of \(\Delta X\) on \(\Delta Y\), since a change in \(Y\) is picking up both a change in \(X\) and a further change (in the same direction as \(X\)) in the unobserved \(u\).

  • If the correlation is negative (move in opposite directions), we know that we have understated the true effect of \(\Delta X\) on \(\Delta Y\), since a change in \(Y\) is picking up both a change in \(X\) that is dampened by a change in the opposite direction of \(u\).

Question 2

In your own words, describe what \(R^2\) means. How do we calculate it, what does it tell us, and how do we interpret it?

The \(R^2\) is a measure of how well the OLS regression line “fits” our observed data points. It is the proportion of the total variation in \(Y\) (TSS) that is explained by the variation from our model (ESS):

\[\begin{align*} R^2&=\frac{ESS}{TSS}= \frac{\sum (\hat{Y_i}-\bar{Y})^2}{\sum(Y_i-\bar{Y})^2} \end{align*}\]

Equivalently, it can be found by subtracting the proportion of unexplained variation in \(Y\) (SSE/TSS) from 1:

\[R^2=1-\frac{SSE}{TSS}=1-\frac{\sum (u_i)^2}{\sum (Y_i-\bar{Y})^2}\]

This is because \(\frac{SSE+ESS}{TSS}=1\). Finally, \(R^2\) is the square of the correlation coefficient between \(X\) and \(Y\), \(R^2=(r_{X,Y})^2\)

The closer \(R^2\) is to 1, the better the fit, the closer to 0, the poorer the fit.

Question 3

In your own words, describe what the standard error of the regression (\(SER\)) means. How do we calculate it, what does it tell us, and how do we interpret it?

SER \((\hat{\sigma_u}\)) is the average size of the error (or residual), \(\hat{u_i}\), that is, the average distance from the regression line to the actual data value for \(Y\) at a given \(X\). The goal of OLS is to minimize this (well, technically minimize the sum of squared errors!).

\[\begin{align*} SER&=\sqrt{\frac{1}{n-2}\sum \hat{u_i}^2}\\ SER &=\sqrt{\frac{SSE}{n-2}}\\ \end{align*}\]

We calculate it by squaring the residuals (to get a positive distance) and taking the mean of them by adding them all up and dividing by \(n-2\), and then taking the square root to return to normal (non-squared) units.

We divide by \(n-2\) rather than by \(n\) due to the degrees of freedom correction for calculating two prior statistics with our data already, \(\hat{\beta_0}\) and \(\hat{\beta_1}\).

Question 4

In your own words, describe what homoskedasticity and heteroskedasticity mean: both in ordinary English, and in terms of the graph of the OLS regression line.

Homoskedasticity means the errors are distributed with the same variance for all levels of X. Knowing anything about X will not tell us anything about the distribution of errors at that level of X.

Heteroskedasticity means the errors are distributed differently for different levels of X. So, at different levels of X, there will be much more or much less variation in the residuals.

Question 5

In your own words, describe what the variation in \(\hat{\beta_1}\) (either variance or standard error) means, or is measuring. What three things determine the variation, and in what way?

The variation of \(\hat{\beta_1}\) (either it’s variance or standard error) is a measure of how precise our estimate is. This idea comes from the sampling distribution of \(\hat{\beta_1}\), since it is a random variable: if we were to take other samples and calculate the slope of a regression line \(\hat{\beta_1}\) for each, the estimate would vary from sample to sample.

The standard error of \(\hat{\beta_1}\) (square this to get variance) is:

\[se(\hat{\beta_1})=\frac{\sigma_u}{\sqrt{n} \times se(X)}\]

The three things that affect it are:

  1. Goodness of Fit of the Regression \((\sigma_u)\) or \(SER\). The worse the fit, the higher the \(SER\), and the worse the precision (higher standard error) of \(\hat{\beta_1}\).

  2. Sample size, \(n\): the more data, the better the precision (lower standard error) of \(\hat{\beta_1}\).

  3. Standard error of \(X\): the more variation (spread) in \(X\)-values, the better the precision (lower standard error) of \(\hat{\beta_1}\).

See the graphs in slides 9-11 of class 2.5 for more.

Question 6

In your own words, describe what a p-value means, and how it is used to establish statistical significance.

The \(p\)-value is the probability that, if the null hypothesis were true, of observing a test statistic at least as extreme as the one found in our sample. Specifically, if \(H_0: \beta_1\), it is the probability of getting a sample slope at least as extreme as the one in our sample, if the slope were truly 0.1

\[Prob(\delta \geq \delta_i|H_0\text{ is true})\]

where \(\delta\) is a test-statistic and \(\delta_i\) is the test statistic we obtained from our sample.

Another way to interpret this is that the \(p\)-value is the probability we commit a Type I error: the probability that, if the null hypothesis were true, we falsely reject it from our sample evidence.

Be careful, the \(p\)-value is not the probability that our alternative hypothesis is true given our findings (commonly believed)! In fact it is basically the opposite, the probability of our findings being valid given the null hypothesis!

Question 7

A researcher is interested in examining the impact of illegal music downloads on commercial music sales. The author collects data on commercial sales of the top 500 singles from 2017 (Y) and the number of downloads from a web site that allows ‘file sharing’ (X). The author estimates the following model:

\[ \text{music sales}_i = \beta_0+\beta_1 \text{illegal downloads}_i + u_i \]

The author finds a large, positive, and statistically significant estimate of \(\hat{\beta_1}\). The author concludes these results demonstrate that illegal downloads actually boost music sales. Is this an unbiased estimate of the impact of illegal music on sales? Why or why not? Do you expect the estimate to overstate or understate the true relationship between illegal downloads and sales?

Does knowing the amount of illegal downloads an artist has convey any information about other variables that affect music sales? In other words, we are asking if \(E[u|X]=0\) (or more simply, \(cor(X,u)=0)\).

It is likely that artists and songs that are the most heavily pirated are the most popular ones, and also are likely have very high music sales. Economists say piracy is like a tax on success–it happens more to those who are already successful and less to those who are still trying to make it big.

In any case, illegal downloads is probably endogenous. Since there is likely a positive correlation between music sales and popularity (in the error term), and popularity is also positively correlated with music sales, it is likely that we are overstating the effect of illegal downloads on sales. In other words, \(\hat{\beta_1}\) is also picking up the positive effect of popular songs, and is too large. The true estimate of \(\beta_1\) is likely much lower than measured.

Theory Problems

For the following questions, please show all work and explain answers as necessary. You may lose points if you only write the correct answer. You may use R to verify your answers, but you are expected to reach the answers in this section “manually.”

Question 8

A researcher wants to estimate the relationship between average weekly earnings (\(AWE\), measured in dollars) and \(Age\) (measured in years) using a simple OLS model. Using a random sample of college-educated full-time workers aged 25-65 yields the following:

\[ \widehat{AWE} = 696.70+9.60 \, Age \]

Part A

Interpret what \(\hat{\beta_0}\) means in this context.

\(\hat{\beta_0}\) is 696.70. This is the vertical intercept of the regression line. It means that a person that is 0 years old earns a $696.70 per week on average. This is often nonsensical, so we don’t often care about the economic meaning of the intercept.

Part B

Interpret what \(\hat{\beta_1}\) means in this context.

\(\hat{\beta_1}\) is 9.60 This is the slope of the regression line. It means that for every year older a person is, they can expect their wages to increase by $9.60, on average. This is the marginal effect of Age on AWE (and the causal effect if this model were exogenous).

Part C

The \(R^2=0.023\) for this regression. What are the units of the \(R^2\), and what does this mean?

\(R^2\) has no units, it is the proportion of variation in \(AWE\) that is explained by our model, between 0 and 1. This model explains only 2.3% of the variation in \(AWE\), meaning this model is poor, and the line does not fit the data points well.

Part D

The \(SER, \, \hat{\sigma_u}=624.1\) for this regression. What are the units of the SER in this context, and what does it mean? Is the SER large in the context of this regression?

\(SER\) is measured in the same units as the dependent variable, \(AWE\), so it is measured in dollars. It is the average error or residual for an individual, the difference (in dollars) between OLS’ predicted \(\widehat{AWE}\) for that person, and their true \(AWE\) in the data. This SER is quite big, $624 in average weekly earnings.

Part E

Suppose Maria is 20 years old. What is her predicted \(\widehat{AWE}\)?

\[\begin{align*} \widehat{AWE}_{Maria}&=696.70+9.60(20)\\ &=888.70\\ \end{align*}\]

She is predicted to earn $888.70 per week, according to our model.

Part F

Suppose the data shows her actual \(AWE\) is $430. What is her residual? Is this a relatively good or a bad prediction? Hint: compare your answer here to your answer in Part D.

\[\begin{align*} \widehat{u}_{Maria}&=Y_{Maria}-\widehat{Y}_{Maria}\\ &=430-888.70\\ &=-458.70\\ \end{align*}\]

Her residual, i.e. the error in the prediction of her wages, is -$488.70 (she actually earns $488.70 less than her predicted wage).

While this sounds large, it actually a relatively good prediction, as it is much lower than the average prediction error (SER), which was $624.10.

Part G

What does the error term, \(u_i\) represent in this case? What might individuals have different values of \(\hat{u}_i\)?

The error term represents all factors other than age that affects an individual’s average weekly earnings. This could include things like experience, ability, job type, education level, conscienciousness etc.

Part H

Do you think that \(Age\) is exogenous? Why or why not? Would we expect \(\hat{\beta_1}\) to be too large or too small?

It’s very unlikely that \(Age\) is exogenous. Knowing someone’s age likely gives us information about \(u\): we can guess about their experience or level of education (they are likely higher for older people), and most of these positively affect wages. Thus, we have probably overstimated the effect of age on earnings (i.e. \(\hat{\beta_1}\)), and the true \(\beta_1\) is likely smaller.

Question 9

Suppose a researcher is interested in estimating a simple linear regression model:

\[ Y_i=\beta_0+\beta_1X_i+u_i \]

In a sample of 48 observations, she generates the following descriptive statistics:

  • \(\bar{X}=30\)
  • \(\bar{Y}=63\)
  • \(\displaystyle\sum^n_{i=1}(X_i-\bar{X})^2= 6900\)
  • \(\displaystyle\sum^n_{i=1}(Y_i-\bar{Y})^2= 29000\)
  • \(\displaystyle\sum^n_{i=1}(X_i-\bar{X})(Y_i-\bar{Y})=13800\)
  • \(\displaystyle\sum^n_{i=1}\hat{u}^2=1656\)

Part A

What is the OLS estimate of \(\hat{\beta_1}\)?

The formula for \(\hat{\beta_1}=\frac{\displaystyle\sum^n_{i=1}(X_i-\bar{X})(Y_i-\bar{Y})}{\displaystyle\sum^n_{i=1}(X_i-\bar{X})^2}=\frac{cov(X,Y)}{var(X)} = \frac{13800}{6900}=2\)

Part B

What is the OLS estimate of \(\hat{\beta_0}\)?

The formula for \(\hat{\beta_0}=\bar{Y}-\hat{\beta_1}\bar{X}=63-30(2)=3\)

Part C

Suppose the OLS estimate of \(\hat{\beta_1}\) has a standard error of \(0.072\). Could we probably reject a null hypothesis of \(H_0: \beta_1=0\) at the 5% level?

Yes, we could reject the null hypothesis as the estimate of \(\hat{\beta_1}=2\) is more than 2 times its standard error of 0.072. The test-statistic would actually be

\[\begin{align*} t&=\frac{\hat{\beta_1}-\beta_{1,0}}{se(\hat{\beta_1})}\\ t&=\frac{2-0}{0.072}\\ t&\approx 27.78\end{align*}\]

This is well beyond the critical value needed to reject \(H_0\), and the \(p\)-value would be basically 0.

Part D

Calculate the \(R^2\) for this model. How much variation in \(Y\) is explained by our model?

We know TSS (4th bullet point) and SSE (last bullet point).

\[\begin{align*} R^2&=1-\frac{SSE}{TSS}\\ &=1-\frac{1656}{29000}\\ &=1-0.057\\ &=0.943\\ \end{align*}\]

This model explains 94.3% of the variation in \(Y_i\).

R Questions

Answer the following questions using R. When necessary, please write answers in the same document (rendered to html or pdf, typed .doc(x), or handwritten) as your answers to the above questions. Be sure to include (email or print an .R file, or show in your rendered quarto document) your code and the outputs of your code with the rest of your answers.

Question 10

Download the MLBattend dataset. This data contains data on attendance at major league baseball games for all 32 MLB teams from the 1970s-2000. We want to answer the following question:

“How big is home-field advantage in baseball? Does a team with higher attendance at home games over their season have score more runs over their season?”

Part A

Clean up the data a bit by mutate()-ing a variable to measure home attendance in millions. This will make it easier to interpret your regression later on.

# import data, save as mlb
mlb <- read_csv("MLBattend.csv") # here in cloud project
Rows: 838 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): team, city, nickname, league, division
dbl (7): season, home_attend, runs_scored, runs_allowed, wins, losses, games...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# alternatively you could just read the URL in directly:
# mlb <- read_csv("http://metricsf22.classes.ryansafner.com/files/data/mlbattend.csv")

# make home attendance variable in millions
mlb <- mlb %>%
  mutate(home_attend_mil = home_attend/1000000)

Part B

Get the correlation between Runs Scored and Home Attendance.

# summarize and get correlation
mlb %>%
  summarize(Correlation = cor(runs_scored, home_attend_mil))

Part C

Plot a scatterplot of Runs Scored (y) on Home Attendance (x). Add a regression line.

# create scatterplot with regression line 
scatter <- ggplot(data = mlb)+
  aes(x = home_attend_mil,
      y = runs_scored)+
  geom_point(color = "#e64173")+
  geom_smooth(method = "lm")+
  theme_light()

# look at it
scatter
`geom_smooth()` using formula 'y ~ x'