Problem Set 3

Author

Answer Key

Published

October 14, 2022

# load packages, they are all installed for you already (in the cloud only)
library(tidyverse) # your friend and mine
library(broom) # for tidy regression
library(lmtest) # for heteroskedasticity test
library(estimatr) # for robust SEs
library(modelsummary) # for nice regression tables
library(car) # for outlier test
library(infer) # for simulating inference

Theory and Concepts

Question 1

In your own words, describe what exogeneity and endogeneity mean, and how they are related to bias in our regression. What things can we learn about the bias if we know \(X\) is endogenous?

The OLS estimators \(\hat{\beta_0}\) and \(\hat{\beta_1}\) are unbiased estimates of the true population parameters \(\beta_0\) and \(\beta_1\) if and only if \(X\) is exogenous. That is to say, if \(cor(X,u)=0\) (i.e. there is no correlation between \(X\) and any unobserved variable that affects \(Y\)), then \(E[\hat{\beta_1}]=\beta_1\).

If \(X\) is correlated with the error term, then \(X\) is endogenous. The true expected value of the OLS estimator is \[E[\hat{\beta_1}]=\beta_1+cor(X,u)\frac{\sigma_u}{\sigma_X}\]

The bias is \(\big(E[\hat{\beta_1}]-\beta_1\big)\), i.e. the difference between average estimated sample slope and the `true’ population slope, so we can determine first the size of the bias based on how large \(cor(X,u)\) is. The stronger the correlation, the larger the bias.

Second, we can determine the direction of the bias depending on the sign of \(cor(X,u)\).

  • If \(X\) and \(u\) are positively correlated (move in the same direction), we know that we have overstated the true effect of \(\Delta X\) on \(\Delta Y\), since a change in \(Y\) is picking up both a change in \(X\) and a further change (in the same direction as \(X\)) in the unobserved \(u\).

  • If the correlation is negative (move in opposite directions), we know that we have understated the true effect of \(\Delta X\) on \(\Delta Y\), since a change in \(Y\) is picking up both a change in \(X\) that is dampened by a change in the opposite direction of \(u\).

Question 2

In your own words, describe what \(R^2\) means. How do we calculate it, what does it tell us, and how do we interpret it?

The \(R^2\) is a measure of how well the OLS regression line “fits” our observed data points. It is the proportion of the total variation in \(Y\) (TSS) that is explained by the variation from our model (ESS):

\[\begin{align*} R^2&=\frac{ESS}{TSS}= \frac{\sum (\hat{Y_i}-\bar{Y})^2}{\sum(Y_i-\bar{Y})^2} \end{align*}\]

Equivalently, it can be found by subtracting the proportion of unexplained variation in \(Y\) (SSE/TSS) from 1:

\[R^2=1-\frac{SSE}{TSS}=1-\frac{\sum (u_i)^2}{\sum (Y_i-\bar{Y})^2}\]

This is because \(\frac{SSE+ESS}{TSS}=1\). Finally, \(R^2\) is the square of the correlation coefficient between \(X\) and \(Y\), \(R^2=(r_{X,Y})^2\)

The closer \(R^2\) is to 1, the better the fit, the closer to 0, the poorer the fit.

Question 3

In your own words, describe what the standard error of the regression (\(SER\)) means. How do we calculate it, what does it tell us, and how do we interpret it?

SER \((\hat{\sigma_u}\)) is the average size of the error (or residual), \(\hat{u_i}\), that is, the average distance from the regression line to the actual data value for \(Y\) at a given \(X\). The goal of OLS is to minimize this (well, technically minimize the sum of squared errors!).

\[\begin{align*} SER&=\sqrt{\frac{1}{n-2}\sum \hat{u_i}^2}\\ SER &=\sqrt{\frac{SSE}{n-2}}\\ \end{align*}\]

We calculate it by squaring the residuals (to get a positive distance) and taking the mean of them by adding them all up and dividing by \(n-2\), and then taking the square root to return to normal (non-squared) units.

We divide by \(n-2\) rather than by \(n\) due to the degrees of freedom correction for calculating two prior statistics with our data already, \(\hat{\beta_0}\) and \(\hat{\beta_1}\).

Question 4

In your own words, describe what homoskedasticity and heteroskedasticity mean: both in ordinary English, and in terms of the graph of the OLS regression line.

Homoskedasticity means the errors are distributed with the same variance for all levels of X. Knowing anything about X will not tell us anything about the distribution of errors at that level of X.

Heteroskedasticity means the errors are distributed differently for different levels of X. So, at different levels of X, there will be much more or much less variation in the residuals.

Question 5

In your own words, describe what the variation in \(\hat{\beta_1}\) (either variance or standard error) means, or is measuring. What three things determine the variation, and in what way?

The variation of \(\hat{\beta_1}\) (either it’s variance or standard error) is a measure of how precise our estimate is. This idea comes from the sampling distribution of \(\hat{\beta_1}\), since it is a random variable: if we were to take other samples and calculate the slope of a regression line \(\hat{\beta_1}\) for each, the estimate would vary from sample to sample.

The standard error of \(\hat{\beta_1}\) (square this to get variance) is:

\[se(\hat{\beta_1})=\frac{\sigma_u}{\sqrt{n} \times se(X)}\]

The three things that affect it are:

  1. Goodness of Fit of the Regression \((\sigma_u)\) or \(SER\). The worse the fit, the higher the \(SER\), and the worse the precision (higher standard error) of \(\hat{\beta_1}\).

  2. Sample size, \(n\): the more data, the better the precision (lower standard error) of \(\hat{\beta_1}\).

  3. Standard error of \(X\): the more variation (spread) in \(X\)-values, the better the precision (lower standard error) of \(\hat{\beta_1}\).

See the graphs in slides 9-11 of class 2.5 for more.

Question 6

In your own words, describe what a p-value means, and how it is used to establish statistical significance.

The \(p\)-value is the probability that, if the null hypothesis were true, of observing a test statistic at least as extreme as the one found in our sample. Specifically, if \(H_0: \beta_1\), it is the probability of getting a sample slope at least as extreme as the one in our sample, if the slope were truly 0.1

\[Prob(\delta \geq \delta_i|H_0\text{ is true})\]

where \(\delta\) is a test-statistic and \(\delta_i\) is the test statistic we obtained from our sample.

Another way to interpret this is that the \(p\)-value is the probability we commit a Type I error: the probability that, if the null hypothesis were true, we falsely reject it from our sample evidence.

Be careful, the \(p\)-value is not the probability that our alternative hypothesis is true given our findings (commonly believed)! In fact it is basically the opposite, the probability of our findings being valid given the null hypothesis!

Question 7

A researcher is interested in examining the impact of illegal music downloads on commercial music sales. The author collects data on commercial sales of the top 500 singles from 2017 (Y) and the number of downloads from a web site that allows ‘file sharing’ (X). The author estimates the following model:

\[ \text{music sales}_i = \beta_0+\beta_1 \text{illegal downloads}_i + u_i \]

The author finds a large, positive, and statistically significant estimate of \(\hat{\beta_1}\). The author concludes these results demonstrate that illegal downloads actually boost music sales. Is this an unbiased estimate of the impact of illegal music on sales? Why or why not? Do you expect the estimate to overstate or understate the true relationship between illegal downloads and sales?

Does knowing the amount of illegal downloads an artist has convey any information about other variables that affect music sales? In other words, we are asking if \(E[u|X]=0\) (or more simply, \(cor(X,u)=0)\).

It is likely that artists and songs that are the most heavily pirated are the most popular ones, and also are likely have very high music sales. Economists say piracy is like a tax on success–it happens more to those who are already successful and less to those who are still trying to make it big.

In any case, illegal downloads is probably endogenous. Since there is likely a positive correlation between music sales and popularity (in the error term), and popularity is also positively correlated with music sales, it is likely that we are overstating the effect of illegal downloads on sales. In other words, \(\hat{\beta_1}\) is also picking up the positive effect of popular songs, and is too large. The true estimate of \(\beta_1\) is likely much lower than measured.

Theory Problems

For the following questions, please show all work and explain answers as necessary. You may lose points if you only write the correct answer. You may use R to verify your answers, but you are expected to reach the answers in this section “manually.”

Question 8

A researcher wants to estimate the relationship between average weekly earnings (\(AWE\), measured in dollars) and \(Age\) (measured in years) using a simple OLS model. Using a random sample of college-educated full-time workers aged 25-65 yields the following:

\[ \widehat{AWE} = 696.70+9.60 \, Age \]

Part A

Interpret what \(\hat{\beta_0}\) means in this context.

\(\hat{\beta_0}\) is 696.70. This is the vertical intercept of the regression line. It means that a person that is 0 years old earns a $696.70 per week on average. This is often nonsensical, so we don’t often care about the economic meaning of the intercept.

Part B

Interpret what \(\hat{\beta_1}\) means in this context.

\(\hat{\beta_1}\) is 9.60 This is the slope of the regression line. It means that for every year older a person is, they can expect their wages to increase by $9.60, on average. This is the marginal effect of Age on AWE (and the causal effect if this model were exogenous).

Part C

The \(R^2=0.023\) for this regression. What are the units of the \(R^2\), and what does this mean?

\(R^2\) has no units, it is the proportion of variation in \(AWE\) that is explained by our model, between 0 and 1. This model explains only 2.3% of the variation in \(AWE\), meaning this model is poor, and the line does not fit the data points well.

Part D

The \(SER, \, \hat{\sigma_u}=624.1\) for this regression. What are the units of the SER in this context, and what does it mean? Is the SER large in the context of this regression?

\(SER\) is measured in the same units as the dependent variable, \(AWE\), so it is measured in dollars. It is the average error or residual for an individual, the difference (in dollars) between OLS’ predicted \(\widehat{AWE}\) for that person, and their true \(AWE\) in the data. This SER is quite big, $624 in average weekly earnings.

Part E

Suppose Maria is 20 years old. What is her predicted \(\widehat{AWE}\)?

\[\begin{align*} \widehat{AWE}_{Maria}&=696.70+9.60(20)\\ &=888.70\\ \end{align*}\]

She is predicted to earn $888.70 per week, according to our model.

Part F

Suppose the data shows her actual \(AWE\) is $430. What is her residual? Is this a relatively good or a bad prediction? Hint: compare your answer here to your answer in Part D.

\[\begin{align*} \widehat{u}_{Maria}&=Y_{Maria}-\widehat{Y}_{Maria}\\ &=430-888.70\\ &=-458.70\\ \end{align*}\]

Her residual, i.e. the error in the prediction of her wages, is -$488.70 (she actually earns $488.70 less than her predicted wage).

While this sounds large, it actually a relatively good prediction, as it is much lower than the average prediction error (SER), which was $624.10.

Part G

What does the error term, \(u_i\) represent in this case? What might individuals have different values of \(\hat{u}_i\)?

The error term represents all factors other than age that affects an individual’s average weekly earnings. This could include things like experience, ability, job type, education level, conscienciousness etc.

Part H

Do you think that \(Age\) is exogenous? Why or why not? Would we expect \(\hat{\beta_1}\) to be too large or too small?

It’s very unlikely that \(Age\) is exogenous. Knowing someone’s age likely gives us information about \(u\): we can guess about their experience or level of education (they are likely higher for older people), and most of these positively affect wages. Thus, we have probably overstimated the effect of age on earnings (i.e. \(\hat{\beta_1}\)), and the true \(\beta_1\) is likely smaller.

Question 9

Suppose a researcher is interested in estimating a simple linear regression model:

\[ Y_i=\beta_0+\beta_1X_i+u_i \]

In a sample of 48 observations, she generates the following descriptive statistics:

  • \(\bar{X}=30\)
  • \(\bar{Y}=63\)
  • \(\displaystyle\sum^n_{i=1}(X_i-\bar{X})^2= 6900\)
  • \(\displaystyle\sum^n_{i=1}(Y_i-\bar{Y})^2= 29000\)
  • \(\displaystyle\sum^n_{i=1}(X_i-\bar{X})(Y_i-\bar{Y})=13800\)
  • \(\displaystyle\sum^n_{i=1}\hat{u}^2=1656\)

Part A

What is the OLS estimate of \(\hat{\beta_1}\)?

The formula for \(\hat{\beta_1}=\frac{\displaystyle\sum^n_{i=1}(X_i-\bar{X})(Y_i-\bar{Y})}{\displaystyle\sum^n_{i=1}(X_i-\bar{X})^2}=\frac{cov(X,Y)}{var(X)} = \frac{13800}{6900}=2\)

Part B

What is the OLS estimate of \(\hat{\beta_0}\)?

The formula for \(\hat{\beta_0}=\bar{Y}-\hat{\beta_1}\bar{X}=63-30(2)=3\)

Part C

Suppose the OLS estimate of \(\hat{\beta_1}\) has a standard error of \(0.072\). Could we probably reject a null hypothesis of \(H_0: \beta_1=0\) at the 5% level?

Yes, we could reject the null hypothesis as the estimate of \(\hat{\beta_1}=2\) is more than 2 times its standard error of 0.072. The test-statistic would actually be

\[\begin{align*} t&=\frac{\hat{\beta_1}-\beta_{1,0}}{se(\hat{\beta_1})}\\ t&=\frac{2-0}{0.072}\\ t&\approx 27.78\end{align*}\]

This is well beyond the critical value needed to reject \(H_0\), and the \(p\)-value would be basically 0.

Part D

Calculate the \(R^2\) for this model. How much variation in \(Y\) is explained by our model?

We know TSS (4th bullet point) and SSE (last bullet point).

\[\begin{align*} R^2&=1-\frac{SSE}{TSS}\\ &=1-\frac{1656}{29000}\\ &=1-0.057\\ &=0.943\\ \end{align*}\]

This model explains 94.3% of the variation in \(Y_i\).

R Questions

Answer the following questions using R. When necessary, please write answers in the same document (rendered to html or pdf, typed .doc(x), or handwritten) as your answers to the above questions. Be sure to include (email or print an .R file, or show in your rendered quarto document) your code and the outputs of your code with the rest of your answers.

Question 10

Download the MLBattend dataset. This data contains data on attendance at major league baseball games for all 32 MLB teams from the 1970s-2000. We want to answer the following question:

“How big is home-field advantage in baseball? Does a team with higher attendance at home games over their season have score more runs over their season?”

Part A

Clean up the data a bit by mutate()-ing a variable to measure home attendance in millions. This will make it easier to interpret your regression later on.

# import data, save as mlb
mlb <- read_csv("MLBattend.csv") # here in cloud project
Rows: 838 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): team, city, nickname, league, division
dbl (7): season, home_attend, runs_scored, runs_allowed, wins, losses, games...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# alternatively you could just read the URL in directly:
# mlb <- read_csv("http://metricsf22.classes.ryansafner.com/files/data/mlbattend.csv")

# make home attendance variable in millions
mlb <- mlb %>%
  mutate(home_attend_mil = home_attend/1000000)

Part B

Get the correlation between Runs Scored and Home Attendance.

# summarize and get correlation
mlb %>%
  summarize(Correlation = cor(runs_scored, home_attend_mil))

Part C

Plot a scatterplot of Runs Scored (y) on Home Attendance (x). Add a regression line.

# create scatterplot with regression line 
scatter <- ggplot(data = mlb)+
  aes(x = home_attend_mil,
      y = runs_scored)+
  geom_point(color = "#e64173")+
  geom_smooth(method = "lm")+
  theme_light()

# look at it
scatter
`geom_smooth()` using formula 'y ~ x'

Part D

We want to estimate a regression of Runs Scored on Home Attendance:

\[ \text{runs scored}_i = \beta_0 + \beta_1 \, \text{home attendance}_i + u_i \]

Run this regression in R.

What are \(\hat{\beta_0}\) and \(\hat{\beta_1}\) for this model? Interpret them in the context of our question.

Hint: make sure to save your regression model as an object, and get a summary() of it. This object will be needed later.

# run regression, save as reg
reg <- lm(runs_scored ~ home_attend_mil, data = mlb)

# get summary of reg
summary(reg)

Call:
lm(formula = runs_scored ~ home_attend_mil, data = mlb)

Residuals:
     Min       1Q   Median       3Q      Max 
-295.566  -52.754    1.414   63.769  271.377 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      572.618      8.081   70.86   <2e-16 ***
home_attend_mil   68.798      4.183   16.45   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 91.47 on 836 degrees of freedom
Multiple R-squared:  0.2445,    Adjusted R-squared:  0.2436 
F-statistic: 270.5 on 1 and 836 DF,  p-value: < 2.2e-16
# Here I'm going to save beta 0 hat and beta 1 hat
# as objects to call up in the text of the markdown document
# We'll need broom and to tidy() our regression first
reg_tidy <- tidy(reg)

reg_tidy
# extract and save beta 0 hat 

beta_0_hat <- reg_tidy %>%
  filter(term == "(Intercept)") %>% # look at intercept row
  pull(estimate) %>% # extract beta 0 hat
  round(., 3) # round to 3 decimal places

beta_0_hat # check 
[1] 572.618
# extract and save beta 1 hat 

beta_1_hat <- reg_tidy %>%
  filter(term == "home_attend_mil") %>% # look at X-variable row
  pull(estimate) %>% # extract beta 1 hat
  round(., 3) # round to 3 decimal places

beta_1_hat # check
[1] 68.798

Part E

Write out the estimated regression equation.

\[\widehat{\text{Runs scored}_i}=572.618-68.798 \text{ Home attendance (mil)}\]

Part F

Make a regression table of the output using modelsummary().

modelsummary(models = list("Runs Scored" = reg),
             fmt = 2, # round to 2 decimals
             output = "html",
             coef_rename = c("(Intercept)" = "Constant",
                             "home_attend_mil" = "Home Attendance (Millions)"),
             gof_map = list(
               list("raw" = "nobs", "clean" = "n", "fmt" = 0),
               list("raw" = "r.squared", "clean" = "R<sup>2</sup>", "fmt" = 2),
               #list("raw" = "adj.r.squared", "clean" = "Adj. R<sup>2</sup>", "fmt" = 2),
               list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
             ),
             escape = FALSE,
             stars = c('*' = .1, '**' = .05, '***' = 0.01)
)
Runs Scored
Constant 572.62***
(8.08)
Home Attendance (Millions) 68.80***
(4.18)
n 838
R2 0.24
SER 91.36
* p < 0.1, ** p < 0.05, *** p < 0.01

Part G

Check the goodness of fit statistics. What is the \(R^2\) and the \(SER\) of this model? Interpret them both in the context of our question.

# Here I'm going to save these
# as objects to call up in the text of the markdown document
# Again we need broom and to glance() our regression first
r_sq <- glance(reg) %>%
  pull(r.squared) %>%
  round(., 3)

r_sq # check
[1] 0.244
ser <- glance(reg) %>%
  pull(sigma) %>%
  round (., 3)

ser # check
[1] 91.473

\(R^2\) is 0.244, meaning the model is able to explain about 24.4% of the variation in Runs Scored.

The SER, i.e. the average error is 91.473, meaning any team’s season (one observation) has on average 91.473 more/fewer runs than its predicted number of runs.

Part H

Now let’s start running some diagnostics of the regression. Make a histogram of the residuals. Do they look roughly normal?

Hint: you will need to use the broom package’s augment() command on your saved regression object to add containing the residuals (.resid), and save this as a new object - to be your data source for the plot in this part and the next part.

# augment the regression, save as reg_aug
reg_aug <- reg %>%
  augment()

# now we use this as the data in our histogram plot in ggplot, (x is .resid)
ggplot(data = reg_aug)+
  aes(x = .resid)+
  geom_histogram(color = "white", fill = "#e64173")+
  theme_light()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Part I

Make a residual plot.

# this is another plot from reg_aug
# where x is .fitted and y is .resid

ggplot(data = reg_aug)+
  aes(x = .fitted,
      y = .resid)+
  geom_point(color = "#e64173")+
  geom_hline(yintercept = 0, color = "black")+
  theme_light()

Part J

Test the regression for heteroskedasticity. Are the errors homoskedastic or heteroskedastic?

Hint: use the lmtest package’s bptest() command on your saved regression object.

bptest(reg)

    studentized Breusch-Pagan test

data:  reg
BP = 0.22515, df = 1, p-value = 0.6351

The null hypothesis \(H_0\) is that the errors are homoskedastic. The \(p\)-value for this test is very large, so we cannot reject the null hypothesis.

This is good, it means the errors are homoskedastic, and our OLS estimators’ standard errors are accurate and do not need to be corrected for heteroskedasticity.

Run another regression using robust standard errors. Hint: use the estimatr package’s lm_robust() command and save the output like the following:

reg_robust <- lm_robust(y ~ x, data = the_data, # change y, x, and data names to yours
                              se_type = "stata") # we'll use this method to calculate
reg_robust <- lm_robust(runs_scored ~ home_attend_mil, data = mlb,
                              se_type = "stata")

Now make another regression output table with modelsummary, with one column using regular standard errors (just use your original saved regression object) and another using robust standard errors (use this new saved object)


modelsummary(models = list("Normal SEs" = reg,
                           "Robust SEs" = reg_robust),
             fmt = 2, # round to 2 decimals
             output = "html",
             coef_rename = c("(Intercept)" = "Constant",
                             "home_attend_mil" = "Home Attendance (Millions)"),
             gof_map = list(
               list("raw" = "nobs", "clean" = "n", "fmt" = 0),
               list("raw" = "r.squared", "clean" = "R<sup>2</sup>", "fmt" = 2),
               #list("raw" = "adj.r.squared", "clean" = "Adj. R<sup>2</sup>", "fmt" = 2),
               list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
             ),
             escape = FALSE,
             stars = c('*' = .1, '**' = .05, '***' = 0.01)
)
Normal SEs Robust SEs
Constant 572.62*** 572.62***
(8.08) (8.70)
Home Attendance (Millions) 68.80*** 68.80***
(4.18) (4.60)
n 838 838
R2 0.24 0.24
SER 91.36 91.36
* p < 0.1, ** p < 0.05, *** p < 0.01

Observe how neither \(\hat{\beta_0}\) nor \(\hat{beta_1}\) changes due to using robust standard errors (again, heteroskedasticity does not bias our estimates!), but using robust standard errors (correcting for heteroskedasticity) does slightly increase the standard errors (making the test statistic slightly smaller and p-values slightly larger).

Part K

Test the data for outliers. If there are any, identify which team(s) and season(s) are outliers. Hint: use the car package’s outlierTest() command on your saved regression object.

# write your code here
outlierTest(reg)
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
     rstudent unadjusted p-value Bonferroni p
816 -3.255201          0.0011787      0.98779

This test detected one outlier, which is observation (row) number 816. Let’s look it up:

mlb %>%
  slice(816)

The Toronto Blue Jays’ 1981 season is an outlier. Just for kicks, let’s point it out on the scatterplot.

outlier <- mlb %>%
  slice(816)

library(ggrepel)
scatter+ # our scatterplot saved from part C
  geom_point(data = outlier,
             aes(x = home_attend_mil,
                 y = runs_scored),
             color = "#314f4f")+
  geom_text_repel(data = outlier,
             aes(x = home_attend_mil,
                 y = runs_scored),
             label = "1981 Blue Jays",
             color = "#314f4f",
             box.padding = 0.75,
             seed = 20)
`geom_smooth()` using formula 'y ~ x'

Part L

Look back at your regression results. What is the marginal effect of home attendance on runs scored? Is this statistically significant? Why or why not?

This is an interpretation question, no need to calculate anything. The marginal effect is \(\hat{\beta_1}\): for every 1 additional million fans attending home games over a team’s season, the team scores 69 more runs.

Looking back at the regression output in part c, the \(t\)-score for the hypothesis test \(H_0: \, \beta_1=0\), \(H_1: \, \beta_1 \neq 0\) is 16.45, yielding a very very small \(p\)-value. We have sufficient evidence to reject \(H_0\) in favor of our alternative hypothesis, that there is a relationship between home attendance and runs scored over a season.

Part M

Now we’ll try out the infer package to understand the \(p\)-value for our observed slope in our regression model.

First, save the (value of) our sample \(\hat{\beta}_1\) from your regression in Part D as an object, I suggest:

our_slope <- 123 # replace "123" with whatever number you found for the slope in part D

Then, using the infer package run the following simulation:

# save our simulations as an object (I called it "sims")
sims <- data %>% # "data" here is whatever you named your dataframe!
  specify(y ~ x) %>% # replacing y and x with your variable names
  hypothesize(null = "independence") %>% # H_0 is that slope is 0, x and y are independent
  generate(reps = 1000,
           type = "permute") %>% # make 1000 samples assuming H_0 is true
  calculate(stat = "slope") # estimate slope in each sample

# look at it
sims

# calculate p value
sims %>%
  get_p_value(obs_stat = our_slope,
              direction = "both") # a two-sided H_a: slope =/= 0

Compare to the p-value in your original regression output in previous parts of this question.

# recall I already saved our slope as beta_1_hat:
beta_1_hat
[1] 68.798
sims <- mlb %>% # our data
  specify(runs_scored ~ home_attend_mil) %>% # our model
  hypothesize(null = "independence") %>% # H_0 is that slope is 0, x and y are independent
  generate(reps = 1000, 
           type = "permute") %>% # make 1000 samples assuming H_0 is true
  calculate(stat = "slope") # estimate slope in each sample

# look at it
sims
# calculate p value
sims %>%
  get_p_value(obs_stat = beta_1_hat, # our slope
              direction = "both") # a two-sided H_a: slope =/= 0
Warning: Please be cautious in reporting a p-value of 0. This result is an
approximation based on the number of `reps` chosen in the `generate()` step. See
`?get_p_value()` for more information.

The \(p\)-value is basically 0. According to the regression output in part D, it is smaller than 0.00000000000000002.

Part N

Make a histogram of the simulated slopes, and plot our sample slope on that histogram, shading the p-value.

You can pipe sims into visualize(obs_stat = our_slope), or use ggplot2 to plot a histogram in the normal way, using sims as the data source and add a geom_vline(xintercept = our_slope) to show our finding on the distribution.

sims %>%
  visualize()+
  shade_p_value(obs_stat = beta_1_hat, # our slope
                direction = "both") # two-sided test, shade both sides

Again, since visualize() really is just using ggplot(), we can use sims as our data to make our own plot:

ggplot(data = sims)+
  aes(x = stat)+ # slope from simulations
  geom_histogram(color = "white",
                 fill = "#e64173")+
  # add "shading" for p-value's two sides
  # right side
  geom_rect(xmin=beta_1_hat,
            xmax=Inf,
            ymin=0,
            ymax=Inf,
            fill = "pink",
            alpha=0.4)+
  # left side
  geom_rect(xmin=-beta_1_hat,
            xmax=-Inf,
            ymin=0,
            ymax=Inf,
            fill = "pink",
            alpha=0.4)+
  # add vertical line for our sample slope
  geom_vline(xintercept = beta_1_hat,
             color = "#314f4f",
             size = 2)+
  # add label
  geom_label(x = beta_1_hat,
             y = 200,
             color = "#314f4f",
             label = expression(paste("Our ", hat(beta[1]))))+
  scale_x_continuous(breaks=c(-75,-50,-25,0,25,50,75),
                     limits=c(-75,75),
                     expand=c(0,0))+
  scale_y_continuous(limits=c(0,450),
                       expand=c(0,0))+
  labs(x = expression(paste("Distribution of ", hat(beta[1]), " under ", H[0], " that ", beta[1]==0)),
       y = "Samples")+
    theme_light(base_family = "Fira Sans Condensed",
           base_size=16)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Note this is the sampling distribution of \(\hat{\beta_1}\) under the null hypothesis (the true \(\beta_1=0)\). Values on the horizontal axis are values of \(\hat{\beta_1}\), not the number of standard deviations away from the null hypothesis.

What the test-statistic \(t\) does is standardize this distribution similar to how we would standardize a distribution to the standard normal distribution via calculating \(Z\)-scores.

Let me visualize what would happen if we tried to standardize our simulated null distribution:

sims %>%
  summarize(mean = mean(stat), # get the mean slope of the simulated distribution of slopes
            se = (sd(stat))) # get the standard error of the slope 
# the mean isn't exactly 0, but close

# get OUR standard error from our original regression
se_beta_1_hat<- reg_tidy %>% 
  filter(term == "home_attend_mil") %>%
  pull(std.error)

# standardize slopes to t-statistics (like Z-scores)
##  t = estimate - null hypothesis value / standard error of estimate

t_statistics <- sims %>%
  mutate(t_scores = ((stat - 0)/se_beta_1_hat))

our_t <- ((beta_1_hat - 0) / se_beta_1_hat)

our_t
[1] 16.44736

This is the t statistic that R calculates as part of the regression, and calculates the p-value for the above hypothesis test as the probability beyond this value in both tails of a theoretical \(t\)-distribution with \(n-k-1\) (in this case, 836) degrees of freedom:

We can visualize the t-distribution

ggplot(data = tibble(x=-4:4))+
  aes(x = x)+
  stat_function(fun = dt, args = list(df = 836), size=2, color="#e64173")+
      # add "shading" for p-value's two sides
  # right side
  geom_rect(xmin=our_t,
            xmax=Inf,
            ymin=0,
            ymax=Inf,
            fill = "pink",
            alpha=0.4)+
  # left side
  geom_rect(xmin=-1* our_t,
            xmax=-Inf,
            ymin=0,
            ymax=Inf,
            fill = "pink",
            alpha=0.4)+
  # add vertical line for our sample slope's t-statistic
  geom_vline(xintercept = our_t,
             color = "#314f4f",
             size = 2)+
  #add label
  geom_label(x = our_t,
             y = 0.35,
             color = "#314f4f",
             label = "Our t")+
  geom_vline(xintercept = -1 * our_t,
             color = "#314f4f",
             size = 2)+
  #add label
  geom_label(x = -1 * our_t,
             y = 0.35,
             color = "#314f4f",
             label = "Our t")+
  
  scale_x_continuous(breaks = seq(-20,20,5),
                     limits = c(-17.5,17.5),
                     expand = c(0,0))+
  scale_y_continuous(breaks = seq(0,0.4,0.05),
                     limits = c(0,0.425),
                     expand = c(0,0))+
  labs(x = "Test statistic, t",
       y = "Probability",
       caption = expression(paste("p-value is probability in the tail(s) beyond our test statistic, ", t[i], " of our sample slope ", hat(beta[1]))))+
  theme_light()

We can calculate this probability (as R does in the regression) by finding the probability in the tails of the \(t_{836}\)-distribution beyond \(\pm16.447\). The \(t\) distribution has 53,938 degrees of freedom — \((n-k-1)\) where \(n = 53940\) and \(k=1\).

2 * pt(16.447, # our t-statistic
       df = 836, # the df number
       lower.tail = F) # we'll use the right tail
[1] 7.160046e-53

If the null hypothesis were true \((\beta_1=0)\), the probability that we get a test-statistic at least as extreme as 16.447 (essentially, 16.447 standard deviations away!!) is virtually 0. # Knit and Submit!

When you are done, click the Render button. Based on the current yaml header format: html, this will currently produce an html webpage, which should automatically open for your review.

Notice in the Files pane in R Studio (by default, the lower right one), there should now be a document called 03-problem-set.html (or if you changed the filename) ending in .html. This is the webpage, so you can find this file on your computer (or download it from Rstudio.cloud with by clicking on the checkmark box in front of the file in the Files page and then going to More -> Export... to download the file to your computer) and send this file.

If you want to make a PDF, install the package “tinytex” and run the following code to install a LaTeX distribution:

Then delete the lines in the yaml header that say format: html: self-contained: TRUE, and add a simple line that says format: pdf . Clicking Render will now produce a PDF, show it, and save it as a new file in the Files pane.

Either way, send me your output file, html or pdf (or, if you like, word) so long as it shows the input and output code of every chunk. I have set it by default to do this, with echo: true in the yaml header.

Don’t forget to add your name to the author part of the header!

Footnotes

  1. Note in the classic sense, the \(p\)-value is actually measuring the probability of a test statistic \((t)\) being at least as extreme as ours. The test statistic essentially standardizes our sample statistic \((\hat{\beta_1})\) so that it measures standard deviations from the null-hypothesized value (i.e. 0), much like a \(Z\)-score.↩︎