2.7 — Hypothesis Testing for Regression — R Practice

Author

Answer Key

Published

October 12, 2022

Required Packages & Data

Load all the required packages we will use (note I have installed them already into the cloud project) by running (clicking the green play button) the chunk below:

library(tidyverse) # your friend and mine
library(broom) # for tidy regression
library(modelsummary) # for nice regression tables
library(infer) # for simulating inference

What is the effect of a diamond’s carat size on its price?

We will use the diamonds example dataset that comes bundled with ggplot, automatically loaded with tidyverse. To make it an explicit tibble in our workspace, run the code below.

diamonds <- diamonds

Note I am calling the tibble diamonds, feel free to name it whatever you like, but I will assume in my code chunks below that you are calling it diamonds - change as appropriate.

Question 1

Just to see what we’re looking at, make a scatterplot using ggplot(), with x as carat and y as price, and add a regression line.

# type your code below in this chunk
ggplot(data = diamonds)+
  aes(x = carat,
      y = price)+
  geom_point()+
  geom_smooth(method = "lm")
`geom_smooth()` using formula 'y ~ x'

Regression

Question 2

Suppose we want to estimate the following relationship:

\[ \text{Price}_i = \beta_0+\beta_1 \, \text{carat}_i+u_i \]

Run a regression of price on carat using lm() and get a summary() of it. Be sure to save your regression model as an object, we’ll need it later.

# type your code below in this chunk
reg <- lm(price ~ carat, data = diamonds)

summary(reg)

Call:
lm(formula = price ~ carat, data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-18585.3   -804.8    -18.9    537.4  12731.7 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2256.36      13.06  -172.8   <2e-16 ***
carat        7756.43      14.07   551.4   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1549 on 53938 degrees of freedom
Multiple R-squared:  0.8493,    Adjusted R-squared:  0.8493 
F-statistic: 3.041e+05 on 1 and 53938 DF,  p-value: < 2.2e-16

Question 3

Part A

Write out the estimated regression equation.

\[ \widehat{price}_i=−2256.36+7756.43 \, \text{carat}_i \]

Part B

Make a regression table of the output (using the modelsummary package).

# type your code below in this chunk
modelsummary(models = list("Price" = reg),
             fmt = 2, # round to 2 decimals
             output = "html",
             coef_rename = c("(Intercept)" = "Constant",
                             "carat" = "Carat"),
             gof_map = list(
               list("raw" = "nobs", "clean" = "n", "fmt" = 0),
               list("raw" = "r.squared", "clean" = "R<sup>2</sup>", "fmt" = 2),
               #list("raw" = "adj.r.squared", "clean" = "Adj. R<sup>2</sup>", "fmt" = 2),
               list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
             ),
             escape = FALSE,
             stars = c('*' = .1, '**' = .05, '***' = 0.01)
)
Price
Constant −2256.36***
(13.06)
Carat 7756.43***
(14.07)
n 53940
R2 0.85
SER 1548.53
* p < 0.1, ** p < 0.05, *** p < 0.01

Part C

What is the estimated \(\hat{\beta}_1\) for this model? What does it mean in terms of the context of our question?

\(\hat{\beta}_1\), the estimated slope, is 7756.43. For every 1 carat, we expect the price of the diamond to increase by $7,756.43.

# type your code below in this chunk

Part D

Use the broom package’s tidy() command on your regression object, and calculate confidence intervals for your estimates by setting conf.int = T inside tidy().

What is the 95% confidence interval for \(\hat{\beta}_1\), and what does it mean?

# type your code below in this chunk
reg_tidy <- reg %>% 
  tidy(conf.int = T)

reg_tidy

We are 95% confident that in similarly constructed samples, the true effect of carat on price (β1) is between $7,728.86 and $7,784.00.

Save these endpoints as an object by running the following code (and changing it for your object names as needed):

# save confidence interval
ci_endpoints <- reg_tidy %>% # use whatever your tidy regression object is named
  filter(term == "carat") %>%
  select(conf.low, conf.high)

# look at it
ci_endpoints

Part E

Save your estimated \(\hat{\beta}_1\) as an object similarly by running and changing the code below if needed. We’ll need it later.

# save our slope
our_slope <- reg_tidy %>% # use whatever your tidy regression object is named
  filter(term == "carat") %>%
  pull(estimate)

# look at it
our_slope
[1] 7756.426

Confidence Intervals

Question 4

Part A

Let’s generate a confidence interval for \(\hat{\beta}_1\) by simulating the sampling distribution of \(\hat{\beta}_1\). Run the following code, which will specify() the model relationship and generate() 1,000 repetitions using the bootstrap method of resampling data points randomly from our sample, with replacement.

# save our simulations as an object (I called it "boot")
boot <- diamonds %>% # or whatever you named your dataframe!
  specify(price ~ carat) %>% # our regression model
  generate(reps = 1000, # run 1000 simulations
           type = "bootstrap") %>% # using bootstrap method
  calculate(stat = "slope") # estimate slope in each simulation

# look at it
boot

Note this will take a few minutes, its doing a lot of calculations! What does it show you when you look at it?

This produces 1,000 bootstrapped samples, and calculates the estimated regression slope \(\hat{\beta}_1\) for each sample.

NOTE: R Studio Cloud may crash when running this (it’s 53,938,000 observations!) — another good reason why, while it is convenient, you really should use your own R Studio app on your computer, rather than relying on the cloud!

Part B

Continue by piping your object from Part A into get_confidence_interval(). Set level = 0.95, type = "se" and point_estimate equal to our estimated \(\hat{\beta}_1\) (saved) from above as our_slope.

boot %>%
  get_confidence_interval(level = 0.95,
                          type = "se", 
                          point_estimate = our_slope) 

This gives us a confidence interval of $7,706.95 and $7,805.91. This not exactly, but quite close to R’s automatic calculation using the theoretical t-distribution.

Part C

Now instead of get_confidence_interval(), pipe your object from Part A into visualize() to see the sampling distribution of \(\hat{\beta}_1\) we simulated. We can add + shade_ci(endpoints = ...) setting the argument equal to whatever our object containing the confidence interval from Question 2 Part D (I have it named here as ci_endpoints).

boot %>%
  visualize()+
  shade_ci(endpoints = ci_endpoints) 

Compare your simulated confidence interval with the theoretically-constructed confidence interval from the output of tidy() from Question 3.

# type your code below in this chunk

Hypothesis Testing

Question 5

Now let’s test the following hypothesis:

\[ \begin{align*} H_0: & \, \beta_1 = 0 \\ H_a: & \, \beta_1 \neq 0 \\ \end{align*} \]

Part A

Is this a one-sided or a two-sided test?

This is a two-sided test, because we are considering the possibility that \(\beta_1\) may be above or below the null hypothesis value of 0.

Part B

Look back at the regression summary() output from question 2, and the output of tidy() from question 3. What does it tell you about this hypothesis test?

Reading the second row of estimates from the lm summary or tidy output of question 2, we saw that the estimate on carat \(\hat{\beta}_1\) was 7756.43, with a standard error of 14.07. This yields a test statistic value of 551.4, and a p-value of basically 0.

Part C

Let’s now do run this hypothesis test with infer, which will simulate the sampling distribution under the null hypothesis that \(\beta_1=\). Run the following code, which will specify() the model relationship and hypothesize() the null hypothesis that there is no relationship between X and Y (i.e. \(\beta_1=0\)), and generate() 1,000 repetitions using the permute method, which will center the distribution at 0, and then calculate(stat = "slope").

# save our simulations as an object (I called it "test_sims")
test_sims <- diamonds %>% # or whatever you named your dataframe!
  specify(price ~ carat) %>% # our regression model
  hypothesize(null = "independence") %>% # H_0 is that slope is 0
  generate(reps = 1000, # run 1000 simulations
           type = "permute") %>% # using permute method, centering distr at 0
  calculate(stat = "slope") # estimate slope in each simulation

# look at it
test_sims

Note this may also take a few minutes. What does it show you?

This produces 1,000 permuted samples, under the assumption that the true slope is 0 (the null hypothesis), and calculates the estimated regression slope \(\hat{\beta}_1\) for each sample.

Part D

Pipe your object from the previous part into the following code, which will get_p_value(). Inside this function, we are setting obs_stat equal to our β1^ we found (from Question 2 part E), and set direction = "both" to run a two-sided test, since our alternative hypothesis is two-sided, \(H_a: \beta_1 \neq 0\)

test_sims %>%
  get_p_value(obs_stat = our_slope,
              direction = "both")
Warning: Please be cautious in reporting a p-value of 0. This result is an
approximation based on the number of `reps` chosen in the `generate()` step. See
`?get_p_value()` for more information.

Note the warning message that comes up!

Part E

Instead of get_p_value(), pipe your simulations into the following code, which will visualize() the null distribution, and (in the second command), place our finding on it and shade_p_value().

test_sims %>%
  visualize()

test_sims %>%
  visualize()+
  shade_p_value(obs_stat = our_slope,
                direction = "both") # for two-sided test

Part F

Compare your simulated p-value with the theoretically-constructed p-value from the output of summary, and/or of tidy() from Question 2 and 3.

Both summary and tidy() also report the t-statistic (t value or statistic) on this test for carat (by default, that \(H_0: \beta_1 = 0\)).

What is the estimated test statistic for this model, and what does this number mean? Try to calculate it yourself with the formula:

\[\text{Test statistic} = \frac{\text{estimate — null hypothesis}}{\text{standard error of the estimate}}\]

\[ \begin{align*} t &= \frac{7756.43-0}{14.07} \\ & \approx 551.3 \end{align*} \]

If the null hypothesis were true \(\beta_1=0\), the probability that we get a test-statistic at least as extreme as 551.3 (essentially, 551 standard deviations away!!) is virtually 0. You can see this when we visualize using infer, above.

The p-value is the probability of a t-statistic at least as large as ours if the null hypothesis were true. R constructs a t-distribution with n-k-1 degrees of freedom (n is number of observations, k is number of X-variables) and calculates the probability in the tails of the distribution beyond this t value. You can calculate it yourself (for a two-sided test) with:

2 * pt(551.3, # our t-statistic 
       df = 53938, # n - 2
       lower.tail = F) # we'll use the right tail
[1] 0