2.7 — Hypothesis Testing for Regression — R Practice

Author

Answer Key

Published

October 12, 2022

Required Packages & Data

Load all the required packages we will use (note I have installed them already into the cloud project) by running (clicking the green play button) the chunk below:

library(tidyverse) # your friend and mine
library(broom) # for tidy regression
library(modelsummary) # for nice regression tables
library(infer) # for simulating inference

What is the effect of a diamond’s carat size on its price?

We will use the diamonds example dataset that comes bundled with ggplot, automatically loaded with tidyverse. To make it an explicit tibble in our workspace, run the code below.

diamonds <- diamonds

Note I am calling the tibble diamonds, feel free to name it whatever you like, but I will assume in my code chunks below that you are calling it diamonds - change as appropriate.

Question 1

Just to see what we’re looking at, make a scatterplot using ggplot(), with x as carat and y as price, and add a regression line.

# type your code below in this chunk
ggplot(data = diamonds)+
  aes(x = carat,
      y = price)+
  geom_point()+
  geom_smooth(method = "lm")
`geom_smooth()` using formula 'y ~ x'

Regression

Question 2

Suppose we want to estimate the following relationship:

\[ \text{Price}_i = \beta_0+\beta_1 \, \text{carat}_i+u_i \]

Run a regression of price on carat using lm() and get a summary() of it. Be sure to save your regression model as an object, we’ll need it later.

# type your code below in this chunk
reg <- lm(price ~ carat, data = diamonds)

summary(reg)

Call:
lm(formula = price ~ carat, data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-18585.3   -804.8    -18.9    537.4  12731.7 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2256.36      13.06  -172.8   <2e-16 ***
carat        7756.43      14.07   551.4   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1549 on 53938 degrees of freedom
Multiple R-squared:  0.8493,    Adjusted R-squared:  0.8493 
F-statistic: 3.041e+05 on 1 and 53938 DF,  p-value: < 2.2e-16

Question 3

Part A

Write out the estimated regression equation.

\[ \widehat{price}_i=−2256.36+7756.43 \, \text{carat}_i \]

Part B

Make a regression table of the output (using the modelsummary package).

# type your code below in this chunk
modelsummary(models = list("Price" = reg),
             fmt = 2, # round to 2 decimals
             output = "html",
             coef_rename = c("(Intercept)" = "Constant",
                             "carat" = "Carat"),
             gof_map = list(
               list("raw" = "nobs", "clean" = "n", "fmt" = 0),
               list("raw" = "r.squared", "clean" = "R<sup>2</sup>", "fmt" = 2),
               #list("raw" = "adj.r.squared", "clean" = "Adj. R<sup>2</sup>", "fmt" = 2),
               list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
             ),
             escape = FALSE,
             stars = c('*' = .1, '**' = .05, '***' = 0.01)
)
Price
Constant −2256.36***
(13.06)
Carat 7756.43***
(14.07)
n 53940
R2 0.85
SER 1548.53
* p < 0.1, ** p < 0.05, *** p < 0.01

Part C

What is the estimated \(\hat{\beta}_1\) for this model? What does it mean in terms of the context of our question?

\(\hat{\beta}_1\), the estimated slope, is 7756.43. For every 1 carat, we expect the price of the diamond to increase by $7,756.43.

# type your code below in this chunk

Part D

Use the broom package’s tidy() command on your regression object, and calculate confidence intervals for your estimates by setting conf.int = T inside tidy().

What is the 95% confidence interval for \(\hat{\beta}_1\), and what does it mean?

# type your code below in this chunk
reg_tidy <- reg %>% 
  tidy(conf.int = T)

reg_tidy

We are 95% confident that in similarly constructed samples, the true effect of carat on price (β1) is between $7,728.86 and $7,784.00.

Save these endpoints as an object by running the following code (and changing it for your object names as needed):

# save confidence interval
ci_endpoints <- reg_tidy %>% # use whatever your tidy regression object is named
  filter(term == "carat") %>%
  select(conf.low, conf.high)

# look at it
ci_endpoints

Part E

Save your estimated \(\hat{\beta}_1\) as an object similarly by running and changing the code below if needed. We’ll need it later.

# save our slope
our_slope <- reg_tidy %>% # use whatever your tidy regression object is named
  filter(term == "carat") %>%
  pull(estimate)

# look at it
our_slope
[1] 7756.426

Confidence Intervals

Question 4

Part A

Let’s generate a confidence interval for \(\hat{\beta}_1\) by simulating the sampling distribution of \(\hat{\beta}_1\). Run the following code, which will specify() the model relationship and generate() 1,000 repetitions using the bootstrap method of resampling data points randomly from our sample, with replacement.

# save our simulations as an object (I called it "boot")
boot <- diamonds %>% # or whatever you named your dataframe!
  specify(price ~ carat) %>% # our regression model
  generate(reps = 1000, # run 1000 simulations
           type = "bootstrap") %>% # using bootstrap method
  calculate(stat = "slope") # estimate slope in each simulation

# look at it
boot

Note this will take a few minutes, its doing a lot of calculations! What does it show you when you look at it?

This produces 1,000 bootstrapped samples, and calculates the estimated regression slope \(\hat{\beta}_1\) for each sample.

NOTE: R Studio Cloud may crash when running this (it’s 53,938,000 observations!) — another good reason why, while it is convenient, you really should use your own R Studio app on your computer, rather than relying on the cloud!

Part B

Continue by piping your object from Part A into get_confidence_interval(). Set level = 0.95, type = "se" and point_estimate equal to our estimated \(\hat{\beta}_1\) (saved) from above as our_slope.

boot %>%
  get_confidence_interval(level = 0.95,
                          type = "se", 
                          point_estimate = our_slope) 

This gives us a confidence interval of $7,706.95 and $7,805.91. This not exactly, but quite close to R’s automatic calculation using the theoretical t-distribution.

Part C

Now instead of get_confidence_interval(), pipe your object from Part A into visualize() to see the sampling distribution of \(\hat{\beta}_1\) we simulated. We can add + shade_ci(endpoints = ...) setting the argument equal to whatever our object containing the confidence interval from Question 2 Part D (I have it named here as ci_endpoints).

boot %>%
  visualize()+
  shade_ci(endpoints = ci_endpoints)