---
title: "2.7 — Hypothesis Testing for Regression — R Practice"
author: "Answer Key"
date: "October 12, 2022"
format:
html:
self-contained: true # so we don't need other files (like plot images)
toc: true # show a table of contents
toc-location: left
theme: default
df-print: paged # by default, show tables (tibbles) as paged tables
editor: visual
execute:
echo: true # shows all code on rendered document
---
# Required Packages & Data
Load all the required packages we will use (**note I have installed them already into the cloud project**) by running (clicking the green play button) the chunk below:
```{r}
#| label: load-packages
#| warning: false
#| message: false
library(tidyverse) # your friend and mine
library(broom) # for tidy regression
library(modelsummary) # for nice regression tables
library(infer) # for simulating inference
```
**What is the effect of a diamond's carat size on its price?**
We will use the `diamonds` example dataset that comes bundled with `ggplot`, automatically loaded with `tidyverse`. To make it an explicit tibble in our workspace, run the code below.
```{r}
diamonds <- diamonds
```
Note I am calling the tibble `diamonds`, feel free to name it whatever you like, but I will assume in my code chunks below that you are calling it `diamonds` - change as appropriate.
## Question 1
Just to see what we're looking at, make a scatterplot using `ggplot()`, with `x` as `carat` and `y` as `price`, and add a regression line.
```{r}
# type your code below in this chunk
ggplot(data = diamonds)+
aes(x = carat,
y = price)+
geom_point()+
geom_smooth(method = "lm")
```
# Regression
## Question 2
Suppose we want to estimate the following relationship:
$$
\text{Price}_i = \beta_0+\beta_1 \, \text{carat}_i+u_i
$$
Run a regression of `price` on `carat` using `lm()` and get a `summary()` of it. Be sure to save your regression model as an object, we'll need it later.
```{r}
# type your code below in this chunk
reg <- lm(price ~ carat, data = diamonds)
summary(reg)
```
## Question 3
### Part A
Write out the estimated regression equation.
$$
\widehat{price}_i=−2256.36+7756.43 \, \text{carat}_i
$$
### Part B
Make a regression table of the output (using the `modelsummary` package).
```{r}
# type your code below in this chunk
modelsummary(models = list("Price" = reg),
fmt = 2, # round to 2 decimals
output = "html",
coef_rename = c("(Intercept)" = "Constant",
"carat" = "Carat"),
gof_map = list(
list("raw" = "nobs", "clean" = "n", "fmt" = 0),
list("raw" = "r.squared", "clean" = "R^{2}", "fmt" = 2),
#list("raw" = "adj.r.squared", "clean" = "Adj. R^{2}", "fmt" = 2),
list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
),
escape = FALSE,
stars = c('*' = .1, '**' = .05, '***' = 0.01)
)
```
### Part C
What is the estimated $\hat{\beta}_1$ for this model? What does it mean in terms of the context of our question?
$\hat{\beta}_1$, the estimated slope, is 7756.43. For every 1 carat, we expect the price of the diamond to increase by \$7,756.43.
```{r}
# type your code below in this chunk
```
### Part D
Use the `broom` package's `tidy()` command on your regression object, and calculate confidence intervals for your estimates by setting `conf.int = T` inside `tidy()`.
What is the 95% confidence interval for $\hat{\beta}_1$, and what does it mean?
```{r}
# type your code below in this chunk
reg_tidy <- reg %>%
tidy(conf.int = T)
reg_tidy
```
**We are 95% confident that in similarly constructed samples, the true effect of carat on price** (β1) **is between \$7,728.86 and \$7,784.00.**
Save these endpoints as an object by running the following code (and changing it for your object names as needed):
```{r}
# save confidence interval
ci_endpoints <- reg_tidy %>% # use whatever your tidy regression object is named
filter(term == "carat") %>%
select(conf.low, conf.high)
# look at it
ci_endpoints
```
### Part E
Save your estimated $\hat{\beta}_1$ as an object similarly by running and changing the code below if needed. We'll need it later.
```{r}
# save our slope
our_slope <- reg_tidy %>% # use whatever your tidy regression object is named
filter(term == "carat") %>%
pull(estimate)
# look at it
our_slope
```
## Confidence Intervals
## Question 4
### Part A
Let's generate a confidence interval for $\hat{\beta}_1$ by simulating the sampling distribution of $\hat{\beta}_1$. Run the following code, which will `specify()` the model relationship and `generate()` 1,000 repetitions using the `bootstrap` method of resampling data points randomly from our sample, with replacement.
```{r}
# save our simulations as an object (I called it "boot")
boot <- diamonds %>% # or whatever you named your dataframe!
specify(price ~ carat) %>% # our regression model
generate(reps = 1000, # run 1000 simulations
type = "bootstrap") %>% # using bootstrap method
calculate(stat = "slope") # estimate slope in each simulation
# look at it
boot
```
Note this will take a few minutes, its doing a lot of calculations! What does it show you when you look at it?
**This produces 1,000 bootstrapped samples, and calculates the estimated regression slope** $\hat{\beta}_1$ **for each sample.**
**NOTE: R Studio Cloud may crash when running this (it's 53,938,000 observations!) --- another good reason why, while it is convenient, you really should use your own R Studio app on your computer, rather than relying on the cloud!**
### Part B
Continue by piping your object from Part A into `get_confidence_interval()`. Set `level = 0.95, type = "se"` and `point_estimate` equal to our estimated $\hat{\beta}_1$ (saved) from above as `our_slope`.
```{r}
boot %>%
get_confidence_interval(level = 0.95,
type = "se",
point_estimate = our_slope)
```
**This gives us a confidence interval of \$7,706.95 and \$7,805.91. This not exactly, but quite close to `R`'s automatic calculation using the theoretical t-distribution.**
### Part C
Now instead of `get_confidence_interval()`, pipe your object from Part A into `visualize()` to see the sampling distribution of $\hat{\beta}_1$ we simulated. We can add `+ shade_ci(endpoints = ...)` setting the argument equal to whatever our object containing the confidence interval from Question 2 Part D (I have it named here as `ci_endpoints`).
```{r}
boot %>%
visualize()+
shade_ci(endpoints = ci_endpoints)
```
Compare your *simulated* confidence interval with the *theoretically-constructed* confidence interval from the output of `tidy()` from Question 3.
```{r}
# type your code below in this chunk
```
# Hypothesis Testing
## Question 5
Now let's test the following hypothesis:
$$
\begin{align*}
H_0: & \, \beta_1 = 0 \\
H_a: & \, \beta_1 \neq 0 \\ \end{align*}
$$
### Part A
Is this a one-sided or a two-sided test?
**This is a two-sided test, because we are considering the possibility that** $\beta_1$ **may be above or below the null hypothesis value of 0.**
### Part B
Look back at the regression `summary()` output from question 2, and the output of `tidy()` from question 3. What does it tell you about this hypothesis test?
**Reading the second row of estimates from the `lm summary` or `tidy` output of question 2, we saw that the `estimate` on `carat`** $\hat{\beta}_1$ **was 7756.43, with a standard error of 14.07. This yields a test statistic value of 551.4, and a p-value of basically 0.**
### Part C
Let's now do run this hypothesis test with `infer`, which will simulate the sampling distribution under the null hypothesis that $\beta_1=$. Run the following code, which will `specify()` the model relationship and `hypothesize()` the null hypothesis that there is no relationship between X and Y (i.e. $\beta_1=0$), and `generate()` 1,000 repetitions using the `permute` method, which will center the distribution at 0, and then `calculate(stat = "slope")`.
```{r}
# save our simulations as an object (I called it "test_sims")
test_sims <- diamonds %>% # or whatever you named your dataframe!
specify(price ~ carat) %>% # our regression model
hypothesize(null = "independence") %>% # H_0 is that slope is 0
generate(reps = 1000, # run 1000 simulations
type = "permute") %>% # using permute method, centering distr at 0
calculate(stat = "slope") # estimate slope in each simulation
# look at it
test_sims
```
Note this may also take a few minutes. What does it show you?
**This produces 1,000 permuted samples, under the assumption that the true slope is 0 (the null hypothesis), and calculates the estimated regression slope** $\hat{\beta}_1$ **for each sample.**
### Part D
Pipe your object from the previous part into the following code, which will `get_p_value()`. Inside this function, we are setting `obs_stat` equal to our β1\^ we found (from Question 2 part E), and set `direction = "both"` to run a *two*-sided test, since our alternative hypothesis is two-sided, $H_a: \beta_1 \neq 0$
```{r}
test_sims %>%
get_p_value(obs_stat = our_slope,
direction = "both")
```
Note the warning message that comes up!
### Part E
Instead of `get_p_value()`, pipe your simulations into the following code, which will `visualize()` the null distribution, and (in the second command), place our finding on it and `shade_p_value()`.
```{r}
test_sims %>%
visualize()
```
```{r}
test_sims %>%
visualize()+
shade_p_value(obs_stat = our_slope,
direction = "both") # for two-sided test
```
### Part F
Compare your *simulated* p-value with the *theoretically-constructed* p-value from the output of `summary`, and/or of `tidy()` from Question 2 and 3.
Both `summary` and `tidy()` also report the t-statistic (`t value` or `statistic`) on this test for `carat` (by default, that $H_0: \beta_1 = 0$).
What is the estimated test statistic for this model, and what does this number mean? Try to calculate it yourself with the formula:
$$\text{Test statistic} = \frac{\text{estimate — null hypothesis}}{\text{standard error of the estimate}}$$
$$
\begin{align*}
t &= \frac{7756.43-0}{14.07} \\
& \approx 551.3
\end{align*}
$$
**If the null hypothesis were true** $\beta_1=0$**, the probability that we get a test-statistic at least as extreme as 551.3 (essentially, 551 standard deviations away!!) is virtually 0. You can see this when we visualize using `infer`, above.**
The p-value is the probability of a t-statistic at least as large as ours if the null hypothesis were true. `R` constructs a t-distribution with `n-k-1` degrees of freedom (`n` is number of observations, `k` is number of X-variables) and calculates the probability in the tails of the distribution beyond this t value. You can calculate it yourself (for a two-sided test) with:
```{r}
2 * pt(551.3, # our t-statistic
df = 53938, # n - 2
lower.tail = F) # we'll use the right tail
```