---
title: "2.7 — Hypothesis Testing for Regression — R Practice"
author: "Your Name Here!"
date: "October 5, 2022"
format:
html:
self-contained: true # so we don't need other files (like plot images)
toc: true # show a table of contents
toc-location: left
theme: default
df-print: paged # by default, show tables (tibbles) as paged tables
editor: visual
execute:
echo: true # shows all code on rendered document
---
# Required Packages & Data
Load all the required packages we will use (**note I have installed them already into the cloud project**) by running (clicking the green play button) the chunk below:
```{r}
#| label: load-packages
#| warning: false
#| message: false
library(tidyverse) # your friend and mine
library(fivethirtyeight) # for the fandango data
library(broom) # for tidy regression
library(lmtest) # for heteroskedasticity test
library(estimatr) # for robust SEs
library(modelsummary) # for nice regression tables
library(car) # for outlier test
```
**What is the effect of a diamond's carat size on its price?**
We will use the `diamonds` example dataset that comes bundled with `ggplot`, automatically loaded with `tidyverse`. To make it an explicit tibble in our workspace, run the code below.
```{r}
diamonds <- diamonds
```
Note I am calling the tibble `diamonds`, feel free to name it whatever you like, but I will assume in my code chunks below that you are calling it `diamonds` - change as appropriate.
## Question 1
Just to see what we're looking at, make a scatterplot using `ggplot()`, with `x` as `carat` and `y` as `price`, and add a regression line.
```{r}
# type your code below in this chunk
```
# Regression
## Question 2
Suppose we want to estimate the following relationship:
$$
\text{Price}_i = \beta_0+\beta_1 \, \text{carat}_i+u_i
$$
Run a regression of `price` on `carat` using `lm()` and get a `summary()` of it. Be sure to save your regression model as an object, we'll need it later.
```{r}
# type your code below in this chunk
```
## Question 3
### Part A
Write out the estimated regression equation.
### Part B
Make a regression table of the output (using the `modelsummary` package).
```{r}
# type your code below in this chunk
```
### Part C
What is the estimated $\hat{\beta}_1$ for this model? What does it mean in terms of the context of our question?
```{r}
# type your code below in this chunk
```
### Part D
Use the `broom` package's `tidy()` command on your regression object, and calculate confidence intervals for your estimates by setting `conf.int = T` inside `tidy()`.
What is the 95% confidence interval for $\hat{\beta}_1$, and what does it mean?
```{r}
# type your code below in this chunk
```
Save these endpoints as an object by running the following code (and changing it for your object names as needed):
```{r}
# save confidence interval
ci_endpoints <- reg_tidy %>% # use whatever your tidy regression object is named
filter(term == "carat") %>%
select(conf.low, conf.high)
# look at it
ci_endpoints
```
### Part E
Save your estimated $\hat{\beta}_1$ as an object similarly by running and changing the code below if needed. We'll need it later.
```{r}
# save our slope
our_slope <- reg_tidy %>% # use whatever your tidy regression object is named
filter(term == "carat") %>%
pull(estimate)
# look at it
our_slope
```
## Confidence Intervals
## Question 4
### Part A
Let's generate a confidence interval for $\hat{\beta}_1$ by simulating the sampling distribution of $\hat{\beta}_1$. Run the following code, which will `specify()` the model relationship and `generate()` 1,000 repetitions using the `bootstrap` method of resampling data points randomly from our sample, with replacement.
```{r}
# save our simulations as an object (I called it "boot")
boot <- diamonds %>% # or whatever you named your dataframe!
specify(price ~ carat) %>% # our regression model
generate(reps = 1000, # run 1000 simulations
type = "bootstrap") %>% # using bootstrap method
calculate(stat = "slope") # estimate slope in each simulation
# look at it
boot
```
Note this will take a few minutes, its doing a lot of calculations! What does it show you when you look at it?
### Part B
Continue by piping your object from Part A into `get_confidence_interval()`. Set `level = 0.95, type = "se"` and `point_estimate` equal to our estimated $\hat{\beta}_1$ (saved) from above as `our_slope`.
```{r}
boot %>%
get_confidence_interval(level = 0.95,
type = "se",
point_estimate = our_slope)
```
### Part C
Now instead of `get_confidence_interval()`, pipe your object from Part A into `visualize()` to see the sampling distribution of $\hat{\beta}_1$ we simulated. We can add `+ shade_ci(endpoints = ...)` setting the argument equal to whatever our object containing the confidence interval from Question 2 Part D (I have it named here as `ci_endpoints`).
```{r}
boot %>%
visualize()+
shade_ci(endpoints = ci_endpoints)
```
Compare your *simulated* confidence interval with the *theoretically-constructed* confidence interval from the output of `tidy()` from Question 3.
```{r}
# type your code below in this chunk
```
# Hypothesis Testing
## Question 5
Now let's test the following hypothesis:
$$
\begin{align*}
H_0 &= 0 \\
H_a &\neq 0 \\ \end{align*}
$$
### Part A
Is this a one-sided or a two-sided test?
### Part B
Look back at the regression `summary()` output from question 2, and the output of `tidy()` from question 3. What does it tell you about this hypothesis test?
### Part C
Let's now do run this hypothesis test with `infer`, which will simulate the sampling distribution under the null hypothesis that $\beta_1=$. Run the following code, which will `specify()` the model relationship and `hypothesize()` the null hypothesis that there is no relationship between X and Y (i.e. $\beta_1=0$), and `generate()` 1,000 repetitions using the `permute` method, which will center the distribution at 0, and then `calculate(stat = "slope")`.
```{r}
# save our simulations as an object (I called it "test_sims")
test_sims <- diamonds %>% # or whatever you named your dataframe!
specify(price ~ carat) %>% # our regression model
hypothesize(null = "independence") %>% # H_0 is that slope is 0
generate(reps = 1000, # run 1000 simulations
type = "permute") %>% # using permute method, centering distr at 0
calculate(stat = "slope") # estimate slope in each simulation
# look at it
test_sims
```
Note this may also take a few minutes. What does it show you?
### Part D
Pipe your object from the previous part into the following code, which will `get_p_value()`. Inside this function, we are setting `obs_stat` equal to our β1\^ we found (from Question 2 part E), and set `direction = "both"` to run a *two*-sided test, since our alternative hypothesis is two-sided, $H_a: \beta_1 \neq 0$
```{r}
test_sims %>%
get_p_value(obs_stat = our_slope,
direction = "both")
```
Note the warning message that comes up!
### Part E
Instead of `get_p_value()`, pipe your simulations into the following code, which will `visualize()` the null distribution, and (in the second command), place our finding on it and `shade_p_value()`.
```{r}
test_sims %>%
visualize()
```
```{r}
test_sims %>%
visualize()+
shade_p_value(obs_stat = our_slope,
direction = "both") # for two-sided test
```
### Part F
Compare your *simulated* p-value with the *theoretically-constructed* p-value from the output of `summary`, and/or of `tidy()` from Question 2 and 3.
Both `summary` and `tidy()` also report the t-statistic (`t value` or `statistic`) on this test for `carat` (by default, that $H_0: \beta_1 = 0$).
What is the estimated test statistic for this model, and what does this number mean? Try to calculate it yourself with the formula:
$$\text{Test statistic} = \frac{\text{estimate — null hypothesis}}{\text{standard error of the estimate}}$$
The p-value is the probability of a t-statistic at least as large as ours if the null hypothesis were true. `R` constructs a t-distribution with `n-k-1` degrees of freedom (`n` is number of observations, `k` is number of X-variables) and calculates the probability in the tails of the distribution beyond this t value. You can calculate it yourself (for a two-sided test) with:
```{r}
2 * pt(your_t, # put your t-statistic here!
df = your_df, # put the df number here!
lower.tail = F) # we'll use the right tail
```