---
title: "2.7 — Hypothesis Testing for Regression — R Practice"
author: "Your Name Here!"
date: "October 5, 2022"
format:
  html: 
    self-contained: true # so we don't need other files (like plot images)
    toc: true # show a table of contents
    toc-location: left
    theme: default
    df-print: paged # by default, show tables (tibbles) as paged tables
editor: visual
execute:
  echo: true # shows all code on rendered document
---

# Required Packages & Data

Load all the required packages we will use (**note I have installed them already into the cloud project**) by running (clicking the green play button) the chunk below:

```{r}
#| label: load-packages
#| warning: false
#| message: false
library(tidyverse) # your friend and mine
library(fivethirtyeight) # for the fandango data
library(broom) # for tidy regression
library(lmtest) # for heteroskedasticity test
library(estimatr) # for robust SEs
library(modelsummary) # for nice regression tables
library(car) # for outlier test
```

**What is the effect of a diamond's carat size on its price?**

We will use the `diamonds` example dataset that comes bundled with `ggplot`, automatically loaded with `tidyverse`. To make it an explicit tibble in our workspace, run the code below.

```{r}
diamonds <- diamonds
```

Note I am calling the tibble `diamonds`, feel free to name it whatever you like, but I will assume in my code chunks below that you are calling it `diamonds` - change as appropriate.

## Question 1

Just to see what we're looking at, make a scatterplot using `ggplot()`, with `x` as `carat` and `y` as `price`, and add a regression line.

<!--WRITE YOUR ANSWERS BELOW -->

```{r}
# type your code below in this chunk

```

# Regression

## Question 2

Suppose we want to estimate the following relationship:

$$
\text{Price}_i = \beta_0+\beta_1 \, \text{carat}_i+u_i
$$

Run a regression of `price` on `carat` using `lm()` and get a `summary()` of it. Be sure to save your regression model as an object, we'll need it later.

<!--WRITE YOUR ANSWERS BELOW -->

```{r}
# type your code below in this chunk

```

## Question 3

### Part A

Write out the estimated regression equation.

<!--WRITE YOUR ANSWERS BELOW -->

### Part B

Make a regression table of the output (using the `modelsummary` package).

<!--WRITE YOUR ANSWERS BELOW -->

```{r}
# type your code below in this chunk

```

### Part C

What is the estimated $\hat{\beta}_1$ for this model? What does it mean in terms of the context of our question?

<!--WRITE YOUR ANSWERS BELOW -->

```{r}
# type your code below in this chunk

```

### Part D

Use the `broom` package's `tidy()` command on your regression object, and calculate confidence intervals for your estimates by setting `conf.int = T` inside `tidy()`.

What is the 95% confidence interval for $\hat{\beta}_1$, and what does it mean?

<!--WRITE YOUR ANSWERS BELOW -->

```{r}
# type your code below in this chunk

```

Save these endpoints as an object by running the following code (and changing it for your object names as needed):

```{r}
# save confidence interval
ci_endpoints <- reg_tidy %>% # use whatever your tidy regression object is named
  filter(term == "carat") %>%
  select(conf.low, conf.high)

# look at it
ci_endpoints
```

### Part E

Save your estimated $\hat{\beta}_1$ as an object similarly by running and changing the code below if needed. We'll need it later.

<!--WRITE YOUR ANSWERS BELOW -->

```{r}
# save our slope
our_slope <- reg_tidy %>% # use whatever your tidy regression object is named
  filter(term == "carat") %>%
  pull(estimate)

# look at it
our_slope
```

## Confidence Intervals

## Question 4

### Part A

Let's generate a confidence interval for $\hat{\beta}_1$ by simulating the sampling distribution of $\hat{\beta}_1$. Run the following code, which will `specify()` the model relationship and `generate()` 1,000 repetitions using the `bootstrap` method of resampling data points randomly from our sample, with replacement.

```{r}
# save our simulations as an object (I called it "boot")
boot <- diamonds %>% # or whatever you named your dataframe!
  specify(price ~ carat) %>% # our regression model
  generate(reps = 1000, # run 1000 simulations
           type = "bootstrap") %>% # using bootstrap method
  calculate(stat = "slope") # estimate slope in each simulation

# look at it
boot
```

Note this will take a few minutes, its doing a lot of calculations! What does it show you when you look at it?

<!--WRITE YOUR ANSWERS BELOW -->

### Part B

Continue by piping your object from Part A into `get_confidence_interval()`. Set `level = 0.95, type = "se"` and `point_estimate` equal to our estimated $\hat{\beta}_1$ (saved) from above as `our_slope`.

<!--WRITE YOUR ANSWERS BELOW -->

```{r}
boot %>%
  get_confidence_interval(level = 0.95,
                          type = "se", 
                          point_estimate = our_slope) 
```

### Part C

Now instead of `get_confidence_interval()`, pipe your object from Part A into `visualize()` to see the sampling distribution of $\hat{\beta}_1$ we simulated. We can add `+ shade_ci(endpoints = ...)` setting the argument equal to whatever our object containing the confidence interval from Question 2 Part D (I have it named here as `ci_endpoints`).

<!--WRITE YOUR ANSWERS BELOW -->

```{r}
boot %>%
  visualize()+
  shade_ci(endpoints = ci_endpoints) 
```

Compare your *simulated* confidence interval with the *theoretically-constructed* confidence interval from the output of `tidy()` from Question 3.

<!--WRITE YOUR ANSWERS BELOW -->

```{r}
# type your code below in this chunk

```

# Hypothesis Testing

## Question 5

Now let's test the following hypothesis:

$$
\begin{align*}
H_0 &= 0 \\
H_a &\neq 0 \\ \end{align*}
$$

### Part A

Is this a one-sided or a two-sided test?

<!--WRITE YOUR ANSWERS BELOW -->

### Part B

Look back at the regression `summary()` output from question 2, and the output of `tidy()` from question 3. What does it tell you about this hypothesis test?

<!--WRITE YOUR ANSWERS BELOW -->

### Part C

Let's now do run this hypothesis test with `infer`, which will simulate the sampling distribution under the null hypothesis that $\beta_1=$. Run the following code, which will `specify()` the model relationship and `hypothesize()` the null hypothesis that there is no relationship between X and Y (i.e. $\beta_1=0$), and `generate()` 1,000 repetitions using the `permute` method, which will center the distribution at 0, and then `calculate(stat = "slope")`.

```{r}
# save our simulations as an object (I called it "test_sims")
test_sims <- diamonds %>% # or whatever you named your dataframe!
  specify(price ~ carat) %>% # our regression model
  hypothesize(null = "independence") %>% # H_0 is that slope is 0
  generate(reps = 1000, # run 1000 simulations
           type = "permute") %>% # using permute method, centering distr at 0
  calculate(stat = "slope") # estimate slope in each simulation

# look at it
test_sims
```

Note this may also take a few minutes. What does it show you?

<!--WRITE YOUR ANSWERS BELOW -->

### Part D

Pipe your object from the previous part into the following code, which will `get_p_value()`. Inside this function, we are setting `obs_stat` equal to our β1\^ we found (from Question 2 part E), and set `direction = "both"` to run a *two*-sided test, since our alternative hypothesis is two-sided, $H_a: \beta_1 \neq 0$

```{r}
test_sims %>%
  get_p_value(obs_stat = our_slope,
              direction = "both")
```

Note the warning message that comes up!

### Part E

Instead of `get_p_value()`, pipe your simulations into the following code, which will `visualize()` the null distribution, and (in the second command), place our finding on it and `shade_p_value()`.

```{r}
test_sims %>%
  visualize()
```

```{r}
test_sims %>%
  visualize()+
  shade_p_value(obs_stat = our_slope,
                direction = "both") # for two-sided test
```

### Part F

Compare your *simulated* p-value with the *theoretically-constructed* p-value from the output of `summary`, and/or of `tidy()` from Question 2 and 3.

Both `summary` and `tidy()` also report the t-statistic (`t value` or `statistic`) on this test for `carat` (by default, that $H_0: \beta_1 = 0$).

What is the estimated test statistic for this model, and what does this number mean? Try to calculate it yourself with the formula:

$$\text{Test statistic} = \frac{\text{estimate — null hypothesis}}{\text{standard error of the estimate}}$$

<!--WRITE YOUR ANSWERS BELOW -->

The p-value is the probability of a t-statistic at least as large as ours if the null hypothesis were true. `R` constructs a t-distribution with `n-k-1` degrees of freedom (`n` is number of observations, `k` is number of X-variables) and calculates the probability in the tails of the distribution beyond this t value. You can calculate it yourself (for a two-sided test) with:

```{r}
2 * pt(your_t, # put your t-statistic here!
       df = your_df, # put the df number here!
       lower.tail = F) # we'll use the right tail
```