--- title: "2.7 — Hypothesis Testing for Regression — R Practice" author: "Your Name Here!" date: "October 5, 2022" format: html: self-contained: true # so we don't need other files (like plot images) toc: true # show a table of contents toc-location: left theme: default df-print: paged # by default, show tables (tibbles) as paged tables editor: visual execute: echo: true # shows all code on rendered document --- # Required Packages & Data Load all the required packages we will use (**note I have installed them already into the cloud project**) by running (clicking the green play button) the chunk below: ```{r} #| label: load-packages #| warning: false #| message: false library(tidyverse) # your friend and mine library(fivethirtyeight) # for the fandango data library(broom) # for tidy regression library(lmtest) # for heteroskedasticity test library(estimatr) # for robust SEs library(modelsummary) # for nice regression tables library(car) # for outlier test ``` **What is the effect of a diamond's carat size on its price?** We will use the `diamonds` example dataset that comes bundled with `ggplot`, automatically loaded with `tidyverse`. To make it an explicit tibble in our workspace, run the code below. ```{r} diamonds <- diamonds ``` Note I am calling the tibble `diamonds`, feel free to name it whatever you like, but I will assume in my code chunks below that you are calling it `diamonds` - change as appropriate. ## Question 1 Just to see what we're looking at, make a scatterplot using `ggplot()`, with `x` as `carat` and `y` as `price`, and add a regression line. ```{r} # type your code below in this chunk ``` # Regression ## Question 2 Suppose we want to estimate the following relationship: $$ \text{Price}_i = \beta_0+\beta_1 \, \text{carat}_i+u_i $$ Run a regression of `price` on `carat` using `lm()` and get a `summary()` of it. Be sure to save your regression model as an object, we'll need it later. ```{r} # type your code below in this chunk ``` ## Question 3 ### Part A Write out the estimated regression equation. ### Part B Make a regression table of the output (using the `modelsummary` package). ```{r} # type your code below in this chunk ``` ### Part C What is the estimated $\hat{\beta}_1$ for this model? What does it mean in terms of the context of our question? ```{r} # type your code below in this chunk ``` ### Part D Use the `broom` package's `tidy()` command on your regression object, and calculate confidence intervals for your estimates by setting `conf.int = T` inside `tidy()`. What is the 95% confidence interval for $\hat{\beta}_1$, and what does it mean? ```{r} # type your code below in this chunk ``` Save these endpoints as an object by running the following code (and changing it for your object names as needed): ```{r} # save confidence interval ci_endpoints <- reg_tidy %>% # use whatever your tidy regression object is named filter(term == "carat") %>% select(conf.low, conf.high) # look at it ci_endpoints ``` ### Part E Save your estimated $\hat{\beta}_1$ as an object similarly by running and changing the code below if needed. We'll need it later. ```{r} # save our slope our_slope <- reg_tidy %>% # use whatever your tidy regression object is named filter(term == "carat") %>% pull(estimate) # look at it our_slope ``` ## Confidence Intervals ## Question 4 ### Part A Let's generate a confidence interval for $\hat{\beta}_1$ by simulating the sampling distribution of $\hat{\beta}_1$. Run the following code, which will `specify()` the model relationship and `generate()` 1,000 repetitions using the `bootstrap` method of resampling data points randomly from our sample, with replacement. ```{r} # save our simulations as an object (I called it "boot") boot <- diamonds %>% # or whatever you named your dataframe! specify(price ~ carat) %>% # our regression model generate(reps = 1000, # run 1000 simulations type = "bootstrap") %>% # using bootstrap method calculate(stat = "slope") # estimate slope in each simulation # look at it boot ``` Note this will take a few minutes, its doing a lot of calculations! What does it show you when you look at it? ### Part B Continue by piping your object from Part A into `get_confidence_interval()`. Set `level = 0.95, type = "se"` and `point_estimate` equal to our estimated $\hat{\beta}_1$ (saved) from above as `our_slope`. ```{r} boot %>% get_confidence_interval(level = 0.95, type = "se", point_estimate = our_slope) ``` ### Part C Now instead of `get_confidence_interval()`, pipe your object from Part A into `visualize()` to see the sampling distribution of $\hat{\beta}_1$ we simulated. We can add `+ shade_ci(endpoints = ...)` setting the argument equal to whatever our object containing the confidence interval from Question 2 Part D (I have it named here as `ci_endpoints`). ```{r} boot %>% visualize()+ shade_ci(endpoints = ci_endpoints) ``` Compare your *simulated* confidence interval with the *theoretically-constructed* confidence interval from the output of `tidy()` from Question 3. ```{r} # type your code below in this chunk ``` # Hypothesis Testing ## Question 5 Now let's test the following hypothesis: $$ \begin{align*} H_0 &= 0 \\ H_a &\neq 0 \\ \end{align*} $$ ### Part A Is this a one-sided or a two-sided test? ### Part B Look back at the regression `summary()` output from question 2, and the output of `tidy()` from question 3. What does it tell you about this hypothesis test? ### Part C Let's now do run this hypothesis test with `infer`, which will simulate the sampling distribution under the null hypothesis that $\beta_1=$. Run the following code, which will `specify()` the model relationship and `hypothesize()` the null hypothesis that there is no relationship between X and Y (i.e. $\beta_1=0$), and `generate()` 1,000 repetitions using the `permute` method, which will center the distribution at 0, and then `calculate(stat = "slope")`. ```{r} # save our simulations as an object (I called it "test_sims") test_sims <- diamonds %>% # or whatever you named your dataframe! specify(price ~ carat) %>% # our regression model hypothesize(null = "independence") %>% # H_0 is that slope is 0 generate(reps = 1000, # run 1000 simulations type = "permute") %>% # using permute method, centering distr at 0 calculate(stat = "slope") # estimate slope in each simulation # look at it test_sims ``` Note this may also take a few minutes. What does it show you? ### Part D Pipe your object from the previous part into the following code, which will `get_p_value()`. Inside this function, we are setting `obs_stat` equal to our β1\^ we found (from Question 2 part E), and set `direction = "both"` to run a *two*-sided test, since our alternative hypothesis is two-sided, $H_a: \beta_1 \neq 0$ ```{r} test_sims %>% get_p_value(obs_stat = our_slope, direction = "both") ``` Note the warning message that comes up! ### Part E Instead of `get_p_value()`, pipe your simulations into the following code, which will `visualize()` the null distribution, and (in the second command), place our finding on it and `shade_p_value()`. ```{r} test_sims %>% visualize() ``` ```{r} test_sims %>% visualize()+ shade_p_value(obs_stat = our_slope, direction = "both") # for two-sided test ``` ### Part F Compare your *simulated* p-value with the *theoretically-constructed* p-value from the output of `summary`, and/or of `tidy()` from Question 2 and 3. Both `summary` and `tidy()` also report the t-statistic (`t value` or `statistic`) on this test for `carat` (by default, that $H_0: \beta_1 = 0$). What is the estimated test statistic for this model, and what does this number mean? Try to calculate it yourself with the formula: $$\text{Test statistic} = \frac{\text{estimate — null hypothesis}}{\text{standard error of the estimate}}$$ The p-value is the probability of a t-statistic at least as large as ours if the null hypothesis were true. `R` constructs a t-distribution with `n-k-1` degrees of freedom (`n` is number of observations, `k` is number of X-variables) and calculates the probability in the tails of the distribution beyond this t value. You can calculate it yourself (for a two-sided test) with: ```{r} 2 * pt(your_t, # put your t-statistic here! df = your_df, # put the df number here! lower.tail = F) # we'll use the right tail ```