--- title: "2.7 — Hypothesis Testing for Regression — R Practice" author: "Answer Key" date: "October 12, 2022" format: html: self-contained: true # so we don't need other files (like plot images) toc: true # show a table of contents toc-location: left theme: default df-print: paged # by default, show tables (tibbles) as paged tables editor: visual execute: echo: true # shows all code on rendered document --- # Required Packages & Data Load all the required packages we will use (**note I have installed them already into the cloud project**) by running (clicking the green play button) the chunk below: {r} #| label: load-packages #| warning: false #| message: false library(tidyverse) # your friend and mine library(broom) # for tidy regression library(modelsummary) # for nice regression tables library(infer) # for simulating inference  **What is the effect of a diamond's carat size on its price?** We will use the diamonds example dataset that comes bundled with ggplot, automatically loaded with tidyverse. To make it an explicit tibble in our workspace, run the code below. {r} diamonds <- diamonds  Note I am calling the tibble diamonds, feel free to name it whatever you like, but I will assume in my code chunks below that you are calling it diamonds - change as appropriate. ## Question 1 Just to see what we're looking at, make a scatterplot using ggplot(), with x as carat and y as price, and add a regression line. {r} # type your code below in this chunk ggplot(data = diamonds)+ aes(x = carat, y = price)+ geom_point()+ geom_smooth(method = "lm")  # Regression ## Question 2 Suppose we want to estimate the following relationship: $$\text{Price}_i = \beta_0+\beta_1 \, \text{carat}_i+u_i$$ Run a regression of price on carat using lm() and get a summary() of it. Be sure to save your regression model as an object, we'll need it later. {r} # type your code below in this chunk reg <- lm(price ~ carat, data = diamonds) summary(reg)  ## Question 3 ### Part A Write out the estimated regression equation. $$\widehat{price}_i=−2256.36+7756.43 \, \text{carat}_i$$ ### Part B Make a regression table of the output (using the modelsummary package). {r} # type your code below in this chunk modelsummary(models = list("Price" = reg), fmt = 2, # round to 2 decimals output = "html", coef_rename = c("(Intercept)" = "Constant", "carat" = "Carat"), gof_map = list( list("raw" = "nobs", "clean" = "n", "fmt" = 0), list("raw" = "r.squared", "clean" = "R2", "fmt" = 2), #list("raw" = "adj.r.squared", "clean" = "Adj. R2", "fmt" = 2), list("raw" = "rmse", "clean" = "SER", "fmt" = 2) ), escape = FALSE, stars = c('*' = .1, '**' = .05, '***' = 0.01) )  ### Part C What is the estimated $\hat{\beta}_1$ for this model? What does it mean in terms of the context of our question? $\hat{\beta}_1$, the estimated slope, is 7756.43. For every 1 carat, we expect the price of the diamond to increase by \$7,756.43. {r} # type your code below in this chunk  ### Part D Use the broom package's tidy() command on your regression object, and calculate confidence intervals for your estimates by setting conf.int = T inside tidy(). What is the 95% confidence interval for$\hat{\beta}_1$, and what does it mean? {r} # type your code below in this chunk reg_tidy <- reg %>% tidy(conf.int = T) reg_tidy  **We are 95% confident that in similarly constructed samples, the true effect of carat on price** (β1) **is between \$7,728.86 and \$7,784.00.** Save these endpoints as an object by running the following code (and changing it for your object names as needed): {r} # save confidence interval ci_endpoints <- reg_tidy %>% # use whatever your tidy regression object is named filter(term == "carat") %>% select(conf.low, conf.high) # look at it ci_endpoints  ### Part E Save your estimated$\hat{\beta}_1$as an object similarly by running and changing the code below if needed. We'll need it later. {r} # save our slope our_slope <- reg_tidy %>% # use whatever your tidy regression object is named filter(term == "carat") %>% pull(estimate) # look at it our_slope  ## Confidence Intervals ## Question 4 ### Part A Let's generate a confidence interval for$\hat{\beta}_1$by simulating the sampling distribution of$\hat{\beta}_1$. Run the following code, which will specify() the model relationship and generate() 1,000 repetitions using the bootstrap method of resampling data points randomly from our sample, with replacement. {r} # save our simulations as an object (I called it "boot") boot <- diamonds %>% # or whatever you named your dataframe! specify(price ~ carat) %>% # our regression model generate(reps = 1000, # run 1000 simulations type = "bootstrap") %>% # using bootstrap method calculate(stat = "slope") # estimate slope in each simulation # look at it boot  Note this will take a few minutes, its doing a lot of calculations! What does it show you when you look at it? **This produces 1,000 bootstrapped samples, and calculates the estimated regression slope**$\hat{\beta}_1$**for each sample.** **NOTE: R Studio Cloud may crash when running this (it's 53,938,000 observations!) --- another good reason why, while it is convenient, you really should use your own R Studio app on your computer, rather than relying on the cloud!** ### Part B Continue by piping your object from Part A into get_confidence_interval(). Set level = 0.95, type = "se" and point_estimate equal to our estimated$\hat{\beta}_1$(saved) from above as our_slope. {r} boot %>% get_confidence_interval(level = 0.95, type = "se", point_estimate = our_slope)  **This gives us a confidence interval of \$7,706.95 and \$7,805.91. This not exactly, but quite close to R's automatic calculation using the theoretical t-distribution.** ### Part C Now instead of get_confidence_interval(), pipe your object from Part A into visualize() to see the sampling distribution of$\hat{\beta}_1we simulated. We can add + shade_ci(endpoints = ...) setting the argument equal to whatever our object containing the confidence interval from Question 2 Part D (I have it named here as ci_endpoints). {r} boot %>% visualize()+ shade_ci(endpoints = ci_endpoints)  Compare your *simulated* confidence interval with the *theoretically-constructed* confidence interval from the output of tidy() from Question 3. {r} # type your code below in this chunk  # Hypothesis Testing ## Question 5 Now let's test the following hypothesis: \begin{align*} H_0: & \, \beta_1 = 0 \\ H_a: & \, \beta_1 \neq 0 \\ \end{align*} ### Part A Is this a one-sided or a two-sided test? **This is a two-sided test, because we are considering the possibility that**\beta_1$**may be above or below the null hypothesis value of 0.** ### Part B Look back at the regression summary() output from question 2, and the output of tidy() from question 3. What does it tell you about this hypothesis test? **Reading the second row of estimates from the lm summary or tidy output of question 2, we saw that the estimate on carat**$\hat{\beta}_1$**was 7756.43, with a standard error of 14.07. This yields a test statistic value of 551.4, and a p-value of basically 0.** ### Part C Let's now do run this hypothesis test with infer, which will simulate the sampling distribution under the null hypothesis that$\beta_1=$. Run the following code, which will specify() the model relationship and hypothesize() the null hypothesis that there is no relationship between X and Y (i.e.$\beta_1=0$), and generate() 1,000 repetitions using the permute method, which will center the distribution at 0, and then calculate(stat = "slope"). {r} # save our simulations as an object (I called it "test_sims") test_sims <- diamonds %>% # or whatever you named your dataframe! specify(price ~ carat) %>% # our regression model hypothesize(null = "independence") %>% # H_0 is that slope is 0 generate(reps = 1000, # run 1000 simulations type = "permute") %>% # using permute method, centering distr at 0 calculate(stat = "slope") # estimate slope in each simulation # look at it test_sims  Note this may also take a few minutes. What does it show you? **This produces 1,000 permuted samples, under the assumption that the true slope is 0 (the null hypothesis), and calculates the estimated regression slope**$\hat{\beta}_1$**for each sample.** ### Part D Pipe your object from the previous part into the following code, which will get_p_value(). Inside this function, we are setting obs_stat equal to our β1\^ we found (from Question 2 part E), and set direction = "both" to run a *two*-sided test, since our alternative hypothesis is two-sided,$H_a: \beta_1 \neq 0${r} test_sims %>% get_p_value(obs_stat = our_slope, direction = "both")  Note the warning message that comes up! ### Part E Instead of get_p_value(), pipe your simulations into the following code, which will visualize() the null distribution, and (in the second command), place our finding on it and shade_p_value(). {r} test_sims %>% visualize()  {r} test_sims %>% visualize()+ shade_p_value(obs_stat = our_slope, direction = "both") # for two-sided test  ### Part F Compare your *simulated* p-value with the *theoretically-constructed* p-value from the output of summary, and/or of tidy() from Question 2 and 3. Both summary and tidy() also report the t-statistic (t value or statistic) on this test for carat (by default, that$H_0: \beta_1 = 0). What is the estimated test statistic for this model, and what does this number mean? Try to calculate it yourself with the formula: $$\text{Test statistic} = \frac{\text{estimate — null hypothesis}}{\text{standard error of the estimate}}$$ \begin{align*} t &= \frac{7756.43-0}{14.07} \\ & \approx 551.3 \end{align*} **If the null hypothesis were true**\beta_1=0\$**, the probability that we get a test-statistic at least as extreme as 551.3 (essentially, 551 standard deviations away!!) is virtually 0. You can see this when we visualize using infer, above.** The p-value is the probability of a t-statistic at least as large as ours if the null hypothesis were true. R constructs a t-distribution with n-k-1 degrees of freedom (n is number of observations, k is number of X-variables) and calculates the probability in the tails of the distribution beyond this t value. You can calculate it yourself (for a two-sided test) with: {r} 2 * pt(551.3, # our t-statistic df = 53938, # n - 2 lower.tail = F) # we'll use the right tail