---
title: "2.5 — Precision and Diagnostics — R Practice"
author: "Answer Key"
date: "September 28, 2022"
format:
html:
self-contained: true # so we don't need other files (like plot images)
toc: true # show a table of contents
toc-location: left
theme: default
df-print: paged # by default, show tables (tibbles) as paged tables
editor: visual
execute:
echo: true # shows all code on rendered document
---
# Required Packages & Data
Load all the required packages we will use (**note I have installed them already into the cloud project**) by running (clicking the green play button) the chunk below:
```{r}
#| label: load-packages
#| warning: false
#| message: false
library(tidyverse) # your friend and mine
library(fivethirtyeight) # for the fandango data
library(broom) # for tidy regression
library(lmtest) # for heteroskedasticity test
library(estimatr) # for robust SEs
library(modelsummary) # for nice regression tables
library(car) # for outlier test
```
**How do critics evaluation of movies affect user ratings?**
Our data comes from fivethirtyeight's public data repository, and was used to write [this article about online movie ratings](https://fivethirtyeight.com/features/fandango-movies-ratings/). You could find this data on fivethirtyeight's [github repository](https://github.com/fivethirtyeight/data/tree/master/fandango) and, if you wanted some practice, download and load in the data yourself using `read_csv()`.\
\
However, conveniently, some statistics professors wrote an `R` package called `fivethirtyeight`, which provides an easier way to import and work with the (pre-tidied) data! We loaded this package in the chunk above, and we will be using the `fandango` dataset from `fivethirtyeight`. To make it an explicit tibble in our workspace, run the code below.
```{r}
# save fandango dataset as "movies" tibble
movies <- fandango
```
Note I am calling the tibble `movies`, feel free to name it whatever you like, but I will assume in my code chunks below that you are calling it `movies` - change as appropriate.
# Exploratory Analysis
Let's first take a look at what the data looks like:
```{r}
movies
glimpse(movies)
```
How many variables are there? How many observations?
## Question 1
The dataset contains a lot of variables (many are from different ratings websites like IMDB, Fandango, and Rotten Tomatoes. Let's just use the rotten tomatoes data, and while we're at it, keep just the few variables we'll want to look at and rename them to make our lives easier going forward.
Put your data wrangling skills to the test again, and (1) rename the variable `rottentomatoes` to `critics` and the variable `rottentomatoes_user` to `audience`; and (2) keep only the variables `film`, `year`, `critics`, and `audience`.
**I am going to overwrite `movies` since it's all we're going to be using. Feel free to make a new tibble named something else.**
```{r}
# type your code below in this chunk
movies <- movies %>%
rename(critics = rottentomatoes,
audience = rottentomatoes_user) %>%
select(film, year, critics, audience)
```
## Question 2
Get the covariance and the correlation between `critics` and `audience`.
```{r}
# type your code below in this chunk
movies %>%
summarize(covariance = cov(critics, audience),
correlation = cor(critics, audience))
```
## Question 3
Make a scatterplot of `audience` (as `y`) on `critics` (as `x`) and add a regression line by adding an additional layer: `geom_smooth(method = "lm")` .
```{r}
# type your code below in this chunk
movies %>%
ggplot()+
aes(x = critics,
y = audience)+
geom_point()+
geom_smooth(method = "lm")
```
# Regression
## Question 4
Let's try to estimate the effect of the critics' score on the audience's score. The population regression model we want to estimate is:
$$
\text{Audience Score}_i = \beta_0+\beta_1 \, \text{Critics Score}_i+u_i
$$
### Part A
Run your regression and save it as an object. Then get a `summary()` of your regression object.
```{r}
# type your code below in this chunk
reg <- lm(audience ~ critics, data = movies)
reg %>% summary()
```
### Part B
What is the estimated $\hat{\beta}_0$ for this model? What does it mean in terms of the graph, and in terms of the context of our question?
$$
\widehat{\text{Audience}_i} = 32.32 + 0.52 \, \text{Critics}_i
$$
$\hat{\beta}_0$ is 32.32. On the graph, this is where the estimated line crosses the vertical axis. We would predict that for a movie with a `critics` score of 0, the `audience` rating would be 32.32.
```{r}
# I just want to show you how to extract these
reg_tidy <- reg %>%
tidy()
beta_0_hat <- reg_tidy %>% # save this as beta_0_hat
filter(term == "(Intercept)") %>%
pull(estimate) %>% # extract the value of the estimate
round(2) # round to 2 decimal places
beta_0_hat # look at it
```
### Part C
What is the estimated $\hat{\beta}_1$ for this model? What does it mean in terms of the graph, and in terms of the context of our question?
$\hat{\beta}_1$ is 0.52. This is the estimated slope of the line, and the marginal effect of (a 1 unit change in) `critics` on `audience`: For every additional 1 point on `critics` score, we predict the `audience` rating would increase by 0.52 points.
```{r}
beta_1_hat <- reg_tidy %>% # save this as beta_1_hat
filter(term == "critics") %>% #
pull(estimate) %>% # extract the value of the estimate
round(2) # round to 2 decimal places
beta_1_hat # look at it
```
## Question 5
Let's make our regression output a bit tidier using the commands from the `broom` package (already loaded above).
### Part A
Use the `tidy()` function on your regression object saved from question 4. Save it as an object and look at it.
```{r}
# type your code below in this chunk
reg_tidy <- reg %>%
tidy()
reg_tidy
```
### Part B
Run the `glance()` function on your original regression object. What does it show you? Find $R^2$ and the SER and confirm they are the same from the Base R `lm` output.
```{r}
# type your code below in this chunk
reg %>%
glance()
```
$R^2$ is `r.squared` (0.61). SER is `sigma` (12.53).
### Part C
Now run the `augment()` function on your original regression object, and save this as an object. Look at it. What does it show you?
```{r}
# type your code below in this chunk
reg_aug <- reg %>%
augment()
reg_aug
```
It makes many new variables in a dataframe with our data $Y$ and $X$. What we'll need are:
- `.fitted`: the predicted values $\hat{Y}_i$
- `.resid`: the residual values $\hat{u}_i$
## Question 6
Let's present our findings in a nice regression output table using the `modelsummary` package (already loaded above). You can try the default settings which is just to include your regression object name as the only argument inside `modelsummary()`. If you want to customize it, you can copy and tweak my code from the [slides](https://metricsf22.classes.ryansafner.com/slides/2.5-slides.html#/using-modelsummary-iv).
```{r}
# type your code below in this chunk
# default settings
modelsummary(reg)
# my customizations
modelsummary(models = list("Audience Score" = reg),
fmt = 2, # round to 2 decimals
output = "html",
coef_rename = c("(Intercept)" = "Constant",
"critics" = "Critics Score"),
gof_map = list(
list("raw" = "nobs", "clean" = "n", "fmt" = 0),
list("raw" = "r.squared", "clean" = "R^{2}", "fmt" = 2),
#list("raw" = "adj.r.squared", "clean" = "Adj. R^{2}", "fmt" = 2),
list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
),
escape = FALSE,
stars = c('*' = .1, '**' = .05, '***' = 0.01)
)
```
Note when you run this chunk, it might not show anything. The table will show properly when you render this document.
# Goodness of Fit
## Question 7
Let's use our results to interpret the goodness of fit statistics, and also calculate them ourselves to verify what they measure.
### Part A
Look both at your results from the regression `lm` object in Question 4, and the `glance()` results from `broom` in Question 5B. What is the $R^2$ for our estimated regression? What does it mean in the context of our question?
$R^2$ is 0.61. This is a measure of the goodness of fit, specifically, the fraction of variation in `audience` scores that can be explained by our model ($\widehat{\text{audience}}$).
### Part B
Use your results from Question 5C (the `augment()`ed data) to calculate $R^2$ as the fraction of total variation in $Y$ explained by variation in $\hat{Y}$:
$$
R^2 = \frac{var(\text{.fitted})}{var(\text{audience})}
$$
```{r}
# type your code below in this chunk
reg_aug %>%
summarize(var_Y_hat = var(.fitted),
var_Y = var(audience),
r_sq = var(.fitted)/var(audience))
```
**Optional Challenge**: calculate $R^2$ as the fraction of variation explained using the sum of squares methods:
$$R^2 = \frac{SSM}{SST}=\frac{\displaystyle\sum^{n}_{i=1}(\hat{Y}_i-\bar{Y})^2}{\displaystyle\sum^{n}_{i=1}(Y_i-\bar{Y})^2}$$
$$
R^2=1-\left(\frac{SSR}{SST}\right)=1-\left(\frac{\displaystyle\sum^{n}_{i=1}\hat{u}_i^2}{\displaystyle\sum^{n}_{i=1}(Y_i-\bar{Y})^2}\right)
$$
You can do so by defining a function I'll call `sum_sq`, which takes the sum of the squared deviations of a variable from its mean:
```{r}
sum_sq <- function(x){sum((x - mean(x))^2)}
```
and then use this on `.fitted`, `.resid`, and `audience` to find SSM, SSR, and SST.
```{r}
# type your code below in this chunk
reg_aug %>%
summarize(SSM = sum_sq(.fitted), #SSM
SST = sum_sq(audience), #SST
SSR = sum_sq(.resid), #SSR
r_sq = SSM/SST, # fraction of explained variation in Y
r_sq_alt = 1 - (SSR/SST)) # 1 - frac of unexplained variation
```
And, of course, you can square the correlation coefficient from Question 2.
```{r}
# type your code below in this chunk
movies %>%
summarize(correlation = cor(audience, critics),
r_sq = correlation^2)
```
### Part C
Calculate the standard error of the regression (SER), $\hat{\sigma}_u$ using the `.resid` values.
$$
\hat{\sigma}_u=\sqrt{\frac{SSR}{n-2}}=\sqrt{\frac{\displaystyle\sum^{n}_{i=1}\hat{u}_i^2}{n-2}}
$$
```{r}
# type your code below in this chunk
reg_aug %>%
summarize(SSR = sum(.resid^2),
n = n(),
SER = sqrt(SSR/(n-2))
)
```
Also compare this to taking the standard deviation of `.resid`. How close is this to the SER, and why?
```{r}
# type your code below in this chunk
reg_aug %>%
summarize(sd(.resid))
```
They are very close due to a sufficiently large sample size. This is because the degrees of freedom correction, $n-2$ of 144 is pretty close to $n$ of 146. So dividing SER by $n$ vs. $n-2$ yields approximately similar answers.
# Predictions
## Question 8
Now let's use our model to make a prediction about what the `audience` score will be for a given `critics` score.
### Part A
Suppose we have a movie that has a critics score of 74. Use the `predict()` function, which requires us to make a tibble of new data to use. So first make a tibble of `critics` values to use:
```{r}
to_predict <- tibble(
critics = c(74)
)
```
Then use the predict function, which has two arguments: the saved regression object, and `newdata`, which we'll set equal to our tibble (I've) called `to_predict`.
```{r}
# type your code below in this chunk
predict(reg, newdata = to_predict)
```
### Part B
We of course can calculate a prediction using the OLS regression equation. First, write out the equation we've estimated, and then use it to predict an $\widehat{\text{audience}}_i$ score for a critics score of 74.
$$
\begin{align*}
\widehat{\text{Audience}_i}&=\hat{\beta}_0+\hat{\beta}_1 \, \text{critics}_i \\
\widehat{\text{Audience}_i}&=32.32+0.52 \, (74) \\
\widehat{\text{Audience}_i}&=70.80\\
\end{align*}
$$
We of course can also use `R` to extract the coefficients and calculate the prediction ourselves. Using the `tidy` regression object you saved in Question 5A, use `filter()` and `pull()` to save $\hat{\beta}_0$ and $\hat{\beta}_1$ as objects, and then use them to estimate:
$$
\widehat{\text{Audience}}_i=\hat{\beta}_0+\hat{\beta}_1 \, (74)
$$
```{r}
# type your code below in this chunk
b0 <- reg_tidy %>%
filter(term == "(Intercept)") %>%
pull(estimate)
b1 <- reg_tidy %>%
filter(term == "critics") %>%
pull(estimate)
b0 + b1 * 74
```
### Part C
One actual movie in the data that has a critics score of 74 is `"Avengers: Age of Ultron"`. Find it in the data with `filter()`.
```{r}
# type your code below in this chunk
movies %>%
filter(critics == 74)
```
Notice it has an *actual* `audience` score, since its in the data. Use the actual value and the predicted value to calculate the residual value for *Avengers: Age of Ultron*.
$$
\begin{align*}
\hat{u}_{ultron}&=\widehat{\text{audience}}_{ultron} - \text{audience}_{ultron} \\
\hat{u}_{ultron}&=86-70.80\\
\hat{u}_{ultron}&=15.20\\
\end{align*}
$$
```{r}
# type your code below in this chunk
ultron_resid = 86 - (b0 + b1 * 74)
ultron_resid
```
Now look in the `augment`ed regression object (saved from Question 5C) to find this movie (you can `filter()` by the `critics` score to help you find it). What is the `.resid` value? It should match yours, with some rounding error.
```{r}
# type your code below in this chunk
reg_aug %>%
filter(critics == 74)
```
The `.resid` value is 15.30.
**Optional Challenge**: Add another two layers to your scatterplot from Question 2A highlighting *Avengers: Age of Ultron* on the plot. `filter` your data to only include this movie, and save it as a tibble to use as the `data` for these new layers (I'll pretend you called this `ultron` as an example): one layer should be `geom_point(data = ultron)` with a unique color. The second layer should be a `geom_text_repel(data = ultron, aes(label = film)` layer with that color again. These two layers will plot the point and then a label for it in a unique color, to stand out from the rest of your points.
```{r}
# type your code below in this chunk
library(ggrepel) # forgot to install and load this!
# subset data to only look at Ultron
ultron <- movies %>%
filter(film == "Avengers: Age of Ultron")
ggplot(data = movies)+
aes(x = critics,
y = audience)+
geom_point()+
geom_smooth(method = "lm")+
geom_point(data = ultron, color = "red")+
geom_text_repel(data = ultron,
aes(label = film),
color = "red")
```
# Diagnostics
## Question 9
### Part A
Now let's start looking at the residuals of the regression. Take the `augment`ed regression object saved from Question 5C and use it as the source of your data to create a histogram with `ggplot()`, setting `aes(x = .resid)`. Does it look roughly normal?
```{r}
# type your code below in this chunk
ggplot(data = reg_aug)+
aes(x = .resid)+
geom_histogram(color = "white")+
theme_classic()
```
While we're at it, let's check the mean of the residuals to make sure it is in fact 0, since $\mathbb{E}[\hat{u}]$ should be 0.
```{r}
# type your code below in this chunk
reg_aug %>%
summarize(mean = mean(.resid),
se = sd(.resid))
```
### Part B
Now the `augment`ed regression object and make a residual plot, which is a scatterplot where `x` is the `.fitted` values $(\hat{Y}_i)$ and `y` is the `.resid` $(\hat{u}_i)$. Feel free to add a horizontal line at 0 with `geom_hline(yintercept = 0)` as an additional layer.
```{r}
# type your code below in this chunk
ggplot(data = reg_aug)+
aes(x = .fitted,
y = .resid)+
geom_point(color = "blue")+
geom_hline(yintercept = 0, color = "red")+
theme_classic()
```
## Question 10
Let's now check for heteroskedasticity of the residuals.
### Part A
Looking at the scatterplot and residual plot, does it appear that the errors are homoskedastic or heteroskedastic?
It's not very obvious, but does seem that as critic scores are very high, there is a more tightly-packed cluster of points near the line, and thus a lower variance of the errors at these levels compared to lower critic scores (where there are much larger errors).
### Part B
Run the `bptest()` function (from the `lmtest` package, already loaded above) on your saved `lm` regression object. According to the test, is the data heteroskedastic or homoskedastic?
```{r}
# type your code below in this chunk
reg %>%
bptest()
```
Since the p-value of 0.005 is smaller than 0.05, we can reject the null hypothesis (that the errors are homoskedastic), and conclude the errors are in fact heteroskedastic.
### Part C
Now let's make some heteroskedasticity-robust standard errors. Use the `lm_robust()` command (instead of the `lm()` command) from the `estimatr` package (already loaded above) to run a new regression (and save it). Make sure you add `se_type = "stata"` inside the command to calculate robust SEs (in the same way that the Stata software does...long story). Save this new regression object and get a `summary` of it. What has changed?
```{r}
# type your code below in this chunk
reg_rse <- lm_robust(audience ~ critics, data = movies)
reg_rse %>%
summary()
```
The *only* things that have changed are the `Std. Error` for $\hat{\beta}_0$ and $\hat{\beta}_1$, and even then, by very little.
### Part D
Now let's compare the normal standard errors and the robust standard errors in a new side-by-side regression table using `modelsummary()` . Use the same command you used for Question 6, but now we need to set the `models` argument equal to a `list()` of our original regression object and this new regression object from Part C (and optionally, you can name these two columns).
```{r}
# type your code below in this chunk
# default settings
modelsummary(models = list(reg,
reg_rse))
# my customizations
modelsummary(models = list("Normal SEs" = reg,
"Robust SEs" = reg_rse),
fmt = 2, # round to 2 decimals
output = "html",
coef_rename = c("(Intercept)" = "Constant",
"critics" = "Critics Score"),
gof_map = list(
list("raw" = "nobs", "clean" = "n", "fmt" = 0),
list("raw" = "r.squared", "clean" = "R^{2}", "fmt" = 2),
#list("raw" = "adj.r.squared", "clean" = "Adj. R^{2}", "fmt" = 2),
list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
),
escape = FALSE,
stars = c('*' = .1, '**' = .05, '***' = 0.01)
)
```
## Question 11
Finally, let's test for outliers. Looking at the scatterplot and regression line, do there appear to be any?
**It's not obvious that there are any major outliers. Perhaps the two movies that have a low (\<25) critics score but a high (\>80) audience score.**
We can officially test for them using the `outlierTest()` command from the `car` package (already loaded above) on your saved regression object.
```{r}
# type your code below in this chunk
outlierTest(reg)
```
The test did not pick up any significant outliers.