2.5 — Precision and Diagnostics — R Practice

Author

Answer Key

Published

September 28, 2022

Required Packages & Data

Load all the required packages we will use (note I have installed them already into the cloud project) by running (clicking the green play button) the chunk below:

library(tidyverse) # your friend and mine
library(fivethirtyeight) # for the fandango data
library(broom) # for tidy regression
library(lmtest) # for heteroskedasticity test
library(estimatr) # for robust SEs
library(modelsummary) # for nice regression tables
library(car) # for outlier test

How do critics evaluation of movies affect user ratings?

Our data comes from fivethirtyeight’s public data repository, and was used to write this article about online movie ratings. You could find this data on fivethirtyeight’s github repository and, if you wanted some practice, download and load in the data yourself using read_csv().

However, conveniently, some statistics professors wrote an R package called fivethirtyeight, which provides an easier way to import and work with the (pre-tidied) data! We loaded this package in the chunk above, and we will be using the fandango dataset from fivethirtyeight. To make it an explicit tibble in our workspace, run the code below.

# save fandango dataset as "movies" tibble
movies <- fandango

Note I am calling the tibble movies, feel free to name it whatever you like, but I will assume in my code chunks below that you are calling it movies - change as appropriate.

Exploratory Analysis

Let’s first take a look at what the data looks like:

movies
glimpse(movies)
Rows: 146
Columns: 23
$ film                       <chr> "Avengers: Age of Ultron", "Cinderella", "A…
$ year                       <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2…
$ rottentomatoes             <int> 74, 85, 80, 18, 14, 63, 42, 86, 99, 89, 84,…
$ rottentomatoes_user        <int> 86, 80, 90, 84, 28, 62, 53, 64, 82, 87, 77,…
$ metacritic                 <int> 66, 67, 64, 22, 29, 50, 53, 81, 81, 80, 71,…
$ metacritic_user            <dbl> 7.1, 7.5, 8.1, 4.7, 3.4, 6.8, 7.6, 6.8, 8.8…
$ imdb                       <dbl> 7.8, 7.1, 7.8, 5.4, 5.1, 7.2, 6.9, 6.5, 7.4…
$ fandango_stars             <dbl> 5.0, 5.0, 5.0, 5.0, 3.5, 4.5, 4.0, 4.0, 4.5…
$ fandango_ratingvalue       <dbl> 4.5, 4.5, 4.5, 4.5, 3.0, 4.0, 3.5, 3.5, 4.0…
$ rt_norm                    <dbl> 3.70, 4.25, 4.00, 0.90, 0.70, 3.15, 2.10, 4…
$ rt_user_norm               <dbl> 4.30, 4.00, 4.50, 4.20, 1.40, 3.10, 2.65, 3…
$ metacritic_norm            <dbl> 3.30, 3.35, 3.20, 1.10, 1.45, 2.50, 2.65, 4…
$ metacritic_user_nom        <dbl> 3.55, 3.75, 4.05, 2.35, 1.70, 3.40, 3.80, 3…
$ imdb_norm                  <dbl> 3.90, 3.55, 3.90, 2.70, 2.55, 3.60, 3.45, 3…
$ rt_norm_round              <dbl> 3.5, 4.5, 4.0, 1.0, 0.5, 3.0, 2.0, 4.5, 5.0…
$ rt_user_norm_round         <dbl> 4.5, 4.0, 4.5, 4.0, 1.5, 3.0, 2.5, 3.0, 4.0…
$ metacritic_norm_round      <dbl> 3.5, 3.5, 3.0, 1.0, 1.5, 2.5, 2.5, 4.0, 4.0…
$ metacritic_user_norm_round <dbl> 3.5, 4.0, 4.0, 2.5, 1.5, 3.5, 4.0, 3.5, 4.5…
$ imdb_norm_round            <dbl> 4.0, 3.5, 4.0, 2.5, 2.5, 3.5, 3.5, 3.5, 3.5…
$ metacritic_user_vote_count <int> 1330, 249, 627, 31, 88, 34, 17, 124, 62, 54…
$ imdb_user_vote_count       <int> 271107, 65709, 103660, 3136, 19560, 39373, …
$ fandango_votes             <int> 14846, 12640, 12055, 1793, 1021, 397, 252, …
$ fandango_difference        <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5…

How many variables are there? How many observations?

Question 1

The dataset contains a lot of variables (many are from different ratings websites like IMDB, Fandango, and Rotten Tomatoes. Let’s just use the rotten tomatoes data, and while we’re at it, keep just the few variables we’ll want to look at and rename them to make our lives easier going forward.

Put your data wrangling skills to the test again, and (1) rename the variable rottentomatoes to critics and the variable rottentomatoes_user to audience; and (2) keep only the variables film, year, critics, and audience.

I am going to overwrite movies since it’s all we’re going to be using. Feel free to make a new tibble named something else.

# type your code below in this chunk
movies <- movies %>%
  rename(critics = rottentomatoes,
         audience = rottentomatoes_user) %>%
  select(film, year, critics, audience)

Question 2

Get the covariance and the correlation between critics and audience.

# type your code below in this chunk
movies %>%
  summarize(covariance = cov(critics, audience),
            correlation = cor(critics, audience))

Question 3

Make a scatterplot of audience (as y) on critics (as x) and add a regression line by adding an additional layer: geom_smooth(method = "lm") .

# type your code below in this chunk
movies %>%
  ggplot()+
  aes(x = critics,
      y = audience)+
  geom_point()+
  geom_smooth(method = "lm")
`geom_smooth()` using formula 'y ~ x'

Regression

Question 4

Let’s try to estimate the effect of the critics’ score on the audience’s score. The population regression model we want to estimate is:

\[ \text{Audience Score}_i = \beta_0+\beta_1 \, \text{Critics Score}_i+u_i \]

Part A

Run your regression and save it as an object. Then get a summary() of your regression object.

# type your code below in this chunk
reg <- lm(audience ~ critics, data = movies)

reg %>% summary()

Call:
lm(formula = audience ~ critics, data = movies)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.324  -7.995  -0.532   8.820  42.348 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 32.31553    2.34251   13.79   <2e-16 ***
critics      0.51868    0.03451   15.03   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.54 on 144 degrees of freedom
Multiple R-squared:  0.6106,    Adjusted R-squared:  0.6079 
F-statistic: 225.8 on 1 and 144 DF,  p-value: < 2.2e-16

Part B

What is the estimated \(\hat{\beta}_0\) for this model? What does it mean in terms of the graph, and in terms of the context of our question?

\[ \widehat{\text{Audience}_i} = 32.32 + 0.52 \, \text{Critics}_i \]

\(\hat{\beta}_0\) is 32.32. On the graph, this is where the estimated line crosses the vertical axis. We would predict that for a movie with a critics score of 0, the audience rating would be 32.32.

# I just want to show you how to extract these
reg_tidy <- reg %>%
  tidy()

beta_0_hat <- reg_tidy %>% # save this as beta_0_hat
  filter(term == "(Intercept)") %>%
  pull(estimate) %>% # extract the value of the estimate
  round(2) # round to 2 decimal places

beta_0_hat # look at it
[1] 32.32

Part C

What is the estimated \(\hat{\beta}_1\) for this model? What does it mean in terms of the graph, and in terms of the context of our question?

\(\hat{\beta}_1\) is 0.52. This is the estimated slope of the line, and the marginal effect of (a 1 unit change in) critics on audience: For every additional 1 point on critics score, we predict the audience rating would increase by 0.52 points.

beta_1_hat <- reg_tidy %>% # save this as beta_1_hat
  filter(term == "critics") %>% #
  pull(estimate) %>% # extract the value of the estimate
  round(2) # round to 2 decimal places

beta_1_hat # look at it
[1] 0.52

Question 5

Let’s make our regression output a bit tidier using the commands from the broom package (already loaded above).

Part A

Use the tidy() function on your regression object saved from question 4. Save it as an object and look at it.

# type your code below in this chunk
reg_tidy <- reg %>%
  tidy()
reg_tidy

Part B

Run the glance() function on your original regression object. What does it show you? Find \(R^2\) and the SER and confirm they are the same from the Base R lm output.

# type your code below in this chunk
reg %>%
  glance()

\(R^2\) is r.squared (0.61). SER is sigma (12.53).

Part C

Now run the augment() function on your original regression object, and save this as an object. Look at it. What does it show you?

# type your code below in this chunk
reg_aug <- reg %>%
  augment()
reg_aug

It makes many new variables in a dataframe with our data \(Y\) and \(X\). What we’ll need are:

  • .fitted: the predicted values \(\hat{Y}_i\)

  • .resid: the residual values \(\hat{u}_i\)

Question 6

Let’s present our findings in a nice regression output table using the modelsummary package (already loaded above). You can try the default settings which is just to include your regression object name as the only argument inside modelsummary(). If you want to customize it, you can copy and tweak my code from the slides.

# type your code below in this chunk

# default settings
modelsummary(reg)
Model 1
(Intercept) 32.316
(2.343)
critics 0.519
(0.035)
Num.Obs. 146
R2 0.611
R2 Adj. 0.608
AIC 1156.7
BIC 1165.7
Log.Lik. −575.360
F 225.845
RMSE 12.45
# my customizations
modelsummary(models = list("Audience Score" = reg),
             fmt = 2, # round to 2 decimals
             output = "html",
             coef_rename = c("(Intercept)" = "Constant",
                             "critics" = "Critics Score"),
             gof_map = list(
               list("raw" = "nobs", "clean" = "n", "fmt" = 0),
               list("raw" = "r.squared", "clean" = "R<sup>2</sup>", "fmt" = 2),
               #list("raw" = "adj.r.squared", "clean" = "Adj. R<sup>2</sup>", "fmt" = 2),
               list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
             ),
             escape = FALSE,
             stars = c('*' = .1, '**' = .05, '***' = 0.01)
)
Audience Score
Constant 32.32***
(2.34)
Critics Score 0.52***
(0.03)
n 146
R2 0.61
SER 12.45
* p < 0.1, ** p < 0.05, *** p < 0.01

Note when you run this chunk, it might not show anything. The table will show properly when you render this document.

Goodness of Fit

Question 7

Let’s use our results to interpret the goodness of fit statistics, and also calculate them ourselves to verify what they measure.

Part A

Look both at your results from the regression lm object in Question 4, and the glance() results from broom in Question 5B. What is the \(R^2\) for our estimated regression? What does it mean in the context of our question?

\(R^2\) is 0.61. This is a measure of the goodness of fit, specifically, the fraction of variation in audience scores that can be explained by our model (\(\widehat{\text{audience}}\)).

Part B

Use your results from Question 5C (the augment()ed data) to calculate \(R^2\) as the fraction of total variation in \(Y\) explained by variation in \(\hat{Y}\):

\[ R^2 = \frac{var(\text{.fitted})}{var(\text{audience})} \]

# type your code below in this chunk
reg_aug %>%
  summarize(var_Y_hat = var(.fitted),
            var_Y = var(audience),
            r_sq = var(.fitted)/var(audience))

Optional Challenge: calculate \(R^2\) as the fraction of variation explained using the sum of squares methods:

\[R^2 = \frac{SSM}{SST}=\frac{\displaystyle\sum^{n}_{i=1}(\hat{Y}_i-\bar{Y})^2}{\displaystyle\sum^{n}_{i=1}(Y_i-\bar{Y})^2}\]

\[ R^2=1-\left(\frac{SSR}{SST}\right)=1-\left(\frac{\displaystyle\sum^{n}_{i=1}\hat{u}_i^2}{\displaystyle\sum^{n}_{i=1}(Y_i-\bar{Y})^2}\right) \]

You can do so by defining a function I’ll call sum_sq, which takes the sum of the squared deviations of a variable from its mean:

sum_sq <- function(x){sum((x - mean(x))^2)}

and then use this on .fitted, .resid, and audience to find SSM, SSR, and SST.

# type your code below in this chunk
reg_aug %>%
  summarize(SSM = sum_sq(.fitted), #SSM
            SST = sum_sq(audience), #SST
            SSR = sum_sq(.resid), #SSR
            r_sq = SSM/SST, # fraction of explained variation in Y
            r_sq_alt = 1 - (SSR/SST)) # 1 - frac of unexplained variation

And, of course, you can square the correlation coefficient from Question 2.

# type your code below in this chunk
movies %>%
  summarize(correlation = cor(audience, critics),
            r_sq = correlation^2)

Part C

Calculate the standard error of the regression (SER), \(\hat{\sigma}_u\) using the .resid values.

\[ \hat{\sigma}_u=\sqrt{\frac{SSR}{n-2}}=\sqrt{\frac{\displaystyle\sum^{n}_{i=1}\hat{u}_i^2}{n-2}} \]

# type your code below in this chunk
reg_aug %>%
  summarize(SSR = sum(.resid^2),
            n = n(),
            SER = sqrt(SSR/(n-2))
            )

Also compare this to taking the standard deviation of .resid. How close is this to the SER, and why?

# type your code below in this chunk
reg_aug %>%
  summarize(sd(.resid))

They are very close due to a sufficiently large sample size. This is because the degrees of freedom correction, \(n-2\) of 144 is pretty close to \(n\) of 146. So dividing SER by \(n\) vs. \(n-2\) yields approximately similar answers.

Predictions

Question 8

Now let’s use our model to make a prediction about what the audience score will be for a given critics score.

Part A

Suppose we have a movie that has a critics score of 74. Use the predict() function, which requires us to make a tibble of new data to use. So first make a tibble of critics values to use:

to_predict <- tibble(
  critics = c(74)
)

Then use the predict function, which has two arguments: the saved regression object, and newdata, which we’ll set equal to our tibble (I’ve) called to_predict.

# type your code below in this chunk
predict(reg, newdata = to_predict)
       1 
70.69768 

Part B

We of course can calculate a prediction using the OLS regression equation. First, write out the equation we’ve estimated, and then use it to predict an \(\widehat{\text{audience}}_i\) score for a critics score of 74.

\[ \begin{align*} \widehat{\text{Audience}_i}&=\hat{\beta}_0+\hat{\beta}_1 \, \text{critics}_i \\ \widehat{\text{Audience}_i}&=32.32+0.52 \, (74) \\ \widehat{\text{Audience}_i}&=70.80\\ \end{align*} \]

We of course can also use R to extract the coefficients and calculate the prediction ourselves. Using the tidy regression object you saved in Question 5A, use filter() and pull() to save \(\hat{\beta}_0\) and \(\hat{\beta}_1\) as objects, and then use them to estimate:

\[ \widehat{\text{Audience}}_i=\hat{\beta}_0+\hat{\beta}_1 \, (74) \]

# type your code below in this chunk
b0 <- reg_tidy %>%
  filter(term == "(Intercept)") %>%
  pull(estimate)

b1 <- reg_tidy %>%
  filter(term == "critics") %>%
  pull(estimate)

b0 + b1 * 74
[1] 70.69768

Part C

One actual movie in the data that has a critics score of 74 is "Avengers: Age of Ultron". Find it in the data with filter().

# type your code below in this chunk
movies %>%
  filter(critics == 74)

Notice it has an actual audience score, since its in the data. Use the actual value and the predicted value to calculate the residual value for Avengers: Age of Ultron.

\[ \begin{align*} \hat{u}_{ultron}&=\widehat{\text{audience}}_{ultron} - \text{audience}_{ultron} \\ \hat{u}_{ultron}&=86-70.80\\ \hat{u}_{ultron}&=15.20\\ \end{align*} \]

# type your code below in this chunk
ultron_resid = 86 - (b0 + b1 * 74) 
ultron_resid
[1] 15.30232

Now look in the augmented regression object (saved from Question 5C) to find this movie (you can filter() by the critics score to help you find it). What is the .resid value? It should match yours, with some rounding error.

# type your code below in this chunk
reg_aug %>%
  filter(critics == 74)

The .resid value is 15.30.

Optional Challenge: Add another two layers to your scatterplot from Question 2A highlighting Avengers: Age of Ultron on the plot. filter your data to only include this movie, and save it as a tibble to use as the data for these new layers (I’ll pretend you called this ultron as an example): one layer should be geom_point(data = ultron) with a unique color. The second layer should be a geom_text_repel(data = ultron, aes(label = film) layer with that color again. These two layers will plot the point and then a label for it in a unique color, to stand out from the rest of your points.

# type your code below in this chunk
library(ggrepel) # forgot to install and load this!

# subset data to only look at Ultron
ultron <- movies %>%
  filter(film == "Avengers: Age of Ultron")

ggplot(data = movies)+
    aes(x = critics,
      y = audience)+
  geom_point()+
  geom_smooth(method = "lm")+
  geom_point(data = ultron, color = "red")+
  geom_text_repel(data = ultron,
                  aes(label = film),
                  color = "red")
`geom_smooth()` using formula 'y ~ x'

Diagnostics

Question 9

Part A

Now let’s start looking at the residuals of the regression. Take the augmented regression object saved from Question 5C and use it as the source of your data to create a histogram with ggplot(), setting aes(x = .resid). Does it look roughly normal?

# type your code below in this chunk
ggplot(data = reg_aug)+
  aes(x = .resid)+
  geom_histogram(color = "white")+
  theme_classic()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

While we’re at it, let’s check the mean of the residuals to make sure it is in fact 0, since \(\mathbb{E}[\hat{u}]\) should be 0.

# type your code below in this chunk
reg_aug %>%
  summarize(mean = mean(.resid),
            se = sd(.resid))

Part B

Now the augmented regression object and make a residual plot, which is a scatterplot where x is the .fitted values \((\hat{Y}_i)\) and y is the .resid \((\hat{u}_i)\). Feel free to add a horizontal line at 0 with geom_hline(yintercept = 0) as an additional layer.

# type your code below in this chunk
ggplot(data = reg_aug)+
  aes(x = .fitted,
      y = .resid)+
  geom_point(color = "blue")+
  geom_hline(yintercept = 0, color = "red")+
  theme_classic()

Question 10

Let’s now check for heteroskedasticity of the residuals.

Part A

Looking at the scatterplot and residual plot, does it appear that the errors are homoskedastic or heteroskedastic?

It’s not very obvious, but does seem that as critic scores are very high, there is a more tightly-packed cluster of points near the line, and thus a lower variance of the errors at these levels compared to lower critic scores (where there are much larger errors).

Part B

Run the bptest() function (from the lmtest package, already loaded above) on your saved lm regression object. According to the test, is the data heteroskedastic or homoskedastic?

# type your code below in this chunk
reg %>%
  bptest()

    studentized Breusch-Pagan test

data:  .
BP = 7.9927, df = 1, p-value = 0.004697

Since the p-value of 0.005 is smaller than 0.05, we can reject the null hypothesis (that the errors are homoskedastic), and conclude the errors are in fact heteroskedastic.

Part C

Now let’s make some heteroskedasticity-robust standard errors. Use the lm_robust() command (instead of the lm() command) from the estimatr package (already loaded above) to run a new regression (and save it). Make sure you add se_type = "stata" inside the command to calculate robust SEs (in the same way that the Stata software does…long story). Save this new regression object and get a summary of it. What has changed?

# type your code below in this chunk
reg_rse <- lm_robust(audience ~ critics, data = movies)

reg_rse %>% 
  summary()

Call:
lm_robust(formula = audience ~ critics, data = movies)

Standard error type:  HC2 

Coefficients:
            Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper  DF
(Intercept)  32.3155    2.67207   12.09 1.120e-23  27.0340  37.5971 144
critics       0.5187    0.03472   14.94 4.556e-31   0.4501   0.5873 144

Multiple R-squared:  0.6106 ,   Adjusted R-squared:  0.6079 
F-statistic: 223.2 on 1 and 144 DF,  p-value: < 2.2e-16

The only things that have changed are the Std. Error for \(\hat{\beta}_0\) and \(\hat{\beta}_1\), and even then, by very little.

Part D

Now let’s compare the normal standard errors and the robust standard errors in a new side-by-side regression table using modelsummary() . Use the same command you used for Question 6, but now we need to set the models argument equal to a list() of our original regression object and this new regression object from Part C (and optionally, you can name these two columns).

# type your code below in this chunk
# default settings
modelsummary(models = list(reg,
                           reg_rse))
Model 1 Model 2
(Intercept) 32.316 32.316
(2.343) (2.672)
critics 0.519 0.519
(0.035) (0.035)
Num.Obs. 146 146
R2 0.611 0.611
R2 Adj. 0.608 0.608
AIC 1156.7 1156.7
BIC 1165.7 1165.7
Log.Lik. −575.360
F 225.845
RMSE 12.45 12.45
# my customizations
modelsummary(models = list("Normal SEs" = reg,
                           "Robust SEs" = reg_rse),
             fmt = 2, # round to 2 decimals
             output = "html",
             coef_rename = c("(Intercept)" = "Constant",
                             "critics" = "Critics Score"),
             gof_map = list(
               list("raw" = "nobs", "clean" = "n", "fmt" = 0),
               list("raw" = "r.squared", "clean" = "R<sup>2</sup>", "fmt" = 2),
               #list("raw" = "adj.r.squared", "clean" = "Adj. R<sup>2</sup>", "fmt" = 2),
               list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
             ),
             escape = FALSE,
             stars = c('*' = .1, '**' = .05, '***' = 0.01)
)
Normal SEs Robust SEs
Constant 32.32*** 32.32***
(2.34) (2.67)
Critics Score 0.52*** 0.52***
(0.03) (0.03)
n 146 146
R2 0.61 0.61
SER 12.45 12.45
* p < 0.1, ** p < 0.05, *** p < 0.01

Question 11

Finally, let’s test for outliers. Looking at the scatterplot and regression line, do there appear to be any?

It’s not obvious that there are any major outliers. Perhaps the two movies that have a low (<25) critics score but a high (>80) audience score.

We can officially test for them using the outlierTest() command from the car package (already loaded above) on your saved regression object.

# type your code below in this chunk
outlierTest(reg)
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
  rstudent unadjusted p-value Bonferroni p
4 3.547831         0.00052581     0.076769

The test did not pick up any significant outliers.