2.5 — Precision and Diagnostics — R Practice

Author

Answer Key

Published

September 28, 2022

Required Packages & Data

Load all the required packages we will use (note I have installed them already into the cloud project) by running (clicking the green play button) the chunk below:

library(tidyverse) # your friend and mine
library(fivethirtyeight) # for the fandango data
library(broom) # for tidy regression
library(lmtest) # for heteroskedasticity test
library(estimatr) # for robust SEs
library(modelsummary) # for nice regression tables
library(car) # for outlier test

How do critics evaluation of movies affect user ratings?

Our data comes from fivethirtyeight’s public data repository, and was used to write this article about online movie ratings. You could find this data on fivethirtyeight’s github repository and, if you wanted some practice, download and load in the data yourself using read_csv().

However, conveniently, some statistics professors wrote an R package called fivethirtyeight, which provides an easier way to import and work with the (pre-tidied) data! We loaded this package in the chunk above, and we will be using the fandango dataset from fivethirtyeight. To make it an explicit tibble in our workspace, run the code below.

# save fandango dataset as "movies" tibble
movies <- fandango

Note I am calling the tibble movies, feel free to name it whatever you like, but I will assume in my code chunks below that you are calling it movies - change as appropriate.

Exploratory Analysis

Let’s first take a look at what the data looks like:

movies
glimpse(movies)
Rows: 146
Columns: 23
$ film                       <chr> "Avengers: Age of Ultron", "Cinderella", "A…
$ year                       <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2…
$ rottentomatoes             <int> 74, 85, 80, 18, 14, 63, 42, 86, 99, 89, 84,…
$ rottentomatoes_user        <int> 86, 80, 90, 84, 28, 62, 53, 64, 82, 87, 77,…
$ metacritic                 <int> 66, 67, 64, 22, 29, 50, 53, 81, 81, 80, 71,…
$ metacritic_user            <dbl> 7.1, 7.5, 8.1, 4.7, 3.4, 6.8, 7.6, 6.8, 8.8…
$ imdb                       <dbl> 7.8, 7.1, 7.8, 5.4, 5.1, 7.2, 6.9, 6.5, 7.4…
$ fandango_stars             <dbl> 5.0, 5.0, 5.0, 5.0, 3.5, 4.5, 4.0, 4.0, 4.5…
$ fandango_ratingvalue       <dbl> 4.5, 4.5, 4.5, 4.5, 3.0, 4.0, 3.5, 3.5, 4.0…
$ rt_norm                    <dbl> 3.70, 4.25, 4.00, 0.90, 0.70, 3.15, 2.10, 4…
$ rt_user_norm               <dbl> 4.30, 4.00, 4.50, 4.20, 1.40, 3.10, 2.65, 3…
$ metacritic_norm            <dbl> 3.30, 3.35, 3.20, 1.10, 1.45, 2.50, 2.65, 4…
$ metacritic_user_nom        <dbl> 3.55, 3.75, 4.05, 2.35, 1.70, 3.40, 3.80, 3…
$ imdb_norm                  <dbl> 3.90, 3.55, 3.90, 2.70, 2.55, 3.60, 3.45, 3…
$ rt_norm_round              <dbl> 3.5, 4.5, 4.0, 1.0, 0.5, 3.0, 2.0, 4.5, 5.0…
$ rt_user_norm_round         <dbl> 4.5, 4.0, 4.5, 4.0, 1.5, 3.0, 2.5, 3.0, 4.0…
$ metacritic_norm_round      <dbl> 3.5, 3.5, 3.0, 1.0, 1.5, 2.5, 2.5, 4.0, 4.0…
$ metacritic_user_norm_round <dbl> 3.5, 4.0, 4.0, 2.5, 1.5, 3.5, 4.0, 3.5, 4.5…
$ imdb_norm_round            <dbl> 4.0, 3.5, 4.0, 2.5, 2.5, 3.5, 3.5, 3.5, 3.5…
$ metacritic_user_vote_count <int> 1330, 249, 627, 31, 88, 34, 17, 124, 62, 54…
$ imdb_user_vote_count       <int> 271107, 65709, 103660, 3136, 19560, 39373, …
$ fandango_votes             <int> 14846, 12640, 12055, 1793, 1021, 397, 252, …
$ fandango_difference        <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5…

How many variables are there? How many observations?

Question 1

The dataset contains a lot of variables (many are from different ratings websites like IMDB, Fandango, and Rotten Tomatoes. Let’s just use the rotten tomatoes data, and while we’re at it, keep just the few variables we’ll want to look at and rename them to make our lives easier going forward.

Put your data wrangling skills to the test again, and (1) rename the variable rottentomatoes to critics and the variable rottentomatoes_user to audience; and (2) keep only the variables film, year, critics, and audience.

I am going to overwrite movies since it’s all we’re going to be using. Feel free to make a new tibble named something else.

# type your code below in this chunk
movies <- movies %>%
  rename(critics = rottentomatoes,
         audience = rottentomatoes_user) %>%
  select(film, year, critics, audience)

Question 2

Get the covariance and the correlation between critics and audience.

# type your code below in this chunk
movies %>%
  summarize(covariance = cov(critics, audience),
            correlation = cor(critics, audience))

Question 3

Make a scatterplot of audience (as y) on critics (as x) and add a regression line by adding an additional layer: geom_smooth(method = "lm") .

# type your code below in this chunk
movies %>%
  ggplot()+
  aes(x = critics,
      y = audience)+
  geom_point()+
  geom_smooth(method = "lm")
`geom_smooth()` using formula 'y ~ x'

Regression

Question 4

Let’s try to estimate the effect of the critics’ score on the audience’s score. The population regression model we want to estimate is:

\[ \text{Audience Score}_i = \beta_0+\beta_1 \, \text{Critics Score}_i+u_i \]

Part A

Run your regression and save it as an object. Then get a summary() of your regression object.

# type your code below in this chunk
reg <- lm(audience ~ critics, data = movies)

reg %>% summary()

Call:
lm(formula = audience ~ critics, data = movies)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.324  -7.995  -0.532   8.820  42.348 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 32.31553    2.34251   13.79   <2e-16 ***
critics      0.51868    0.03451   15.03   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.54 on 144 degrees of freedom
Multiple R-squared:  0.6106,    Adjusted R-squared:  0.6079 
F-statistic: 225.8 on 1 and 144 DF,  p-value: < 2.2e-16

Part B

What is the estimated \(\hat{\beta}_0\) for this model? What does it mean in terms of the graph, and in terms of the context of our question?

\[ \widehat{\text{Audience}_i} = 32.32 + 0.52 \, \text{Critics}_i \]

\(\hat{\beta}_0\) is 32.32. On the graph, this is where the estimated line crosses the vertical axis. We would predict that for a movie with a critics score of 0, the audience rating would be 32.32.

# I just want to show you how to extract these
reg_tidy <- reg %>%
  tidy()

beta_0_hat <- reg_tidy %>% # save this as beta_0_hat
  filter(term == "(Intercept)") %>%
  pull(estimate) %>% # extract the value of the estimate
  round(2) # round to 2 decimal places

beta_0_hat # look at it
[1] 32.32

Part C

What is the estimated \(\hat{\beta}_1\) for this model? What does it mean in terms of the graph, and in terms of the context of our question?

\(\hat{\beta}_1\) is 0.52. This is the estimated slope of the line, and the marginal effect of (a 1 unit change in) critics on audience: For every additional 1 point on critics score, we predict the audience rating would increase by 0.52 points.

beta_1_hat <- reg_tidy %>% # save this as beta_1_hat
  filter(term == "critics") %>% #
  pull(estimate) %>% # extract the value of the estimate
  round(2) # round to 2 decimal places

beta_1_hat # look at it
[1] 0.52

Question 5

Let’s make our regression output a bit tidier using the commands from the broom package (already loaded above).

Part A

Use the tidy() function on your regression object saved from question 4. Save it as an object and look at it.

# type your code below in this chunk
reg_tidy <- reg %>%
  tidy()
reg_tidy

Part B

Run the glance() function on your original regression object. What does it show you? Find \(R^2\) and the SER and confirm they are the same from the Base R lm output.

# type your code below in this chunk
reg %>%
  glance()

\(R^2\) is r.squared (0.61). SER is sigma (12.53).

Part C

Now run the augment() function on your original regression object, and save this as an object. Look at it. What does it show you?

# type your code below in this chunk
reg_aug <- reg %>%
  augment()
reg_aug

It makes many new variables in a dataframe with our data \(Y\) and \(X\). What we’ll need are:

  • .fitted: the predicted values \(\hat{Y}_i\)

  • .resid: the residual values \(\hat{u}_i\)

Question 6

Let’s present our findings in a nice regression output table using the modelsummary package (already loaded above). You can try the default settings which is just to include your regression object name as the only argument inside modelsummary(). If you want to customize it, you can copy and tweak my code from the slides.

# type your code below in this chunk

# default settings
modelsummary(reg)
Model 1
(Intercept) 32.316
(2.343)
critics 0.519
(0.035)
Num.Obs. 146
R2 0.611
R2 Adj. 0.608
AIC 1156.7
BIC 1165.7
Log.Lik. −575.360
F 225.845
RMSE 12.45
# my customizations
modelsummary(models = list("Audience Score" = reg),
             fmt = 2, # round to 2 decimals
             output = "html",
             coef_rename = c("(Intercept)" = "Constant",
                             "critics" = "Critics Score"),
             gof_map = list(
               list("raw" = "nobs", "clean" = "n", "fmt" = 0),
               list("raw" = "r.squared", "clean" = "R<sup>2</sup>", "fmt" = 2),
               #list("raw" = "adj.r.squared", "clean" = "Adj. R<sup>2</sup>", "fmt" = 2),
               list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
             ),
             escape = FALSE,
             stars = c('*' = .1, '**' = .05, '***' = 0.01)
)
Audience Score
Constant 32.32***
(2.34)
Critics Score 0.52***
(0.03)
n 146
R2 0.61
SER 12.45
* p < 0.1, ** p < 0.05, *** p < 0.01

Note when you run this chunk, it might not show anything. The table will show properly when you render this document.

Goodness of Fit

Question 7

Let’s use our results to interpret the goodness of fit statistics, and also calculate them ourselves to verify what they measure.

Part A

Look both at your results from the regression lm object in Question 4, and the glance() results from broom in Question 5B. What is the \(R^2\) for our estimated regression? What does it mean in the context of our question?

\(R^2\) is 0.61. This is a measure of the goodness of fit, specifically, the fraction of variation in audience scores that can be explained by our model (\(\widehat{\text{audience}}\)).

Part B

Use your results from Question 5C (the augment()ed data) to calculate \(R^2\) as the fraction of total variation in \(Y\) explained by variation in \(\hat{Y}\):

\[ R^2 = \frac{var(\text{.fitted})}{var(\text{audience})} \]

# type your code below in this chunk
reg_aug %>%
  summarize(var_Y_hat = var(.fitted),
            var_Y = var(audience),
            r_sq = var(.fitted)/var(audience))

Optional Challenge: calculate \(R^2\) as the fraction of variation explained using the sum of squares methods:

\[R^2 = \frac{SSM}{SST}=\frac{\displaystyle\sum^{n}_{i=1}(\hat{Y}_i-\bar{Y})^2}{\displaystyle\sum^{n}_{i=1}(Y_i-\bar{Y})^2}\]

\[ R^2=1-\left(\frac{SSR}{SST}\right)=1-\left(\frac{\displaystyle\sum^{n}_{i=1}\hat{u}_i^2}{\displaystyle\sum^{n}_{i=1}(Y_i-\bar{Y})^2}\right) \]

You can do so by defining a function I’ll call sum_sq, which takes the sum of the squared deviations of a variable from its mean:

sum_sq <- function(x){sum((x - mean(x))^2)}

and then use this on .fitted, .resid, and audience to find SSM, SSR, and SST.

# type your code below in this chunk
reg_aug %>%
  summarize(SSM = sum_sq(.fitted), #SSM
            SST = sum_sq(audience), #SST
            SSR = sum_sq(.resid), #SSR
            r_sq = SSM/SST, # fraction of explained variation in Y
            r_sq_alt = 1 - (SSR/SST)) # 1 - frac of unexplained variation

And, of course, you can square the correlation coefficient from Question 2.

# type your code below in this chunk
movies %>%
  summarize(correlation = cor(audience, critics),
            r_sq = correlation^2)

Part C

Calculate the standard error of the regression (SER), \(\hat{\sigma}_u\) using the .resid values.

\[ \hat{\sigma}_u=\sqrt{\frac{SSR}{n-2}}=\sqrt{\frac{\displaystyle\sum^{n}_{i=1}\hat{u}_i^2}{n-2}} \]

# type your code below in this chunk
reg_aug %>%
  summarize(SSR = sum(.resid^2),
            n = n(),
            SER = sqrt(SSR/(n-2))
            )

Also compare this to taking the standard deviation of .resid. How close is this to the SER, and why?

# type your code below in this chunk
reg_aug %>%
  summarize(sd(.resid))

They are very close due to a sufficiently large sample size. This is because the degrees of freedom correction, \(n-2\) of 144 is pretty close to \(n\) of 146. So dividing SER by \(n\) vs. \(n-2\) yields approximately similar answers.

Predictions

Question 8

Now let’s use our model to make a prediction about what the audience score will be for a given critics score.

Part A

Suppose we have a movie that has a critics score of 74. Use the predict() function, which requires us to make a tibble of new data to use. So first make a tibble of critics values to use:

to_predict <- tibble(
  critics = c(74)
)

Then use the predict function, which has two arguments: the saved regression object, and newdata, which we’ll set equal to our tibble (I’ve) called to_predict.

# type your code below in this chunk
predict(reg, newdata = to_predict)
       1 
70.69768 

Part B

We of course can calculate a prediction using the OLS regression equation. First, write out the equation we’ve estimated, and then use it to predict an \(\widehat{\text{audience}}_i\) score for a critics score of 74.

\[ \begin{align*} \widehat{\text{Au