4.1 — Multivariate OLS Estimators — R Practice

Author

Answer Key

Published

October 31, 2022

Required Packages & Data

Load all the required packages we will use (note I have installed them already into the cloud project) by running (clicking the green play button) the chunk below:

library(tidyverse) # your friend and mine
library(broom) # for tidy regression
library(modelsummary) # for nice regression tables
library(car) # for vif command

Download and read in (read_csv) the data below.

# run or edit this chunk (if you want to rename the data)

# read in data from url 
# or you could download and upload it to this project instead
speed <- read_csv("https://metricsf22.classes.ryansafner.com/files/data/speeding_tickets.csv")
Rows: 68357 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (9): Black, Hispanic, Female, Amount, MPHover, Age, OutTown, OutState, S...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

This data comes from a paper by Makowsky and Strattman (2009) that we will examine later. Even though state law sets a formula for tickets based on how fast a person was driving, police officers in practice often deviate from that formula. This dataset includes information on all traffic stops. An amount for the fine is given only for observations in which the police officer decided to assess a fine. There are a number of variables in this dataset, but the one’s we’ll look at are:

Variable Description
Amount Amount of fine (in dollars) assessed for speeding
Age Age of speeding driver (in years)
MPHover Miles per hour over the speed limit

We want to explore who gets fines, and how much. We’ll come back to the other variables (which are categorical) in this dataset in later lessons.

Question 1

How does the age of a driver affect the amount of the fine? Make a scatterplot of the Amount of the fine (y) and the driver’s Age (x) along with a regression line.

# type your code below in this chunk
ggplot(data = speed)+
  aes(x = Age,
      y = Amount)+
  geom_point()+
  geom_smooth(method = "lm")
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 36683 rows containing non-finite values (stat_smooth).
Warning: Removed 36683 rows containing missing values (geom_point).

Question 2

Next, find the correlation between Amount and Age.

# run or edit this chunk
speed %>%
  select(Amount, Age) %>%
  cor()
       Amount Age
Amount      1  NA
Age        NA   1
# note this method produces a correlation table for all selected variables
# you could instead do:
# speed %>%
#   summarize(cor(Amount, Age))

Notice that it won’t work! This is because there are a lot of NAs (missing data) for Amount. If you tried to get the mean() of Amount, for example, it would do the same thing.

You can verify the NAs with:

# run or edit this chunk
speed %>%
  select(Amount) %>%
  summary()
     Amount     
 Min.   :  3    
 1st Qu.: 75    
 Median :115    
 Mean   :122    
 3rd Qu.:155    
 Max.   :725    
 NA's   :36683  
# OR
# sped %>% count(Amount) # but this has a lot of rows!

How many NA’s are there?

There are 36,683 NA’s!

In order to run a correlation, we need to drop or ignore all of the NAs. You could filter() the data:

# run or edit this chunk
speed_complete <- speed %>%
  filter(!is.na(Amount)) # remove all NAs

speed_complete %>% 
  select(Amount) %>%
  summary()
     Amount   
 Min.   :  3  
 1st Qu.: 75  
 Median :115  
 Mean   :122  
 3rd Qu.:155  
 Max.   :725  
speed_complete %>%
  select(Amount, Age) %>%
  cor()
            Amount         Age
Amount  1.00000000 -0.06546571
Age    -0.06546571  1.00000000

Or, if you don’t want to change your data, the cor() command allows you to set use = "pairwise.complete.obs" as an argument.

# run or edit this chunk
speed %>%
  select(Amount, Age) %>%
  cor(use = "pairwise.complete.obs")
            Amount         Age
Amount  1.00000000 -0.06546571
Age    -0.06546571  1.00000000

Question 3

We want to estimate the following model:

\[\widehat{\text{Amount}_i}= \hat{\beta_0}+\hat{\beta_1}\text{Age}_i\]

Run a regression, and save it as an object. Then get a summary() of it.

# write your code below here
reg1 <- lm(Amount ~ Age, data = speed)
reg1 %>% summary()

Call:
lm(formula = Amount ~ Age, data = speed)

Residuals:
    Min      1Q  Median      3Q     Max 
-123.21  -46.58   -5.92   32.55  600.24 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 131.70665    0.88649  148.57   <2e-16 ***
Age          -0.28927    0.02478  -11.68   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 56.13 on 31672 degrees of freedom
  (36683 observations deleted due to missingness)
Multiple R-squared:  0.004286,  Adjusted R-squared:  0.004254 
F-statistic: 136.3 on 1 and 31672 DF,  p-value: < 2.2e-16

Part A

Write out the estimated regression equation.

\[ \widehat{\text{Amount}}_i=131.71-0.29 \, \text{Age}_i \]

Part B

What is \(\hat{\beta_0}\) for this model? What does it mean in the context of our question?

\(\hat{\beta_0}=131.71\). It means when Age is 0, the fine Amount is expected to be $131.71.

Part C

What is \(\hat{\beta_1}\) for this model? What does it mean in the context of our question?

\(\hat{\beta_1}=-0.29\). It means for every additional year old someone is, their expected Amount of the fine decreases by $0.29.

Part D

What is the marginal effect of Age on Amount?

The marginal effect is \(\hat{\beta_1}\), described in the previous part. Just checking to make sure you know it’s the marginal effect!

Question 4

Redo question 4 with the broom package. Try out tidy() and glance(). This is just to keep you versatile!

# type your code below in this chunk

# tidy() to get coefficients
reg1_tidy <- tidy(reg1)

reg1_tidy
# glance() at original reg for measures of fit
glance(reg1)

Question 5

How big would the difference in expected fine be for two drivers, one 18 years old and one 40 years old?

18-year-old driver:

\[\begin{align*} \widehat{Amount}_i &= \hat{\beta_0}+\hat{\beta_1}Age_i\\ &= 131.71-0.29(18)\\ &=126.49\\ \end{align*}\]

40-year-old driver:

\[\begin{align*} \widehat{Amount}_i &= \hat{\beta_0}+\hat{\beta_1}Age_i\\ &= 131.71-0.29(40)\\ &=120.11\\ \end{align*}\]

The difference is \(126.49-120.11=6.38\) only!

# we can use R as a calculator and use the regression coefficients

# extract beta0 and save it
beta_0 <- reg1_tidy %>%
  filter(term == "(Intercept)") %>%
  pull(estimate)

# extract beta1 and save it
beta_1 <- reg1_tidy %>%
  filter(term == "Age") %>%
  pull(estimate)

# predicted amount for 18 year old
amount_18yr <- beta_0 + beta_1 * 18

# predicted amount for 40 year old
amount_40yr <- beta_0 + beta_1 * 40

# difference
amount_18yr - amount_40yr
[1] 6.363966

Question 6

Now run the regression again, controlling for speed (MPHover).

# type your code below in this chunk
reg2 <- lm(Amount ~ Age + MPHover, data = speed)
summary(reg2)

Call:
lm(formula = Amount ~ Age + MPHover, data = speed)

Residuals:
     Min       1Q   Median       3Q      Max 
-308.763  -19.783    7.682   25.757  226.295 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.49386    0.95428   3.661 0.000251 ***
Age          0.02496    0.01760   1.418 0.156059    
MPHover      6.89175    0.03869 178.130  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 39.67 on 31671 degrees of freedom
  (36683 observations deleted due to missingness)
Multiple R-squared:  0.5026,    Adjusted R-squared:  0.5026 
F-statistic: 1.6e+04 on 2 and 31671 DF,  p-value: < 2.2e-16

Part A

Write the new regression equation.

\[\widehat{\text{Amount}}_i=3.49+0.02 \, \text{Age}_i+6.89 \, \text{MPHover}_i\]

Part B

What is the marginal effect of Age on Amount? What happened to it, compared to Question 3D?

\(\hat{\beta_1}=0.02\). For every additional year old someone is, holding their speed constant, their expected fine increases by $0.02. The magnitude of the effect shrank and it switched from negative to positive!

Part C

What is the marginal effect of MPHover on Amount?

\(\hat{\beta_2}=6.89\). For every additional MPH someone drives above the speed limit, holding their Age constant, their expected fine increases by $6.89

Part D

What is \(\hat{\beta_0}\) for our model, and what does it mean in the context of our question?

\(\hat{\beta_0}=3.49\). This is the expected fine for a driver of Age \(=0\) and MPHover \(=0\). Not very meaningful…

Part E

What is the adjusted \(\bar{R}^2\)? What does it mean?

It is \(0.5026\). We can explain 50% of the variation in Amount from variation in our model (using Age and MPHOver).

Question 7

Now suppose both the 18 year old and the 40 year old each went 10 MPH over the speed limit. How big would the difference in expected fine be for the two drivers?

18-year-old driver:

\[\begin{align*} \widehat{Amount}_i &= \hat{\beta_0}+\hat{\beta_1}Age_i+\hat{\beta_2}MPHover_i\\ &= 3.49+0.02(18)+6.89(10)\\ &=72.75\\ \end{align*}\]

40-year-old driver:

\[\begin{align*} \widehat{Amount}_i &= \hat{\beta_0}+\hat{\beta_1}Age_i+\hat{\beta_2}MPHover_i\\ &= 3.49+0.02(40)+6.89(10)\\ &=73.19\\ \end{align*}\]

The difference is \(72.75-73.19=-0.44\) only!

# we can use R as a calculator and use the regression coefficients

# first we need to tidy reg2
reg2_tidy <- tidy(reg2)

# extract beta0 and save it
multi_beta_0 <- reg2_tidy %>%
  filter(term == "(Intercept)") %>%
  pull(estimate)

# extract beta1 and save it
multi_beta_1 <- reg2_tidy %>%
  filter(term == "Age") %>%
  pull(estimate)

# extract beta2 and save it
multi_beta_2 <- reg2_tidy %>%
  filter(term == "MPHover") %>%
  pull(estimate)

# predicted amount for 18 year old going 10 MPH over
amount_18yr_10mph <- multi_beta_0 + multi_beta_1 * 18 + multi_beta_2 * 10

# predicted amount for 40 year old going 10 MPH over
amount_40yr_10mph <- multi_beta_0 + multi_beta_1 * 40 + multi_beta_2 * 10

# difference
amount_18yr_10mph - amount_40yr_10mph # close, we have some rounding error!
[1] -0.5492232

Question 8

What is the difference in expected fine between two 18 year-olds, one who went 10 MPH over, and one who went 30 MPH over?

18-year-old driver, 10 MPH over: we saw was $72.75.

18-year-old driver, 30 MPH over:

\[\begin{align*} \widehat{Amount}_i &= \hat{\beta_0}+\hat{\beta_1}Age_i+\hat{\beta_2}MPHover_i\\ &= 3.49+0.02(18)+6.89(30)\\ &=210.55\\ \end{align*}\]

The difference is now \(210.55-72.75=-137.80\)!

# we can use R as a calculator and use the regression coefficients

# predicted amount for 40 year old going 30 MPH over
amount_18yr_30mph <- multi_beta_0 + multi_beta_1 * 18 + multi_beta_2 * 30

# difference
amount_18yr_30mph - amount_18yr_10mph
[1] 137.835

So clearly the speed plays a much bigger role than age does!

Question 9

Use the modelsummary package’s modelsummary() command to make a regression table of your two regressions: the one from question 3, and the one from question 6.

# type your code below in this chunk
modelsummary(models = list("Simple OLS" = reg1,
                           "Multivariate OLS" = reg2),
             fmt = 3,
       coef_rename = c("(Intercept)" = "Constant",
                       "Age" = "Age",
                       "MPHover" = "MPH Over Speed Limit"),
       gof_map = list(
               list("raw" = "nobs", "clean" = "n", "fmt" = 0),
               list("raw" = "r.squared", "clean" = "R<sup>2</sup>", "fmt" = 2),
               list("raw" = "adj.r.squared", "clean" = "Adj. R<sup>2</sup>", "fmt" = 2),
               list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
             ),
             escape = FALSE,
             stars = c('*' = .1, '**' = .05, '***' = 0.01))
Simple OLS Multivariate OLS
Constant 131.707*** 3.494***
(0.886) (0.954)
Age −0.289*** 0.025
(0.025) (0.018)
MPH Over Speed Limit 6.892***
(0.039)
n 31674 31674
R2 0.004 0.50
Adj. R2 0.004 0.50
SER 56.12 39.67
* p < 0.1, ** p < 0.05, *** p < 0.01

Question 10

Are our two independent variables multicollinear? Do younger people tend to drive faster?

Part A

Get the correlation between Age and MPHover.

# type your code below in this chunk

speed %>%
  select(Age, MPHover) %>%
  cor()
               Age    MPHover
Age      1.0000000 -0.1035001
MPHover -0.1035001  1.0000000
# fortunately no NA's here!

Part B

Make a scatterplot of MPHover (y) on Age (x).

# type your code below in this chunk
ggplot(data = speed)+
  aes(x = Age,
      y = MPHover)+
  geom_point()+
  geom_smooth(method = "lm")
`geom_smooth()` using formula 'y ~ x'

Part C

Run an auxiliary regression of MPHover on Age.

# type your code below in this chunk
reg_aux <- lm(MPHover ~ Age, data = speed)
summary(reg_aux)

Call:
lm(formula = MPHover ~ Age, data = speed)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.683  -4.020  -0.488   2.512  59.551 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 16.541094   0.054399  304.07   <2e-16 ***
Age         -0.039007   0.001434  -27.21   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.056 on 68355 degrees of freedom
Multiple R-squared:  0.01071,   Adjusted R-squared:  0.0107 
F-statistic: 740.2 on 1 and 68355 DF,  p-value: < 2.2e-16

Part D

Interpret the coefficient on Age from this regression.

In our notation, this is \(\hat{\delta_1}=-0.03\), which says for every 1 year older someone is, they drive 0.03 MPH less over the speed limit.

Part E

Look at your regression table in question 9. What happened to the standard error on Age? Why (consider the formula for variance of \(\hat{\beta_1})\)?

The formula for \(var[\hat{\beta_1}]=\frac{1}{1-R^2_1} \times \frac{SER^2}{n \times var[Age]}\)

The standard error of the regression (SER) went from 56.13 to 39.67! Adding the second variable created a much better fit, which lowered variance on the betas (and hence, standard error, the square root of variance).

Variance might have slightly increased to more than it otherwise would be, due to multicollinearity between Age and MPHover, which we investigate next.

Part F

Calculate the Variance Inflation Factor (VIF) using the car package’s vif() command. Run it on your regression object saved from Question 6.

# type your code below in this chunk
reg2 %>% vif()
     Age  MPHover 
1.010149 1.010149 

The VIF is 1.01 for \(\hat{\beta_1}\) and \(\hat{\beta_2}\). Variance increases only by 1% due to (weak) multicollinearity between Age and MPHover.

Part G

Calculate the VIF manually, using what you learned in this question.

\(VIF=\frac{1}{1-R^2_1}\), where \(R^2_1\) comes from the auxiliary regression of MPHover on Age (Part C). The \(R^2\) from that regression was \(0.011\), so:

\[\begin{align*} VIF&=\frac{1}{1-R^2_1}\\ &=\frac{1}{1-0.011}\\ &=\frac{1}{0.989}\\ & \approx 1.011\\ \end{align*}\]
# we can do this in R

# first we need to extract and save the R-squared from the aux reg
# use broom's glance()

aux_r_sq <- glance(reg_aux) %>%
  pull(r.squared) %>%
  round(.,3) # round to 3 decimals

vif <- 1/(1-aux_r_sq)
vif
[1] 1.011122

Again, we have some slight rounding error.

Question 11

Let’s now think about the omitted variable bias. Suppose the “true” model is the one we ran from Question 6.

Part A

Do you suppose that MPHover fits the two criteria for omitted variable bias?

  1. MPHover probably affects Amount
  2. \(cor(MPHover, Age) \neq 0\)

Possibly, though it is only weakly correlated with Age \((-0.10)\). Speed clearly has a big role to play in affecting and predicting fines, but probably does not add very much bias to the marginal effect of Age on Amount.

Part B

Look at the regression we ran in Question 3. Consider this the “omitted” regression, where we left out MPHover. Does our estimate of the marginal effect of Age on Amount overstate or understate the true marginal effect?

Since there is negative correlation between MPHover and Age, \(\hat{\beta_1}\) likely understates the effect of Age on Amount.

Part C

Use the “true” model (Question 6), the “omitted” regression (Question 3), and our “auxiliary” regression (Question 11) to identify each of the following parameters that describe our biased estimate of the marginal effect of Age on Amount: \[\alpha_1=\beta_1+\beta_2\delta_1\]

See the notation I used in class.

Using our notation from class, and the regressions from questions 3, 6, and 10, we have three regressions:

  • True Model: \(\widehat{\text{Amount}_i}=3.49+0.02 \, \text{Age}_i+6.89 \, \text{MPHover}_i\)
  • Omitted Reg: \(\widehat{\text{Amount}_i}=131.71-0.29 \, \text{Age}_i\)
  • Auxiliary Reg: \(\widehat{\text{MPHover}_i}=16.54-0.04 \, \text{Age}_i\)

\[\begin{align*} \hat{\alpha_1}&=\beta_1+\beta_2 \delta_1\\ -0.29 &= 0.02+6.89(-0.04)\\ \end{align*}\]

Where:

  • \(\beta_1=0.02\): the true marginal effect of Age on Amount (holding MPHover constant)
  • \(\beta_2=6.89\): the true marginal effect of MPHover on Amount (holding Age constant)
  • \(\delta_1=6.89\): the (auxiliary) effect of MPHover on Age

We have some slight rounding error, but this is the relationship.

Part D

From your answer in part C, how large is the omitted variable bias from leaving out MPHover?

\[\begin{align*} \hat{\alpha_1}&=\beta_1+\underbrace{\beta_2 \delta_1}_{Bias}\\ -0.29 &= 0.02+\underbrace{6.89(-0.04)}_{bias}\\ Bias &=-0.2756\\ \end{align*}\]

Question 12

Make a coefficient plot of your coefficients from the regression in Question 6. The package modelsummary (which you will need to install and load) has a great command modelplot() to do this on your regression object.

# type your code below in this chunk
modelplot(reg2)

# alternatively, we can make it ourselves

# first we have to make sure to tidy our reg with confidence intervals!
reg2_tidy <- tidy(reg2, conf.int=TRUE)

# look at it
reg2_tidy
# use this as the data for our plot
ggplot(data = reg2_tidy)+
  aes(x = estimate,
      y = term,
      color = term)+
    geom_point()+ # point for each OLS estimate
    # now make "error bars" using conf. int. for each OLS estimate
    geom_segment(aes(x = conf.low,
                     xend = conf.high,
                     y = term,
                     yend = term))+
  # add line at 0
  geom_vline(xintercept = 0,
             size = 1,
             color = "black",
             linetype = "dashed")+
  labs(x = expression(paste("Marginal effect, ", hat(beta[j]))),
       y = "Variable")+
  guides(color = "none")+ # hide legend
  theme_classic(base_family = "Fira Sans Condensed",
           base_size = 16)