4.4 — Nonlinearities & Variable Transformations — R Practice

Author

Answer Key

Published

November 16, 2022

Required Packages & Data

Load all the required packages we will use (note I have installed them already into the cloud project) by running (clicking the green play button) the chunk below:

library(tidyverse) # your friend and mine
library(broom) # for tidy regression
library(modelsummary) # for nice regression tables
library(car) # for F-test

We are returning to the speeding tickets data that we began to explore in R Practice 4.1 on Multivariate Regression and R Practice 4.3 on Categorical Data nad Interactions. Download and read in (read_csv) the data below.

speeding_tickets.csv

# run or edit this chunk (if you want to rename the data)

# read in data from url 
# or you could download and upload it to this project instead
speed <- read_csv("https://metricsf22.classes.ryansafner.com/files/data/speeding_tickets.csv") %>%
  mutate_at(c("Black", "Hispanic", "Female", "OutTown", "OutState"), factor) %>%
  filter(Amount > 0)

Rows: 68357 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (9): Black, Hispanic, Female, Amount, MPHover, Age, OutTown, OutState, S...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# this code cleans the data the same way from last class

This data comes from a paper by Makowsky and Strattman (2009) that we will examine later. Even though state law sets a formula for tickets based on how fast a person was driving, police officers in practice often deviate from that formula. This dataset includes information on all traffic stops. An amount for the fine is given only for observations in which the police officer decided to assess a fine. There are a number of variables in this dataset, but the one’s we’ll look at are:

Variable	Description
`Amount`	Amount of fine (in dollars) assessed for speeding
`Age`	Age of speeding driver (in years)
`MPHover`	Miles per hour over the speed limit
`Black`	Dummy $= 1$ if driver was black, $= 0$ if not
`Hispanic`	Dummy $= 1$ if driver was Hispanic, $= 0$ if not
`Female`	Dummy $= 1$ if driver was female, $= 0$ if not
`OutTown`	Dummy $= 1$ if driver was not from local town, $= 0$ if not
`OutState`	Dummy $= 1$ if driver was not from local state, $= 0$ if not
`StatePol`	Dummy $= 1$ if driver was stopped by State Police, $= 0$ if stopped by other (local)

We want to explore who gets fines, and how much. We’ll come back to the other variables (which are categorical) in this dataset in later lessons.

Question 1

Run a regression of Amount on Age. Write out the estimated regression equation, and interpret the coefficient on Age.

reg_linear <- lm(Amount ~ Age, data = speed)
summary(reg_linear)


Call:
lm(formula = Amount ~ Age, data = speed)

Residuals:
    Min      1Q  Median      3Q     Max 
-123.21  -46.58   -5.92   32.55  600.24 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 131.70665    0.88649  148.57   <2e-16 ***
Age          -0.28927    0.02478  -11.68   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 56.13 on 31672 degrees of freedom
Multiple R-squared:  0.004286,  Adjusted R-squared:  0.004254 
F-statistic: 136.3 on 1 and 31672 DF,  p-value: < 2.2e-16

$\hat{{Amount}_{i}} = 131.71 - 0.29 {Age}_{i}$

For every year of age, expected fines decrease by $0.29.

Question 2

Is the effect of Age on Amount nonlinear? Let’s run a quadratic regression.

Part A

Create a new variable for $A g e^{2}$ . Then run a quadratic regression:

${\hat{Amount}}_{i} = β_{0} + β_{1} {Age}_{i} + β_{2} {Age}_{i}^{2}$

# make Age_sq variable
speed <- speed %>%
  mutate(Age_sq = Age^2)

# view it
speed

ABCDEFGHIJ0123456789

Hispanic <fct>	Female <fct>	Amount <dbl>	MPHover <dbl>	Age <dbl>	OutTown <fct>	OutState <fct>
0	0	50	28	22	1	0
0	0	50	12	41	1	0
0	0	125	15	20	1	0
0	1	50	21	22	1	0
1	0	50	16	43	1	0
0	0	100	15	22	1	0
0	1	175	20	60	1	0
0	1	165	19	29	1	0
0	0	175	20	42	0	0
0	1	115	14	19	1	1

# run quadratic regression
reg_quad <- lm(Amount ~ Age + Age_sq, data = speed) 
summary(reg_quad)


Call:
lm(formula = Amount ~ Age + Age_sq, data = speed)

Residuals:
    Min      1Q  Median      3Q     Max 
-125.54  -44.96   -5.25   33.25  599.88 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 146.746357   2.269451  64.662  < 2e-16 ***
Age          -1.173833   0.125360  -9.364  < 2e-16 ***
Age_sq        0.011357   0.001578   7.198 6.25e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 56.08 on 31671 degrees of freedom
Multiple R-squared:  0.005912,  Adjusted R-squared:  0.005849 
F-statistic: 94.18 on 2 and 31671 DF,  p-value: < 2.2e-16

Part B

Try running the same regression using the alternate notation: lm(Y ~ X + I(X^2)), replacing X and Y with our variables. This method allows you to run a quadratric regression without having to create a new variable first. Do you get the same results?

reg_quad_alt <- lm(Amount ~ Age + I(Age^2), data = speed) 
summary(reg_quad_alt)


Call:
lm(formula = Amount ~ Age + I(Age^2), data = speed)

Residuals:
    Min      1Q  Median      3Q     Max 
-125.54  -44.96   -5.25   33.25  599.88 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 146.746357   2.269451  64.662  < 2e-16 ***
Age          -1.173833   0.125360  -9.364  < 2e-16 ***
I(Age^2)      0.011357   0.001578   7.198 6.25e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 56.08 on 31671 degrees of freedom
Multiple R-squared:  0.005912,  Adjusted R-squared:  0.005849 
F-statistic: 94.18 on 2 and 31671 DF,  p-value: < 2.2e-16

This gives the same results.

Part C

Write out the estimated regression equation.

$\hat{{Amount}_{i}} = 146.75 - 1.17 {Age}_{i} + 0.01 {Age}_{i}^{2}$

Part D

Is this model an improvement from the linear model? Compared ${\bar{R}}^{2}$ .

Yes, a slight improvement. ${\bar{R}}^{2}$ went from 0.00425 on the linear model to 0.00485 on the quadratic model.

Part E

Is the coefficient on the quadratic term statistically significantly different from zero? i.e. could we reject $H_{0} : β_{2}$ ?

Yes, since $p < 0.05$ , we have sufficient evidence to reject $H_{0}$ . This implies that the quadratic term is not unnecessary.

Part F

Write an equation for the marginal effect of Age on Amount.

The marginal effect is measured by the first derivative of the regression equation with respect to Age. But you can just remember the resulting formula and plug in the parameters:

$\begin{aligned} \frac{d Y}{d X} & = β_{1} + 2 β_{2} X \\ \frac{d A m o u n t}{d A g e} & = - 1.17 + 2 (0.01) A g e \\ = - 1.17 + 0.02 A g e \end{aligned}$

Part G

Predict the marginal effect on Amount of being one year older when you are 18. How about when you are 40?

For 18 year olds:

$\begin{aligned} \frac{d A m o u n t}{d A g e} & = - 1.17 + 0.02 (18) \\ = - 1.17 + 0.36 \\ = - $ 0.81 \end{aligned}$

For 40 year olds:

$\begin{aligned} \frac{d A m o u n t}{d A g e} & = - 1.17 + 0.02 (40) \\ = - 1.17 + 0.80 \\ = - $ 0.37 \end{aligned}$

# Let's do this in R:

tidy_reg_quad <- tidy(reg_quad)

tidy_reg_quad

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	std.error <dbl>	statistic <dbl>
(Intercept)	146.74635707	2.269451041	64.661610
Age	-1.17383275	0.125359691	-9.363718
Age_sq	0.01135719	0.001577841	7.197932

# save beta 1 
quad_beta_1 <- tidy_reg_quad %>%
  filter(term == "Age") %>%
  pull(estimate)

# save beta 2
quad_beta_2 <- tidy_reg_quad %>%
  filter(term == "Age_sq") %>%
  pull(estimate)

# create function to estimate marginal effects
marginal_effect <- function(x){
  return(quad_beta_1 + 2 * quad_beta_2 * x)
}

# run the function on the 18-year-old and the 40-year-old
marginal_effect(c(18,40))

[1] -0.7649738 -0.2652574

# close enough, we had some rounding error

Part H

Our quadratic function is a $U$ -shape. According to the model, at what age is the amount of the fine minimized?

We can set the derivative equal to 0, or you can just remember the formula and plug in the parameters:

$\begin{aligned} \frac{d Y}{d X} & = β_{1} + 2 β_{2} X \\ 0 & = β_{1} + 2 β_{2} X \\ - β_{1} & = 2 β_{2} X \\ - \frac{1}{2} \frac{β_{1}}{β_{2}} & = A g e^{*} \\ - \frac{1}{2} \frac{- 1.17}{0.01} & = A g e^{*} \\ - \frac{1}{2} 117 & \approx A g e^{*} \\ 58.5 & \approx A g e^{*} \end{aligned}$

# Let's do this in R:

-0.5*(quad_beta_1 / quad_beta_2)

[1] 51.67795

# again, some rounding error

Part I

Create a scatterplot between Amount (y) and Age (x). Add a layer with a linear regression (as usual, geom_smooth(method = "lm")), and an additional layer of with the predicted quadratic regression curve. This additional layer is similar but we need to specify the formula of the curve to be quadratic: geom_smooth(method = "lm", formula = "y ~ x + I(x^2)")

ggplot(data = speed)+
  aes(x = Age,
      y = Amount)+
  geom_point()+
  geom_smooth(method = "lm",
              formula = "y ~ x + I(x^2)",
              color = "red")

Part J

It’s quite hard to see the quadratic curve with all those data points. Redo another plot and this time, only keep the quadratic geom_smooth() layer and leave out the geom_point() layer. This will only plot the regression curve.

ggplot(data = speed)+
  aes(x = Age,
      y = Amount)+
  geom_smooth(method = "lm",
              formula = "y ~ x + I(x^2)",
              color = "red")

Question 3

Should we use a higher-order polynomial equation? Run the following cubic regression, and determine whether it is necessary.

${\hat{Amount}}_{i} = β_{0} + β_{1} {Age}_{i} + β_{2} {Age}_{i}^{2} + β_{3} {Age}_{i}^{3}$

reg_cube <- lm(Amount ~ Age + I(Age^2) + I(Age^3), data = speed)
summary(reg_cube)


Call:
lm(formula = Amount ~ Age + I(Age^2) + I(Age^3), data = speed)

Residuals:
    Min      1Q  Median      3Q     Max 
-125.83  -44.71   -5.53   33.21  600.01 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.519e+02  5.558e+00  27.337  < 2e-16 ***
Age         -1.612e+00  4.453e-01  -3.619 0.000296 ***
I(Age^2)     2.240e-02  1.089e-02   2.058 0.039646 *  
I(Age^3)    -8.457e-05  8.251e-05  -1.025 0.305363    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 56.08 on 31670 degrees of freedom
Multiple R-squared:  0.005945,  Adjusted R-squared:  0.005851 
F-statistic: 63.13 on 3 and 31670 DF,  p-value: < 2.2e-16

The $t$ -statistic on Age $^{3}$ is small (-1.03) and the $p$ -value is 0.31, so the cubic term does not have a significant impact on Amount. We should not include it.

Just for fun, would the cubic model look any better?

ggplot(data = speed)+
  aes(x = Age,
      y = Amount)+
  geom_point()+
  geom_smooth(method = "lm", formula = "y ~ x + I(x^2)", color = "red")+
  geom_smooth(method = "lm", formula = "y ~ x + I(x^2) + I(x^3)", color = "orange")

ggplot(data = speed)+
  aes(x = Age,
      y = Amount)+
  geom_smooth(method = "lm", formula = "y ~ x + I(x^2)", color = "red")+
  geom_smooth(method = "lm", formula = "y ~ x + I(x^2) + I(x^3)", color = "orange")

Question 4

Run an $F$ -test to check if a nonlinear model is appropriate. Use the car package’s linearHypothesis() command, which looks like:

linearHypothesis(reg_name, # name of your saved regression object
                 c("var1", "var2")) # name of the variables you are testing

Your null hypothesis is $H_{0} : β_{2} = β_{3} = 0$ from the regression in question 4. The command is

linearHypothesis(reg_cube, c("I(Age^2)", "I(Age^3)")) # F-test

ABCDEFGHIJ0123456789

	Res.Df <dbl>	RSS <dbl>	Df <dbl>	Sum of Sq <dbl>	F <dbl>	Pr(>F) <dbl>
1	31672	99768711	NA	NA	NA	NA
2	31670	99602463	2	166248.3	26.43048	3.395905e-12

We get a large $F$ of 26.43, with a very small $p$ -value. Therefore, we can reject the null hyothesis that the model is linear $(β_{2} = 0, β_{3} = 0)$ . We should in fact not use a linear model. Note it does not tell us if the model should be quadratic or cubic (or even logarithmic of some sort), only that it is not linear. Remember, this was a joint hypothesis of all non-linear terms $(β_{2}$ and $β_{3})$ !

Question 5

Now let’s take a look at speed (MPHover the speed limit).

Part A

Creating new variables as necessary, run a linear-log model of Amount on MPHover. Write down the estimated regression equation, and interpret the coefficient on MPHover $(\hat{β_{1}})$ . Make a scatterplot with the regression line. Hint: The simple geom_smooth(method = "lm") layer is sufficient, so long as you use the right variables on the plot!

# create log of MPHover
speed <- speed %>%
  mutate(log_mph = log(MPHover))

# Run linear-log regression
linear_log_reg <- lm(Amount ~ log_mph, data = speed)
summary(linear_log_reg)


Call:
lm(formula = Amount ~ log_mph, data = speed)

Residuals:
    Min      1Q  Median      3Q     Max 
-190.61  -16.44    8.56   20.52  425.33 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -200.0975     1.9401  -103.1   <2e-16 ***
log_mph      115.7544     0.6922   167.2   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.99 on 31672 degrees of freedom
Multiple R-squared:  0.4689,    Adjusted R-squared:  0.4689 
F-statistic: 2.796e+04 on 1 and 31672 DF,  p-value: < 2.2e-16

# note we could have done this without creating the variable
# just take log() inside the regression:
linear_log_reg_alt <- lm(Amount ~ log(MPHover), data = speed)
summary(linear_log_reg)


Call:
lm(formula = Amount ~ log_mph, data = speed)

Residuals:
    Min      1Q  Median      3Q     Max 
-190.61  -16.44    8.56   20.52  425.33 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -200.0975     1.9401  -103.1   <2e-16 ***
log_mph      115.7544     0.6922   167.2   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.99 on 31672 degrees of freedom
Multiple R-squared:  0.4689,    Adjusted R-squared:  0.4689 
F-statistic: 2.796e+04 on 1 and 31672 DF,  p-value: < 2.2e-16

ggplot(data = speed)+
  aes(x = log_mph,
      y = Amount)+
  geom_point()+
  geom_smooth(method = "lm", color = "red")+
  labs(title = "Linear-Log Model")

`geom_smooth()` using formula = 'y ~ x'

$\hat{{Amount}_{i}} = - 200.10 + 115.75 {ln(MPHover}_{i})$

A 1% increase in speed (over the speed limit) increases the fine by $\frac{115.75}{100} \approx $ 1.16$

Part B

Creating new variables as necessary, run a log-linear model of Amount on MPHover. Write down the estimated regression equation, and interpret the coefficient on MPHover $(\hat{β_{1}})$ . Make a scatterplot with the regression line. Hint: The simple geom_smooth(method = "lm") is sufficient, so long as you use the right variables on the plot!

# create log of Amount
speed <- speed %>%
  mutate(log_Amount = log(Amount))

# Run log-linear regression
log_linear_reg <- lm(log_Amount ~ MPHover, data = speed)
summary(log_linear_reg)


Call:
lm(formula = log_Amount ~ MPHover, data = speed)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2353 -0.1895  0.0929  0.2733  1.2970 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.8790452  0.0058716   660.6   <2e-16 ***
MPHover     0.0484962  0.0003256   148.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3355 on 31672 degrees of freedom
Multiple R-squared:  0.4119,    Adjusted R-squared:  0.4119 
F-statistic: 2.219e+04 on 1 and 31672 DF,  p-value: < 2.2e-16

# again we could have done this without creating the variable
# just take log() inside the regression:
log_linear_reg_alt <- lm(log(Amount) ~ MPHover, data = speed)
summary(log_linear_reg_alt)


Call:
lm(formula = log(Amount) ~ MPHover, data = speed)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2353 -0.1895  0.0929  0.2733  1.2970 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.8790452  0.0058716   660.6   <2e-16 ***
MPHover     0.0484962  0.0003256   148.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3355 on 31672 degrees of freedom
Multiple R-squared:  0.4119,    Adjusted R-squared:  0.4119 
F-statistic: 2.219e+04 on 1 and 31672 DF,  p-value: < 2.2e-16

ggplot(data = speed)+
  aes(x = MPHover,
      y = log_Amount)+
  geom_point()+
  geom_smooth(method = "lm", color = "red")+
  labs(title = "Log-Linear Model")

`geom_smooth()` using formula = 'y ~ x'

$\hat{{ln(Amount}_{i})} = 3.87 + 0.05 {MPHover}_{i}$

For every 1 MPH in speed (over the speed limit), expected fine increases by $0.05 \times 100 % = 5 %$

Part C

Creating new variables as necessary, run a log-log model of Amount on MPHover. Write down the estimated regression equation, and interpret the coefficient on MPHover $(\hat{β_{1}})$ . Make a scatterplot with the regression line. Hint: The simple geom_smooth(method = "lm") is sufficient, so long as you use the right variables on the plot!

# Run log-log regression
log_log_reg <- lm(log_Amount ~ log_mph, data = speed)
summary(log_log_reg)


Call:
lm(formula = log_Amount ~ log_mph, data = speed)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1417 -0.1727  0.1034  0.2319  2.3669 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.30860    0.01564   147.6   <2e-16 ***
log_mph      0.86196    0.00558   154.5   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3304 on 31672 degrees of freedom
Multiple R-squared:  0.4297,    Adjusted R-squared:  0.4297 
F-statistic: 2.386e+04 on 1 and 31672 DF,  p-value: < 2.2e-16

# again we could have done this just taking log()s inside the regression:
log_log_reg_alt <- lm(log(Amount) ~ log(MPHover), data = speed)
summary(log_log_reg_alt)


Call:
lm(formula = log(Amount) ~ log(MPHover), data = speed)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1417 -0.1727  0.1034  0.2319  2.3669 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.30860    0.01564   147.6   <2e-16 ***
log(MPHover)  0.86196    0.00558   154.5   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3304 on 31672 degrees of freedom
Multiple R-squared:  0.4297,    Adjusted R-squared:  0.4297 
F-statistic: 2.386e+04 on 1 and 31672 DF,  p-value: < 2.2e-16

ggplot(data = speed)+
  aes(x = log_mph,
      y = log_Amount)+
  geom_point()+
  geom_smooth(method = "lm", color = "red")+
  labs(title = "Log-Log Model")

`geom_smooth()` using formula = 'y ~ x'

$\hat{{ln(Amount}_{i})} = 2.31 + 0.86 {ln(MPHover}_{i})$

For every 1% increase in speed (over the speed limit), expected fine increases by 0.86%.

Part D

Which of the three log models has the best fit? Hint: Check $R^{2}$

We can compare the $R^{2}$ ’s of the three models or compare scatterplots with the regression lines. I will make a table of the three regressions with modelsummary for easy comparison of fit:

modelsummary(models = list("Amount" = linear_log_reg,
                           "ln Amount" = log_linear_reg,
                           "ln Amount" = log_log_reg),
             fmt = 2, # round to 2 decimals
             output = "html",
             coef_rename = c("(Intercept)" = "Constant",
                             "MPHover" = "MPH Over Speed Limit",
                             "log_mph" = "ln MPH Over Speed Limit"),
             gof_map = list(
               list("raw" = "nobs", "clean" = "n", "fmt" = 0),
               #list("raw" = "r.squared", "clean" = "R<sup>2</sup>", "fmt" = 2),
               list("raw" = "adj.r.squared", "clean" = "Adj. R<sup>2</sup>", "fmt" = 2),
               list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
             ),
             escape = FALSE,
             stars = c('*' = .1, '**' = .05, '***' = 0.01)
)

	Amount	ln Amount	ln Amount
Constant	−200.10***	3.88***	2.31***
	(1.94)	(0.006)	(0.02)
ln MPH Over Speed Limit	115.75***		0.86***
	(0.69)		(0.006)
MPH Over Speed Limit		0.05***
		(0.0003)
n	31674	31674	31674
Adj. R²	0.47	0.41	0.43
SER	40.99	0.34	0.33
* p < 0.1, p < 0.05, * p < 0.01

It appears the linear-log model has the best fit with the highest $R^{2}$ out of the three, but not by very much.

Question 6

Return to the quadratic model from Question 3. Run a quadratic regression of Amount on Age, Age $^{2}$ , MPHover, and all of the race dummy variables. Test the null hypothesis: “the race of the driver has no effect on Amount”

full_reg <- lm(Amount ~ Age + Age_sq + MPHover + Black + Hispanic, data = speed)
summary(full_reg)


Call:
lm(formula = Amount ~ Age + Age_sq + MPHover + Black + Hispanic, 
    data = speed)

Residuals:
    Min      1Q  Median      3Q     Max 
-308.22  -19.61    7.46   24.77  226.48 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.676048   1.782585   4.867 1.14e-06 ***
Age         -0.278472   0.088851  -3.134 0.001725 ** 
Age_sq       0.003914   0.001118   3.501 0.000464 ***
MPHover      6.887107   0.038754 177.716  < 2e-16 ***
Black1      -1.641981   1.018421  -1.612 0.106911    
Hispanic1    2.482807   1.057868   2.347 0.018932 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 39.66 on 31668 degrees of freedom
Multiple R-squared:  0.5029,    Adjusted R-squared:  0.5029 
F-statistic:  6408 on 5 and 31668 DF,  p-value: < 2.2e-16

# it turns out we need to make Black and Hispanic into numeric variables for linearHypothesis to work
# so we'll redo this with them as numeric variables

speed <- speed %>%
  mutate_at(c("Black", "Hispanic"), as.numeric)

full_reg <- lm(Amount ~ Age + Age_sq + MPHover + Black + Hispanic, data = speed)
summary(full_reg)


Call:
lm(formula = Amount ~ Age + Age_sq + MPHover + Black + Hispanic, 
    data = speed)

Residuals:
    Min      1Q  Median      3Q     Max 
-308.22  -19.61    7.46   24.77  226.48 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.835222   2.335859   3.354 0.000797 ***
Age         -0.278472   0.088851  -3.134 0.001725 ** 
Age_sq       0.003914   0.001118   3.501 0.000464 ***
MPHover      6.887107   0.038754 177.716  < 2e-16 ***
Black       -1.641981   1.018421  -1.612 0.106911    
Hispanic     2.482807   1.057868   2.347 0.018932 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 39.66 on 31668 degrees of freedom
Multiple R-squared:  0.5029,    Adjusted R-squared:  0.5029 
F-statistic:  6408 on 5 and 31668 DF,  p-value: < 2.2e-16

linearHypothesis(full_reg, c("Black", "Hispanic"))

ABCDEFGHIJ0123456789

	Res.Df <dbl>	RSS <dbl>	Df <dbl>	Sum of Sq <dbl>	F <dbl>	Pr(>F) <dbl>
1	31670	49818704	NA	NA	NA	NA
2	31668	49805262	2	13442.23	4.273531	0.01394054

Since $p < 0.05$ , we can reject the null hypothesis in favor of the alternative hypothesis, which implies that the race of the driver has some effect on Amount.

Question 7

Now let’s try standardizing variables. Let’s try running a regression of Amount on Age and MPHover, but standardizing each variable.

Part A

Create new standardized variables for Amount, Age, and MPHover:

# let's check the mean and sd of each variable
speed %>% 
  summarize_at(vars(Amount, Age, MPHover),
               funs("avg" = mean,
                    "sd" = sd))

Warning: `funs()` was deprecated in dplyr 0.8.0.
ℹ Please use a list of either functions or lambdas:

# Simple named list: list(mean = mean, median = median)

# Auto named with `tibble::lst()`: tibble::lst(mean, median)

# Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

ABCDEFGHIJ0123456789

Amount_avg <dbl>	Age_avg <dbl>	MPHover_avg <dbl>	Amount_sd <dbl>	Age_sd <dbl>
122.0332	33.44061	17.07906	56.24517	12.72899

Variable	Mean	SD
Amount	122.03	56.25
Age	33.44	12.73
MPHover	17.08	5.79

# make standardized variables
speed <- speed %>%
  mutate(Amount_Z = scale(Amount),
         Age_Z = scale(Age),
         MPHover_Z = scale(MPHover))

We can verify that now all variables have been standardized to have mean 0 and sd 1:

speed %>% 
  summarize_at(vars(Amount_Z, Age_Z, MPHover_Z),
               funs("avg" = mean,
                    "sd" = sd))

ABCDEFGHIJ0123456789

Amount_Z_avg <dbl>	Age_Z_avg <dbl>	MPHover_Z_avg <dbl>	Amount_Z_sd <dbl>
8.983876e-17	-1.629641e-16	2.363844e-16	1

Variable	Mean	SD
Amount	0	1
Age	0	1
MPHover	0	1

Part B

Run a regression of standardized Amount_Z on standardized Age_Z and MPHover_Z. Interpret $\hat{β_{1}}$ and $\hat{β_{2}}$ . Which variable has a bigger effect on Amount?

std_reg <- lm(Amount_Z ~ Age_Z + MPHover_Z, data = speed)
summary(std_reg)


Call:
lm(formula = Amount_Z ~ Age_Z + MPHover_Z, data = speed)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.4896 -0.3517  0.1366  0.4579  4.0234 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.803e-15  3.963e-03   0.000    1.000    
Age_Z       5.650e-03  3.983e-03   1.418    0.156    
MPHover_Z   7.095e-01  3.983e-03 178.130   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7053 on 31671 degrees of freedom
Multiple R-squared:  0.5026,    Adjusted R-squared:  0.5026 
F-statistic: 1.6e+04 on 2 and 31671 DF,  p-value: < 2.2e-16

$\hat{β_{1}}$ : a 1 standard deviation increase in Age causes a 0.0056 standard deviation increase in Amount (i.e. $0.0056 \times 56.25 = $ 0.315)$

$\hat{β_{2}}$ : a 1 standard deviation increase in MPHover causes a 0.7095 standard deviation increase in Amount. (i.e. $0.7095 \times 56.25 = $ 39.90)$

MPHover has a much larger marginal effect on Amount.

Hispanic <fct>	Female <fct>	Amount <dbl>	MPHover <dbl>	Age <dbl>	OutTown <fct>	OutState <fct>
0	0	50	28	22	1	0
0	0	50	12	41	1	0
0	0	125	15	20	1	0
0	1	50	21	22	1	0
1	0	50	16	43	1	0
0	0	100	15	22	1	0
0	1	175	20	60	1	0
0	1	165	19	29	1	0
0	0	175	20	42	0	0
0	1	115	14	19	1	1

Hispanic <fct>	Female <fct>	Amount <dbl>	MPHover <dbl>	Age <dbl>	OutTown <fct>	OutState <fct>
0	0	50	28	22	1	0
0	0	50	12	41	1	0
0	0	125	15	20	1	0
0	1	50	21	22	1	0
1	0	50	16	43	1	0
0	0	100	15	22	1	0
0	1	175	20	60	1	0
0	1	165	19	29	1	0
0	0	175	20	42	0	0
0	1	115	14	19	1	1

Hispanic <fct>	Female <fct>	Amount <dbl>	MPHover <dbl>	Age <dbl>	OutTown <fct>	OutState <fct>
0	0	50	28	22	1	0
0	0	50	12	41	1	0
0	0	125	15	20	1	0
0	1	50	21	22	1	0
1	0	50	16	43	1	0
0	0	100	15	22	1	0
0	1	175	20	60	1	0
0	1	165	19	29	1	0
0	0	175	20	42	0	0
0	1	115	14	19	1	1