4.4 — Nonlinearities & Variable Transformations — R Practice


Answer Key


November 16, 2022

Required Packages & Data

Load all the required packages we will use (note I have installed them already into the cloud project) by running (clicking the green play button) the chunk below:

library(tidyverse) # your friend and mine
library(broom) # for tidy regression
library(modelsummary) # for nice regression tables
library(car) # for F-test

We are returning to the speeding tickets data that we began to explore in R Practice 4.1 on Multivariate Regression and R Practice 4.3 on Categorical Data nad Interactions. Download and read in (read_csv) the data below.

# run or edit this chunk (if you want to rename the data)

# read in data from url 
# or you could download and upload it to this project instead
speed <- read_csv("https://metricsf22.classes.ryansafner.com/files/data/speeding_tickets.csv") %>%
  mutate_at(c("Black", "Hispanic", "Female", "OutTown", "OutState"), factor) %>%
  filter(Amount > 0)
Rows: 68357 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (9): Black, Hispanic, Female, Amount, MPHover, Age, OutTown, OutState, S...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# this code cleans the data the same way from last class

This data comes from a paper by Makowsky and Strattman (2009) that we will examine later. Even though state law sets a formula for tickets based on how fast a person was driving, police officers in practice often deviate from that formula. This dataset includes information on all traffic stops. An amount for the fine is given only for observations in which the police officer decided to assess a fine. There are a number of variables in this dataset, but the one’s we’ll look at are:

Variable Description
Amount Amount of fine (in dollars) assessed for speeding
Age Age of speeding driver (in years)
MPHover Miles per hour over the speed limit
Black Dummy \(=1\) if driver was black, \(=0\) if not
Hispanic Dummy \(=1\) if driver was Hispanic, \(=0\) if not
Female Dummy \(=1\) if driver was female, \(=0\) if not
OutTown Dummy \(=1\) if driver was not from local town, \(=0\) if not
OutState Dummy \(=1\) if driver was not from local state, \(=0\) if not
StatePol Dummy \(=1\) if driver was stopped by State Police, \(=0\) if stopped by other (local)

We want to explore who gets fines, and how much. We’ll come back to the other variables (which are categorical) in this dataset in later lessons.

Question 1

Run a regression of Amount on Age. Write out the estimated regression equation, and interpret the coefficient on Age.

reg_linear <- lm(Amount ~ Age, data = speed)

lm(formula = Amount ~ Age, data = speed)

    Min      1Q  Median      3Q     Max 
-123.21  -46.58   -5.92   32.55  600.24 

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 131.70665    0.88649  148.57   <2e-16 ***
Age          -0.28927    0.02478  -11.68   <2e-16 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 56.13 on 31672 degrees of freedom
Multiple R-squared:  0.004286,  Adjusted R-squared:  0.004254 
F-statistic: 136.3 on 1 and 31672 DF,  p-value: < 2.2e-16

\(\widehat{\text{Amount}_i}=131.71-0.29 \, \text{Age}_i\)

For every year of age, expected fines decrease by $0.29.

Question 2

Is the effect of Age on Amount nonlinear? Let’s run a quadratic regression.

Part A

Create a new variable for \(Age^2\). Then run a quadratic regression:

\[\widehat{\text{Amount}}_i=\beta_0+\beta_1 \, \text{Age}_i+\beta_2 \, \text{Age}_i^2\]

# make Age_sq variable
speed <- speed %>%
  mutate(Age_sq = Age^2)

# view it