---
title: "4.4 — Nonlinearities & Variable Transformations — R Practice"
author: "Your Name Here!"
date: "November 16, 2022"
format:
html:
self-contained: true # so we don't need other files (like plot images)
toc: true # show a table of contents
toc-location: left
theme: default
df-print: paged # by default, show tables (tibbles) as paged tables
editor: visual
execute:
echo: true # shows all code on rendered document
---
# Required Packages & Data
Load all the required packages we will use (**note I have installed them already into the cloud project**) by running (clicking the green play button) the chunk below:
```{r}
#| label: load-packages
#| warning: false
#| message: false
library(tidyverse) # your friend and mine
library(broom) # for tidy regression
library(modelsummary) # for nice regression tables
library(car) # for F-test
```
We are returning to the speeding tickets data that we began to explore in [R Practice 4.1 on Multivariate Regression](http://metricsf22.classes.ryansafner.com/r/4.1-r) and [R Practice 4.3 on Categorical Data nad Interactions](https://metricsf22.classes.ryansafner.com/r/4.3-r). Download and read in (`read_csv`) the data below.
- [ `speeding_tickets.csv`](http://metricsf21.classes.ryansafner.com/data/speeding_tickets.csv)
```{r}
# run or edit this chunk (if you want to rename the data)
# read in data from url
# or you could download and upload it to this project instead
speed <- read_csv("https://metricsf22.classes.ryansafner.com/files/data/speeding_tickets.csv") %>%
mutate_at(c("Black", "Hispanic", "Female", "OutTown", "OutState"), factor) %>%
filter(Amount > 0)
# this code cleans the data the same way from last class
```
This data comes from a paper by Makowsky and Strattman (2009) that we will examine later. Even though state law sets a formula for tickets based on how fast a person was driving, police officers in practice often deviate from that formula. This dataset includes information on all traffic stops. An amount for the fine is given only for observations in which the police officer decided to assess a fine. There are a number of variables in this dataset, but the one's we'll look at are:
| Variable | Description |
|--------------|---------------------------------------------------------|
| `Amount` | Amount of fine (in dollars) assessed for speeding |
| `Age` | Age of speeding driver (in years) |
| `MPHover` | Miles per hour over the speed limit |
| `Black` | Dummy $=1$ if driver was black, $=0$ if not |
| `Hispanic` | Dummy $=1$ if driver was Hispanic, $=0$ if not |
| `Female` | Dummy $=1$ if driver was female, $=0$ if not |
| `OutTown` | Dummy $=1$ if driver was not from local town, $=0$ if not |
| `OutState` | Dummy $=1$ if driver was not from local state, $=0$ if not |
| `StatePol` | Dummy $=1$ if driver was stopped by State Police, $=0$ if stopped by other (local) |
We want to explore **who gets fines, and how much**. We'll come back to the other variables (which are categorical) in this dataset in later lessons.
## Question 1
Run a regression of `Amount` on `Age`. Write out the estimated regression equation, and interpret the coefficient on Age.
```{r}
# type your code below in this chunk
```
## Question 2
Is the effect of `Age` on `Amount` nonlinear? Let's run a quadratic regression.
### Part A
Create a new variable for $Age^2$. Then run a quadratic regression:
$$\widehat{\text{Amount}}_i=\beta_0+\beta_1 \, \text{Age}_i+\beta_2 \, \text{Age}_i^2$$
```{r}
# type your code below in this chunk
```
### Part B
Try running the same regression using the alternate notation: `lm(Y ~ X + I(X^2))`, replacing `X` and `Y` with our variables. This method allows you to run a quadratric regression without having to create a new variable first. Do you get the same results?
```{r}
# type your code below in this chunk
```
### Part C
Write out the estimated regression equation.
### Part D
Is this model an improvement from the linear model? Compared $\bar{R}^2$.
### Part E
Is the coefficient on the quadratic term statistically significantly different from zero? i.e. could we reject $H_0: \beta_2$?
### Part F
Write an equation for the marginal effect of `Age` on `Amount`.
### Part G
Predict the marginal effect on `Amount` of being one year older when you are 18. How about when you are 40?
### Part H
Our quadratic function is a $U$-shape. According to the model, at what age is the amount of the fine minimized?
### Part I
Create a scatterplot between `Amount` (`y`) and `Age` (`x`). Add a layer with a linear regression (as usual, `geom_smooth(method = "lm")`), and an additional layer of with the predicted quadratic regression curve. This additional layer is similar but we need to specify the formula of the curve to be quadratic:
```{r}
# type your code below in this chunk
```
### Part J
It's quite hard to see the quadratic curve with all those data points. Redo another plot and this time, only keep the quadratic `geom_smooth()` layer and leave out the `geom_point()` layer. This will only plot the regression curve.
```{r}
# type your code below in this chunk
```
## Question 3
Should we use a higher-order polynomial equation? Run the following cubic regression, and determine whether it is necessary.
$$\widehat{\text{Amount}}_i = \beta_0 + \beta_1 \, \text{Age}_i + \beta_2 \, \text{Age}_i^2 + \beta_3 \, \text{Age}_i^3$$
## Question 4
Run an $F$-test to check if a nonlinear model is appropriate. Use the `car` package's `linearHypothesis()` command, which looks like:
```{r}
#| eval: false <- set this to true if you want to edit and run this chunk
linearHypothesis(reg_name, # name of your saved regression object
c("var1", "var2")) # name of the variables you are testing
```
Your null hypothesis is $H_0: \beta_2=\beta_3=0$ from the regression in question 4. The command is
```{r}
# type your code below in this chunk
```
## Question5
Now let's take a look at speed (`MPHover` the speed limit).
### Part A
Creating new variables as necessary, run a **linear-log** model of `Amount` on `MPHover`. Write down the estimated regression equation, and interpret the coefficient on `MPHover` $(\hat{\beta_1})$. Make a scatterplot with the regression line. Hint: The simple `geom_smooth(method = "lm")` layer is sufficient, so long as you use the right variables on the plot!
```{r}
# type your code below in this chunk
```
### Part B
Creating new variables as necessary, run a **log-linear** model of `Amount` on `MPHover`. Write down the estimated regression equation, and interpret the coefficient on `MPHover` $(\hat{\beta_1})$. Make a scatterplot with the regression line. Hint: The simple `geom_smooth(method = "lm")` is sufficient, so long as you use the right variables on the plot!
```{r}
# type your code below in this chunk
```
### Part C
Creating new variables as necessary, run a **log-log** model of `Amount` on `MPHover`. Write down the estimated regression equation, and interpret the coefficient on `MPHover` $(\hat{\beta_1})$. Make a scatterplot with the regression line. Hint: The simple `geom_smooth(method = "lm")` is sufficient, so long as you use the right variables on the plot!
```{r}
# type your code below in this chunk
```
### Part D
Which of the three log models has the best fit? Hint: Check $R^2$
## Question 6
Return to the quadratic model from Question 3. Run a quadratic regression of `Amount` on `Age`, `Age`$^2$, `MPHover`, and all of the race dummy variables. Test the null hypothesis: *“the race of the driver has no effect on Amount”*
```{r}
# type your code below in this chunk
```
## Question 7
Now let's try standardizing variables. Let's try running a regression of `Amount` on `Age` and `MPHover`, but standardizing each variable.
### Part A
Create new standardized variables for `Amount`, `Age`, and `MPHover`:
```{r}
# type your code below in this chunk
```
### Part B
Run a regression of standardized `Amount_Z` on standardized `Age_Z` and `MPHover_Z`. Interpret $\hat{\beta_1}$ and $\hat{\beta_2}$. Which variable has a bigger effect on `Amount`?
```{r}
# type your code below in this chunk
```