library(tidyverse) # your friend and mine
library(broom) # for tidy regression
library(modelsummary) # for nice regression tables
4.3 — Categorical Data & Interactions — R Practice
Required Packages & Data
Load all the required packages we will use (note I have installed them already into the cloud project) by running (clicking the green play button) the chunk below:
We are returning to the speeding tickets data that we began to explore in R Practice 4.1 on Multivariate Regression. Download and read in (read_csv
) the data below.
# run or edit this chunk (if you want to rename the data)
# read in data from url
# or you could download and upload it to this project instead
<- read_csv("https://metricsf22.classes.ryansafner.com/files/data/speeding_tickets.csv") speed
Rows: 68357 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (9): Black, Hispanic, Female, Amount, MPHover, Age, OutTown, OutState, S...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
This data comes from a paper by Makowsky and Strattman (2009) that we will examine later. Even though state law sets a formula for tickets based on how fast a person was driving, police officers in practice often deviate from that formula. This dataset includes information on all traffic stops. An amount for the fine is given only for observations in which the police officer decided to assess a fine. There are a number of variables in this dataset, but the one’s we’ll look at are:
Variable | Description |
---|---|
Amount |
Amount of fine (in dollars) assessed for speeding |
Age |
Age of speeding driver (in years) |
MPHover |
Miles per hour over the speed limit |
Black |
Dummy \(=1\) if driver was black, \(=0\) if not |
Hispanic |
Dummy \(=1\) if driver was Hispanic, \(=0\) if not |
Female |
Dummy \(=1\) if driver was female, \(=0\) if not |
OutTown |
Dummy \(=1\) if driver was not from local town, \(=0\) if not |
OutState |
Dummy \(=1\) if driver was not from local state, \(=0\) if not |
StatePol |
Dummy \(=1\) if driver was stopped by State Police, \(=0\) if stopped by other (local) |
We want to explore who gets fines, and how much. We’ll come back to the other variables (which are categorical) in this dataset in later lessons.
Question 1
We will have to do a little more cleaning to get some of the data into a more usable form.
Part A
Inspect the data with str()
or head()
or glimpse()
to see what it looks like.
# type your code below in this chunk
str(speed)
spc_tbl_ [68,357 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Black : num [1:68357] 0 0 0 0 0 0 0 0 0 0 ...
$ Hispanic: num [1:68357] 0 0 0 0 0 0 0 0 0 0 ...
$ Female : num [1:68357] 1 1 1 0 0 0 1 0 1 0 ...
$ Amount : num [1:68357] NA NA NA NA NA NA NA NA NA NA ...
$ MPHover : num [1:68357] 14 15 15 13 12 17 15 15 15 15 ...
$ Age : num [1:68357] 22 43 32 24 54 30 18 53 51 33 ...
$ OutTown : num [1:68357] 1 1 0 1 1 1 0 0 1 1 ...
$ OutState: num [1:68357] 0 0 0 0 0 0 0 0 0 0 ...
$ StatePol: num [1:68357] 0 0 0 0 0 0 0 0 0 0 ...
- attr(*, "spec")=
.. cols(
.. Black = col_double(),
.. Hispanic = col_double(),
.. Female = col_double(),
.. Amount = col_double(),
.. MPHover = col_double(),
.. Age = col_double(),
.. OutTown = col_double(),
.. OutState = col_double(),
.. StatePol = col_double()
.. )
- attr(*, "problems")=<externalptr>
What class
of variable are Black
, Hispanic
, Female
, OutTown
, and OutState
?
They are all num
- numeric
variables.
Part B
Notice that when importing the data from the .csv
file, R
interpreted these variables as numeric
(num
) or double
(dbl
), but we want them to be factor
(fct
) variables, to ensure R
recognizes that there are two groups (categories), 0 and 1.
You could convert the variables one at a time to factors using as.factor()
inside a mutate()
command. But there is a special mutate()
command that allows you to apply a transformation (like changing a variable’s class to factor
), which you can run the following chunk to execute:
# run or edit this chunk
<- speed %>%
speed mutate_at(c("Black", "Hispanic", "Female", "OutTown", "OutState"), factor)
speed