--- title: "4.3 — Categorical Data & Interactions — R Practice" author: "Answer Key" date: "November 9, 2022" format: html: self-contained: true # so we don't need other files (like plot images) toc: true # show a table of contents toc-location: left theme: default df-print: paged # by default, show tables (tibbles) as paged tables editor: visual execute: echo: true # shows all code on rendered document --- # Required Packages & Data Load all the required packages we will use (**note I have installed them already into the cloud project**) by running (clicking the green play button) the chunk below: ```{r} #| label: load-packages #| warning: false #| message: false library(tidyverse) # your friend and mine library(broom) # for tidy regression library(modelsummary) # for nice regression tables ``` We are returning to the speeding tickets data that we began to explore in [R Practice 4.1 on Multivariate Regression](http://metricsf22.classes.ryansafner.com/r/4.1-r-practice). Download and read in (`read_csv`) the data below. - [ `speeding_tickets.csv`](http://metricsf21.classes.ryansafner.com/data/speeding_tickets.csv) ```{r} # run or edit this chunk (if you want to rename the data) # read in data from url # or you could download and upload it to this project instead speed <- read_csv("https://metricsf22.classes.ryansafner.com/files/data/speeding_tickets.csv") ``` This data comes from a paper by Makowsky and Strattman (2009) that we will examine later. Even though state law sets a formula for tickets based on how fast a person was driving, police officers in practice often deviate from that formula. This dataset includes information on all traffic stops. An amount for the fine is given only for observations in which the police officer decided to assess a fine. There are a number of variables in this dataset, but the one's we'll look at are: | Variable | Description | |--------------|---------------------------------------------------------| | `Amount` | Amount of fine (in dollars) assessed for speeding | | `Age` | Age of speeding driver (in years) | | `MPHover` | Miles per hour over the speed limit | | `Black` | Dummy $=1$ if driver was black, $=0$ if not | | `Hispanic` | Dummy $=1$ if driver was Hispanic, $=0$ if not | | `Female` | Dummy $=1$ if driver was female, $=0$ if not | | `OutTown` | Dummy $=1$ if driver was not from local town, $=0$ if not | | `OutState` | Dummy $=1$ if driver was not from local state, $=0$ if not | | `StatePol` | Dummy $=1$ if driver was stopped by State Police, $=0$ if stopped by other (local) | We want to explore **who gets fines, and how much**. We'll come back to the other variables (which are categorical) in this dataset in later lessons. ## Question 1 We will have to do a little more cleaning to get some of the data into a more usable form. ### Part A Inspect the data with `str()` or `head()` or `glimpse()` to see what it looks like. ```{r} # type your code below in this chunk str(speed) ``` What `class` of variable are `Black`, `Hispanic`, `Female`, `OutTown`, and `OutState`? They are all `num` - `numeric` variables. ### Part B Notice that when importing the data from the `.csv` file, `R` interpreted these variables as `numeric` (`num`) or `double` (`dbl`), but we want them to be `factor` (`fct`) variables, to ensure `R` recognizes that there are two groups (categories), 0 and 1. You could convert the variables one at a time to factors using `as.factor()` inside a `mutate()` command. But there is a special `mutate()` command that allows you to apply a transformation (like changing a variable's class to `factor`), which you can run the following chunk to execute: ```{r} # run or edit this chunk speed <- speed %>% mutate_at(c("Black", "Hispanic", "Female", "OutTown", "OutState"), factor) speed ``` Confirm that these are now `factor` (`fct`) variables. ### Part C Finally, recall from the last time we worked with this data that there are many `NA`s for `Amount` (these are people that were stopped but did not receive a fine). Let's `filter()` only those observations for which `Amount` is a positive number, and save this in your dataframe (assign and overwrite it, or make a new dataframe). ```{r} # run or edit this chunk # see the NAs speed %>% select(Amount) %>% summary() # overwrite data to keep only positive Amounts speed <- speed %>% filter(Amount > 0) # see there are no more NAs speed %>% select(Amount) %>% summary() ``` ## Question 2 Does the sex of the driver affect the fine? There's already a dummy variable `Female` in the dataset, so create a scatterplot between `Amount` (as `y`) and Female (as `x`). Hint: Use `geom_jitter()` instead of `geom_point()` to better see the points, and play around with `width` settings inside `geom_jitter()` ```{r} # type your code below in this chunk ggplot(data = speed)+ aes(x = Female, y = Amount, color = Female)+ geom_jitter(width = 0.25) ``` As an aside, if you had not made `Female` a `factor`, and kept it `numeric`, it might alter the plot. You can make `Female` a `factor` just for plotting purposes by setting `aes(x = as.factor(Female))` inside the `aes()` layer of your `ggplot()` code. ## Question 3 Now let's start looking at the distribution conditionally to find the different group means. ### Part A Find the mean and standard deviation of `Amount` for *male* drivers and again for *female* drivers. Hint: properly `filter()` the data and then use the `summarize()` command. ```{r} # Summarize for Men # I'll save as male_summary, since # we'll use this in next part male_summary <- speed %>% filter(Female == 0) %>% summarize(mean = mean(Amount), sd = sd(Amount)) # look at it male_summary # Summarize for Women # I'll save as female_summary, since # we'll use this in next part female_summary <- speed %>% filter(Female == 1) %>% summarize(mean = mean(Amount), sd = sd(Amount)) # look at it female_summary ``` ### Part B What is the difference between the average Amount for Males and Females? ```{r} # we can do this with R using our summary dataframes we made last part male_avg <- male_summary %>% pull(mean) # pull extracts the value male_avg #check female_avg <- female_summary %>% pull(mean) female_avg #check # difference male_avg - female_avg ``` Males get `r paste("$", round(male_avg, 2))` and Females get `r paste("$", round(female_avg, 2))`, so Males get `r paste("$", round(male_avg - female_avg, 2))` higher fines on average than Females. ### Part C We did not go over how to do this in class, but you can run a **t-test for the difference in group means** to see if the difference is statistically significant. The syntax is similar for a regression: ```{r} # run or edit this chunk t.test(Amount ~ Female, data = speed) ``` Is there a statistically significant difference between `Amount` for male and female drivers? Hint: this is like any hypothesis test. Here $H_0: \text{difference}=0$. A $t$-value needs to be large enough to be greater than a critical value of $t$. Check the $p$-value and see if it is less than our standard of $\alpha=0.05.$ Yes, there is a statistically significant difference in fine amounts between men and women. The p-value is very very small, certainly smaller than 0.05. So we have sufficient evidence to reject the null hypothesis (of no difference). ## Question 4 ### Part A Now run the following regression to ensure we get the same result as the t-test. $$\widehat{\text{Amount}}_i=\hat{\beta_0}+\hat{\beta_1} \, \text{Female}_i$$ ```{r} female_reg <- lm(Amount ~ Female, data = speed) summary(female_reg) # can also do broom's tidy() tidy(female_reg) ``` ### Part B Write out the estimated regression equation. $$\widehat{\text{Amount}}_i=124.67-7.94 \, \text{Female}_i$$ ### Part C Use the regression coefficients to find i. the average `Amount` for men ii. the average `Amount` for women iii. the difference in average `Amount` between men and women - Males get fined on average \$124.67 $(\hat{\beta_0})$ - Females get fined on average $124.67-7.94=116.73$ $(\hat{\beta_0}+\hat{\beta_1})$ - The difference is $-\$7.94$ $(\hat{\beta_3})$ ## Question 5 Let's recode the sex variable to `Male` instead of `Female`. ### Part A Make a new variable called `Male` and save it in your dataframe using the `ifelse()` command: ```{r} # run or edit this chunk speed <- speed %>% # overwrite or save as another dataframe mutate(Male = ifelse(test = Female == 0, # test indiv. to see if Female is 0 yes = 1, # if yes (a Male), code Male as 1 no = 0), # if no (a Female), code Male as 0 ) # Verify it worked speed %>% select(Female, Male) ``` ### Part B Run the same regression as in question 4, but use `Male` instead of `Female`. $$\widehat{\text{Amount}}_i=\hat{\beta_0}+\hat{\beta_1} \, \text{Male}_i$$ ```{r} # type your code below in this chunk male_reg <- lm(Amount ~ Male, data = speed) summary(male_reg) tidy(male_reg) ``` ### Part C Write out the estimated regression equation. $$\widehat{\text{Amount}}_i = 116.73+7.94 \, \text{Male}_i$$ ### Part D Use the regression coefficients to find i. the average `Amount` for men ii. the average `Amount` for women iii. the difference in average `Amount` between men and women - Females get fined on average \$116.73 $(\hat{\beta_0})$ - Males get fined on average $116.73+7.94=\$124.67$ $(\hat{\beta_0}+\hat{\beta_1})$ - The difference is $7.94$ $(\hat{\beta_3})$ ## Question 6 Run a regression of `Amount` on `Male` and `Female`. What happens, and why? $$\widehat{\text{Amount}}_i=\hat{\beta_0}+\hat{\beta_1} \, \text{Male}_i + \hat{\beta_2} \, \text{Female}_i$$ ```{r} dummy_variable_trap <- lm(Amount ~ Male + Female, data = speed) summary(dummy_variable_trap) ``` `Male` and `Female` are perfectly multicollinear, as for every person $i$, `Male`$_i$+`Female`$_i$=1. We can confirm this by seeing the correlation between `Male` and `Female` is exactly $-1$. To run a regression, we must exclude one of the dummies, and as we've seen, it makes no difference which one we exclude. ```{r} speed %>% select(Male, Female) %>% mutate_if(is.factor, as.numeric) %>% # make both factors into numeric cor() # correlation requires numeric data, can't have factors ``` ## Question 7 `Age` probably has a lot to do with differences in fines, perhaps also age affects fines differences between males and females. ### Part A Run a regression of `Amount` on `Age` and `Female.` How does the coefficient on `Female` change? ```{r} age_reg <- lm(Amount ~ Age + Female, data = speed) summary(age_reg) tidy(age_reg) ``` The coefficient on `Female` goes from $-7.94$ (question 4) to $-7.85$ when controlling for `Age`. ### Part B Now let's see if the difference in fine between men and women are different depending on the driver's age. Run the regression again, but add an **interaction term** between `Female` and `Age`, using `Female*Age` or `Female:Age`. $$\widehat{\text{Amount}}_i=\hat{\beta_0}+\hat{\beta_1} \, \text{Age}_i + \hat{\beta_2} \, \text{Female}_i + \hat{\beta_3} \, (\text{Age}_i \times \text{Female}_i)$$ ```{r} # note you can also do Age*Female for the interaction term interact_reg <- lm(Amount ~ Age + Female + Age:Female, data = speed) summary(interact_reg) tidy(interact_reg) ``` ### Part C Write out your estimated regression equation. $$\widehat{\text{Amount}}_i=135.55-12.02\, \text{Female}_i-0.33\, \text{Age}_i+0.12 \, (\text{Female}_i \times \text{Age}_i)$$ ### Part D Interpret the interaction effect. Is it statistically significant? The coefficient on the interaction term, $\hat{\beta_3}$ is 0.12. For every additional year of age, females can expect their fine to increase by $0.12 *more* than males gain for every additional year of age. $\hat{beta_3}$ has a standard error of 0.52, which means it has a $t$-statistic of 2.355 and $p$-value of 0.0185, so it is statistically significant (at the 95% level). ### Part E Plugging in 0 or 1 as necessary, rewrite (on your paper) this regression as *two separate* equations, one for Males and one for Females. For Males $(Female=0)$: $$\begin{align*} \widehat{Amount}&=135.55-0.33Age-12.02Female+0.12Female \times Age\\ &=135.55-0.33Age-12.02(0)+0.12(0)Age\\ &=135.55-0.33Age\\ \end{align*}$$ For Females $(Female=1)$: $$\begin{align*} \widehat{Amount}&=135.55-0.33Age-12.02Female+0.12Female \times Age\\ &=135.55-0.33Age-12.02(1)+0.12(1)Age\\ &=(135.55-12.02)+(-0.33+0.12)Age\\ &=123.53-0.21Age\\ \end{align*}$$ ```{r} # Another way to do this is to run conditional regressions: # subset data for only Males speed_males <- speed %>% filter(Female == 0) # subset data for only Feales speed_females <- speed %>% filter(Female == 1) # run a regression for each subset male_age_reg <- lm(Amount ~ Age, data = speed_males) tidy(male_age_reg) female_age_reg <- lm(Amount ~ Age, data = speed_females) tidy(female_age_reg) # compare regressions modelsummary(models = list("Males Only" = male_age_reg, "Females Only" = female_age_reg), fmt = 2, # round to 2 decimals output = "html", coef_rename = c("(Intercept)" = "Constant"), gof_map = list( list("raw" = "nobs", "clean" = "n", "fmt" = 0), list("raw" = "adj.r.squared", "clean" = "Adj. R²", "fmt" = 2), list("raw" = "rmse", "clean" = "SER", "fmt" = 2) ), escape = FALSE, stars = c('*' = .1, '**' = .05, '***' = 0.01) ) ``` ### Part F Let's try to visualize this. Make a scatterplot of `Age` (X) and `Amount` (Y) and include a regression line. Try adding `color = Female` inside your original (global) `aes()` layer. This will produce two sets of points and regression lines colored by `Female`. By the way, if it isn't a `factor` variable already, we can ensure that it is with `as.factor(Female)`. We shouldn't need to in *this* case because we already reset `Female` as a faction in question 1. ```{r} # if we don't color p1 <- ggplot(data = speed)+ aes(x = Age, y = Amount)+ geom_point()+ geom_smooth(method = "lm") p1 # if we color by Female p2 <- ggplot(data = speed)+ aes(x = Age, y = Amount, color = Female)+ geom_point()+ geom_smooth(method = "lm") p2 ``` ### Part G Add a final facet layer to the plot make two different sub-plots by sex with `facet_wrap( ~ Female)`. ```{r} p2+ facet_wrap(~Female) ``` ## Question 8 Now let's look at the possible interaction between sex (`Male` or `Female`) and whether a driver is from In-State or Out-of-State (`OutState`). ### Part A Use `R` to examine the data and find the average fine for: i. Males In-State ii. Males Out-of-State iii. Females In-State iv. Females Out-of-State ```{r} # Summarize for Men in State speed %>% filter(Female == 0, OutState == 0) %>% summarize(mean = mean(Amount)) # Summarize for Men Out of State speed %>% filter(Female == 0, OutState == 1) %>% summarize(mean = mean(Amount)) # Summarize for Women in State speed %>% filter(Female == 1, OutState == 0) %>% summarize(mean = mean(Amount)) # Summarize for Women Out of State speed %>% filter(Female == 1, OutState == 1) %>% summarize(mean = mean(Amount)) ``` ### Part B Now run a regression of the following model: $$\widehat{\text{Amount}}_i=\hat{\beta_0}+\hat{\beta_1} \, \text{Female}_i+\hat{\beta_2} \, \text{OutState}_i+\hat{\beta_3} \, (\text{Female}_i \times \text{OutState}_i)$$ ```{r} sex_state_reg <- lm(Amount ~ Female + Female:OutState, data = speed) summary(sex_state_reg) tidy(sex_state_reg) ``` ### Part C Write out the estimated regression equation. $$\widehat{\text{Amount}}_i=123.68-8.88 \, \text{Female}_i+4.29 \, \text{OutState}_i+5.17 \, (\text{Female}_i\times \text{OutState}_i)$$ ### Part D What does each coefficient mean? - $\hat{\beta_0}= \$123.68$; mean for in-state males - $\hat{\beta_1}=-\$8.88$: difference between in-state males and females - $\hat{\beta_2}=\$4.29$: difference between males in-state vs. out-of-state - $\hat{\beta_3}=\$5.17$: difference between effect of being in-state vs. out-of-state between males vs. females (or, equivalently, difference between effect of being male vs. female between in-state vs. out-of-state) ### Part E Using the regression equation, what are the average fine for i. Males In-State ii. Males Out-of-State iii. Females In-State iv. Females Out-of-State Compare to your answers in part A. - Males In-State: $\hat{\beta_0}=\$123.68$ - Males Out-of-State: $\hat{\beta_0}+\hat{\beta_2}=123.68+4.29=\$127.97$ - Females In-State: $\hat{\beta_0}+\hat{\beta_1}=123.68-8.88=\$114.80$ - Females Out-of-State: $\hat{\beta_0}+\hat{\beta_1}+\hat{\beta_2}+\hat{\beta_3}=123.68-8.88+4.29+5.17=\$124.26$ ## Question 9 Collect your regressions from questions 4, 5b, 7a, 7b, and 8b and output them in a regression table with `modelsummary()`. ```{r} modelsummary(models = list(female_reg, male_reg, age_reg, interact_reg, sex_state_reg), fmt = 2, # round to 2 decimals output = "html", coef_rename = c("(Intercept)" = "Constant"), gof_map = list( list("raw" = "nobs", "clean" = "n", "fmt" = 0), list("raw" = "adj.r.squared", "clean" = "Adj. R²", "fmt" = 2), list("raw" = "rmse", "clean" = "SER", "fmt" = 2) ), escape = FALSE, stars = c('*' = .1, '**' = .05, '***' = 0.01) ) ```