4.3 — Categorical Data

ECON 480 • Econometrics • Fall 2022

Dr. Ryan Safner
Associate Professor of Economics

## Categorical Variables

• Categorical variables place an individual into one of several possible categories
• e.g. sex, season, political party
• may be responses to survey questions
• can be quantitative (e.g. age, zip code)
• In R: character or factor type data
• factor $\implies$ specific possible categories

# Working with factor Variables in R

## Factors in R I

• factor is a special type of character object class that indicates membership in a category (called a level)

• Suppose I have data on students:

• See that rank is a character (<chr>) variable, just a string of text

## Factors in R II

• We can make rank a factor variable, to indicate a student is a member of one of the possible categories: (freshman, sophomore, junior, senior)
students <- students %>%
mutate(rank = as.factor(rank)) # overwrite and change class of rank to factor

students %>% head(n = 5)

## Factors in R III

# what are the categories?
students %>%
group_by(rank) %>%
count()
# note the order is arbitrary! This is an "unordered" factor

## Ordered Factors in R I

• If there is a rank order you wish to preserve, you can make an ordered (factor) variable
• list the levels from 1st to last
students <- students %>%
mutate(rank = ordered(rank, # overwrite and change class of rank to ordered
# next, specify the levels, in order
levels = c("Freshman", "Sophomore", "Junior", "Senior")
))
students %>% head(n = 5)

## Ordered Factors in R II

students %>%
group_by(rank) %>%
count()

## Example Research Question with Categorical Data

Example

How much higher wages, on average, do men earn compared to women?

## A Difference in Group Means

• Basic statistics: can test for statistically significant difference in group means with a t-test1, let:

• $Y_M$: average earnings of a sample of $n_M$ men

• $Y_W$: average earnings of a sample of $n_M$ women

• Difference in group averages: $d=$ $\bar{Y}_M$ $-$ $\bar{Y}_W$

• The hypothesis test is:

• $H_0: d=0$
• $H_1: d \neq 0$

## Plotting factors in R

• Plotting wage vs. a factor variable, e.g. gender (which is either Male or Female) looks like this
ggplot(data = wages)+
aes(x = gender,
y = wage)+
geom_point(aes(color = gender))+
geom_smooth(method = "lm", color = "black")+
scale_y_continuous(labels = scales::dollar)+
scale_color_manual(values = c("Female" = "#e64173", "Male" = "#0047AB"))+
labs(x = "Gender",
y = "Wage")+
guides(color = "none")+ # hide legend
theme_bw(base_family = "Fira Sans Condensed",
base_size = 20)
• Effectively R treats values of a factor variable as integers (e.g. "Female" = 0, "Male" = 1)
• Let’s make this more explicit by making a dummy variable to stand in for gender

# Regression with Dummy Variables

## Comparing Groups with Regression

• In a regression, we can easily compare across groups via a dummy variable1

• Dummy variable only $=0$ or $=1$, if a condition is TRUE vs. FALSE

• Signifies whether an observation belongs to a category or not

Example

$\widehat{Wage}_i=\hat{\beta_0}+\hat{\beta_1} \, Female_i \quad \quad \text{ where } Female_i = \begin{cases} 1 & \text{if individual } i \text{ is } Female \\ 0 & \text{if individual } i \text{ is } Male\\ \end{cases}$

• Again, $\hat{\beta_1}$ makes less sense as the “slope” of a line in this context

## Comparing Groups in Regression: Scatterplot

ggplot(data = wages)+
aes(x = as.factor(female),
y = wage)+
geom_point(aes(color = gender))+
geom_smooth(method = "lm", color = "black")+
scale_y_continuous(labels = scales::dollar)+
scale_color_manual(values = c("Female" = "#e64173", "Male" = "#0047AB"))+
labs(x = "Female",
y = "Wage")+
guides(color = "none")+ # hide legend
theme_bw(base_family = "Fira Sans Condensed",
base_size = 20)
• Hard to see relationships because of overplotting . . .

## Comparing Groups in Regression: Scatterplot

ggplot(data = wages)+
aes(x = as.factor(female),
y = wage)+
geom_jitter(aes(color = gender),
width=0.05,
seed = 2)+
geom_smooth(method = "lm", color = "black")+
scale_y_continuous(labels = scales::dollar)+
scale_color_manual(values = c("Female" = "#e64173", "Male" = "#0047AB"))+
labs(x = "Female",
y = "Wage")+
guides(color = "none")+ # hide legend
theme_bw(base_family = "Fira Sans Condensed",
base_size = 20)
• Tip: use geom_jitter() instead of geom_point() to randomly nudge points!
• Only used for plotting, does not affect actual data, regression, etc.

## Dummy Variables as Group Means

$\hat{Y_i}=\hat{\beta_0}+\hat{\beta_1} D_i \quad \text{ where }D_i=\{\color{#6A5ACD}{0},\color{#e64173}{1}\}$

• When $D_i=0$ (“Control group”):
• $\hat{Y_i}=\hat{\beta_0}$
• $\color{#6A5ACD}{\mathbb{E}[Y_i|D_i=0]}=\hat{\beta_0}$ $\iff$ the mean of $Y$ when $D_i=0$
• When $D_i=1$ (“Treatment group”):
• $\hat{Y_i}=\hat{\beta_0}+\hat{\beta_1} D_i$
• $\color{#e64173}{\mathbb{E}[Y_i|D_i=1]}=\hat{\beta_0}+\hat{\beta_1}$ $\iff$ the mean of $Y$ when $D_i=1$
• So the difference in group means:

\begin{align*} &=\color{#e64173}{\mathbb{E}[Y_i|D_i=1]}-\color{#6A5ACD}{\mathbb{E}[Y_i|D_i=0]}\\ &=(\hat{\beta_0}+\hat{\beta_1})-(\hat{\beta_0})\\ &=\hat{\beta_1}\\ \end{align*}

## Dummy Variables as Group Means: Our Example

Example

$\widehat{Wage}_i=\hat{\beta_0}+\hat{\beta_1} \, Female_i$

• Mean wage for men:

$\mathbb{E}[Wage|Female=0]=\hat{\beta_0}$

• Mean wage for women:

$\mathbb{E}[Wage|Female=1]=\hat{\beta_0}+\hat{\beta_1}$

• Difference in wage between men & women:

$\hat{\beta_1}$

## Comparing Groups in Regression: Scatterplot

$\widehat{Wage}_i=\hat{\beta_0}+\hat{\beta_1} \, Female_i$

## Comparing Groups in Regression: Scatterplot

$\widehat{Wage}_i=\hat{\beta_0}+\hat{\beta_1} \, Female_i$

## Conditional Group Means

# Summarize for Men

wages %>%
filter(gender=="Male") %>%
summarize(mean = mean(wage),
sd = sd(wage))
# Summarize for Women

wages %>%
filter(gender=="Female") %>%
summarize(mean = mean(wage),
sd = sd(wage))

## The Regression (factor variables)

reg <- lm(wage ~ gender, data = wages)
summary(reg)

Call:
lm(formula = wage ~ gender, data = wages)

Residuals:
Min      1Q  Median      3Q     Max
-5.5995 -1.8495 -0.9877  1.4260 17.8805

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   4.5877     0.2190  20.950  < 2e-16 ***
genderMale    2.5118     0.3034   8.279 1.04e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.476 on 524 degrees of freedom
Multiple R-squared:  0.1157,    Adjusted R-squared:  0.114
F-statistic: 68.54 on 1 and 524 DF,  p-value: 1.042e-15
library(broom)
tidy(reg)
• Putting the factor variable gender in, R automatically chooses a value to set as TRUE, in this case Male = TRUE
• genderMALE $=1$ for Male, $=0$ for Female
• According to the data, men earn, on average, $2.51 more than women ## The Regression: Dummy Variables • Let’s explicitly make gender into a dummy variable for female: # add a female dummy variable wages <- wages %>% mutate(female = ifelse(test = gender == "Female", yes = 1, no = 0)) wages ## The Regression (Dummy variables) female_reg <- lm(wage ~ female, data = wages) summary(female_reg)  Call: lm(formula = wage ~ female, data = wages) Residuals: Min 1Q Median 3Q Max -5.5995 -1.8495 -0.9877 1.4260 17.8805 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.0995 0.2100 33.806 < 2e-16 *** female -2.5118 0.3034 -8.279 1.04e-15 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.476 on 524 degrees of freedom Multiple R-squared: 0.1157, Adjusted R-squared: 0.114 F-statistic: 68.54 on 1 and 524 DF, p-value: 1.042e-15 library(broom) tidy(female_reg) ## Dummy Regression vs. Group Means From tabulation of group means Gender Avg. Wage Std. Dev. $n$ Female $4.59$ $2.33$ $252$ Male $7.10$ $4.16$ $274$ Difference $2.51$ $0.30$ $-$ From $t$-test of difference in group means $\widehat{\text{Wages}_i}=7.10-2.51 \, \text{Female}_i$ # Recoding Dummy Variables ## Recoding Dummy Variables Example Suppose instead of female we had used: $\widehat{Wage}_i=\hat{\beta_0}+\hat{\beta_1} \, Male_i \quad \quad \text{ where } Male_i = \begin{cases} 1 & \text{if person } i \text{ is } Male \\ 0 & \text{if person } i \text{ is } Female\\ \end{cases}$ ## Recoding Dummies in the Data wages <- wages %>% mutate(male = ifelse(female == 0, # condition: is female equal to 0? yes = 1, # if true: code as "1" no = 0)) # if false: code as "0" # verify it worked wages %>% select(wage, female, male) %>% head(n = 5) ## Scatterplot with Male ## Dummy Variables as Group Means: With Male Example $\widehat{Wage}_i=\hat{\beta_0}+\hat{\beta_1} \, Male_i$ • Mean wage for men: $\mathbb{E}[Wage|Male=1]=\hat{\beta_0}+\hat{\beta_1}$ • Mean wage for women: $\mathbb{E}[Wage|Male=0]=\hat{\beta_0}$ • Difference in wage between men & women: $\hat{\beta_1}$ ## Scatterplot & Regression Line with Male ## The Regression with Male male_reg <- lm(wage ~ male, data = wages) summary(male_reg)  Call: lm(formula = wage ~ male, data = wages) Residuals: Min 1Q Median 3Q Max -5.5995 -1.8495 -0.9877 1.4260 17.8805 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.5877 0.2190 20.950 < 2e-16 *** male 2.5118 0.3034 8.279 1.04e-15 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.476 on 524 degrees of freedom Multiple R-squared: 0.1157, Adjusted R-squared: 0.114 F-statistic: 68.54 on 1 and 524 DF, p-value: 1.042e-15 library(broom) tidy(male_reg) ## The Dummy Regression: Male or Female Wage Wage Constant 7.10*** 4.59*** (0.21) (0.22) female −2.51*** (0.30) male 2.51*** (0.30) n 526 526 Adj. R2 0.11 0.11 SER 3.47 3.47 * p < 0.1, ** p < 0.05, *** p < 0.01 • Note it doesn’t matter if we use male or female, difference is always$2.51

• Compare the constant (average for the $D=0$ group)

• Should you use male AND female in a regression? We’ll come to that…

# Categorical Variables (More than 2 Categories)

## Categorical Variables with More than 2 Categories

• A categorical variable expresses membership in a category, where there is no ranking or hierarchy of the categories
• We’ve looked at categorical variables with 2 categories only
• e.g. Male/Female, Spring/Summer/Fall/Winter, Democratic/Republican/Independent
• Might be an ordinal variable expresses rank or an ordering of data, but not necessarily their relative magnitude
• e.g. Order of finalists in a competition (1st, 2nd, 3rd)
• e.g. Highest education attained (1=elementary school, 2=high school, 3=bachelor’s degree, 4=graduate degree)
• in R, an ordered factor

## Using Categorical Variables in Regression I

Example

How do wages vary by region of the country? Let $Region_i=\{Northeast, \, Midwest, \, South, \, West\}$

• Can we run the following regression?

$\widehat{Wages_i}=\hat{\beta_0}+\hat{\beta_1} \, Region_i$

## Using Categorical Variables in Regression II

Example

How do wages vary by region of the country? Let $Region_i=\{Northeast, \, Midwest, \, South, \, West\}$

• Code region numerically:

$Region_i= \begin{cases}1 & \text{if } i \text{ is in }Northeast\\ 2 & \text{if } i \text{ is in } Midwest\\ 3 & \text{if } i \text{ is in } South \\ 4 & \text{if } i \text{ is in } West\\ \end{cases}$

• Can we run the following regression?

$\widehat{Wages_i}=\hat{\beta_0}+\hat{\beta_1} \, Region_i$

## Using Categorical Variables in Regression III

Example

How do wages vary by region of the country? Let $Region_i=\{Northeast, \, Midwest, \, South, \, West\}$

• Create a dummy variable for each region:
• $Northeast_i = 1$ if $i$ is in Northeast, otherwise $=0$
• $Midwest_i = 1$ if $i$ is in Midwest, otherwise $=0$
• $South_i = 1$ if $i$ is in South, otherwise $=0$
• $West_i = 1$ if $i$ is in West, otherwise $=0$
• Can we run the following regression?

$\widehat{Wages_i}=\hat{\beta_0}+\hat{\beta_1}Northeast_i+\hat{\beta_2}Midwest_i+\hat{\beta_3}South_i+\hat{\beta_4}West_i$

• For every $i: \, Northeast_i+Midwest_i+South_i+West_i=1$!

## The Dummy Variable Trap

Example

$\widehat{Wages_i}=\hat{\beta_0}+\hat{\beta_1}Northeast_i+\hat{\beta_2}Midwest_i+\hat{\beta_3}South_i+\hat{\beta_4}West_i$

• If we include all possible categories, they are perfectly multicollinear, an exact linear function of one another:

$Northeast_i+Midwest_i+South_i+West_i=1 \quad \forall i$

• This is known as the dummy variable trap, a common source of perfect multicollinearity

## The Reference Category

• To avoid the dummy variable trap, always omit one category from the regression, known as the “reference category”

• It does not matter which category we omit!

• Coefficients on each dummy variable measure the difference between the reference category and each category dummy

## The Reference Category: Example

Example

$\widehat{Wages_i}=\hat{\beta_0}+\hat{\beta_1}Northeast_i+\hat{\beta_2}Midwest_i+\hat{\beta_3}South_i$

• $West_i$ is omitted (arbitrarily chosen)
• $\hat{\beta_0}$: average wage for $i$ in the West (omitted reference category)
• $\hat{\beta_1}$: difference between West and Northeast
• $\hat{\beta_2}$: difference between West and Midwest
• $\hat{\beta_3}$: difference between West and South

## Regression in R with Categorical Variable

lm(wage ~ region, data = wages) %>% summary()

Call:
lm(formula = wage ~ region, data = wages)

Residuals:
Min     1Q Median     3Q    Max
-6.083 -2.387 -1.097  1.157 18.610

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)       5.7105     0.3195  17.871   <2e-16 ***
regionNortheast   0.6593     0.4651   1.418   0.1569
regionSouth      -0.3236     0.4173  -0.775   0.4385
regionWest        0.9029     0.5035   1.793   0.0735 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.671 on 522 degrees of freedom
Multiple R-squared:  0.0175,    Adjusted R-squared:  0.01185
F-statistic: 3.099 on 3 and 522 DF,  p-value: 0.02646

## Regression in R with Dummies (& Dummy Variable Trap)

lm(wage ~ northeast + midwest + south + west, data = wages) %>% summary()

Call:
lm(formula = wage ~ northeast + midwest + south + west, data = wages)

Residuals:
Min     1Q Median     3Q    Max
-6.083 -2.387 -1.097  1.157 18.610

Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept)   6.6134     0.3891  16.995  < 2e-16 ***
northeast    -0.2436     0.5154  -0.473  0.63664
midwest      -0.9029     0.5035  -1.793  0.07352 .
south        -1.2265     0.4728  -2.594  0.00974 **
west              NA         NA      NA       NA
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.671 on 522 degrees of freedom
Multiple R-squared:  0.0175,    Adjusted R-squared:  0.01185
F-statistic: 3.099 on 3 and 522 DF,  p-value: 0.02646
• R automatically drops one category to avoid perfect multicollinearity

## Using Different Reference Categories in R

No Northeast No Midwest No South No West
Constant 6.37*** 5.71*** 5.39*** 6.61***
(0.34) (0.32) (0.27) (0.39)
midwest −0.66 0.32 −0.90*
(0.47) (0.42) (0.50)
south −0.98** −0.32 −1.23***
(0.43) (0.42) (0.47)
west 0.24 0.90* 1.23***
(0.52) (0.50) (0.47)
northeast 0.66 0.98** −0.24
(0.47) (0.43) (0.52)
n 526 526 526 526
R2 0.02 0.02 0.02 0.02
Adj. R2 0.01 0.01 0.01 0.01
SER 3.66 3.66 3.66 3.66
* p < 0.1, ** p < 0.05, *** p < 0.01
• Constant is alsways average wage for reference (omitted) region

• Compare coefficients between Midwest in (1) and Northeast in (2)…

• Compare coefficients between West in (3) and South in (4)…

• Does not matter which region we omit!

• Same $R^2$, SER, coefficients give same results

## Dummy Dependent (Y) Variables

• In many contexts, we will want to have our dependent $(Y)$ variable be a dummy variable

Example

$\widehat{\text{Admitted}}_i=\hat{\beta_0}+\hat{\beta_1} \, GPA_i \quad \text{ where } \text{Admitted}_i = \begin{cases} 1 & \text{if } i \text{ is Admitted} \\ 0 & \text{if } i \text{ is Not Admitted}\\ \end{cases}$

• A model where $Y$ is a dummy is called a linear probability model, as it measures the probability of $Y$ occurring given the $X$’s, i.e. $P(Y_i=1|X_1, \cdots, X_k)$
• e.g. the probability person $i$ is Admitted to a program with a given GPA
• Special models to properly interpret and extend this (logistic “logit”, probit, etc)

• Feel free to write papers with dummy $Y$ variables!

# Interaction Effects

## Sliders and Switches

• Marginal effect of dummy variable: effect on $Y$ of going from 0 to 1
• Marginal effect of continuous variable: effect on $Y$ of a 1 unit change in $X$

## Interaction Effects

• Sometimes one $X$ variable might interact with another in determining $Y$

Example

Consider the gender pay gap again. - Gender affects wages - Experience affects wages

• Does experience affect wages differently by gender?
• i.e. is there an interaction effect between gender and experience?
• Note this is NOT the same as just asking: “do men earn more than women with the same amount of experience?”

$\widehat{\text{wages}}_i=\beta_0+\beta_1 \, \text{Gender}_i + \beta_2 \, \text{Experience}_i$

## Three Types of Interactions

• Depending on the types of variables, there are 3 possible types of interaction effects

• We will look at each in turn

1. Interaction between a dummy and a continuous variable:

$Y_i=\beta_0+\beta_1X_i+\beta_2 D_i+\beta_3 \color{#e64173}{(X_i \times D_i)}$

1. Interaction between a two dummy variables:

$Y_i=\beta_0+\beta_1D_{1i}+\beta_2 D_{2i}+\beta_3 \color{#e64173}{(D_{1i} \times D_{2i})}$

1. Interaction between a two continuous variables:

$Y_i=\beta_0+\beta_1X_{1i}+\beta_2 X_{2i}+\beta_3 \color{#e64173}{(X_{1i} \times X_{2i})}$

# Interactions Between a Dummy and Continuous Variable

## Interactions: A Dummy & Continuous Variable

• Does the marginal effect of the continuous variable on $Y$ change depending on whether the dummy is “on” or “off”?

## Interactions: A Dummy & Continuous Variable I

• We can model an interaction by introducing a variable that is an .hi[interaction term] capturing the interaction between two variables:

$Y_i=\beta_0+\beta_1X_i+\beta_2 D_i+\color{#e64173}{\beta_3(X_i \times D_i)} \quad \text{ where } D_i=\{0,1\}$

• $\color{#e64173}{\beta_3}$ estimates the interaction effect between $X_i$ and $D_i$ on $Y_i$
• What do the different coefficients $(\beta)$’s tell us?
• Again, think logically by examining each group $(D_i=0$ or $D_i=1)$

## Dummy-Continuous Interaction Effects as Two Regressions I

$Y_i=\beta_0+\beta_1X_i+\beta_2 D_i+\beta_3 X_i \times D_i$

• When $D_i=0$ (“Control group”):

\begin{align*} \hat{Y_i}&=\hat{\beta_0}+\hat{\beta_1}X_i+\hat{\beta_2}(\color{red}{0})+\hat{\beta_3}X_i \times (\color{red}{0})\\ \hat{Y_i}& =\hat{\beta_0}+\hat{\beta_1}X_i\\ \end{align*}

• When $D_i=1$ (“Treatment group”):

\begin{align*} \hat{Y_i}&=\hat{\beta_0}+\hat{\beta_1}X_i+\hat{\beta_2}(\color{blue}{1})+\hat{\beta_3}X_i \times (\color{blue}{1})\\ \hat{Y_i}&= (\hat{\beta_0}+\hat{\beta_2})+(\hat{\beta_1}+\hat{\beta_3})X_i\\ \end{align*}

• So what we really have is two regression lines!

## Dummy-Continuous Interaction Effects as Two Regressions II

• $D_i=0$ group:

$\color{#D7250E}{Y_i=\hat{\beta_0}+\hat{\beta_1}X_i}$

• $D_i=1$ group:

$\color{#0047AB}{Y_i=(\hat{\beta_0}+\hat{\beta_2})+(\hat{\beta_1}+\hat{\beta_3})X_i}$

## Interpretting Coefficients I

$Y_i=\beta_0+\beta_1X_i+\beta_2 D_i+\beta_3 \color{#e64173}{(X_i \times D_i)}$

• To interpret the coefficients, compare cases after changing $X$ by $\color{#e64173}{\Delta X}$:

$Y_i+\color{#e64173}{\Delta Y_i}=\beta_0+\beta_1(X_i\color{#e64173}{+\Delta X_i})\beta_2D_i+\beta_3\big((X_i\color{#e64173}{+\Delta X_i})D_i\big)$

• Subtracting these two equations, the difference is:

\begin{align*} \Delta Y_i &= \beta_1 \Delta X_i + \beta_3 D_i \Delta X_i\\ \color{#6A5ACD}{\frac{\Delta Y_i}{\Delta X_i}} &\color{#6A5ACD}{= \beta_1+\beta_3 D_i}\\ \end{align*}

• The effect of $X \rightarrow Y$ depends on the value of $D_i$!

• $\beta_3$: increment to the effect of $X \rightarrow Y$ when $D_i=1$ (vs. $D_i=0$)

## Interpretting Coefficients II

$Y_i=\beta_0+\beta_1X_i+\beta_2 D_i+\beta_3 \color{#e64173}{(X_i \times D_i)}$

• $\hat{\beta_0}$: $\mathbb{E}[Y_i]$ for $X_i=0$ and $D_i=0$
• $\beta_1$: Marginal effect of $X_i \rightarrow Y_i$ for $D_i=0$
• $\beta_2$: Marginal effect on $Y_i$ of difference between $D_i=0$ and $D_i=1$ when $X=0$ (“intercepts”)
• $\beta_3$: The difference of the marginal effect of $X_i \rightarrow Y_i$ between $D_i=0$ and $D_i=1$ (“slopes”)
• This is a bit awkward, easier to think about the two regression lines:

## Interpretting Coefficients III

$Y_i=\beta_0+\beta_1X_i+\beta_2 D_i+\beta_3 \color{#e64173}{(X_i \times D_i)}$

• For $D_i=0$ Group: $\hat{Y_i}=\hat{\beta_0}+\hat{\beta_1}X_i$
• Intercept: $\hat{\beta_0}$
• Slope: $\hat{\beta_1}$
• For $D_i=1$ Group: $\hat{Y_i}=(\hat{\beta_0}+\hat{\beta_2})+(\hat{\beta_1}+\hat{\beta_3})X_i$
• Intercept: $\hat{\beta_0}+\hat{\beta_2}$
• Slope: $\hat{\beta_1}+\hat{\beta_3}$
• $\hat{\beta_2}$: difference in intercept between groups

• $\hat{\beta_3}$: difference in slope between groups

• How can we determine if the two lines have the same slope and/or intercept?
• Same intercept? $t$-test $H_0$: $\beta_2=0$
• Same slope? $t$-test $H_0$: $\beta_3=0$

## Interactions in Our Example

Example

$\widehat{wage}_i=\hat{\beta_0}+\hat{\beta_1} \, \text{experience}_i+\hat{\beta_2} \, \text{female}_i+\hat{\beta_3} \, (\text{experience}_i \times \text{female}_i)$

• For men $female=0$:

$\widehat{wage_i}=\hat{\beta_0}+\hat{\beta_1} \, \text{experience}_i$

• For women $female=1$:

$\widehat{wage}_i=\underbrace{(\hat{\beta_0}+\hat{\beta_2})}_{\text{intercept}}+\underbrace{(\hat{\beta_1}+\hat{\beta_3})}_{\text{slope}} \, \text{experience}_i$

## Interactions in Our Example: Scatterplot

Code
interaction_plot <- ggplot(data = wages)+
aes(x = exper,
y = wage,
color = as.factor(gender))+ # make factor
geom_point(alpha = 0.5)+
scale_y_continuous(limits = c(0,26),
expand = c(0,0),
labels=scales::dollar)+
scale_x_continuous(limits = c(0,55),
expand = c(0,0))+
labs(x = "Experience (Years)",
y = "Wage",
color = "Gender")+
scale_color_manual(values = c("Female" = "#e64173",
"Male" = "#0047AB")
)+ # setting custom colors
theme_bw()+
theme(legend.position = "bottom")
interaction_plot

## Interactions in Our Example: Scatterplot

Code
interaction_plot +
geom_smooth(method = "lm")

## Interactions in Our Example: Scatterplot

Code
interaction_plot +
geom_smooth(method = "lm") +
facet_wrap(~ gender)

## Interactions in Our Example: Regression in R

• Syntax for adding an interaction term is easy1 in R: x1 * x2
• Or could just do x1 * x2 (multiply)
# both are identical in R
interaction_reg <- lm(wage ~ exper * female, data = wages)
interaction_reg <- lm(wage ~ exper + female + exper * female, data = wages)

## Interactions in Our Example: Regression

Code
modelsummary(models = list("Wage" = interaction_reg),
fmt = 2, # round to 2 decimals
output = "html",
coef_rename = c("(Intercept)" = "Constant"),
gof_map = list(
list("raw" = "nobs", "clean" = "n", "fmt" = 0),
#list("raw" = "r.squared", "clean" = "R<sup>2</sup>", "fmt" = 2),
list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
),
escape = FALSE,
stars = c('*' = .1, '**' = .05, '***' = 0.01)
)
Wage
Constant 6.16***
(0.34)
exper 0.05***
(0.02)
female −1.55***
(0.48)
exper:female −0.06**
(0.02)
n 526
SER 3.43
* p < 0.1, ** p < 0.05, *** p < 0.01

## Interactions in Our Example: Interpretting Coefficients

$\widehat{\text{wage}}_i=6.16+0.05 \, \text{experience}_i - 1.55 \, \text{female}_i - 0.06 \, (\text{experience}_i \times \text{female}_i)$

• $\hat{\beta_0}$: Men with 0 years of experience earn 6.16
• $\hat{\beta_1}$: For every additional year of experience, men earn $0.05 • $\hat{\beta_2}$: Women with 0 years of experience earn$1.55 less than men
• For every additional year of experience, women earn 0.01 less on average ## Interactions in Our Example: Hypothesis Testing $\widehat{\text{wage}}_i=6.16+0.05 \, \text{experience}_i - 1.55 \, \text{female}_i - 0.06 \, (\text{experience}_i \times \text{female}_i)$ • Are intercepts of the 2 regressions different? $\color{#6A5ACD}{H_0: \beta_2=0}$ • Difference between men vs. women for no experience? • Is $\hat{\beta_2}$ significant? • Yes (reject) $\color{#6A5ACD}{H_0}$: $p$-value = 0.00 • Are slopes of the 2 regressions different? $\color{#6A5ACD}{H_0: \beta_3=0}$ • Difference between men vs. women for marginal effect of experience? • Is $\hat{\beta_3}$ significant? • Yes (reject) $\color{#6A5ACD}{H_0}$: $p$-value = 0.01 # Interactions Between Two Dummy Variables ## Interactions Between Two Dummy Variables • Does the marginal effect on $Y$ of one dummy going from “off” to “on” change depending on whether the other dummy is “off” or “on”? ## Interactions Between Two Dummy Variables $Y_i=\beta_0+\beta_1D_{1i}+\beta_2 D_{2i}+\beta_3 \color{#e64173}{(D_{1i} \times D_{2i})}$ • $D_{1i}$ and $D_{2i}$ are dummy variables • $\hat{\beta_1}$: effect on $Y$ of going from $D_{1}=0$ to $D_{1}=1$ when $D_{2}=0$ • $\hat{\beta_2}$: effect on $Y$ of going from $D_{2}=0$ to $D_{2}=1$ when $D_{1}=0$ • $\hat{\beta_3}$: effect on $Y$ of going from $D_{1}=0$ to $D_{1}=1$ when $D_{2}=1$ vs. $D_2=0$1 • increment to the effect of $D_{1i}$ going from 0 to 1 when $D_{2i}=1$ (vs. 0) • As always, best to think logically about possibilities (when each dummy $=0$ or $=1)$ ## 2 Dummy Interaction: Interpretting Coefficients $Y_i=\beta_0+\beta_1D_{1i}+\beta_2 D_{2i}+\beta_3 \color{#e64173}{(D_{1i} \times D_{2i})}$ • To interpret coefficients, compare cases: • Hold $D_{2}$ constant (set to some value $D_{2}=\mathbf{d_2}$) • Let $D_1$ change $\color{#FFA500}{0}$ to $\color{#44C1C4}{1}$: \begin{align*} E(Y|D_{1}&=\color{#FFA500}{0}, D_{2}=\mathbf{d_2}) = \beta_0+\beta_2 \mathbf{d_2}\\ E(Y|D_{1}&=\color{#44C1C4}{1}, D_{2}=\mathbf{d_2}) = \beta_0+\beta_1(\color{#44C1C4}{1})+\beta_2 \mathbf{d_2}+\beta_3(\color{#44C1C4}{1})\mathbf{d_2}\\ \end{align*} • Subtracting the two, the difference is: $\color{#6A5ACD}{\beta_1+\beta_3 \mathbf{d_2}}$ • The marginal effect of $\color{#6A5ACD}{D_{1} \rightarrow Y}$ depends on the value of $\color{#6A5ACD}{D_{2}\\}$ • $\color{#e64173}{\hat{\beta_3}}$ is the increment to the effect of $D_1$ on $Y$ when $D_2$ goes from $0$ to $1$ ## Interactions Between 2 Dummy Variables: Example Example Does the gender pay gap change if a person is married vs. single? $\widehat{\text{wage}}_i=\hat{\beta_0}+\hat{\beta_1} \, \text{female}_i+\hat{\beta_2} \, \text{married}_i+\hat{\beta_3} \, (\text{female}_i \times \text{married}_i)$ • Logically, there are 4 possible combinations of $female_i = \{\color{#0047AB}{0},\color{#e64173}{1}\}$ and $married_i = \{\color{#FFA500}{0},\color{#44C1C4}{1}\}$ 1. Unmarried men $(female_i=\color{#0047AB}{0}, \, married_i=\color{#FFA500}{0})$ $\widehat{wage_i}=\hat{\beta_0}$ 1. Married men $(female_i=\color{#0047AB}{0}, \, married_i=\color{#44C1C4}{1})$ $\widehat{wage_i}=\hat{\beta_0}+\hat{\beta_2}$ 1. Unmarried women $(female_i=\color{#e64173}{1}, \, married_i=\color{#FFA500}{0})$ $\widehat{wage_i}=\hat{\beta_0}+\hat{\beta_1}$ 1. Married women $(female_i=\color{#e64173}{1}, \, married_i=\color{#44C1C4}{1})$ $\widehat{wage_i}=\hat{\beta_0}+\hat{\beta_1}+\hat{\beta_2}+\hat{\beta_3}$ ## Conditional Group Means in the Data # get average wage for unmarried men wages %>% filter(female == 0, married == 0) %>% summarize(mean = mean(wage)) # get average wage for married men wages %>% filter(female == 0, married == 1) %>% summarize(mean = mean(wage)) # get average wage for unmarried women wages %>% filter(female == 1, married == 0) %>% summarize(mean = mean(wage)) # get average wage for married women wages %>% filter(female == 1, married == 1) %>% summarize(mean = mean(wage)) ## Two Dummies Interaction: Group Means $\widehat{\text{wage}}_i=\hat{\beta_0}+\hat{\beta_1} \, \text{female}_i+\hat{\beta_2} \, \text{married}_i+\hat{\beta_3} \, (\text{female}_i \times \text{married}_i)$ Men Women Unmarried5.17 $4.61 Married$7.98 $4.57 ## Two Dummies Interaction: Regression in R I reg_dummies <- lm(wage ~ female + married + female:married, data = wages) reg_dummies %>% tidy() ## Two Dummies Interaction: Regression in R II Code modelsummary(models = list("Wage" = reg_dummies), fmt = 2, # round to 2 decimals output = "html", coef_rename = c("(Intercept)" = "Constant"), gof_map = list( list("raw" = "nobs", "clean" = "n", "fmt" = 0), #list("raw" = "r.squared", "clean" = "R<sup>2</sup>", "fmt" = 2), list("raw" = "adj.r.squared", "clean" = "Adj. R<sup>2</sup>", "fmt" = 2), list("raw" = "rmse", "clean" = "SER", "fmt" = 2) ), escape = FALSE, stars = c('*' = .1, '**' = .05, '***' = 0.01) ) Wage Constant 5.17*** (0.36) female −0.56 (0.47) married 2.82*** (0.44) female:married −2.86*** (0.61) n 526 Adj. R2 0.18 SER 3.34 * p < 0.1, ** p < 0.05, *** p < 0.01 ## Two Dummies Interaction: Interpretting Coefficients I $\widehat{\text{wage}}_i=5.17-0.56 \, \text{female}_i+2.82 \, \text{married}_i-2.86 \, (\text{female}_i \times \text{married}_i)$ Men Women Unmarried$5.17 $4.61 Married$7.98 $4.57 • Wage for unmarried men: $\hat{\beta_0}=5.17$ • Wage for married men: $\hat{\beta_0}+\hat{\beta_2}=5.17+2.82=7.98$ • Wage for unmarried women: $\hat{\beta_0}+\hat{\beta_1}=5.17-0.56=4.61$ • Wage for married women: $\hat{\beta_0}+\hat{\beta_1}+\hat{\beta_2}+\hat{\beta_3}=5.17-0.56+2.82-2.86=4.57$ ## Two Dummies Interaction: Interpretting Coefficients II $\widehat{\text{wage}}_i=5.17-0.56 \, \text{female}_i+2.82 \, \text{married}_i-2.86 \, (\text{female}_i \times \text{married}_i)$ Men Women Diff Unmarried$5.17 $4.61$0.56
Married $7.98$4.57 $3.41 Diff$2.81 $0.04$2.85
• $\hat{\beta_0}$: Wage for unmarried men
• $\hat{\beta_1}$: Difference in wages between men and women who are unmarried
• $\hat{\beta_2}$: Difference in wages between married and unmarried men
• $\hat{\beta_3}$: Difference in:
• effect of Marriage on wages between men and women
• effect of Gender on wages between unmarried and married individuals
• “difference in differences”

# Interactions Between Two Continuous Variables

## Interactions Between Two Continuous Variables

• Does the marginal effect of $X_1$ on $Y$ depend on what $X_2$ is set to?

## Interactions Between Two Continuous Variables

$Y=\beta_0+\beta_1X_{1}+\beta_2 X_{2}+\beta_3 \color{#e64173}{(X_{1} \times X_{2})}$

• To interpret coefficients, compare changes after changing $\color{#e64173}{\Delta X_{1}}$ (holding $X_2$ constant):

$Y+\color{#e64173}{\Delta Y}=\beta_0+\beta_1(X_1+\color{#e64173}{\Delta X_{1}})+\beta_2X_{2}+\beta_3((X_{1}+\color{#e64173}{\Delta X_{1}}) \times X_{2})$

• Take the difference to get:

\begin{align*} \Delta Y &= \beta_1 \Delta X_{1}+ \beta_3 X_{2} \Delta X_{1}\\ \color{#6A5ACD}{\frac{\Delta Y_i}{\Delta X_{1}}} &= \color{#6A5ACD}{\beta_1+\beta_3 X_{2}}\\ \end{align*}

• The effect of $X_1 \rightarrow Y$ depends on the value of $X_2$
• $\color{#e64173}{\beta_3}$: increment to the effect of $X_1 \rightarrow Y$ for every 1 unit change in $X_2$
• Likewise: the effect of $X_2 \rightarrow Y$ depends on the value of $X_1$

$\frac{\Delta Y}{\Delta X_2} = \color{#6A5ACD}{\beta_2+\beta_3 X_{1}}$

• $\color{#e64173}{\beta_3}$: increment to the effect of $X_2 \rightarrow Y$ for every 1 unit change in $X_1$

## Continuous Variables Interaction: Example

Example

Do education and experience interact in their determination of wages?

$\widehat{\text{wage}}_i=\hat{\beta_0}+\hat{\beta_1} \, \text{education}_i+\hat{\beta_2} \, \text{experience}_i+\hat{\beta_3} \, (\text{education}_i \times \text{experience}_i)$

• Estimated effect of education on wages depends on the amount of experience (and vice versa)!

$\frac{\Delta \text{wage}}{\Delta \text{education}}=\hat{\beta_1}+\beta_3 \, \text{experience}_i$

$\frac{\Delta \text{wage}}{\Delta \text{experience}}=\hat{\beta_2}+\beta_3 \, \text{education}_i$

• This is a type of nonlinearity (we will examine nonlinearities next lesson)

## Continuous Variables Interaction: In R I

reg_cont <- lm(wage ~ educ + exper + educ:exper, data = wages)
reg_cont %>% tidy()

## Continuous Variables Interaction: In R II

Code
modelsummary(models = list("Wage" = reg_cont),
fmt = 2, # round to 2 decimals
output = "html",
coef_rename = c("(Intercept)" = "Constant"),
gof_map = list(
list("raw" = "nobs", "clean" = "n", "fmt" = 0),
#list("raw" = "r.squared", "clean" = "R<sup>2</sup>", "fmt" = 2),
list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
),
escape = FALSE,
stars = c('*' = .1, '**' = .05, '***' = 0.01)
)
Wage
Constant −2.86**
(1.18)
educ 0.60***
(0.09)
exper 0.05
(0.04)
educ:exper 0.00
(0.00)
n 526
SER 3.25
* p < 0.1, ** p < 0.05, *** p < 0.01

## Continuous Variables Interaction: Marginal Effects

$\widehat{\text{wage}}_i=-2.860+0.602 \, \text{education}_i+0.047 \, \text{experience}_i+0.002\, (\text{education}_i \times \text{experience}_i)$

Marginal Effect of Education on Wages by Years of Experience:

Experience $\displaystyle\frac{\Delta \text{wage}}{\Delta \text{education}}=\hat{\beta_1}+\hat{\beta_3} \, \text{experience}$
5 years $0.602+0.002(5)=0.612$
10 years $0.602+0.002(10)=0.622$
15 years $0.602+0.002(15)=0.632$
• Marginal effect of education $\rightarrow$ wages increases with more experience

## Continuous Variables Interaction: Marginal Effects

$\widehat{\text{wage}}_i=-2.860+0.602 \, \text{education}_i+0.047 \, \text{experience}_i+0.002\, (\text{education}_i \times \text{experience}_i)$

Marginal Effect of Experience on Wages by Years of Education:

Education $\displaystyle\frac{\Delta \text{wage}}{\Delta \text{experience}}=\hat{\beta_2}+\hat{\beta_3} \, \text{education}$
5 years $0.047+0.002(5)=0.057$
10 years $0.047+0.002(10)=0.067$
15 years $0.047+0.002(15)=0.077$
• Marginal effect of experience $\rightarrow$ wages increases with more education

• If you want to estimate the marginal effects more precisely, and graph them, see the appendix in today’s appendix