4.3 — Categorical Data

ECON 480 • Econometrics • Fall 2022

Dr. Ryan Safner
Associate Professor of Economics

safner@hood.edu
ryansafner/metricsF22
metricsF22.classes.ryansafner.com

Contents

Working with factor Variables in R

Regression with Dummy Variables

Recoding Dummy Variables

Categorical Variables (More than 2 Categories)

Interaction Effects

Interactions Between a Dummy and Continuous Variable

Interactions Two Dummy Variables

Interactions Between Two Continuous Variables

Categorical Variables

  • Categorical variables place an individual into one of several possible categories
    • e.g. sex, season, political party
    • may be responses to survey questions
    • can be quantitative (e.g. age, zip code)
  • In R: character or factor type data
    • factor ⟹ specific possible categories

Working with factor Variables in R

Factors in R I

  • factor is a special type of character object class that indicates membership in a category (called a level)

  • Suppose I have data on students:

ABCDEFGHIJ0123456789
id
<dbl>
rank
<chr>
grade
<dbl>
1Freshman76
2Junior82
3Sophomore73
4Sophomore95
5Senior74
5 rows
  • See that rank is a character (<chr>) variable, just a string of text

Factors in R II

  • We can make rank a factor variable, to indicate a student is a member of one of the possible categories: (freshman, sophomore, junior, senior)
students <- students %>%
  mutate(rank = as.factor(rank)) # overwrite and change class of rank to factor

students %>% head(n = 5)
ABCDEFGHIJ0123456789
id
<dbl>
rank
<fct>
grade
<dbl>
1Freshman76
2Junior82
3Sophomore73
4Sophomore95
5Senior74
5 rows

Factors in R III

# what are the categories?
students %>%
  group_by(rank) %>%
  count()
ABCDEFGHIJ0123456789
rank
<fct>
n
<int>
Freshman4
Junior1
Senior3
Sophomore2
4 rows
# note the order is arbitrary! This is an "unordered" factor

Ordered Factors in R I

  • If there is a rank order you wish to preserve, you can make an ordered (factor) variable
    • list the levels from 1st to last
students <- students %>%
  mutate(rank = ordered(rank, # overwrite and change class of rank to ordered
                        # next, specify the levels, in order
                        levels = c("Freshman", "Sophomore", "Junior", "Senior")
                        ))
students %>% head(n = 5)
ABCDEFGHIJ0123456789
id
<dbl>
rank
<ord>
grade
<dbl>
1Freshman76
2Junior82
3Sophomore73
4Sophomore95
5Senior74
5 rows

Ordered Factors in R II

students %>%
  group_by(rank) %>%
  count()
ABCDEFGHIJ0123456789
rank
<ord>
n
<int>
Freshman4
Sophomore2
Junior1
Senior3
4 rows

Example Research Question with Categorical Data

Example

How much higher wages, on average, do men earn compared to women?

A Difference in Group Means

  • Basic statistics: can test for statistically significant difference in group means with a t-test1, let:

  • YM: average earnings of a sample of nM men

  • YW: average earnings of a sample of nM women

  • Difference in group averages: d= ˉYM − ˉYW

  • The hypothesis test is:

    • H0:d=0
    • H1:d≠0

  1. See today’s appendix for this example

Plotting factors in R

  • Plotting wage vs. a factor variable, e.g. gender (which is either Male or Female) looks like this
  • Plot
  • Code

ggplot(data = wages)+
  aes(x = gender,
      y = wage)+
  geom_point(aes(color = gender))+
  geom_smooth(method = "lm", color = "black")+
  scale_y_continuous(labels = scales::dollar)+
  scale_color_manual(values = c("Female" = "#e64173", "Male" = "#0047AB"))+
  labs(x = "Gender",
       y = "Wage")+
  guides(color = "none")+ # hide legend
  theme_bw(base_family = "Fira Sans Condensed",
           base_size = 20)
  • Effectively R treats values of a factor variable as integers (e.g. "Female" = 0, "Male" = 1)
  • Let’s make this more explicit by making a dummy variable to stand in for gender

Regression with Dummy Variables

Comparing Groups with Regression

  • In a regression, we can easily compare across groups via a dummy variable1

  • Dummy variable only =0 or =1, if a condition is TRUE vs. FALSE

  • Signifies whether an observation belongs to a category or not

Example

^Wagei=^β0+^β1Femalei where Femalei={1if individual i is Female0if individual i is Male

  • Again, ^β1 makes less sense as the “slope” of a line in this context
  1. Also called a binary variable or dichotomous variable since it only takes on 2 values.

Comparing Groups in Regression: Scatterplot

  • Plot
  • Code

ggplot(data = wages)+
  aes(x = as.factor(female),
      y = wage)+
  geom_point(aes(color = gender))+
  geom_smooth(method = "lm", color = "black")+
  scale_y_continuous(labels = scales::dollar)+
  scale_color_manual(values = c("Female" = "#e64173", "Male" = "#0047AB"))+
  labs(x = "Female",
       y = "Wage")+
  guides(color = "none")+ # hide legend
  theme_bw(base_family = "Fira Sans Condensed",
           base_size = 20)
  • Hard to see relationships because of overplotting . . .

Comparing Groups in Regression: Scatterplot

  • Plot
  • Code

ggplot(data = wages)+
  aes(x = as.factor(female),
      y = wage)+
  geom_jitter(aes(color = gender),
              width=0.05,
              seed = 2)+
  geom_smooth(method = "lm", color = "black")+
  scale_y_continuous(labels = scales::dollar)+
  scale_color_manual(values = c("Female" = "#e64173", "Male" = "#0047AB"))+
  labs(x = "Female",
       y = "Wage")+
  guides(color = "none")+ # hide legend
  theme_bw(base_family = "Fira Sans Condensed",
           base_size = 20)
  • Tip: use geom_jitter() instead of geom_point() to randomly nudge points!
    • Only used for plotting, does not affect actual data, regression, etc.

Dummy Variables as Group Means

^Yi=^β0+^β1Di where Di={0,1}

  • When Di=0 (“Control group”):
    • ^Yi=^β0
    • E[Yi|Di=0]=^β0 ⟺ the mean of Y when Di=0
  • When Di=1 (“Treatment group”):
    • ^Yi=^β0+^β1Di
    • E[Yi|Di=1]=^β0+^β1 ⟺ the mean of Y when Di=1
  • So the difference in group means:

=E[Yi|Di=1]−E[Yi|Di=0]=(^β0+^β1)−(^β0)=^β1

Dummy Variables as Group Means: Our Example

Example

^Wagei=^β0+^β1Femalei

  • Mean wage for men:

E[Wage|Female=0]=^β0

  • Mean wage for women:

E[Wage|Female=1]=^β0+^β1

  • Difference in wage between men & women:

^β1

Comparing Groups in Regression: Scatterplot

^Wagei=^β0+^β1Femalei

Comparing Groups in Regression: Scatterplot

^Wagei=^β0+^β1Femalei

The Data

ABCDEFGHIJ0123456789
 
 
wage
<dbl>
gender
<fct>
educ
<int>
exper
<int>
13.10Female112
23.24Female1222
33.00Male112
46.00Male844
55.30Male127
68.75Male169
711.25Male1815
85.00Female125
93.60Female1226
1018.18Male1722
Next
123456
...
53
Previous
1-10 of 526 rows | 1-5 of 25 columns

Conditional Group Means

# Summarize for Men

wages %>%
  filter(gender=="Male") %>%
  summarize(mean = mean(wage),
            sd = sd(wage))
ABCDEFGHIJ0123456789
mean
<dbl>
sd
<dbl>
7.0994894.160858
1 row
# Summarize for Women

wages %>%
  filter(gender=="Female") %>%
  summarize(mean = mean(wage),
            sd = sd(wage))
ABCDEFGHIJ0123456789
mean
<dbl>
sd
<dbl>
4.5876592.529363
1 row

Visualize Differences

The Regression (factor variables)

reg <- lm(wage ~ gender, data = wages)
summary(reg)

Call:
lm(formula = wage ~ gender, data = wages)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5995 -1.8495 -0.9877  1.4260 17.8805 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.5877     0.2190  20.950  < 2e-16 ***
genderMale    2.5118     0.3034   8.279 1.04e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.476 on 524 degrees of freedom
Multiple R-squared:  0.1157,    Adjusted R-squared:  0.114 
F-statistic: 68.54 on 1 and 524 DF,  p-value: 1.042e-15
library(broom)
tidy(reg)
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)4.5876590.218983420.9498023.012371e-71
genderMale2.5118300.30340928.2786881.041764e-15
2 rows
  • Putting the factor variable gender in, R automatically chooses a value to set as TRUE, in this case Male = TRUE
    • genderMALE =1 for Male, =0 for Female
  • According to the data, men earn, on average, $2.51 more than women

The Regression: Dummy Variables

  • Let’s explicitly make gender into a dummy variable for female:
# add a female dummy variable 
wages <- wages %>%
  mutate(female = ifelse(test = gender == "Female",
                         yes = 1,
                         no = 0))
wages
ABCDEFGHIJ0123456789
 
 
wage
<dbl>
female
<dbl>
educ
<int>
exper
<int>
13.101112
23.2411222
33.000112
46.000844
55.300127
68.750169
711.2501815
85.001125
93.6011226
1018.1801722
Next
123456
...
53
Previous
1-10 of 526 rows | 1-5 of 26 columns

The Regression (Dummy variables)

female_reg <- lm(wage ~ female, data = wages)
summary(female_reg)

Call:
lm(formula = wage ~ female, data = wages)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5995 -1.8495 -0.9877  1.4260 17.8805 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.0995     0.2100  33.806  < 2e-16 ***
female       -2.5118     0.3034  -8.279 1.04e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.476 on 524 degrees of freedom
Multiple R-squared:  0.1157,    Adjusted R-squared:  0.114 
F-statistic: 68.54 on 1 and 524 DF,  p-value: 1.042e-15
library(broom)
tidy(female_reg)
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)7.0994890.210008233.8057778.971839e-134
female-2.5118300.3034092-8.2786881.041764e-15
2 rows

Dummy Regression vs. Group Means

From tabulation of group means

Gender Avg. Wage Std. Dev. n
Female 4.59 2.33 252
Male 7.10 4.16 274
Difference 2.51 0.30 −

From t-test of difference in group means

ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)7.0994890.210008233.8057778.971839e-134
female-2.5118300.3034092-8.2786881.041764e-15
2 rows

^Wagesi=7.10−2.51Femalei

Recoding Dummy Variables

Recoding Dummy Variables

Example

Suppose instead of female we had used:

^Wagei=^β0+^β1Malei where Malei={1if person i is Male0if person i is Female

Recoding Dummies in the Data

wages <- wages %>%
  mutate(male = ifelse(female == 0, # condition: is female equal to 0?
                       yes = 1, # if true: code as "1"
                       no = 0)) # if false: code as "0"

# verify it worked
wages %>% 
  select(wage, female, male) %>%
  head(n = 5)
ABCDEFGHIJ0123456789
 
 
wage
<dbl>
female
<dbl>
male
<dbl>
13.1010
23.2410
33.0001
46.0001
55.3001
5 rows

Scatterplot with Male

Dummy Variables as Group Means: With Male

Example

^Wagei=^β0+^β1Malei

  • Mean wage for men:

E[Wage|Male=1]=^β0+^β1

  • Mean wage for women:

E[Wage|Male=0]=^β0

  • Difference in wage between men & women:

^β1

Scatterplot & Regression Line with Male

The Regression with Male

male_reg <- lm(wage ~ male, data = wages)
summary(male_reg)

Call:
lm(formula = wage ~ male, data = wages)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5995 -1.8495 -0.9877  1.4260 17.8805 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.5877     0.2190  20.950  < 2e-16 ***
male          2.5118     0.3034   8.279 1.04e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.476 on 524 degrees of freedom
Multiple R-squared:  0.1157,    Adjusted R-squared:  0.114 
F-statistic: 68.54 on 1 and 524 DF,  p-value: 1.042e-15
library(broom)
tidy(male_reg)
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)4.5876590.218983420.9498023.012371e-71
male2.5118300.30340928.2786881.041764e-15
2 rows

The Dummy Regression: Male or Female

Wage Wage
Constant 7.10*** 4.59***
(0.21) (0.22)
female −2.51***
(0.30)
male 2.51***
(0.30)
n 526 526
Adj. R2 0.11 0.11
SER 3.47 3.47
* p < 0.1, ** p < 0.05, *** p < 0.01
  • Note it doesn’t matter if we use male or female, difference is always $2.51

  • Compare the constant (average for the D=0 group)

  • Should you use male AND female in a regression? We’ll come to that…

Categorical Variables (More than 2 Categories)

Categorical Variables with More than 2 Categories

  • A categorical variable expresses membership in a category, where there is no ranking or hierarchy of the categories
    • We’ve looked at categorical variables with 2 categories only
    • e.g. Male/Female, Spring/Summer/Fall/Winter, Democratic/Republican/Independent
  • Might be an ordinal variable expresses rank or an ordering of data, but not necessarily their relative magnitude
    • e.g. Order of finalists in a competition (1st, 2nd, 3rd)
    • e.g. Highest education attained (1=elementary school, 2=high school, 3=bachelor’s degree, 4=graduate degree)
    • in R, an ordered factor

Using Categorical Variables in Regression I

Example

How do wages vary by region of the country? Let Regioni={Northeast,Midwest,South,West}

  • Can we run the following regression?

^Wagesi=^β0+^β1Regioni

Using Categorical Variables in Regression II

Example

How do wages vary by region of the country? Let Regioni={Northeast,Midwest,South,West}

  • Code region numerically:

Regioni={1if i is in Northeast2if i is in Midwest3if i is in South4if i is in West

  • Can we run the following regression?

^Wagesi=^β0+^β1Regioni

Using Categorical Variables in Regression III

Example

How do wages vary by region of the country? Let Regioni={Northeast,Midwest,South,West}

  • Create a dummy variable for each region:
    • Northeasti=1 if i is in Northeast, otherwise =0
    • Midwesti=1 if i is in Midwest, otherwise =0
    • Southi=1 if i is in South, otherwise =0
    • Westi=1 if i is in West, otherwise =0
  • Can we run the following regression?

^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi+^β4Westi

  • For every i:Northeasti+Midwesti+Southi+Westi=1!

The Dummy Variable Trap

Example

^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi+^β4Westi

  • If we include all possible categories, they are perfectly multicollinear, an exact linear function of one another:

Northeasti+Midwesti+Southi+Westi=1∀i

  • This is known as the dummy variable trap, a common source of perfect multicollinearity

The Reference Category

  • To avoid the dummy variable trap, always omit one category from the regression, known as the “reference category”

  • It does not matter which category we omit!

  • Coefficients on each dummy variable measure the difference between the reference category and each category dummy

The Reference Category: Example

Example

^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi

  • Westi is omitted (arbitrarily chosen)
  • ^β0: average wage for i in the West (omitted reference category)
  • ^β1: difference between West and Northeast
  • ^β2: difference between West and Midwest
  • ^β3: difference between West and South

Regression in R with Categorical Variable

lm(wage ~ region, data = wages) %>% summary()

Call:
lm(formula = wage ~ region, data = wages)

Residuals:
   Min     1Q Median     3Q    Max 
-6.083 -2.387 -1.097  1.157 18.610 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)       5.7105     0.3195  17.871   <2e-16 ***
regionNortheast   0.6593     0.4651   1.418   0.1569    
regionSouth      -0.3236     0.4173  -0.775   0.4385    
regionWest        0.9029     0.5035   1.793   0.0735 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.671 on 522 degrees of freedom
Multiple R-squared:  0.0175,    Adjusted R-squared:  0.01185 
F-statistic: 3.099 on 3 and 522 DF,  p-value: 0.02646

Regression in R with Dummies (& Dummy Variable Trap)

lm(wage ~ northeast + midwest + south + west, data = wages) %>% summary()

Call:
lm(formula = wage ~ northeast + midwest + south + west, data = wages)

Residuals:
   Min     1Q Median     3Q    Max 
-6.083 -2.387 -1.097  1.157 18.610 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.6134     0.3891  16.995  < 2e-16 ***
northeast    -0.2436     0.5154  -0.473  0.63664    
midwest      -0.9029     0.5035  -1.793  0.07352 .  
south        -1.2265     0.4728  -2.594  0.00974 ** 
west              NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.671 on 522 degrees of freedom
Multiple R-squared:  0.0175,    Adjusted R-squared:  0.01185 
F-statistic: 3.099 on 3 and 522 DF,  p-value: 0.02646
  • R automatically drops one category to avoid perfect multicollinearity

Using Different Reference Categories in R

No Northeast No Midwest No South No West
Constant 6.37*** 5.71*** 5.39*** 6.61***
(0.34) (0.32) (0.27) (0.39)
midwest −0.66 0.32 −0.90*
(0.47) (0.42) (0.50)
south −0.98** −0.32 −1.23***
(0.43) (0.42) (0.47)
west 0.24 0.90* 1.23***
(0.52) (0.50) (0.47)
northeast 0.66 0.98** −0.24
(0.47) (0.43) (0.52)
n 526 526 526 526
R2 0.02 0.02 0.02 0.02
Adj. R2 0.01 0.01 0.01 0.01
SER 3.66 3.66 3.66 3.66
* p < 0.1, ** p < 0.05, *** p < 0.01
  • Constant is alsways average wage for reference (omitted) region

  • Compare coefficients between Midwest in (1) and Northeast in (2)…

  • Compare coefficients between West in (3) and South in (4)…

  • Does not matter which region we omit!

    • Same R2, SER, coefficients give same results

Dummy Dependent (Y) Variables

  • In many contexts, we will want to have our dependent (Y) variable be a dummy variable

Example

^Admittedi=^β0+^β1GPAi where Admittedi={1if i is Admitted0if i is Not Admitted

  • A model where Y is a dummy is called a linear probability model, as it measures the probability of Y occurring given the X’s, i.e. P(Yi=1|X1,⋯,Xk)
    • e.g. the probability person i is Admitted to a program with a given GPA
  • Special models to properly interpret and extend this (logistic “logit”, probit, etc)

  • Feel free to write papers with dummy Y variables!

Interaction Effects

Sliders and Switches

  • Marginal effect of dummy variable: effect on Y of going from 0 to 1
  • Marginal effect of continuous variable: effect on Y of a 1 unit change in X

Interaction Effects

  • Sometimes one X variable might interact with another in determining Y

Example

Consider the gender pay gap again. - Gender affects wages - Experience affects wages

  • Does experience affect wages differently by gender?
    • i.e. is there an interaction effect between gender and experience?
  • Note this is NOT the same as just asking: “do men earn more than women with the same amount of experience?”

^wagesi=β0+β1Genderi+β2Experiencei

Three Types of Interactions

  • Depending on the types of variables, there are 3 possible types of interaction effects

  • We will look at each in turn

  1. Interaction between a dummy and a continuous variable:

Yi=β0+β1Xi+β2Di+β3(Xi×Di)

  1. Interaction between a two dummy variables:

Yi=β0+β1D1i+β2D2i+β3(D1i×D2i)

  1. Interaction between a two continuous variables:

Yi=β0+β1X1i+β2X2i+β3(X1i×X2i)

Interactions Between a Dummy and Continuous Variable

Interactions: A Dummy & Continuous Variable

  • Does the marginal effect of the continuous variable on Y change depending on whether the dummy is “on” or “off”?

Interactions: A Dummy & Continuous Variable I

  • We can model an interaction by introducing a variable that is an .hi[interaction term] capturing the interaction between two variables:

Yi=β0+β1Xi+β2Di+β3(Xi×Di) where Di={0,1}

  • β3 estimates the interaction effect between Xi and Di on Yi
  • What do the different coefficients (β)’s tell us?
    • Again, think logically by examining each group (Di=0 or Di=1)

Dummy-Continuous Interaction Effects as Two Regressions I

Yi=β0+β1Xi+β2Di+β3Xi×Di

  • When Di=0 (“Control group”):

^Yi=^β0+^β1Xi+^β2(0)+^β3Xi×(0)^Yi=^β0+^β1Xi

  • When Di=1 (“Treatment group”):

^Yi=^β0+^β1Xi+^β2(1)+^β3Xi×(1)^Yi=(^β0+^β2)+(^β1+^β3)Xi

  • So what we really have is two regression lines!

Dummy-Continuous Interaction Effects as Two Regressions II

  • Di=0 group:

Yi=^β0+^β1Xi

  • Di=1 group:

Yi=(^β0+^β2)+(^β1+^β3)Xi

Interpretting Coefficients I

Yi=β0+β1Xi+β2Di+β3(Xi×Di)

  • To interpret the coefficients, compare cases after changing X by ΔX:

Yi+ΔYi=β0+β1(Xi+ΔXi)β2Di+β3((Xi+ΔXi)Di)

  • Subtracting these two equations, the difference is:

ΔYi=β1ΔXi+β3DiΔXiΔYiΔXi=β1+β3Di

  • The effect of X→Y depends on the value of Di!

  • β3: increment to the effect of X→Y when Di=1 (vs. Di=0)

Interpretting Coefficients II

Yi=β0+β1Xi+β2Di+β3(Xi×Di)

  • ^β0: E[Yi] for Xi=0 and Di=0
  • β1: Marginal effect of Xi→Yi for Di=0
  • β2: Marginal effect on Yi of difference between Di=0 and Di=1 when X=0 (“intercepts”)
  • β3: The difference of the marginal effect of Xi→Yi between Di=0 and Di=1 (“slopes”)
  • This is a bit awkward, easier to think about the two regression lines:

Interpretting Coefficients III

Yi=β0+β1Xi+β2Di+β3(Xi×Di)

  • For Di=0 Group: ^Yi=^β0+^β1Xi
    • Intercept: ^β0
    • Slope: ^β1
  • For Di=1 Group: ^Yi=(^β0+^β2)+(^β1+^β3)Xi
    • Intercept: ^β0+^β2
    • Slope: ^β1+^β3
  • ^β2: difference in intercept between groups

  • ^β3: difference in slope between groups

  • How can we determine if the two lines have the same slope and/or intercept?
    • Same intercept? t-test H0: β2=0
    • Same slope? t-test H0: β3=0

Interactions in Our Example

Example

^wagei=^β0+^β1experiencei+^β2femalei+^β3(experiencei×femalei)

  • For men female=0:

^wagei=^β0+^β1experiencei

  • For women female=1:

^wagei=(^β0+^β2)⏟intercept+(^β1+^β3)⏟slopeexperiencei

Interactions in Our Example: Scatterplot

Code
interaction_plot <- ggplot(data = wages)+
  aes(x = exper,
      y = wage,
      color = as.factor(gender))+ # make factor
  geom_point(alpha = 0.5)+
  scale_y_continuous(limits = c(0,26),
                     expand = c(0,0),
                     labels=scales::dollar)+
  scale_x_continuous(limits = c(0,55),
                     expand = c(0,0))+
  labs(x = "Experience (Years)",
       y = "Wage",
       color = "Gender")+
  scale_color_manual(values = c("Female" = "#e64173",
                                "Male" = "#0047AB")
                     )+ # setting custom colors
  theme_bw()+
  theme(legend.position = "bottom")
interaction_plot

Interactions in Our Example: Scatterplot

Code
interaction_plot + 
  geom_smooth(method = "lm")

Interactions in Our Example: Scatterplot

Code
interaction_plot + 
  geom_smooth(method = "lm") +
  facet_wrap(~ gender)

Interactions in Our Example: Regression in R

  • Syntax for adding an interaction term is easy1 in R: x1 * x2
    • Or could just do x1 * x2 (multiply)
# both are identical in R
interaction_reg <- lm(wage ~ exper * female, data = wages)
interaction_reg <- lm(wage ~ exper + female + exper * female, data = wages)
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)6.158275490.3416740818.0238307.998534e-57
exper0.053604760.015437163.4724505.585255e-04
female-1.546546770.48186030-3.2095341.411253e-03
exper:female-0.055069890.02217496-2.4834271.332533e-02
4 rows
  1. There are several options here. (1) Using :, running y ~ x1:x2 will run Y=β0+β3(X1×X2) only (i.e. not including x1 and x2 terms). You of course can add them in yourself by running y ~ x1 + x2 + x1:x2 as in my example above. (2) Using *, running y ~ x1*x2 will run the full Y=β0+β1X1+β2X2+β3(X1×X2) only (i.e. including x1 and x2 terms)

Interactions in Our Example: Regression

Code
modelsummary(models = list("Wage" = interaction_reg),
             fmt = 2, # round to 2 decimals
             output = "html",
             coef_rename = c("(Intercept)" = "Constant"),
             gof_map = list(
               list("raw" = "nobs", "clean" = "n", "fmt" = 0),
               #list("raw" = "r.squared", "clean" = "R<sup>2</sup>", "fmt" = 2),
               list("raw" = "adj.r.squared", "clean" = "Adj. R<sup>2</sup>", "fmt" = 2),
               list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
             ),
             escape = FALSE,
             stars = c('*' = .1, '**' = .05, '***' = 0.01)
)
Wage
Constant 6.16***
(0.34)
exper 0.05***
(0.02)
female −1.55***
(0.48)
exper:female −0.06**
(0.02)
n 526
Adj. R2 0.13
SER 3.43
* p < 0.1, ** p < 0.05, *** p < 0.01

Interactions in Our Example: Interpretting Coefficients

^wagei=6.16+0.05experiencei−1.55femalei−0.06(experiencei×femalei)

  • ^β0: Men with 0 years of experience earn 6.16
  • ^β1: For every additional year of experience, men earn $0.05
  • ^β2: Women with 0 years of experience earn $1.55 less than men
  • ^β3: Women earn $0.06 less than men for every additional year of experience

Interactions in Our Example: As Two Regressions I

^wagei=6.16+0.05experiencei−1.55femalei−0.06(experiencei×femalei)

Regression for men female=0

^wagei=6.16+0.05experiencei

  • Men with 0 years of experience earn $6.16 on average
  • For every additional year of experience, men earn $0.05 more on average

Interactions in Our Example: As Two Regressions I

^wagei=6.16+0.05experiencei−1.55femalei−0.06(experiencei×femalei)

Regression for women female=1

^wagei=6.16+0.05experiencei−1.55(1)−0.06experiencei×(1)=(6.16−1.55)+(0.05−0.06)experiencei=4.61−0.01experiencei

  • Women with 0 years of experience earn $4.61 on average
  • For every additional year of experience, women earn $0.01 less on average

Interactions in Our Example: Hypothesis Testing

^wagei=6.16+0.05experiencei−1.55femalei−0.06(experiencei×femalei)

ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)6.158275490.3416740818.0238307.998534e-57
exper0.053604760.015437163.4724505.585255e-04
female-1.546546770.48186030-3.2095341.411253e-03
exper:female-0.055069890.02217496-2.4834271.332533e-02
4 rows
  • Are intercepts of the 2 regressions different? H0:β2=0
    • Difference between men vs. women for no experience?
    • Is ^β2 significant?
    • Yes (reject) H0: p-value = 0.00
  • Are slopes of the 2 regressions different? H0:β3=0
    • Difference between men vs. women for marginal effect of experience?
    • Is ^β3 significant?
    • Yes (reject) H0: p-value = 0.01

Interactions Between Two Dummy Variables

Interactions Between Two Dummy Variables

  • Does the marginal effect on Y of one dummy going from “off” to “on” change depending on whether the other dummy is “off” or “on”?

Interactions Between Two Dummy Variables

Yi=β0+β1D1i+β2D2i+β3(D1i×D2i)

  • D1i and D2i are dummy variables
  • ^β1: effect on Y of going from D1=0 to D1=1 when D2=0
  • ^β2: effect on Y of going from D2=0 to D2=1 when D1=0
  • ^β3: effect on Y of going from D1=0 to D1=1 when D2=1 vs. D2=01
    • increment to the effect of D1i going from 0 to 1 when D2i=1 (vs. 0)
  • As always, best to think logically about possibilities (when each dummy =0 or =1)
  1. And the effect on Y of going from $D2=0 to D2=1 when D1=1 vs. D1=0.

2 Dummy Interaction: Interpretting Coefficients

Yi=β0+β1D1i+β2D2i+β3(D1i×D2i)

  • To interpret coefficients, compare cases:
    • Hold D2 constant (set to some value D2=d2)
    • Let D1 change 0 to 1:

E(Y|D1=0,D2=d2)=β0+β2d2E(Y|D1=1,D2=d2)=β0+β1(1)+β2d2+β3(1)d2

  • Subtracting the two, the difference is:

β1+β3d2

  • The marginal effect of D1→Y depends on the value of D2
    • ^β3 is the increment to the effect of D1 on Y when D2 goes from 0 to 1

Interactions Between 2 Dummy Variables: Example

Example

Does the gender pay gap change if a person is married vs. single?

^wagei=^β0+^β1femalei+^β2marriedi+^β3(femalei×marriedi)

  • Logically, there are 4 possible combinations of femalei={0,1} and marriedi={0,1}
  1. Unmarried men (femalei=0,marriedi=0)

^wagei=^β0

  1. Married men (femalei=0,marriedi=1)

^wagei=^β0+^β2

  1. Unmarried women (femalei=1,marriedi=0)

^wagei=^β0+^β1

  1. Married women (femalei=1,marriedi=1)

^wagei=^β0+^β1+^β2+^β3

Conditional Group Means in the Data

# get average wage for unmarried men
wages %>%
  filter(female == 0,
         married == 0) %>%
  summarize(mean = mean(wage))
ABCDEFGHIJ0123456789
mean
<dbl>
5.168023
1 row
# get average wage for married men
wages %>%
  filter(female == 0,
         married == 1) %>%
  summarize(mean = mean(wage))
ABCDEFGHIJ0123456789
mean
<dbl>
7.983032
1 row
# get average wage for unmarried women
wages %>%
  filter(female == 1,
         married == 0) %>%
  summarize(mean = mean(wage))
ABCDEFGHIJ0123456789
mean
<dbl>
4.611583
1 row
# get average wage for married women
wages %>%
  filter(female == 1,
         married == 1) %>%
  summarize(mean = mean(wage))
ABCDEFGHIJ0123456789
mean
<dbl>
4.565909
1 row

Two Dummies Interaction: Group Means

^wagei=^β0+^β1femalei+^β2marriedi+^β3(femalei×marriedi)

Men Women
Unmarried $5.17 $4.61
Married $7.98 $4.57

Two Dummies Interaction: Regression in R I

reg_dummies <- lm(wage ~ female + married + female:married, data = wages)
reg_dummies %>% tidy()
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)5.16802330.361434814.2986312.255740e-39
female-0.55643990.4735578-1.1750202.405224e-01
married2.81500860.43634136.4513912.531401e-10
female:married-2.86068290.6075577-4.7084963.202330e-06
4 rows

Two Dummies Interaction: Regression in R II

Code
modelsummary(models = list("Wage" = reg_dummies),
             fmt = 2, # round to 2 decimals
             output = "html",
             coef_rename = c("(Intercept)" = "Constant"),
             gof_map = list(
               list("raw" = "nobs", "clean" = "n", "fmt" = 0),
               #list("raw" = "r.squared", "clean" = "R<sup>2</sup>", "fmt" = 2),
               list("raw" = "adj.r.squared", "clean" = "Adj. R<sup>2</sup>", "fmt" = 2),
               list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
             ),
             escape = FALSE,
             stars = c('*' = .1, '**' = .05, '***' = 0.01)
)
Wage
Constant 5.17***
(0.36)
female −0.56
(0.47)
married 2.82***
(0.44)
female:married −2.86***
(0.61)
n 526
Adj. R2 0.18
SER 3.34
* p < 0.1, ** p < 0.05, *** p < 0.01

Two Dummies Interaction: Interpretting Coefficients I

^wagei=5.17−0.56femalei+2.82marriedi−2.86(femalei×marriedi)

Men Women
Unmarried $5.17 $4.61
Married $7.98 $4.57
  • Wage for unmarried men: ^β0=5.17
  • Wage for married men: ^β0+^β2=5.17+2.82=7.98
  • Wage for unmarried women: ^β0+^β1=5.17−0.56=4.61
  • Wage for married women: ^β0+^β1+^β2+^β3=5.17−0.56+2.82−2.86=4.57

Two Dummies Interaction: Interpretting Coefficients II

^wagei=5.17−0.56femalei+2.82marriedi−2.86(femalei×marriedi)

Men Women Diff
Unmarried $5.17 $4.61 $0.56
Married $7.98 $4.57 $3.41
Diff $2.81 $0.04 $2.85
  • ^β0: Wage for unmarried men
  • ^β1: Difference in wages between men and women who are unmarried
  • ^β2: Difference in wages between married and unmarried men
  • ^β3: Difference in:
    • effect of Marriage on wages between men and women
    • effect of Gender on wages between unmarried and married individuals
    • “difference in differences”

Interactions Between Two Continuous Variables

Interactions Between Two Continuous Variables

  • Does the marginal effect of X1 on Y depend on what X2 is set to?

Interactions Between Two Continuous Variables

Y=β0+β1X1+β2X2+β3(X1×X2)

  • To interpret coefficients, compare changes after changing ΔX1 (holding X2 constant):

Y+ΔY=β0+β1(X1+ΔX1)+β2X2+β3((X1+ΔX1)×X2)

  • Take the difference to get:

ΔY=β1ΔX1+β3X2ΔX1ΔYiΔX1=β1+β3X2

  • The effect of X1→Y depends on the value of X2
    • β3: increment to the effect of X1→Y for every 1 unit change in X2
  • Likewise: the effect of X2→Y depends on the value of X1

ΔYΔX2=β2+β3X1

  • β3: increment to the effect of X2→Y for every 1 unit change in X1

Continuous Variables Interaction: Example

Example

Do education and experience interact in their determination of wages?

^wagei=^β0+^β1educationi+^β2experiencei+^β3(educationi×experiencei)

  • Estimated effect of education on wages depends on the amount of experience (and vice versa)!

ΔwageΔeducation=^β1+β3experiencei

ΔwageΔexperience=^β2+β3educationi

  • This is a type of nonlinearity (we will examine nonlinearities next lesson)

Continuous Variables Interaction: In R I

reg_cont <- lm(wage ~ educ + exper + educ:exper, data = wages)
reg_cont %>% tidy()
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)-2.8599156271.181079647-2.42144181.579891e-02
educ0.6017354700.0898999776.69338855.640482e-11
exper0.0457689110.0426137581.07404072.833007e-01
educ:exper0.0020623450.0034906140.59082585.548929e-01
4 rows

Continuous Variables Interaction: In R II

Code
modelsummary(models = list("Wage" = reg_cont),
             fmt = 2, # round to 2 decimals
             output = "html",
             coef_rename = c("(Intercept)" = "Constant"),
             gof_map = list(
               list("raw" = "nobs", "clean" = "n", "fmt" = 0),
               #list("raw" = "r.squared", "clean" = "R<sup>2</sup>", "fmt" = 2),
               list("raw" = "adj.r.squared", "clean" = "Adj. R<sup>2</sup>", "fmt" = 2),
               list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
             ),
             escape = FALSE,
             stars = c('*' = .1, '**' = .05, '***' = 0.01)
)
Wage
Constant −2.86**
(1.18)
educ 0.60***
(0.09)
exper 0.05
(0.04)
educ:exper 0.00
(0.00)
n 526
Adj. R2 0.22
SER 3.25
* p < 0.1, ** p < 0.05, *** p < 0.01

Continuous Variables Interaction: Marginal Effects

^wagei=−2.860+0.602educationi+0.047experiencei+0.002(educationi×experiencei)

Marginal Effect of Education on Wages by Years of Experience:

Experience ΔwageΔeducation=^β1+^β3experience
5 years 0.602+0.002(5)=0.612
10 years 0.602+0.002(10)=0.622
15 years 0.602+0.002(15)=0.632
  • Marginal effect of education → wages increases with more experience

Continuous Variables Interaction: Marginal Effects

^wagei=−2.860+0.602educationi+0.047experiencei+0.002(educationi×experiencei)

Marginal Effect of Experience on Wages by Years of Education:

Education ΔwageΔexperience=^β2+^β3education
5 years 0.047+0.002(5)=0.057
10 years 0.047+0.002(10)=0.067
15 years 0.047+0.002(15)=0.077
  • Marginal effect of experience → wages increases with more education

  • If you want to estimate the marginal effects more precisely, and graph them, see the appendix in today’s appendix

ECON 480 — Econometrics

1
4.3 — Categorical Data ECON 480 • Econometrics • Fall 2022 Dr. Ryan Safner Associate Professor of Economics safner@hood.edu ryansafner/metricsF22 metricsF22.classes.ryansafner.com

  1. Slides

  2. Tools

  3. Close
  • Title Slide
  • Contents
  • Categorical Variables
  • Working with factor Variables in R
  • Factors in R I
  • Factors in R II
  • Factors in R III
  • Ordered Factors in R I
  • Ordered Factors in R II
  • Example Research Question with Categorical Data
  • A Difference in Group Means
  • Plotting factors in R
  • Regression with Dummy Variables
  • Comparing Groups with Regression
  • Comparing Groups in Regression: Scatterplot
  • Comparing Groups in Regression: Scatterplot
  • Dummy Variables as Group Means
  • Dummy Variables as Group Means: Our Example
  • Comparing Groups in Regression: Scatterplot
  • Comparing Groups in Regression: Scatterplot
  • The Data
  • Conditional Group Means
  • Visualize Differences
  • The Regression (factor variables)
  • The Regression: Dummy Variables
  • The Regression (Dummy variables)
  • Dummy Regression vs. Group Means
  • Recoding Dummy Variables
  • Recoding Dummy Variables
  • Recoding Dummies in the Data
  • Scatterplot with Male
  • Dummy Variables as Group Means: With Male
  • Scatterplot & Regression Line with Male
  • The Regression with Male
  • The Dummy Regression: Male or Female
  • Categorical Variables (More than 2 Categories)
  • Categorical Variables with More than 2 Categories
  • Using Categorical Variables in Regression I
  • Using Categorical Variables in Regression II
  • Using Categorical Variables in Regression III
  • The Dummy Variable Trap
  • The Reference Category
  • The Reference Category: Example
  • Regression in R with Categorical Variable
  • Regression in R with Dummies (& Dummy Variable Trap)
  • Using Different Reference Categories in R
  • Dummy Dependent (Y) Variables
  • Interaction Effects
  • Sliders and Switches
  • Interaction Effects
  • Three Types of Interactions
  • Interactions Between a Dummy and Continuous Variable
  • Interactions: A Dummy & Continuous Variable
  • Interactions: A Dummy & Continuous Variable I
  • Dummy-Continuous Interaction Effects as Two Regressions I
  • Dummy-Continuous Interaction Effects as Two Regressions II
  • Interpretting Coefficients I
  • Interpretting Coefficients II
  • Interpretting Coefficients III
  • Interactions in Our Example
  • Interactions in Our Example: Scatterplot
  • Interactions in Our Example: Scatterplot
  • Interactions in Our Example: Scatterplot
  • Interactions in Our Example: Regression in R
  • Interactions in Our Example: Regression
  • Interactions in Our Example: Interpretting Coefficients
  • Interactions in Our Example: As Two Regressions I
  • Interactions in Our Example: As Two Regressions I
  • Interactions in Our Example: Hypothesis Testing
  • Interactions Between Two Dummy Variables
  • Interactions Between Two Dummy Variables
  • Interactions Between Two Dummy Variables
  • 2 Dummy Interaction: Interpretting Coefficients
  • Interactions Between 2 Dummy Variables: Example
  • Conditional Group Means in the Data
  • Two Dummies Interaction: Group Means
  • Two Dummies Interaction: Regression in R I
  • Two Dummies Interaction: Regression in R II
  • Two Dummies Interaction: Interpretting Coefficients I
  • Two Dummies Interaction: Interpretting Coefficients II
  • Interactions Between Two Continuous Variables
  • Interactions Between Two Continuous Variables
  • Interactions Between Two Continuous Variables
  • Continuous Variables Interaction: Example
  • Continuous Variables Interaction: In R I
  • Continuous Variables Interaction: In R II
  • Continuous Variables Interaction: Marginal Effects
  • Continuous Variables Interaction: Marginal Effects
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help