2.5 — Precision and Diagnostics

ECON 480 • Econometrics • Fall 2022

Dr. Ryan Safner
Associate Professor of Economics

## Contents

Variation in $\hat{\beta}_1$

Presenting Regression Results

Heteroskedasticity

Outliers

## The Sampling Distribution of $\hat{\beta_1}$

$\hat{\beta_1} \sim N(\mathbb{E}[\hat{\beta_1}], \sigma_{\hat{\beta_1}})$

## The Sampling Distribution of $\hat{\beta_1}$

$\hat{\beta_1} \sim N(\mathbb{E}[\hat{\beta_1}], \sigma_{\hat{\beta_1}})$

1. Center1 of the distribution: $\mathbb{E}[\hat{\beta_1}]$ (last class)

## The Sampling Distribution of $\hat{\beta_1}$

$\hat{\beta_1} \sim N(\mathbb{E}[\hat{\beta_1}], \sigma_{\hat{\beta_1}})$

1. Center1 of the distribution: $\mathbb{E}[\hat{\beta_1}]$ (last class)

2. Precision or uncertainty of the estimate (today)

• Variance $\sigma^2_{\hat{\beta}_1}$
• Standard error2 $\sigma_{\hat{\beta}_1} = \sqrt{var(\hat{\beta}_1)}$

## The Sampling Distribution of $\hat{\beta_1}$

$\hat{\beta_1} \sim N(\mathbb{E}[\hat{\beta_1}], \sigma_{\hat{\beta_1}})$

1. Center1 of the distribution: $\mathbb{E}[\hat{\beta_1}]$ (last class)

2. Precision or uncertainty of the estimate (today)

• Variance $\sigma^2_{\hat{\beta}_1}$
• Standard error2 $\sigma_{\hat{\beta}_1} = \sqrt{var(\hat{\beta}_1)}$

# Variation in $\hat{\beta}_1$

## What Affects Variation in $\hat{\beta_1}$

$var(\hat{\beta_1})=\frac{(SER)^2}{n \times var(X)}$

$se(\hat{\beta_1})=\sqrt{var(\hat{\beta_1})} = \frac{SER}{\sqrt{n} \times sd(X)}$

• Variation in $\hat{\beta_1}$ is affected by 3 things:
1. Goodness of fit of the model (SER)1
• Larger $SER$ $\rightarrow$ larger $var(\hat{\beta_1})$
2. Sample size, n
• Larger $n$ $\rightarrow$ smaller $var(\hat{\beta_1})$
3. Variance of X
• Larger $var(X)$ $\rightarrow$ smaller $var(\hat{\beta_1})$

# Presenting Regression Results

## Our Class Size Regression

school_reg %>% summary()

Call:
lm(formula = testscr ~ str, data = ca_school)

Residuals:
Min      1Q  Median      3Q     Max
-47.727 -14.251   0.483  12.822  48.540

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 698.9330     9.4675  73.825  < 2e-16 ***
str          -2.2798     0.4798  -4.751 2.78e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.58 on 418 degrees of freedom
Multiple R-squared:  0.05124,   Adjusted R-squared:  0.04897
F-statistic: 22.58 on 1 and 418 DF,  p-value: 2.783e-06
• How can we present all of this information in a tidy way?

## Our Class Size Regression

library(broom)
school_reg %>% tidy()
term estimate std.error statistic p.value
(Intercept) 698.932952 9.4674914 73.824514 0.0e+00
str -2.279808 0.4798256 -4.751327 2.8e-06

school_reg %>% glance()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.0512401 0.0489703 18.58097 22.57511 2.8e-06 1 -1822.25 3650.499 3662.62 144315.5 418 420
• Better (?), but still not how you see regressions reported in reports…especially when you have many regression models!

## Regression Tables

• Professional journals and papers often have a regression table, including:
• Estimates of $\hat{\beta_0}$ and $\hat{\beta_1}$
• Standard errors of $\hat{\beta_0}$ and $\hat{\beta_1}$ (often below, in parentheses)
• Indications of statistical significance (often with asterisks)
• Measures of regression fit: $R^2$, $SER$, etc
• Later: multiple rows & columns for multiple variables & models
Test Score
Constant 698.93***
(9.47)
STR −2.28***
(0.48)
n 420
R2 0.05
SER 18.54
* p < 0.1, ** p < 0.05, *** p < 0.01

## Regression Output Tables

Test Score
Constant 698.93***
(9.47)
STR −2.28***
(0.48)
n 420
R2 0.05
SER 18.54
* p < 0.1, ** p < 0.05, *** p < 0.01

## Using modelsummary I

• You will need to first install.packages("modelsummary")

• Load with library(modelsummary)

• Command: modelsummary()

• Main argument is the name of your lm regression object

• Default output is fine, but often we want to customize a bit!

# install.packages("modelsummary") # install first!
library(modelsummary)

modelsummary(school_reg) # our regression
Model 1
(Intercept) 698.933
(9.467)
str −2.280
(0.480)
Num.Obs. 420
R2 0.051
AIC 3650.5
BIC 3662.6
F 22.575
RMSE 18.54

## Using modelsummary II

• Whole command is modelsummary(), everything will go in ()
1. models, a list() of models to use, can give a name to each model, will show up as column title in table
models = list("Test Score" = school_reg) # set name to "Test Score"
1. coef_rename if you want to rename any independent variables as something nicer than their names in the dataset
• "old name" = "new name" (yes annoying!)
coef_rename = list("(Intercept)" = "Constant",
"str" = "Student Teacher Ratio")

## Using modelsummary III

• Whole command is modelsummary(), everything will go in ()
1. gof_map: a list() of goodness of fit statistics, can customize what you want to include/exclude, what you want to label them in the table…a bit advanced, here’s what I like:
gof_map = list(
list("raw" = "nobs", "clean" = "n", "fmt" = 0),
list("raw" = "r.squared", "clean" = "R<sup>2</sup>", "fmt" = 2),
#list("raw" = "adj.r.squared", "clean" = "Adj. R<sup>2</sup>", "fmt" = 2), # we'll want this later!
list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
)
1. Other minor options (combine with commas):
fmt = 2, # round to 2 decimals
output = "html" # depending on type of document creating; pdf would be "latex"
escape = FALSE # allows formatting of things like <sup>2</sup>
stars = c('*' = .1, '**' = .05, '***' = 0.01) # show significance levels if set to true, I don't like the defaults so I set my own

## Using modelsummary IV

modelsummary(models = list("Test Score" = school_reg),
fmt = 2, # round to 2 decimals
output = "html",
coef_rename = c("(Intercept)" = "Constant",
"str" = "STR"),
gof_map = list(
list("raw" = "nobs", "clean" = "n", "fmt" = 0),
list("raw" = "r.squared", "clean" = "R<sup>2</sup>", "fmt" = 2),
list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
),
escape = FALSE,
stars = c('*' = .1, '**' = .05, '***' = 0.01)
)
Test Score
Constant 698.93***
(9.47)
STR −2.28***
(0.48)
n 420
R2 0.05
SER 18.54
* p < 0.1, ** p < 0.05, *** p < 0.01

## modelplot() in modelsummary

Also nice about the modelsummary package is the command modelplot()

modelplot(school_reg)

## modelplot() in modelsummary

Also nice about the modelsummary package is the command modelplot()

modelplot(school_reg,
coef_omit = 'Intercept') # don't show intercept

## Though You Could Make It Yourself in ggplot

• Use the conf.low and conf.high (from a tidy regression) as xmin and xmax aesthetics inside geom_errorbarh().
school_reg %>%
tidy(conf.int = TRUE) %>%
filter(term == "str") %>%
ggplot()+
aes(x = estimate,
y = term)+
geom_point(size = 3)+
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high),
height = 0.1)+ # height of whiskers
theme_light()

## Diagnostics: Residuals I

• We often look at the residuals of a regression to get more insight about its goodness of fit and its bias

• Recall broom’s augment creates some useful new variables

• .fitted are fitted (predicted) values from model, i.e. $\hat{Y}_i$
• .resid are residuals (errors) from model, i.e. $\hat{u}_i$

## Diagnostics: Residuals II

• Often a good idea to store in a new object (so we can make some plots)
aug_reg <- augment(school_reg)

aug_reg %>% head()
testscr str .fitted .resid .hat .sigma .cooksd .std.resid
690.80 17.88991 658.1474 32.65260 0.0044244 18.53408 0.0068925 1.7612148
661.20 21.52466 649.8608 11.33917 0.0047485 18.59490 0.0008927 0.6117112
643.60 18.69723 656.3069 -12.70689 0.0029742 18.59279 0.0006996 -0.6848850
647.70 17.35714 659.3620 -11.66198 0.0058575 18.59441 0.0011673 -0.6294767
640.85 18.67133 656.3659 -15.51592 0.0030072 18.58766 0.0010548 -0.8363024
605.55 21.40625 650.1308 -44.58076 0.0044603 18.47411 0.0129531 -2.4046387

• We make 4 critical assumptions about $u$:
1. The expected value of the errors is 0

$\mathbb{E}[u]=0$

1. The variance of the errors over $X$ is constant:

$var(u|X)=\sigma^2_{u}$

1. Errors are not correlated across observations:

$cor(u_i,u_j)=0 \quad \forall i \neq j$

1. There is no correlation between $X$ and the error term:

$cor(X, u)=0 \text{ or } E[u|X]=0$

## Assumptions 1 and 2: Errors are i.i.d.

• Assumptions 1 and 2 assume that errors are coming from the same (normal) distribution

$u \sim N(0, \sigma_u)$

• Assumption 1: $E[u]=0$
• Assumption 2: $sd(u|X)=\sigma_u$
• virtually always unknown…
• We often can visually check by plotting a histogram of $u$

## Plotting a Histogram of Residuals

ggplot(data = aug_reg)+
aes(x = .resid)+
geom_histogram(color="white", fill = "pink")+
labs(x = expression(paste("Residual, ", hat(u))))+
scale_y_continuous(limits = c(20, 40),
expand = c(0,0))+
theme_light(base_family = "Fira Sans Condensed",
base_size=20)

## Checking the Distribution of Residuals

school_reg %>% summary()

Call:
lm(formula = testscr ~ str, data = ca_school)

Residuals:
Min      1Q  Median      3Q     Max
-47.727 -14.251   0.483  12.822  48.540

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 698.9330     9.4675  73.825  < 2e-16 ***
str          -2.2798     0.4798  -4.751 2.78e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.58 on 418 degrees of freedom
Multiple R-squared:  0.05124,   Adjusted R-squared:  0.04897
F-statistic: 22.58 on 1 and 418 DF,  p-value: 2.783e-06
aug_reg %>%
summarize(E_u = mean(.resid),
sd_u = sd(.resid))
E_u sd_u
0 18.55878

## Residual Plot

• We often plot a residual plot to see any odd patterns about residuals
• $x$-axis are $\hat{Y}_i$ values (.fitted)
• $y$-axis are $u_i$ values (.resid)
ggplot(data = aug_reg)+
aes(x = .fitted,
y = .resid)+
geom_point(color = "blue")+
geom_hline(aes(yintercept = 0), color = "red")+
labs(x = expression(paste("Predicted Test Score,", hat(y)[i])),
y = expression(paste("Residual, ", hat(u)[i])))+
theme_light(base_family = "Fira Sans Condensed",
base_size = 20)

# Heteroskedasticity

## Homoskedasticity

• Homoskedasticity:” variance of the residuals over $X$ is constant, written:

$var(u|X)=\sigma^2_{u}$

• Knowing the value of $X$ does not affect the variance (spread) of the errors

## Heteroskedasticity I

• Heteroskedasticity:” variance of the residuals over $X$ is NOT constant:

$var(u|X) \neq \sigma^2_{u}$

• This does not cause $\hat{\beta_1}$ to be biased, but it does cause the standard error of $\hat{\beta_1}$ to be incorrect

• This does cause a problem for inference!

• Specifically, it will make $se(\hat{\beta}_1)$ wrong (often too small)1

## Heteroskedasticity II

• Recall the formula for the standard error of $\hat{\beta_1}$:

$se(\hat{\beta_1})=\sqrt{var(\hat{\beta_1})} = \frac{SER}{\sqrt{n} \times sd(X)}$

• This assumes homoskedasticity (Assumption 2)

## Heteroskedasticity III

• A better formula for estimating standard errors that are robust to heteroskedasticity (called “robust standard errors”):

$se(\hat{\beta_1})=\sqrt{\frac{\displaystyle\sum^n_{i=1}(X_i-\bar{X})^2\hat{u}^2}{\big[\displaystyle\sum^n_{i=1}(X_i-\bar{X})^2\big]^2}}$

• Don’t learn formula, do learn what heteroskedasticity is and how it affects our model!

## Visualizing Heteroskedasticity I

• Our original scatterplot with regression line

• Does the spread of the errors change over different values of $str$?

• No: homoskedastic
• Yes: heteroskedastic

## Visualizing Heteroskedasticity

• Notice the distribution of $\hat{u}$, changes for different values of STR, and $\sigma_{\hat{u}}$ is not constant

## More Obvious Heteroskedasticity

• Visual cue: data is “fan-shaped”
• Data points are closer to line in some areas
• Data points are more spread from line in other areas

## What Might Cause Heteroskedastic Errors?

$wage_i=\beta_0+\beta_1educ_i+u_i$

## What Might Cause Heteroskedastic Errors?

$wage_i=\beta_0+\beta_1educ_i+u_i$

## What Might Cause Heteroskedastic Errors?

$wage_i=\beta_0+\beta_1educ_i+u_i$

Wage
Intercept −0.90
(0.68)
Years of Schooling 0.54***
(0.05)
n 526
R2 0.16
SER 3.37
* p < 0.1, ** p < 0.05, *** p < 0.01

## Detecting Heteroskedasticity I

• Several tests to check if data is heteroskedastic
• One common test is Breusch-Pagan test
• Can use the lmtest package’s function bptest()
• $H_0$: homoskedastic1
• If $p$-value < 0.05, reject $H_0\implies$ heteroskedastic
library("lmtest")
school_reg %>% bptest()

studentized Breusch-Pagan test

data:  .
BP = 5.7936, df = 1, p-value = 0.01608
• Since $p<0.05$, can reject $H_0$ that errors are homoskedastic and conclude they are heteroskedastic

## How About the Wages Regression?

wage_reg %>% bptest()

studentized Breusch-Pagan test

data:  .
BP = 15.306, df = 1, p-value = 9.144e-05

## Fixing Heteroskedasticity I

• Heteroskedasticity is easy to fix with software that can calculate robust standard errors (using the more complicated formula above)
• Easiest method is to use estimatr package
• lm_robust() command (instead of lm) to run regression
• set se_type = "stata" to calculate robust SEs using the formula above1
#install.packages("estimatr")
library(estimatr)

## Fixing Heteroskedasticity II

school_reg_robust <- lm_robust(testscr ~ str, data = ca_school,
se_type = "stata")

school_reg_robust
              Estimate Std. Error   t value      Pr(>|t|)   CI Lower   CI Upper
(Intercept) 698.932952 10.3643599 67.436191 9.486678e-227 678.560192 719.305713
str          -2.279808  0.5194892 -4.388557  1.446737e-05  -3.300945  -1.258671
DF
(Intercept) 418
str         418

school_reg_robust %>% summary()

Call:
lm_robust(formula = testscr ~ str, data = ca_school, se_type = "stata")

Standard error type:  HC1

Coefficients:
Estimate Std. Error t value   Pr(>|t|) CI Lower CI Upper  DF
(Intercept)   698.93    10.3644  67.436 9.487e-227  678.560  719.306 418
str            -2.28     0.5195  -4.389  1.447e-05   -3.301   -1.259 418

Multiple R-squared:  0.05124 ,  Adjusted R-squared:  0.04897
F-statistic: 19.26 on 1 and 418 DF,  p-value: 1.447e-05

## Fixing Heteroskedasticity III

# can tidy, glance, augment, etc
school_reg_robust %>% tidy()
term estimate std.error statistic p.value conf.low conf.high df outcome
(Intercept) 698.932952 10.3643599 67.436191 0.00e+00 678.560192 719.305713 418 testscr
str -2.279808 0.5194892 -4.388557 1.45e-05 -3.300945 -1.258671 418 testscr

school_reg_robust %>% glance()
r.squared adj.r.squared statistic p.value df.residual nobs se_type
0.0512401 0.0489703 19.25943 1.45e-05 418 420 HC1

## Showing The Effect of Heteroskedasticity (on $se(\hat{\beta}_1)$)

modelsummary(models = list("Normal SE" = school_reg,
"Robust SE" = school_reg_robust),
fmt = 2, # round to 2 decimals
output = "html",
coef_rename = c("(Intercept)" = "Constant",
"str" = "STR"),
gof_map = list(
list("raw" = "nobs", "clean" = "n", "fmt" = 0),
list("raw" = "r.squared", "clean" = "R<sup>2</sup>", "fmt" = 2),
list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
),
escape = FALSE,
stars = c('*' = .1, '**' = .05, '***' = 0.01)
)
Normal SE Robust SE
Constant 698.93*** 698.93***
(9.47) (10.36)
STR −2.28*** −2.28***
(0.48) (0.52)
n 420 420
R2 0.05 0.05
SER 18.54 18.54
* p < 0.1, ** p < 0.05, *** p < 0.01
• What changed?

# Outliers

## Outliers Can Bias OLS! I

• Outliers can affect the slope (and intercept) of the line and add bias
• May be result of human error (measurement, transcribing, etc)
• May be meaningful and accurate
• In any case, compare how including/dropping outliers affects regression and always discuss outliers!

## Outliers Can Bias OLS! III

Original With Outliers
Constant 698.93*** 641.40***
(9.47) (11.21)
STR −2.28*** 0.71
(0.48) (0.57)
n 420 423
R2 0.05 0.00
SER 18.54 23.71
* p < 0.1, ** p < 0.05, *** p < 0.01

## Detecting Outliers

• The car package has an outlierTest command to run on the regression
# install.packages("car")
library("car")

# Use Bonferonni test
outlierTest(school_outlier_reg) # will point out which obs #s seem outliers
    rstudent unadjusted p-value Bonferroni p
422 8.822768         3.0261e-17   1.2800e-14
423 7.233470         2.2493e-12   9.5147e-10
421 6.232045         1.1209e-09   4.7414e-07
# find these observations
ca_school_outliers %>%
slice(c(422,423,421)) # find observations 422, 423, 421
observat district testscr str
422 Crazy District 2 850 28
423 Crazy District 3 820 29
421 Crazy District 1 800 30 