2.6 — Inference for Regression

ECON 480 • Econometrics • Fall 2022

Dr. Ryan Safner
Associate Professor of Economics

Contents

Why Uncertainty Matters

Confidence Intervals

Confidence Intervals Using the infer Package

Confidence Intervals, Theory

Why Uncertainty Matters

Recall: Two Big Problems with Data

• We use econometrics to identify causal relationships & make inferences about them:
1. Problem for identification: endogeneity
• $X$ is exogenous if $cor(x, u) = 0$
• $X$ is endogenous if $cor(x, u) \neq 0$
1. Problem for inference: randomness
• Data is random due to natural sampling variation
• Taking one sample of a population will yield slightly different information than another sample of the same population

Distributions of the OLS Estimators

$Y_i = \beta_0+\beta_1 X_i+u_i$

• OLS estimators $(\hat{\beta_0}$ and $\hat{\beta_1})$ are computed from a finite (specific) sample of data

• Our OLS model contains 2 sources of randomness:

• Modeled randomness: population $u_i$ includes all factors affecting $Y$ other than $X$
• different samples will have different values of those other factors $(u_i)$
• Sampling randomness: different samples will generate different OLS estimators
• Thus, $\hat{\beta_0}, \hat{\beta_1}$ are also random variables, with their own sampling distribution

The Two Problems: Where We’re Heading…Ultimately

Sample $\color{#6A5ACD}{\xrightarrow{\text{statistical inference}}}$ Population $\color{#e64173}{\xrightarrow{\text{causal indentification}}}$ Unobserved Parameters

• We want to identify causal relationships between population variables
• Logically first thing to consider
• Endogeneity problem
• We’ll use sample statistics to infer something about population parameters
• In practice, we’ll only ever have a finite sample distribution of data
• We don’t know the population distribution of data
• Randomness problem

Why Sample vs. Population Matters

Population relationship
$Y_i = 698.93 + -2.28 X_i + u_i$

$Y_i = \beta_0 + \beta_1 X_i + u_i$

Why Sample vs. Population Matters

Sample 1: 50 random observations

Population relationship
$Y_i = 698.93 + -2.28 X_i + u_i$

Sample relationship
$\hat{Y}_i = 708.12 + -2.54 X_i$

Why Sample vs. Population Matters

Sample 2: 50 random individuals

Population relationship
$Y_i = 698.93 + -2.28 X_i + u_i$

Sample relationship
$\hat{Y}_i = 708.12 + -2.54 X_i$

Why Sample vs. Population Matters

Sample 3: 50 random individuals

Population relationship
$Y_i = 698.93 + -2.28 X_i + u_i$

Sample relationship
$\hat{Y}_i = 708.12 + -2.54 X_i$

Why Sample vs. Population Matters

• Let’s repeat this process 10,000 times!

• This exercise is called a (Monte Carlo) simulation

• I’ll show you how to do this next class with the infer package

Why Sample vs. Population Matters

• On average, estimated regression lines from (hypothetical) samples provide an unbiased estimate of true population regression line

$\mathbb{E}[\hat{\beta}_1] = \beta_1$

• But, any individual estimate can miss the mark

• This leads to uncertainty about our estimated regression line

• We only have 1 sample in reality!
• This is why we care about the standard error of our line: $se(\hat{\beta_1})$!

Confidence Intervals

Statistical Inference

Sample $\color{#6A5ACD}{\xrightarrow{\text{statistical inference}}}$ Population $\color{#e64173}{\xrightarrow{\text{causal indentification}}}$ Unobserved Parameters

• We want to start inferring what the true population regression model is, using our estimated regression model from our sample

$\hat{Y_i}=\hat{\beta_0}+\hat{\beta_1}X \color{#6A5ACD}{\xrightarrow{\text{🤞 hopefully 🤞}}} Y_i=\beta_0+\beta_1X+u_i$

• We can’t yet make causal inferences about whether/how $X$ causes $Y$
• coming after the midterm!

Estimation and Statistical Inference

• Our problem with uncertainty is we don’t know whether our sample estimate is close or far from the unknown population parameter

• But we can use our errors to learn how well our model statistics likely estimate the true parameters

• Use $\hat{\beta_1}$ and its standard error, $se(\hat{\beta_1})$ for statistical inference about true $\beta_1$

• We have two options…

Estimation and Statistical Inference

• Use our $\hat{\beta_1}$ & $se(\hat{\beta_1})$ to determine if statistically significant evidence to reject a hypothesized $\beta_1$
• Use our $\hat{\beta_1}$ & $se(\hat{\beta_1})$ to create a range of values that gives us a good chance of capturing the true $\beta_1$

Generating Confidence Intervals

• We can generate our confidence interval by generating a “bootstrap” sampling distribution:
• Take our sample data and resample it many times by selecting random observations and then replacing them
• This allows us to approximate the sampling distribution of $\hat{\beta_1}$ by simulation!

Confidence Intervals Using the infer Package

Confidence Intervals Using the infer Package I

• The infer package allows you to do statistical inference in a tidy way, following the philosophy of the tidyverse
# install.packages("infer")

library(infer)

Confidence Intervals Using the infer Package II

• infer allows you to run through these steps manually to understand the process:
1. specify() a model

2. generate() a bootstrap distribution

3. calculate() the confidence interval

4. visualize() with a histogram (optional)

Bootstrapping

Another “Sample”

👆 Bootstrapped from Our Sample

• Now we want to do this 1,000 times to simulate the (unknown) sampling distribution of $\hat{\beta_1}$

The infer Pipeline: specify()

Specify

data %>%

specify(y ~ x)

• Take our data and pipe it into the specify() function, which is essentially a lm() function for regression (for our purposes)
ca_school %>%
specify(testscr ~ str)

The infer Pipeline: generate()

Generate

%>% generate(reps = n, type = "bootstrap")

• Now the magic starts, as we run a number of simulated samples

• Set the number of reps and set type to "bootstrap"

ca_school %>%
specify(testscr ~ str) %>%
generate(reps = 1000, #<<
type = "bootstrap") #<<

The infer Pipeline: generate()

Generate

%>% generate(reps = n, type = "bootstrap")

• Now the magic starts, as we run a number of simulated samples

• Set the number of reps and set type to "bootstrap"

• replicate: the “sample” number (1-1000)

• creates x and y values (data points)

The infer Pipeline: calculate()

Calculate

%>% calculate(stat = "slope")

ca_school %>%
specify(testscr ~ str) %>%
generate(reps = 1000,
type = "bootstrap") %>%
calculate(stat = "slope") #<<
• For each of the 1,000 replicates, calculate slope in lm(testscr ~ str)

• Calls it the stat

The infer Pipeline: calculate()

Calculate

%>% calculate(stat = "slope")

The infer Pipeline: calculate()

Calculate

%>% calculate(stat = "slope")

boot <- ca_school %>%
specify(testscr ~ str) %>%
generate(reps = 1000,
type = "bootstrap") %>%
calculate(stat = "slope")
• boot is (our simulated) sampling distribution of $\hat{\beta_1}$!

• We can now use this to estimate the confidence interval from our $\hat{\beta_1}=-2.28$

• And visualize it

Confidence Interval

• A 95% confidence interval is the middle 95% of the sampling distribution
ci <- boot %>%
summarize(lower = quantile(stat, 0.025),
upper = quantile(stat, 0.975))
ci
sampling_dist <- ggplot(data = boot)+
aes(x = stat)+
geom_histogram(color="white", fill = "#e64173")+
labs(x = expression(hat(beta[1])))+
theme_pander(base_family = "Fira Sans Condensed",
base_size=20)

sampling_dist

Confidence Interval

• A 95% confidence interval is the middle 95% of the sampling distribution
ci <- boot %>%
summarize(lower = quantile(stat, 0.025),
upper = quantile(stat, 0.975))
ci
sampling_dist+
geom_vline(data = ci, aes(xintercept = lower), size = 1, linetype="dashed")+ #<<
geom_vline(data = ci, aes(xintercept = upper), size = 1, linetype="dashed") #<<

The infer Pipeline: get_confidence_interval()

Get Confidence Interval

%>% get_confidence_interval()

ca_school %>% #<< # save this
specify(testscr ~ str) %>%
generate(reps = 1000,
type = "bootstrap") %>%
calculate(stat = "slope") %>%
get_confidence_interval(level = 0.95, #<<
type = "se", #<<
point_estimate = -2.28) #<<

Broom Can Estimate a Confidence Interval

school_reg %>%
tidy(conf.int = T)

our_CI <- school_reg %>%
tidy(conf.int = T) %>%
filter(term == "str") %>%
select(conf.low, conf.high)

our_CI

The infer Pipeline: visualize()

Visualize

%>% visualize()

ca_school %>%
specify(testscr ~ str) %>%
generate(reps = 1000,
type = "bootstrap") %>%
calculate(stat = "slope") %>%
visualize() #<<

The infer Pipeline: visualize()

Visualize

%>% visualize()

• If we have our confidence levels saved (our_CI) we can shade_ci() in infer’s visualize() function
ca_school %>%
specify(testscr ~ str) %>%
generate(reps = 1000,
type = "bootstrap") %>%
calculate(stat = "slope") %>%
visualize()+
shade_ci(endpoints = our_CI) 

Confidence Intervals, Theory

Confidence Intervals, Theory

• In general, a confidence interval (CI) takes a point estimate and extrapolates it within some margin of error (MOE):

$\bigg( \big[$ estimate - MOE $\big]$, $\big[$ estimate + MOE $\big] \bigg)$

• The main question is, how confident do we want to be that our interval contains the true parameter?
• Larger confidence level, larger margin of error (and thus larger interval)

Confidence Intervals, Theory

• $\color{#6A5ACD}{(1- \alpha)}$ is the confidence level of our confidence interval
• $\color{#6A5ACD}{\alpha}$ is the “significance level” that we use in hypothesis testing
• $\color{#6A5ACD}{\alpha}=$ probability that the true parameter is not contained within our interval
• Typical levels: 90%, 95%, 99%
• 95% is especially common, $\alpha=0.05$

Confidence Levels

• Depending on our confidence level, we are essentially looking for the middle $(1-\alpha)$% of the sampling distribution

• This puts $\alpha$ in the tails; $\frac{\alpha}{2}$ in each tail

Confidence Levels and the Empirical Rule

• Recall the 68-95-99.7% empirical rule for (standard) normal distributions!1

• 95% of data falls within 2 standard deviations of the mean

• Thus, in 95% of samples, the true parameter is likely to fall within about 2 standard deviations of the sample estimate

Interpretting Confidence Intervals

• So our confidence interval for our slope is (-3.22, -1.33), what does this mean again?

95% of the time, the true effect of class size on test score will be between -3.22 and -1.33

We are 95% confident that a randomly selected school district will have an effect of class size on test score between -3.22 and -1.33

The effect of class size on test score is -2.28 95% of the time.

We are 95% confident that in similarly constructed samples, the true effect is between -3.22 and -1.33

Estimating in R

• base R doesn’t show confidence intervals in the lm summary() output, need the confint command
confint(school_reg)
                2.5 %     97.5 %
(Intercept) 680.32313 717.542779
str          -3.22298  -1.336637

Estimating with broom

• broom’s tidy() command can include confidence intervals
school_reg %>%
tidy(conf.int = TRUE)