2.6 — Inference for Regression
ECON 480 • Econometrics • Fall 2022
Dr. Ryan Safner
Associate Professor of Economics
safner@hood.edu
ryansafner/metricsF22
metricsF22.classes.ryansafner.com


\[Y_i = \beta_0+\beta_1 X_i+u_i\]
OLS estimators \((\hat{\beta_0}\) and \(\hat{\beta_1})\) are computed from a finite (specific) sample of data
Our OLS model contains 2 sources of randomness:
Sample \(\color{#6A5ACD}{\xrightarrow{\text{statistical inference}}}\) Population \(\color{#e64173}{\xrightarrow{\text{causal indentification}}}\) Unobserved Parameters

Population relationship
\(Y_i = 698.93 + -2.28 X_i + u_i\)
\(Y_i = \beta_0 + \beta_1 X_i + u_i\)

Sample 1: 50 random observations
Population relationship
\(Y_i = 698.93 + -2.28 X_i + u_i\)
Sample relationship
\(\hat{Y}_i = 708.12 + -2.54 X_i\)

Sample 2: 50 random individuals
Population relationship
\(Y_i = 698.93 + -2.28 X_i + u_i\)
Sample relationship
\(\hat{Y}_i = 708.12 + -2.54 X_i\)

Sample 3: 50 random individuals
Population relationship
\(Y_i = 698.93 + -2.28 X_i + u_i\)
Sample relationship
\(\hat{Y}_i = 708.12 + -2.54 X_i\)
Let’s repeat this process 10,000 times!
This exercise is called a (Monte Carlo) simulation
infer package
\[\mathbb{E}[\hat{\beta}_1] = \beta_1\]
But, any individual estimate can miss the mark
This leads to uncertainty about our estimated regression line

Sample \(\color{#6A5ACD}{\xrightarrow{\text{statistical inference}}}\) Population \(\color{#e64173}{\xrightarrow{\text{causal indentification}}}\) Unobserved Parameters
\[\hat{Y_i}=\hat{\beta_0}+\hat{\beta_1}X \color{#6A5ACD}{\xrightarrow{\text{🤞 hopefully 🤞}}} Y_i=\beta_0+\beta_1X+u_i\]
Our problem with uncertainty is we don’t know whether our sample estimate is close or far from the unknown population parameter
But we can use our errors to learn how well our model statistics likely estimate the true parameters
Use \(\hat{\beta_1}\) and its standard error, \(se(\hat{\beta_1})\) for statistical inference about true \(\beta_1\)
We have two options…





infer Packageinfer Package Iinfer Package II
infer allows you to run through these steps manually to understand the process:specify() a model
generate() a bootstrap distribution
calculate() the confidence interval
visualize() with a histogram (optional)
infer Package IIIinfer Package IIIinfer Package IIIinfer Package IIIinfer Package III| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 |
| str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 |
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 671.5164920 | 8.9597708 | 74.947954 | 1.853313e-244 |
| str | -0.9595986 | 0.4521103 | -2.122488 | 3.438415e-02 |
👆 Bootstrapped from Our Sample
infer Pipeline: specify()infer Pipeline: specify()data %>%
specify(y ~ x)
specify() function, which is essentially a lm() function for regression (for our purposes)infer Pipeline: generate()infer Pipeline: generate()infer Pipeline: generate()%>% generate(reps = n, type = "bootstrap")
Now the magic starts, as we run a number of simulated samples
Set the number of reps and set type to "bootstrap"
replicate: the “sample” number (1-1000)
creates x and y values (data points)
infer Pipeline: calculate()infer Pipeline: calculate()infer Pipeline: calculate()infer Pipeline: calculate()%>% calculate(stat = "slope")
infer Pipeline: get_confidence_interval()%>% get_confidence_interval()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> | |
|---|---|---|---|---|---|
| (Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 | |
| str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 |
infer Pipeline: visualize()infer Pipeline: visualize()%>% visualize()
\(\bigg( \big[\) estimate - MOE \(\big]\), \(\big[\) estimate + MOE \(\big] \bigg)\)


Depending on our confidence level, we are essentially looking for the middle \((1-\alpha)\)% of the sampling distribution
This puts \(\alpha\) in the tails; \(\frac{\alpha}{2}\) in each tail

Recall the 68-95-99.7% empirical rule for (standard) normal distributions!1
95% of data falls within 2 standard deviations of the mean
Thus, in 95% of samples, the true parameter is likely to fall within about 2 standard deviations of the sample estimate

❌ 95% of the time, the true effect of class size on test score will be between -3.22 and -1.33
❌ We are 95% confident that a randomly selected school district will have an effect of class size on test score between -3.22 and -1.33
❌ The effect of class size on test score is -2.28 95% of the time.
✅ We are 95% confident that in similarly constructed samples, the true effect is between -3.22 and -1.33
Rbase R doesn’t show confidence intervals in the lm summary() output, need the confint command 2.5 % 97.5 %
(Intercept) 680.32313 717.542779
str -3.22298 -1.336637
broombroom’s tidy() command can include confidence intervals| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> | |
|---|---|---|---|---|---|
| (Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 | |
| str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 |

