2.3 — Simple Linear Regression
ECON 480 • Econometrics • Fall 2022
Dr. Ryan Safner
Associate Professor of Economics
safner@hood.edu
ryansafner/metricsF22
metricsF22.classes.ryansafner.com
We will begin with bivariate data for relationships between \(X\) and \(Y\)
Immediate aim is to explore associations between variables, quantified with correlation and linear regression
Later we want to develop more sophisticated tools to argue for causation
Rows: 112
Columns: 6
$ ...1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ Country <chr> "Albania", "Algeria", "Angola", "Argentina", "Australia", "A…
$ ISO <chr> "ALB", "DZA", "AGO", "ARG", "AUS", "AUT", "BHR", "BGD", "BEL…
$ ef <dbl> 7.40, 5.15, 5.08, 4.81, 7.93, 7.56, 7.60, 6.35, 7.51, 6.22, …
$ gdp <dbl> 4543.0880, 4784.1943, 4153.1463, 10501.6603, 54688.4459, 476…
$ continent <chr> "Europe", "Africa", "Africa", "Americas", "Oceania", "Europe…
ggplot(data = econfreedom)+
aes(x = ef,
y = gdp)+
geom_point(aes(color = continent),
size = 2)+
labs(x = "Economic Freedom Index (2014)",
y = "GDP per Capita (2014 USD)",
color = "")+
scale_y_continuous(labels = scales::dollar)+
theme_pander(base_family = "Fira Sans Condensed",
base_size=20)+
theme(legend.position = "bottom")
Direction: is the trend positive or negative?
Form: is the trend linear, quadratic, something else, or no pattern?
Strength: is the association strong or weak?
Outliers: do any observations deviate from the trends above?
\[s_{X,Y}=E\big[(X-\bar{X})(Y-\bar{Y}) \big]\]
8923 what, exactly?
\[r_{X,Y}=\frac{s_{X,Y}}{s_X s_Y}=\frac{cov(X,Y)}{sd(X)sd(Y)}\]
\[\begin{align*} r&=\frac{1}{n-1}\sum^n_{i=1}\bigg(\frac{x_i-\bar{X}}{s_X}\bigg)\bigg(\frac{y_i-\bar{Y}}{s_Y}\bigg)\\ r&=\frac{1}{n-1}\sum^n_{i=1}Z_XZ_Y\\ \end{align*}\]
\[-1 \leq r \leq 1\]
Negative values \(\implies\) negative association
Positive values \(\implies\) positive association
Correlation of 0 \(\implies\) no association
As \(|r| \rightarrow 1 \implies\) the stronger the association
Correlation of \(|r|=1 \implies\) perfectly linear
Your Occasional Reminder: Correlation does not imply causation!
If \(X\) and \(Y\) are strongly correlated, \(X\) can still be endogenous!
See today’s appendix page for more on Covariance and Correlation
\[Y = a + bX\]
\[Y = a + bX\]
\[Y = a + bX\]
A linear equation describing a line has two parameters:
How do we choose the equation that best fits the data?
This process is called linear regression
Linear regression lets us estimate the slope of the population regression line between \(X\) and \(Y\) using sample data
We can make statistical inferences about what the true population slope coefficient is
\(\text{slope}=\frac{\Delta Y}{\Delta X}\): for a 1-unit change in \(X\), how many units will this cause \(Y\) to change?
Data are student-teacher-ratio and average test scores on Stanford 9 Achievement Test for 5th grade students for 420 K-6 and K-8 school districts in California in 1999, (Stock and Watson, 2015: p. 141)
Rows: 420
Columns: 21
$ observat <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ dist_cod <dbl> 75119, 61499, 61549, 61457, 61523, 62042, 68536, 63834, 62331…
$ county <chr> "Alameda", "Butte", "Butte", "Butte", "Butte", "Fresno", "San…
$ district <chr> "Sunol Glen Unified", "Manzanita Elementary", "Thermalito Uni…
$ gr_span <chr> "KK-08", "KK-08", "KK-08", "KK-08", "KK-08", "KK-08", "KK-08"…
$ enrl_tot <dbl> 195, 240, 1550, 243, 1335, 137, 195, 888, 379, 2247, 446, 987…
$ teachers <dbl> 10.90, 11.15, 82.90, 14.00, 71.50, 6.40, 10.00, 42.50, 19.00,…
$ calw_pct <dbl> 0.5102, 15.4167, 55.0323, 36.4754, 33.1086, 12.3188, 12.9032,…
$ meal_pct <dbl> 2.0408, 47.9167, 76.3226, 77.0492, 78.4270, 86.9565, 94.6237,…
$ computer <dbl> 67, 101, 169, 85, 171, 25, 28, 66, 35, 0, 86, 56, 25, 0, 31, …
$ testscr <dbl> 690.80, 661.20, 643.60, 647.70, 640.85, 605.55, 606.75, 609.0…
$ comp_stu <dbl> 0.34358975, 0.42083332, 0.10903226, 0.34979424, 0.12808989, 0…
$ expn_stu <dbl> 6384.911, 5099.381, 5501.955, 7101.831, 5235.988, 5580.147, 5…
$ str <dbl> 17.88991, 21.52466, 18.69723, 17.35714, 18.67133, 21.40625, 1…
$ avginc <dbl> 22.690001, 9.824000, 8.978000, 8.978000, 9.080333, 10.415000,…
$ el_pct <dbl> 0.000000, 4.583333, 30.000002, 0.000000, 13.857677, 12.408759…
$ read_scr <dbl> 691.6, 660.5, 636.3, 651.9, 641.8, 605.7, 604.5, 605.5, 608.9…
$ math_scr <dbl> 690.0, 661.9, 650.9, 643.5, 639.9, 605.4, 609.0, 612.5, 616.1…
$ aowijef <dbl> 35.77982, 43.04933, 37.39445, 34.71429, 37.34266, 42.81250, 3…
$ es_pct <dbl> 1.000000, 3.583333, 29.000002, 1.000000, 12.857677, 11.408759…
$ es_frac <dbl> 0.01000000, 0.03583334, 0.29000002, 0.01000000, 0.12857677, 0…
\[\beta = \frac{\text{change in test score}}{\text{change in class size}} = \frac{\Delta \text{test score}}{\Delta \text{class size}}\]
\[\Delta \text{test score} = \beta \times \Delta \text{class size}\]
\[\Delta \text{test score} = \beta \times \Delta \text{class size}\]
\[\begin{align*} \Delta \text{test score} &= -2 \times \beta\\ \Delta \text{test score} &= -2 \times -0.6\\ \Delta \text{test score}&= 1.2 \\ \end{align*}\]
Test scores would improve by 1.2 points, on average.
\[\text{test score} = \beta_0 + \beta_{1} \times \text{class size}\]
The line relating class size and test scores has the above equation
\(\beta_0\) is the vertical-intercept, test score where class size is 0
\(\beta_{1}\) is the slope of the regression line
This relationship only holds on average for all districts in the population, individual districts are also affected by other factors
\[\text{test score} = \beta_0 + \beta_1 \text{class size}+\text{other factors}\]
For now, we will ignore these until Unit III
Thus, \(\beta_0 + \beta_1 \text{class size}\) gives the average effect of class sizes on scores
Later, we will want to estimate the marginal effect (causal effect) of each factor on an individual district’s test score, holding all other factors constant
\[Y = \beta_0 + \beta_1 X + u\]
\[Y = \beta_0 + \beta_1 X + u\]
How do we draw a line through the scatterplot? We do not know the “true” \(\beta_0\) or \(\beta_1\)
We do have data from a sample of class sizes and test scores
So the real question is, how can we estimate \(\beta_0\) and \(\beta_1\)?
\[\begin{align*} \color{#0047AB}{Y_i} &= \color{#047806}{\hat{Y}_i} + \color{#D7250E}{\hat{u}_i} \\ \color{#0047AB}{\text{Observed}_i} &= \color{#047806}{\text{Model}_i} + \color{#D7250E}{\text{Error}_i} \\ \end{align*}\]
Take the residuals \(\color{#D7250E}{\hat{u}_i}\) and square them (why)?
The regression line minimizes the sum of the squared residuals (SSR)
\[SSR = \sum^n_{i=1} \color{#D7250E}{\hat{u}_i}^2\]
\[\min_{\beta_0, \beta_1} \sum^n_{i=1}[\underbrace{Y_i-(\underbrace{\beta_0+\beta_1 X_i}_{\hat{Y_i}})}_{\hat{u_i}}]^2\]
\[\hat{Y_i}=\hat{\beta_0}+\hat{\beta_1}X_i\]
\[\hat{\beta}_0=\bar{Y}-\hat{\beta}_1\bar{X}\]
\[\hat{\beta}_1=\frac{\displaystyle\sum^n_{i=1}(X_i-\bar{X})(Y_i-\bar{Y})}{\displaystyle\sum^n_{i=1}(X_i-\bar{X})^2}=\frac{s_{XY}}{s^2_X}= \frac{cov(X,Y)}{var(X)}\]
\[\hat{\beta}_1=r\frac{s_Y}{s_X}\]
\[\begin{align*} \sum^n_{i=1} \hat{u}_i &= 0\\ \mathbb{E}[\hat{u}] &= 0 \\ \end{align*}\]
\[\text{test score}_i=\beta_0+\beta_1 str_i\]
Format for regression is lm(y ~ x, data = df)
y
is dependent variable (listed first!)
~
means “is modeled by” or “is explained by”
x
is the independent variable
df
is name of dataframe where data is stored
This is base R
(there’s no good tidyverse
way to do this yet…ish1)
Call:
lm(formula = testscr ~ str, data = ca_school)
Residuals:
Min 1Q Median 3Q Max
-47.727 -14.251 0.483 12.822 48.540
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 698.9330 9.4675 73.825 < 2e-16 ***
str -2.2798 0.4798 -4.751 2.78e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.58 on 418 degrees of freedom
Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897
F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06
Looking at the summary
, there’s a lot of information here!
These objects are cumbersome, come from a much older, pre-tidyverse
era of base R
Luckily, we now have some more tidy
ways of working with regression output!
broom
The broom
package allows us to work with regression objects as tidier tibbles
Several useful commands:
Command | Does |
---|---|
tidy() |
Create tibble of regression coefficients & stats |
glance() |
Create tibble of regression fit statistics |
augment() |
Create tibble of data with regression-based variables |
broom
: tidy()
tidy()
function creates a tidy tibble
of regression output# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 699. 9.47 73.8 6.57e-242
2 str -2.28 0.480 -4.75 2.78e- 6
broom
: tidy()
tidy()
function creates a tidy tibble
of regression output…with confidence intervals# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 699. 9.47 73.8 6.57e-242 680. 718.
2 str -2.28 0.480 -4.75 2.78e- 6 -3.22 -1.34
broom
: glance()
glance()
shows us a lot of overall regression statistics and diagnostics
# A tibble: 1 × 12
r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC BIC devia…⁴ df.re…⁵
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 0.0512 0.0490 18.6 22.6 2.78e-6 1 -1822. 3650. 3663. 144315. 418
# … with 1 more variable: nobs <int>, and abbreviated variable names
# ¹r.squared, ²adj.r.squared, ³statistic, ⁴deviance, ⁵df.residual
broom
: augment()
augment()
creates a new tibble with the data \((X,Y)\) and regression-based variables, including:
.fitted
are fitted (predicted) values from model, i.e. \(\hat{Y}_i\).resid
are residuals (errors) from model, i.e. \(\hat{u}_i\)# A tibble: 420 × 8
testscr str .fitted .resid .hat .sigma .cooksd .std.resid
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 691. 17.9 658. 32.7 0.00442 18.5 0.00689 1.76
2 661. 21.5 650. 11.3 0.00475 18.6 0.000893 0.612
3 644. 18.7 656. -12.7 0.00297 18.6 0.000700 -0.685
4 648. 17.4 659. -11.7 0.00586 18.6 0.00117 -0.629
5 641. 18.7 656. -15.5 0.00301 18.6 0.00105 -0.836
6 606. 21.4 650. -44.6 0.00446 18.5 0.0130 -2.40
7 607. 19.5 654. -47.7 0.00239 18.5 0.00794 -2.57
8 609 20.9 651. -42.3 0.00343 18.5 0.00895 -2.28
9 612. 19.9 653. -41.0 0.00244 18.5 0.00597 -2.21
10 613. 20.8 652. -38.9 0.00329 18.5 0.00723 -2.09
# … with 410 more rows
\[\widehat{\text{test score}_i}=689.93-2.28 \, str_i\]
\[\text{test score}_i = 689.93 - 2.28 \, str_i + \hat{u}_i\]
.resid = testscr - .fitted
\[\hat{u}_i = \text{test score}_i - \widehat{\text{test score}}_i\]
\[\hat{u}_i = \text{test score}_i - (689.93-2.28 \, str_i)\]
# A tibble: 420 × 4
testscr str .fitted .resid
<dbl> <dbl> <dbl> <dbl>
1 691. 17.9 658. 32.7
2 661. 21.5 650. 11.3
3 644. 18.7 656. -12.7
4 648. 17.4 659. -11.7
5 641. 18.7 656. -15.5
6 606. 21.4 650. -44.6
7 607. 19.5 654. -47.7
8 609 20.9 651. -42.3
9 612. 19.9 653. -41.0
10 613. 20.8 652. -38.9
# … with 410 more rows
testscr = .fitted + .resid
.fitted
value:\[\widehat{\text{Test Score}}_{\text{Richmond}}=698-2.28(22) \approx 648\]
.resid
value:\[\hat{u}_{Richmond}=672-648 \approx 24\]
\[\begin{align*} \widehat{\text{test score}_i} &= \hat{\beta_0}+\hat{\beta_1} \, \text{str}_i \\ &= 698.93 - 2.28 (18)\\ &= 657.89\\ \end{align*}\]
R
R
with the predict()
1 function, which requires (at least) two inputs:
lm
object (saved regression)newdata
with \(X\) value(s) to predict \(\hat{Y}\) for, as a data.frame
(or tibble
)# A tibble: 1 × 1
str
<dbl>
1 18
R
, Manually IR
, Manually IIR
, Manually II[1] -2.279808