2.3 — Simple Linear Regression

ECON 480 • Econometrics • Fall 2022

Dr. Ryan Safner

Associate Professor of Economics

safner@hood.edu

ryansafner/metricsF22

metricsF22.classes.ryansafner.com

- We looked at single variables for descriptive statistics
- Most uses of statistics in economics and business investigate relationships
*between*variables

We will begin with bivariate data for relationships between \(X\) and \(Y\)

Immediate aim is to explore associations between variables, quantified with correlation and linear regression

Later we want to develop more sophisticated tools to argue for causation

**Rows**are individual observations (countries)**Columns**are variables on all individuals

```
Rows: 112
Columns: 6
$ ...1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ Country <chr> "Albania", "Algeria", "Angola", "Argentina", "Australia", "A…
$ ISO <chr> "ALB", "DZA", "AGO", "ARG", "AUS", "AUT", "BHR", "BGD", "BEL…
$ ef <dbl> 7.40, 5.15, 5.08, 4.81, 7.93, 7.56, 7.60, 6.35, 7.51, 6.22, …
$ gdp <dbl> 4543.0880, 4784.1943, 4153.1463, 10501.6603, 54688.4459, 476…
$ continent <chr> "Europe", "Africa", "Africa", "Americas", "Oceania", "Europe…
```

```
ggplot(data = econfreedom)+
aes(x = ef,
y = gdp)+
geom_point(aes(color = continent),
size = 2)+
labs(x = "Economic Freedom Index (2014)",
y = "GDP per Capita (2014 USD)",
color = "")+
scale_y_continuous(labels = scales::dollar)+
theme_pander(base_family = "Fira Sans Condensed",
base_size=20)+
theme(legend.position = "bottom")
```

- Look for association between independent and dependent variables

Direction: is the trend positive or negative?

Form: is the trend linear, quadratic, something else, or no pattern?

Strength: is the association strong or weak?

Outliers: do any observations deviate from the trends above?

- For any two variables, we can measure their sample covariance, \(cov(X,Y)\) or \(s_{X,Y}\) to quantify how they vary
*together*^{1}

\[s_{X,Y}=E\big[(X-\bar{X})(Y-\bar{Y}) \big]\]

- Intuition: if \(x_i\) is above the mean of \(X\), would we expect the associated \(y_i\):
- to be
**above**the mean of \(Y\) also \((X\) and \(Y\) covary**positively**) - to be
**below**the mean of \(Y\) \((X\) and \(Y\) covary**negatively**)

- to be

- Covariance is a common measure, but the units are meaningless, thus we rarely need to use it so
**don’t worry about learning the formula**

8923 what, exactly?

- Better to
*standardize*covariance into a more intuitive concept: correlation, \(r_{X,Y}\) \(\in [-1, 1]\)

\[r_{X,Y}=\frac{s_{X,Y}}{s_X s_Y}=\frac{cov(X,Y)}{sd(X)sd(Y)}\]

- Simply weight covariance by the product of the standard deviations of \(X\) and \(Y\)

- Alternatively, take the average
^{1}of the product of standardized \((Z\)-scores for) each \((x_i,y_i)\) pair:^{2}

\[\begin{align*} r&=\frac{1}{n-1}\sum^n_{i=1}\bigg(\frac{x_i-\bar{X}}{s_X}\bigg)\bigg(\frac{y_i-\bar{Y}}{s_Y}\bigg)\\ r&=\frac{1}{n-1}\sum^n_{i=1}Z_XZ_Y\\ \end{align*}\]

- Correlation is standardized to

\[-1 \leq r \leq 1\]

Negative values \(\implies\) negative association

Positive values \(\implies\) positive association

Correlation of 0 \(\implies\) no association

As \(|r| \rightarrow 1 \implies\) the stronger the association

Correlation of \(|r|=1 \implies\) perfectly linear

Your Occasional Reminder: Correlation does not imply causation!

- I’ll show you the difference in a few weeks (when we can actually talk about causation)

If \(X\) and \(Y\) are strongly correlated, \(X\) can still be endogenous!

See today’s appendix page for more on Covariance and Correlation

- If an association appears linear, we can estimate the equation of a line that would “fit” the data

\[Y = a + bX\]

- A linear equation describing a line has two parameters:
- \(a\): vertical intercept
- \(b\): slope

- If an association appears linear, we can estimate the equation of a line that would “fit” the data

\[Y = a + bX\]

- A linear equation describing a line has two parameters:
- \(a\): vertical intercept
- \(b\): slope

- How do we choose the equation that
**best**fits the data?

- If an association appears linear, we can estimate the equation of a line that would “fit” the data

\[Y = a + bX\]

A linear equation describing a line has two parameters:

- \(a\): vertical intercept
- \(b\): slope

How do we choose the equation that

**best**fits the data?This process is called linear regression

Linear regression lets us

**estimate**the slope of the population regression line between \(X\) and \(Y\) using sample dataWe can make statistical inferences about what the true population slope coefficient is

- eventually & hopefully: a
*causal*inference

- eventually & hopefully: a
\(\text{slope}=\frac{\Delta Y}{\Delta X}\): for a 1-unit change in \(X\), how many units will this

*cause*\(Y\) to change?

Data are student-teacher-ratio and average test scores on Stanford 9 Achievement Test for 5th grade students for 420 K-6 and K-8 school districts in California in 1999, (Stock and Watson, 2015: p. 141)

```
Rows: 420
Columns: 21
$ observat <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ dist_cod <dbl> 75119, 61499, 61549, 61457, 61523, 62042, 68536, 63834, 62331…
$ county <chr> "Alameda", "Butte", "Butte", "Butte", "Butte", "Fresno", "San…
$ district <chr> "Sunol Glen Unified", "Manzanita Elementary", "Thermalito Uni…
$ gr_span <chr> "KK-08", "KK-08", "KK-08", "KK-08", "KK-08", "KK-08", "KK-08"…
$ enrl_tot <dbl> 195, 240, 1550, 243, 1335, 137, 195, 888, 379, 2247, 446, 987…
$ teachers <dbl> 10.90, 11.15, 82.90, 14.00, 71.50, 6.40, 10.00, 42.50, 19.00,…
$ calw_pct <dbl> 0.5102, 15.4167, 55.0323, 36.4754, 33.1086, 12.3188, 12.9032,…
$ meal_pct <dbl> 2.0408, 47.9167, 76.3226, 77.0492, 78.4270, 86.9565, 94.6237,…
$ computer <dbl> 67, 101, 169, 85, 171, 25, 28, 66, 35, 0, 86, 56, 25, 0, 31, …
$ testscr <dbl> 690.80, 661.20, 643.60, 647.70, 640.85, 605.55, 606.75, 609.0…
$ comp_stu <dbl> 0.34358975, 0.42083332, 0.10903226, 0.34979424, 0.12808989, 0…
$ expn_stu <dbl> 6384.911, 5099.381, 5501.955, 7101.831, 5235.988, 5580.147, 5…
$ str <dbl> 17.88991, 21.52466, 18.69723, 17.35714, 18.67133, 21.40625, 1…
$ avginc <dbl> 22.690001, 9.824000, 8.978000, 8.978000, 9.080333, 10.415000,…
$ el_pct <dbl> 0.000000, 4.583333, 30.000002, 0.000000, 13.857677, 12.408759…
$ read_scr <dbl> 691.6, 660.5, 636.3, 651.9, 641.8, 605.7, 604.5, 605.5, 608.9…
$ math_scr <dbl> 690.0, 661.9, 650.9, 643.5, 639.9, 605.4, 609.0, 612.5, 616.1…
$ aowijef <dbl> 35.77982, 43.04933, 37.39445, 34.71429, 37.34266, 42.81250, 3…
$ es_pct <dbl> 1.000000, 3.583333, 29.000002, 1.000000, 12.857677, 11.408759…
$ es_frac <dbl> 0.01000000, 0.03583334, 0.29000002, 0.01000000, 0.12857677, 0…
```

- If we
*change*\((\Delta)\) the class size by an amount, what would we expect the*change*in test scores to be?

\[\beta = \frac{\text{change in test score}}{\text{change in class size}} = \frac{\Delta \text{test score}}{\Delta \text{class size}}\]

- If we knew \(\beta\), we could say that changing class size by 1 student will change test scores by \(\beta\)

- Rearranging:

\[\Delta \text{test score} = \beta \times \Delta \text{class size}\]

- Rearranging:

\[\Delta \text{test score} = \beta \times \Delta \text{class size}\]

- Suppose \(\beta=-0.6\). If we shrank class size by 2 students, our model predicts:

\[\begin{align*} \Delta \text{test score} &= -2 \times \beta\\ \Delta \text{test score} &= -2 \times -0.6\\ \Delta \text{test score}&= 1.2 \\ \end{align*}\]

Test scores would improve by 1.2 points, *on average*.

\[\text{test score} = \beta_0 + \beta_{1} \times \text{class size}\]

The line relating class size and test scores has the above equation

\(\beta_0\) is the vertical-intercept, test score where class size is 0

\(\beta_{1}\) is the slope of the regression line

This relationship only holds on average for all districts in the population,

*individual*districts are also affected by other factors

- To get an equation that holds for
*each*district, we need to include other factors

\[\text{test score} = \beta_0 + \beta_1 \text{class size}+\text{other factors}\]

For now, we will ignore these until Unit III

Thus, \(\beta_0 + \beta_1 \text{class size}\) gives the average effect of class sizes on scores

Later, we will want to estimate the marginal effect (causal effect) of each factor on an individual district’s test score, holding all other factors constant

\[Y = \beta_0 + \beta_1 X + u\]

- \(Y\) is the dependent variable of interest
- AKA “response variable,” “regressand,” “Left-hand side (LHS) variable”

- \(X_1\) is an independent variable
- AKA “explanatory variable”, “regressor,” “Right-hand side (RHS) variable”, “covariate”

- Our data consists of a spreadsheet of observed values of \((X_{1i}, X_{2i}, Y_i)\)

\[Y = \beta_0 + \beta_1 X + u\]

- To model, we “regress Y on \(X_1\)”

- \(\beta_0\) and \(\beta_1\) are parameters that describe the population relationships between the variables
- unknown! to be estimated

- \(u\) is a random error term
**’U’nobservable**, we can’t measure it, and must model with assumptions about it

How do we draw a line through the scatterplot? We do not know the

**“true”**\(\beta_0\) or \(\beta_1\)We do have data from a sample of class sizes and test scores

So the real question is, how can we estimate \(\beta_0\) and \(\beta_1\)?

- With a simple linear regression model, for each associated \(X\) value, we have

- The
**observed**(or**actual**)**values**of \(\color{#0047AB}{Y_i}\)

- With a simple linear regression model, for each associated \(X\) value, we have

- The
**observed**(or**actual**)**values**of \(\color{#0047AB}{Y_i}\) **Predicted**(or**fitted**)**values**, \(\color{#047806}{\hat{Y}_i}\)

- With a simple linear regression model, for each associated \(X\) value, we have

- The
**observed**(or**actual**)**values**of \(\color{#0047AB}{Y_i}\) **Predicted**(or**fitted**)**values**, \(\color{#047806}{\hat{Y}_i}\)- The
**residual**(or**error**), \(\color{#D7250E}{\hat{u}_i}=\color{#0047AB}{Y_i}-\color{#047806}{\hat{Y}_i}\) … the difference between predicted and observed values

\[\begin{align*} \color{#0047AB}{Y_i} &= \color{#047806}{\hat{Y}_i} + \color{#D7250E}{\hat{u}_i} \\ \color{#0047AB}{\text{Observed}_i} &= \color{#047806}{\text{Model}_i} + \color{#D7250E}{\text{Error}_i} \\ \end{align*}\]

- Take the residuals \(\color{#D7250E}{\hat{u}_i}\) and square them (why)?

Take the residuals \(\color{#D7250E}{\hat{u}_i}\) and square them (why)?

The regression line

*minimizes*the sum of the squared residuals (SSR)

\[SSR = \sum^n_{i=1} \color{#D7250E}{\hat{u}_i}^2\]

- The Ordinary Least Squares (OLS) estimators of the unknown population parameters \(\beta_0\) and \(\beta_1\), solve the calculus problem:

\[\min_{\beta_0, \beta_1} \sum^n_{i=1}[\underbrace{Y_i-(\underbrace{\beta_0+\beta_1 X_i}_{\hat{Y_i}})}_{\hat{u_i}}]^2\]

- Intuitively, OLS estimators minimize the sum of the squared residuals (distance between the actual values \(Y_i\) and the predicted values \(\hat{Y_i}\)) along the estimated regression line

- The OLS regression line or sample regression line is the linear function constructed using the OLS estimators:

\[\hat{Y_i}=\hat{\beta_0}+\hat{\beta_1}X_i\]

- \(\hat{\beta_0}\) and \(\hat{\beta_1}\) (“beta 0 hat” & “beta 1 hat”) are the OLS estimators of population parameters \(\beta_0\) and \(\beta_1\) using sample data

- The predicted value of Y given X, based on the regression, is \(E(Y_i|X_i)=\hat{Y_i}\)

- The residual or prediction error for the \(i^{th}\) observation is the difference between observed \(Y_i\) and its predicted value, \(\hat{u_i}=Y_i-\hat{Y_i}\)

- The solution to the SSE minimization problem yields:
^{1}

\[\hat{\beta}_0=\bar{Y}-\hat{\beta}_1\bar{X}\]

\[\hat{\beta}_1=\frac{\displaystyle\sum^n_{i=1}(X_i-\bar{X})(Y_i-\bar{Y})}{\displaystyle\sum^n_{i=1}(X_i-\bar{X})^2}=\frac{s_{XY}}{s^2_X}= \frac{cov(X,Y)}{var(X)}\]

- The regression line goes through the “center of mass” point \((\bar{X},\bar{Y})\)

- Again, \(\hat{\beta}_0= \bar{Y}-\hat{\beta}_1 \bar{X}\)

- The slope, \(\hat{\beta}_1\) has the same sign as the correlation coefficient \(r_{X,Y}\), and is related

\[\hat{\beta}_1=r\frac{s_Y}{s_X}\]

- The residuals sum and average to zero

\[\begin{align*} \sum^n_{i=1} \hat{u}_i &= 0\\ \mathbb{E}[\hat{u}] &= 0 \\ \end{align*}\]

- The residuals and \(X\) are uncorrelated

- There is some true (unknown) population relationship:

\[\text{test score}_i=\beta_0+\beta_1 str_i\]

- \(\beta_1=\frac{\Delta \text{test score}}{\Delta \text{str}}= ??\)

Format for regression is `lm(y ~ x, data = df)`

`y`

is dependent variable (listed first!)`~`

means “is modeled by” or “is explained by”`x`

is the independent variable`df`

is name of dataframe where data is stored

This is `base R`

(there’s no good `tidyverse`

way to do this yet…ish^{1})

```
Call:
lm(formula = testscr ~ str, data = ca_school)
Residuals:
Min 1Q Median 3Q Max
-47.727 -14.251 0.483 12.822 48.540
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 698.9330 9.4675 73.825 < 2e-16 ***
str -2.2798 0.4798 -4.751 2.78e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.58 on 418 degrees of freedom
Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897
F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06
```

Looking at the

`summary`

, there’s a lot of information here!These objects are cumbersome, come from a much older, pre-

`tidyverse`

era of`base R`

Luckily, we now have some more

`tidy`

ways of working with regression*output*!

`broom`

The

`broom`

package allows us to work with regression objects as tidier`tibbles`

Several useful commands:

Command | Does |
---|---|

`tidy()` |
Create tibble of regression coefficients & stats |

`glance()` |
Create tibble of regression fit statistics |

`augment()` |
Create tibble of data with regression-based variables |

`broom`

: `tidy()`

- The
`tidy()`

function creates a*tidy*`tibble`

of regression output

```
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 699. 9.47 73.8 6.57e-242
2 str -2.28 0.480 -4.75 2.78e- 6
```

`broom`

: `tidy()`

- The
`tidy()`

function creates a*tidy*`tibble`

of regression output…**with confidence intervals**

```
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 699. 9.47 73.8 6.57e-242 680. 718.
2 str -2.28 0.480 -4.75 2.78e- 6 -3.22 -1.34
```

`broom`

: `glance()`

`glance()`

shows us a lot of overall regression statistics and diagnostics- We’ll interpret these in next class and beyond

```
# A tibble: 1 × 12
r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC BIC devia…⁴ df.re…⁵
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 0.0512 0.0490 18.6 22.6 2.78e-6 1 -1822. 3650. 3663. 144315. 418
# … with 1 more variable: nobs <int>, and abbreviated variable names
# ¹r.squared, ²adj.r.squared, ³statistic, ⁴deviance, ⁵df.residual
```

`broom`

: `augment()`

`augment()`

creates a new tibble with the data \((X,Y)\) and regression-based variables, including:`.fitted`

are fitted (predicted) values from model, i.e. \(\hat{Y}_i\)`.resid`

are residuals (errors) from model, i.e. \(\hat{u}_i\)

```
# A tibble: 420 × 8
testscr str .fitted .resid .hat .sigma .cooksd .std.resid
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 691. 17.9 658. 32.7 0.00442 18.5 0.00689 1.76
2 661. 21.5 650. 11.3 0.00475 18.6 0.000893 0.612
3 644. 18.7 656. -12.7 0.00297 18.6 0.000700 -0.685
4 648. 17.4 659. -11.7 0.00586 18.6 0.00117 -0.629
5 641. 18.7 656. -15.5 0.00301 18.6 0.00105 -0.836
6 606. 21.4 650. -44.6 0.00446 18.5 0.0130 -2.40
7 607. 19.5 654. -47.7 0.00239 18.5 0.00794 -2.57
8 609 20.9 651. -42.3 0.00343 18.5 0.00895 -2.28
9 612. 19.9 653. -41.0 0.00244 18.5 0.00597 -2.21
10 613. 20.8 652. -38.9 0.00329 18.5 0.00723 -2.09
# … with 410 more rows
```

- Using OLS, we find:

\[\widehat{\text{test score}_i}=689.93-2.28 \, str_i\]

- \(\hat{\beta_0} = 689.93\): test score for \(str=0\)

- \(\hat{\beta_1} = -2.28\): for every 1 unit change in \(str\), \(\hat{\text{test_score}}\) changes by -2.28 points

\[\text{test score}_i = 689.93 - 2.28 \, str_i + \hat{u}_i\]

`.resid = testscr - .fitted`

\[\hat{u}_i = \text{test score}_i - \widehat{\text{test score}}_i\]

\[\hat{u}_i = \text{test score}_i - (689.93-2.28 \, str_i)\]

```
# A tibble: 420 × 4
testscr str .fitted .resid
<dbl> <dbl> <dbl> <dbl>
1 691. 17.9 658. 32.7
2 661. 21.5 650. 11.3
3 644. 18.7 656. -12.7
4 648. 17.4 659. -11.7
5 641. 18.7 656. -15.5
6 606. 21.4 650. -44.6
7 607. 19.5 654. -47.7
8 609 20.9 651. -42.3
9 612. 19.9 653. -41.0
10 613. 20.8 652. -38.9
# … with 410 more rows
```

`testscr = .fitted + .resid`

- One district in our sample is Richmond Elementary

`.fitted`

value:

\[\widehat{\text{Test Score}}_{\text{Richmond}}=698-2.28(22) \approx 648\]

`.resid`

value:

\[\hat{u}_{Richmond}=672-648 \approx 24\]

- We can use the regression model to make a prediction for a particular \(x_i\)

\[\begin{align*} \widehat{\text{test score}_i} &= \hat{\beta_0}+\hat{\beta_1} \, \text{str}_i \\ &= 698.93 - 2.28 (18)\\ &= 657.89\\ \end{align*}\]

`R`

- We can do this in
`R`

with the`predict()`

^{1}function, which requires (at least) two inputs:- An
`lm`

object (saved regression) `newdata`

with \(X\) value(s) to predict \(\hat{Y}\) for, as a`data.frame`

(or`tibble`

)

- An

```
# A tibble: 1 × 1
str
<dbl>
1 18
```

`R`

, Manually I- Of course we could do it ourselves…

`R`

, Manually II- Of course we could do it ourselves…

`R`

, Manually II- Of course we could do it ourselves…

`[1] -2.279808`