Midterm Concepts

OLS Regression

Bivariate data and associations between variables (e.g. $X$ and $Y$ )

Apparent relationships are best viewed by looking at a scatterplot
Check for associations to be positive/negative, weak/strong, linear/nonlinear, etc
- $Y$ : dependent variable
- $X$ : independent variable
Correlation coefficient ( $r$ ) can quantify the strength of an association

$r_{X, Y} = \frac{1}{n - 1} \sum^{n} (\frac{X_{i} - \bar{X}}{s_{X}}) (\frac{Y_{i} - \bar{Y}}{s_{Y}}) = \frac{\sum^{n} Z_{X} Z_{Y}}{n - 1}$
- $- 1 \leq r \leq 1$ and $r$ only measures linear associations
- $| r |$ closer to 1 imply stronger correlation (near a perfect straight line)
- Correlation does not imply causation! Might be confounding or lurking variables (e.g.$Z$) affecting $X$ and/or $Y$

Population regression model

$Y_{i} = β_{0} + β_{1} X_{i} + u_{i}$

$β_{1}$ : $\frac{Δ Y}{Δ X}$ : the slope between $X$ and $Y$, number of units of $Y$ from a 1 unit change in $X$
$β_{0}$ is the $Y$ -intercept: literally, value of $Y$ when $X = 0$
$u_{i}$ is the error, difference between actual value of $Y | X$ vs. predicted value of $\hat{Y}$

Ordinary Least Squares (OLS) regression model

$Y_{i} = {\hat{β}}_{0} + {\hat{β}}_{1} X_{i} + u_{i}$

$\hat{Y_{i}} = \hat{β_{0}} + \hat{β_{1}} X_{i}$

${\hat{u}}_{i} = Y_{i} - {\hat{Y}}_{i}$

OLS estimators $\hat{β_{0}}$ and $\hat{β_{1}}$ estimate population regression line from sample data
Minimize sum of squared errors (SSR) $m i n \sum^{n} {\hat{u}}_{i}^{2}$
OLS regression line

$\hat{β_{1}} = \frac{c o v (X, Y)}{v a r (X)} = \frac{\sum (X_{i} - \bar{X}) (Y_{i} - \bar{Y})}{\sum (X_{i} - \bar{X})^{2}} = r_{X, Y} \frac{s_{Y}}{s_{X}}$

$\hat{β_{0}} = \bar{Y} - \hat{β_{1}} \bar{X}$

Measures of Fit

$R^{2}$ : fraction of total variation on $Y$ explained by variation in $X$ according to model

$\begin{aligned} R^{2} & = \frac{S S M}{S S T} \\ R^{2} & = 1 - \frac{S S R}{S S T} \\ R^{2} & = r_{X, Y}^{2} \end{aligned}$

Where
- $S S M = \sum ({\hat{Y}}_{i} - {\bar{Y}}_{i})^{2}$
- $S S T = \sum (Y_{i} - {\bar{Y}}_{i})^{2}$
- $S S R = \sum u_{i}^{2}$

Standard error of the regression (or residuals), SER: average size of ${\hat{u}}_{i}$ , i.e. average distance between points and the regression line

$S E R (σ_{{\hat{u}}_{i}}) = \frac{S S R}{n - 2} = \frac{\sum {\hat{u_{i}}}^{2}}{n - 2}$

Sampling Distribution of ${\hat{β}}_{1}$

$\hat{β_{1}}$ is a random variable, so it has its own sampling distribution with mean $E [\hat{β_{1}}]$ and standard error $s e [\hat{β_{1}}]$

Mean of OLS estimator $\hat{β_{1}}$ & Bias: Endogeneity & Exogeneity
- $X$ is exogenous if it is not correlated with the error term
  
  $\begin{aligned} c o r (X, u) & = 0 \\ E [u | X] & = 0 \end{aligned}$
  - equivalently, knowing $X$ should tell us nothing about $u$ (zero conditional mean assumption)
  - if $X$ is exogenous, OLS estimate of $β_{1}$ is unbiased
    
    $E [{\hat{β}}_{1}] = β_{1}$
- $X$ is endogenous if it is correlated with the error term
  
  $c o r (X, u) \neq 0$
  - If $X$ is endogenous, OLS estimate of $β_{1}$ is biased:
    
    $E [{\hat{β}}_{1}] = β_{1} + \underset{b i a s}{\underset{⏟}{c o r (X, u) \frac{σ_{u}}{σ_{X}}}}$
    - Can measure strength and direction (+ or -) of bias
    - Note if unbiased, $c o r (X, u) = 0$ , so $E [\hat{β_{1}}] = β_{1}$
- Assumptions about u
  1. The mean of the errors is 0
    
    $E [u_{i}] = 0$
  2. The variance of the errors is constant over all values of $X$ (homoskedasticity)
    
    $v a r [u_{i} | X_{i}] = σ_{u}^{2}$
  3. Errors are not correlated across observations $i$ and $j$ (no autocorrelation)
    
    $c o r (u_{i}, u_{j}) = 0$
  4. There is no correlation between $X$ and $u$ , i.e. the model is exogenous
    
    $\begin{aligned} c o r (X, u) & = 0 \\ E [u | X] & = 0 \end{aligned}$
Precision of OLS estimator ${\hat{β}}_{1}$ measures uncertainty/variability of estimate

$\begin{aligned} v a r [{\hat{β}}_{1}] & = \frac{S E R^{2}}{n \times v a r (X)} \\ s e [{\hat{β}}_{1}] & = \sqrt{v a r [{\hat{β}}_{1}]} \end{aligned}$
- Affected by three factors:
  - Model fit, (SER)
  - Sample size, $n$
  - Variation in $X$
- Heteroskedasticity & Homoskedasticity
  - Homoskedastic errors ( ${\hat{u}}_{i}$ ) have the same variance over all values of $X$
  - Heteroskedastic errors ( ${\hat{u}}_{i}$ ) have different variance over values of $X$
    - Heteroskedasticity does not bias our estimates, but incorrectly lowers variance & standard errors (inflating $t$-statistics and significance!)
    - Can correct for heteroskedasticity by using robust standard errors

Hypothesis Testing of $β_{1}$

$H_{0} : β_{1} = β_{1, 0}$ , often $H_{0} : β_{1} = 0$
Two sided alternative $H_{1} : β_{1} \neq 0$
One sided alternatives $H_{1} : β_{1} > 0$ or $H_{2} : β_{1} < 0$
$t$ -statistic

$t = \frac{\hat{β_{1}} - β_{1, 0}}{s e (\hat{β_{1}})}$

Compare $t$ against critical value $t$ *, or compute $p$ -value as usual
Confidence intervals (95%): $\hat{β_{1}} \pm 1.96 (s e (\hat{β_{1}}))$

OLS Regression

Measures of Fit

Sampling Distribution of β^1

Hypothesis Testing of β1

Sampling Distribution of ${\hat{β}}_{1}$

Hypothesis Testing of $β_{1}$