OLS Regression

Bivariate data and associations between variables (e.g. $$X$$ and $$Y$$)

• Apparent relationships are best viewed by looking at a scatterplot

• Check for associations to be positive/negative, weak/strong, linear/nonlinear, etc

• $$Y$$: dependent variable

• $$X$$: independent variable

• Correlation coefficient ($$r$$) can quantify the strength of an association

$r_{X,Y}=\frac{1}{n-1} \sum^n \bigg(\frac{X_i-\bar{X}}{s_X}\bigg) \bigg(\frac{Y_i-\bar{Y}}{s_Y}\bigg) = \frac{\displaystyle \sum^n Z_X Z_Y}{n-1}$

• $$-1 \leq r \leq 1$$ and $$r$$ only measures linear associations

• $$|r|$$ closer to 1 imply stronger correlation (near a perfect straight line)

• Correlation does not imply causation! Might be confounding or lurking variables (e.g.$Z$) affecting $$X$$ and/or $$Y$$

Population regression model

$Y_i=\beta_0+\beta_1X_i+u_i$

• $$\beta_1$$: $$\frac{\Delta Y}{\Delta X}$$: the slope between $X$ and $Y$, number of units of $$Y$$ from a 1 unit change in $$X$$

• $$\beta_0$$ is the $$Y$$-intercept: literally, value of $$Y$$ when $$X=0$$

• $$u_i$$ is the error, difference between actual value of $$Y|X$$ vs. predicted value of $$\hat{Y}$$

Ordinary Least Squares (OLS) regression model

$Y_i = \hat{\beta}_0+\hat{\beta}_1 X_i + u_i$

$\hat{Y_i}=\hat{\beta_0}+\hat{\beta_1}X_i$

$\hat{u}_i = Y_i-\hat{Y}_i$

• OLS estimators $$\hat{\beta_0}$$ and $$\hat{\beta_1}$$ estimate population regression line from sample data

• Minimize sum of squared errors (SSR) $$min\displaystyle \sum^n \hat{u}_i^2$$

• OLS regression line

$\hat{\beta_1} = \frac{cov(X,Y)}{var(X)}=\frac{\sum (X_i-\bar{X})(Y_i-\bar{Y})}{\sum (X_i-\bar{X})^2}=r_{X,Y}\frac{s_Y}{s_X}$

$\hat{\beta_0} = \bar{Y}-\hat{\beta_1}\bar{X}$

Measures of Fit

• $$R^2$$: fraction of total variation on $$Y$$ explained by variation in $$X$$ according to model

\begin{align*} R^2 & = \frac{SSM}{SST} \\ R^2 & = 1 - \frac{SSR}{SST} \\ R^2 & = r_{X,Y}^2 \\ \end{align*}

• Where

• $$SSM = \sum (\hat{Y}_i - \bar{Y}_i)^2$$

• $$SST = \sum(Y_i - \bar{Y}_i)^2$$

• $$SSR = \sum u_i^2$$

• Standard error of the regression (or residuals), SER: average size of $$\hat{u}_i$$, i.e. average distance between points and the regression line

$SER \, (\sigma_{\hat{u}_i})= \frac{SSR}{n-2} = \frac{\sum \hat{u_i}^2}{n-2}$

Sampling Distribution of $$\hat{\beta}_1$$

$$\hat{\beta_1}$$ is a random variable, so it has its own sampling distribution with mean $$\mathbb{E}[\hat{\beta_1}]$$ and standard error $$se[\hat{\beta_1}]$$

• Mean of OLS estimator $$\hat{\beta_1}$$ & Bias: Endogeneity & Exogeneity

• $$X$$ is exogenous if it is not correlated with the error term

\begin{align*} cor(X,u) &=0 \\ \mathbb{E}[u|X] &=0 \\ \end{align*}

• equivalently, knowing $$X$$ should tell us nothing about $$u$$ (zero conditional mean assumption)

• if $$X$$ is exogenous, OLS estimate of $$\beta_1$$ is unbiased

$E[\hat{\beta}_1]=\beta_1$

• $$X$$ is endogenous if it is correlated with the error term

$cor(X,u) \neq 0$

• If $$X$$ is endogenous, OLS estimate of $$\beta_1$$ is biased:

$\mathbb{E}[\hat{\beta}_1] = \beta_1 + \underbrace{cor(X,u)\frac{\sigma_u}{\sigma_X}}_{bias}$

• Can measure strength and direction (+ or -) of bias

• Note if unbiased, $$cor(X,u)=0$$, so $$E[\hat{\beta_1}]=\beta_1$$

• Assumptions about u

1. The mean of the errors is 0

$\mathbb{E}[u_i] = 0$

2. The variance of the errors is constant over all values of $$X$$ (homoskedasticity)

$var[u_i|X_i]=\sigma_u^2$

3. Errors are not correlated across observations $$i$$ and $$j$$ (no autocorrelation)

$cor(u_i,u_j) = 0$

4. There is no correlation between $$X$$ and $$u$$, i.e. the model is exogenous

\begin{align*} cor(X,u) &=0 \\ \mathbb{E}[u|X] &=0 \\ \end{align*}

• Precision of OLS estimator $$\hat{\beta}_1$$ measures uncertainty/variability of estimate

\begin{align*} var[\hat{\beta}_1]&=\frac{SER^2}{n\times var(X)}\\ se[\hat{\beta}_1]&=\sqrt{var[\hat{\beta}_1]} \\ \end{align*}

• Affected by three factors:

• Model fit, (SER)

• Sample size, $$n$$

• Variation in $$X$$

• Heteroskedasticity & Homoskedasticity

• Homoskedastic errors ($$\hat{u}_i$$) have the same variance over all values of $$X$$

• Heteroskedastic errors ($$\hat{u}_i$$) have different variance over values of $$X$$

• Heteroskedasticity does not bias our estimates, but incorrectly lowers variance & standard errors (inflating $t$-statistics and significance!)

• Can correct for heteroskedasticity by using robust standard errors

Hypothesis Testing of $$\beta_1$$

• $$H_0: \beta_1=\beta_{1,0}$$, often $$H_0: \beta_1=0$$

• Two sided alternative $$H_1: \beta_1 \neq 0$$

• One sided alternatives $$H_1: \beta_1 > 0$$ or $$H_2: \beta_1 < 0$$

• $$t$$-statistic

$t=\frac{\hat{\beta_1}-\beta_{1,0}}{se(\hat{\beta_1})}$

• Compare $$t$$ against critical value $$t$$*, or compute $$p$$-value as usual

• Confidence intervals (95%): $$\hat{\beta_1} \pm 1.96 \left(se(\hat{\beta_1})\right)$$