# Midterm Concepts

# OLS Regression

Bivariate data and associations between variables (e.g. \(X\) and \(Y\))

Apparent relationships are best viewed by looking at a scatterplot

Check for associations to be positive/negative, weak/strong, linear/nonlinear, etc

\(Y\): dependent variable

\(X\): independent variable

Correlation coefficient (\(r\)) can quantify the strength of an association

\[ r_{X,Y}=\frac{1}{n-1} \sum^n \bigg(\frac{X_i-\bar{X}}{s_X}\bigg) \bigg(\frac{Y_i-\bar{Y}}{s_Y}\bigg) = \frac{\displaystyle \sum^n Z_X Z_Y}{n-1} \]

\(-1 \leq r \leq 1\) and \(r\) only measures

*linear*associations\(|r|\) closer to 1 imply stronger correlation (near a perfect straight line)

Correlation does not imply causation! Might be confounding or lurking variables (e.g.$Z$) affecting \(X\) and/or \(Y\)

Population regression model

\[ Y_i=\beta_0+\beta_1X_i+u_i \]

\(\beta_1\): \(\frac{\Delta Y}{\Delta X}\): the slope between $X$ and $Y$, number of units of \(Y\) from a 1 unit change in \(X\)

\(\beta_0\) is the \(Y\)-intercept: literally, value of \(Y\) when \(X=0\)

\(u_i\) is the error, difference between actual value of \(Y|X\) vs. predicted value of \(\hat{Y}\)

Ordinary Least Squares (OLS) regression model

\[ Y_i = \hat{\beta}_0+\hat{\beta}_1 X_i + u_i \]

\[ \hat{Y_i}=\hat{\beta_0}+\hat{\beta_1}X_i \]

\[ \hat{u}_i = Y_i-\hat{Y}_i \]

OLS estimators \(\hat{\beta_0}\) and \(\hat{\beta_1}\) estimate population regression line from sample data

Minimize sum of squared errors (SSR) \(min\displaystyle \sum^n \hat{u}_i^2\)

OLS regression line

\[ \hat{\beta_1} = \frac{cov(X,Y)}{var(X)}=\frac{\sum (X_i-\bar{X})(Y_i-\bar{Y})}{\sum (X_i-\bar{X})^2}=r_{X,Y}\frac{s_Y}{s_X} \]

\[ \hat{\beta_0} = \bar{Y}-\hat{\beta_1}\bar{X} \]

# Measures of Fit

- \(R^2\): fraction of total variation on \(Y\) explained by variation in \(X\) according to model

\[ \begin{align*} R^2 & = \frac{SSM}{SST} \\ R^2 & = 1 - \frac{SSR}{SST} \\ R^2 & = r_{X,Y}^2 \\ \end{align*} \]

Where

\(SSM = \sum (\hat{Y}_i - \bar{Y}_i)^2\)

\(SST = \sum(Y_i - \bar{Y}_i)^2\)

\(SSR = \sum u_i^2\)

Standard error of the regression (or residuals), SER: average size of \(\hat{u}_i\), i.e. average distance between points and the regression line

\[ SER \, (\sigma_{\hat{u}_i})= \frac{SSR}{n-2} = \frac{\sum \hat{u_i}^2}{n-2} \]

# Sampling Distribution of \(\hat{\beta}_1\)

\(\hat{\beta_1}\) is a random variable, so it has its own sampling distribution with mean \(\mathbb{E}[\hat{\beta_1}]\) and standard error \(se[\hat{\beta_1}]\)

Mean of OLS estimator \(\hat{\beta_1}\) & Bias: Endogeneity & Exogeneity

\(X\) is

**exogenous**if it is not correlated with the error term\[ \begin{align*} cor(X,u) &=0 \\ \mathbb{E}[u|X] &=0 \\ \end{align*} \]

equivalently, knowing \(X\) should tell us nothing about \(u\) (

*zero conditional mean*assumption)if \(X\) is exogenous, OLS estimate of \(\beta_1\) is

*unbiased*\[ E[\hat{\beta}_1]=\beta_1 \]

\(X\) is

**endogenous**if it is correlated with the error term\[ cor(X,u) \neq 0 \]

If \(X\) is endogenous, OLS estimate of \(\beta_1\) is

*biased*:\[ \mathbb{E}[\hat{\beta}_1] = \beta_1 + \underbrace{cor(X,u)\frac{\sigma_u}{\sigma_X}}_{bias} \]

Can measure strength and direction (+ or -) of bias

Note if unbiased, \(cor(X,u)=0\), so \(E[\hat{\beta_1}]=\beta_1\)

Assumptions about u

The mean of the errors is 0

\[ \mathbb{E}[u_i] = 0 \]

The variance of the errors is constant over all values of \(X\) (homoskedasticity)

\[ var[u_i|X_i]=\sigma_u^2 \]

Errors are not correlated across observations \(i\) and \(j\) (no autocorrelation)

\[ cor(u_i,u_j) = 0 \]

There is no correlation between \(X\) and \(u\), i.e. the model is exogenous

\[ \begin{align*} cor(X,u) &=0 \\ \mathbb{E}[u|X] &=0 \\ \end{align*} \]

Precision of OLS estimator \(\hat{\beta}_1\) measures uncertainty/variability of estimate

\[ \begin{align*} var[\hat{\beta}_1]&=\frac{SER^2}{n\times var(X)}\\ se[\hat{\beta}_1]&=\sqrt{var[\hat{\beta}_1]} \\ \end{align*} \]

Affected by three factors:

Model fit, (SER)

Sample size, \(n\)

Variation in \(X\)

Heteroskedasticity & Homoskedasticity

Homoskedastic errors (\(\hat{u}_i\)) have the same variance over all values of \(X\)

Heteroskedastic errors (\(\hat{u}_i\)) have different variance over values of \(X\)

Heteroskedasticity does

*not*bias our estimates, but incorrectly lowers variance & standard errors (inflating $t$-statistics and significance!)Can correct for heteroskedasticity by using robust standard errors

# Hypothesis Testing of \(\beta_1\)

\(H_0: \beta_1=\beta_{1,0}\), often \(H_0: \beta_1=0\)

Two sided alternative \(H_1: \beta_1 \neq 0\)

One sided alternatives \(H_1: \beta_1 > 0\) or \(H_2: \beta_1 < 0\)

\(t\)-statistic

\[ t=\frac{\hat{\beta_1}-\beta_{1,0}}{se(\hat{\beta_1})} \]

Compare \(t\) against critical value \(t\)*, or compute \(p\)-value as usual

Confidence intervals (95%): \(\hat{\beta_1} \pm 1.96 \left(se(\hat{\beta_1})\right)\)