Midterm Concepts

OLS Regression

Bivariate data and associations between variables (e.g. X and Y)

  • Apparent relationships are best viewed by looking at a scatterplot

  • Check for associations to be positive/negative, weak/strong, linear/nonlinear, etc

    • Y: dependent variable

    • X: independent variable

  • Correlation coefficient (r) can quantify the strength of an association

    rX,Y=1n1n(XiX¯sX)(YiY¯sY)=nZXZYn1

    • 1r1 and r only measures linear associations

    • |r| closer to 1 imply stronger correlation (near a perfect straight line)

    • Correlation does not imply causation! Might be confounding or lurking variables (e.g.$Z$) affecting X and/or Y

Population regression model

Yi=β0+β1Xi+ui

  • β1: ΔYΔX: the slope between $X$ and $Y$, number of units of Y from a 1 unit change in X

  • β0 is the Y-intercept: literally, value of Y when X=0

  • ui is the error, difference between actual value of Y|X vs. predicted value of Y^

Ordinary Least Squares (OLS) regression model

Yi=β^0+β^1Xi+ui

Yi^=β0^+β1^Xi

u^i=YiY^i

  • OLS estimators β0^ and β1^ estimate population regression line from sample data

  • Minimize sum of squared errors (SSR) minnu^i2

  • OLS regression line

β1^=cov(X,Y)var(X)=(XiX¯)(YiY¯)(XiX¯)2=rX,YsYsX

β0^=Y¯β1^X¯

Measures of Fit

  • R2: fraction of total variation on Y explained by variation in X according to model

R2=SSMSSTR2=1SSRSSTR2=rX,Y2

  • Where

    • SSM=(Y^iY¯i)2

    • SST=(YiY¯i)2

    • SSR=ui2

  • Standard error of the regression (or residuals), SER: average size of u^i, i.e. average distance between points and the regression line

    SER(σu^i)=SSRn2=ui^2n2

Sampling Distribution of β^1

β1^ is a random variable, so it has its own sampling distribution with mean E[β1^] and standard error se[β1^]

  • Mean of OLS estimator β1^ & Bias: Endogeneity & Exogeneity

    • X is exogenous if it is not correlated with the error term

      cor(X,u)=0E[u|X]=0

      • equivalently, knowing X should tell us nothing about u (zero conditional mean assumption)

      • if X is exogenous, OLS estimate of β1 is unbiased

        E[β^1]=β1

    • X is endogenous if it is correlated with the error term

      cor(X,u)0

      • If X is endogenous, OLS estimate of β1 is biased:

        E[β^1]=β1+cor(X,u)σuσXbias

        • Can measure strength and direction (+ or -) of bias

        • Note if unbiased, cor(X,u)=0, so E[β1^]=β1

    • Assumptions about u

      1. The mean of the errors is 0

        E[ui]=0

      2. The variance of the errors is constant over all values of X (homoskedasticity)

        var[ui|Xi]=σu2

      3. Errors are not correlated across observations i and j (no autocorrelation)

        cor(ui,uj)=0

      4. There is no correlation between X and u, i.e. the model is exogenous

        cor(X,u)=0E[u|X]=0

  • Precision of OLS estimator β^1 measures uncertainty/variability of estimate

    var[β^1]=SER2n×var(X)se[β^1]=var[β^1]

    • Affected by three factors:

      • Model fit, (SER)

      • Sample size, n

      • Variation in X

    • Heteroskedasticity & Homoskedasticity

      • Homoskedastic errors (u^i) have the same variance over all values of X

      • Heteroskedastic errors (u^i) have different variance over values of X

        • Heteroskedasticity does not bias our estimates, but incorrectly lowers variance & standard errors (inflating $t$-statistics and significance!)

        • Can correct for heteroskedasticity by using robust standard errors

Hypothesis Testing of β1

  • H0:β1=β1,0, often H0:β1=0

  • Two sided alternative H1:β10

  • One sided alternatives H1:β1>0 or H2:β1<0

  • t-statistic

t=β1^β1,0se(β1^)

  • Compare t against critical value t*, or compute p-value as usual

  • Confidence intervals (95%): β1^±1.96(se(β1^))