2.7 — Hypothesis Testing (Regression)

ECON 480 • Econometrics • Fall 2022

Dr. Ryan Safner
Associate Professor of Economics

## Contents

Hypothesis Testing

Digression: p-Values and the Philosophy of Science

Hypothesis Testing by Simulation with infer

Theory-Based Hypothesis Testing (What R Calculates)

The Use and Abuse of p-Values

# Hypothesis Testing

## Estimation and Hypothesis Testing I

• We want to test if our estimates are statistically significant and they describe the population
• this is the “bread and butter” of using inferential statistics

Examples

• Does reducing class size improve test scores?
• Do more years of education increase your wages?
• Is the gender wage gap between men and women 23%?

All modern science is built upon statistical hypothesis testing, so understand it well

## Estimation and Hypothesis Testing II

• Note, we can test a lot of hypotheses about a lot of population parameters, e.g.
• A population mean $\mu$
• Example: average height of adults
• A population proportion $p$
• Example: percent of voters who voted for Biden
• A difference in population means $\mu_A-\mu_B$
• Example: difference in average wages of men vs. women
• A difference in population proportions $p_A-p_B$
• Example: difference in percent of patients reporting symptoms of drug A vs B
• We will focus on hypotheses about population regression slope $(\beta_1)$, i.e. the causal effect1 of $X$ on $Y$

## Null and Alternative Hypotheses I

• All scientific inquiries begin with a null hypothesis $(H_0)$ that proposes a specific value of a population parameter
• Notation: add a subscript 0: $\beta_{1,0}$ (or $\mu_0$, $p_0$, etc)
• We suggest an alternative hypothesis $(H_a)$, often the one we hope to verify
• Note, can be multiple alternative hypotheses: $H_1, H_2, \ldots , H_n$
• Ask: “Does our data (sample) give us sufficient evidence to reject $H_0$ in favor of $H_a$?”
• Note: the test is always about $\mathbf{H_0}$!
• See if we have sufficient evidence to reject the status quo

## Null and Alternative Hypotheses II

• Null hypothesis assigns a value (or a range) to a population parameter
• e.g. $\beta_1=2$ or $\beta_1 \leq 20$
• Most common is $\beta_1=0$ $\implies$ $X$ has no effect on $Y$ (no slope for a line)
• Note: always an equality!
• Alternative hypothesis must mathematically contradict the null hypothesis
• e.g. $\beta_1 \neq 2$ or $\beta_1 > 20$ or $\beta_1 \neq 0$
• Note: always an inequality!
• Alternative hypotheses come in two forms:
1. One-sided alternative: $\beta_1 >H_0$ or $\beta_1< H_0$
2. Two-sided alternative: $\beta_1 \neq H_0$
• Note this means either $\beta_1 < H_0$ or $\beta_1 > H_0$

::: footer :::

## Components of a Valid Hypothesis Test

• All statistical hypothesis tests have the following components:
1. A null hypothesis, $H_0$
1. An alternative hypothesis, $H_a$
1. A test statistic to determine if we reject $H_0$ when the statistic reaches a “critical value”
• Beyond the critical value is the “rejection region”, sufficient evidence to reject $H_0$
1. A conclusion whether or not to reject $H_0$ in favor of $H_a$

## Type I and Type II Errors I

• Sample statistic $(\hat{\beta_1})$ will rarely be exactly equal to the hypothesized parameter $(\beta_1)$

• Difference between observed statistic and true parameter could be because:

1. Parameter is not the hypothesized value
• $H_0$ is false
1. Parameter truly is the hypothesized value, but sampling variability gave us a different estimate
• $H_0$ is true
• We cannot distinguish between these two possibilities with any certainty

• So, we can interpret our estimates probabilistically as committing one of two types of error

## Type I and Type II Errors II

1. Type I error (false positive): rejecting $H_0$ when it is in fact true
• Believing we found an important result when there is truly no relationship
1. Type II error (false negative): failing to reject $H_0$ when it is in fact false
• Believing we found nothing when there was truly a relationship to find

## Type I and Type II Errors III

• Depending on context, committing one type of error may be more serious than the other

## Type I and Type II Errors IV

• Anglo-American common law presumes defendant is innocent: $H_0$
• Jury judges whether the evidence presented against the defendant is plausible assuming the defendant were in fact innocent
• If highly improbable (beyond a “reasonable doubt”): sufficient evidence to reject $H_0$ and convict

## Type I and Type II Errors V

William Blackstone

(1723-1780)

“It is better that ten guilty persons escape than that one innocent suffer.”

• Type I error is worse than a Type II error in law!

Blackstone, William, 1765-1770, Commentaries on the Laws of England

## Significance Level, $\alpha$, and Confidence Level $1-\alpha$

• The significance level, $\alpha$, is the probability of a Type I error

$\alpha=P(\text{Reject } H_0 | H_0 \text{ is true})$

• The confidence level is defined as $(1-\alpha)$
• Specify in advance an $\alpha$-level (0.10, 0.05, 0.01) with associated confidence level (90%, 95%, 99%)
• The probability of a Type II error is defined as $\beta$:

$\beta=P(\text{Don't reject } H_0 | H_0 \text{ is false})$

## Power and p-values

• The statistical power of the test is $(1-\beta)$: the probability of correctly rejecting $H_0$ when $H_0$ is in fact false (e.g. convicting a guilty person)

$\text{Power} = 1- \beta = P(\text{Reject }H_0|H_0 \text{ is false})$

• The p-value or significance probability is the probability that, if the null hypothesis were true, the test statistic from any sample will be at least as extreme as the test statistic from our sample

$p(\delta \geq \delta_i|H_0 \text{ is true})$

• where $\delta$ represents some test statistic
• $\delta_i$ is the test statistic we observe in our sample
• More on this in a bit

## p-values and Statistical Significance

• After running our test, we need to make a decision between the competing hypotheses

• Compare $p$-value with pre-determined $\alpha$ (commonly, $\alpha=0.05$, 95% confidence level)

• If $p<\alpha$: statistically significant evidence sufficient to reject $H_0$ in favor of $H_a$

• Note this does not mean $H_a$ is true! We merely have rejected $H_0$!
• If $p \geq \alpha$: insufficient evidence to reject $H_0$

• Note this does not mean $H_0$ is true! We merely have failed to reject $H_0$!

# Digression: p-Values and the Philosophy of Science

## Hypothesis Testing and the Philosophy of Science I

Sir Ronald A. Fisher

(1890-1962)

“The null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.”

Fisher, R.A., 1931, The Design of Experiments

## Hypothesis Testing and the Philosophy of Science II

• Modern philosophy of science is largely based off of hypothesis testing and falsifiability, which form the “Scientific Method1

• For something to be “scientific”, it must be falsifiable, or at least testable (at least in principle)

• Hypotheses can be corroborated with evidence, but always tentative until falsified by data in suggesting an alternative hypothesis

• “All swans are white” is a hypothesis rejected upon discovery of a single black swan

## Hypothesis Testing and p-Values

• Hypothesis testing, confidence intervals, and p-values are probably the hardest thing to understand in statistics

Fivethirtyeight: Not Even Scientists Can Easily Explain P-values

## Hypothesis Testing: Which Test? I

• Rigorous course on statistics (ECMG 212 or MATH 112) will spend weeks going through different types of tests:
• Sample mean; difference of means
• Proportion; difference of proportions
• Z-test vs t-test
• 1 sample vs. 2 samples
• $\chi^2$ test

## There is Only One Test!

• Fortunately, some clever statisticians realized there is only one test and some built a nice R package called infer
1. Calculate a statistic, $\delta_i$1, from a sample of data

2. Simulate a world where $\delta$ is null $(H_0)$

3. Examine the distribution of $\delta$ across the null world

4. Calculate the probability that $\delta_i$ could exist in the null world

5. Decide if $\delta_i$ is statistically significant

## Elements of a Hypothesis Test

Alan Downey: “There is still only one test”

## Hypothesis Testing with the infer Package I

• R naturally runs the following hypothesis test on any regression as part of lm():

\begin{align*} H_0: \; & \beta_1=0\\ H_1: \; & \beta_1 \neq 0 \end{align*}

• infer allows you to run through these steps manually to understand the process:
1. specify() a model
1. hypothesize() the null
1. generate() simulations of the null world
1. calculate() the $p$-value
1. visualize() with a histogram (optional)

## Theory-Based Inference: Critical Values of Test Statistic

• Test statistic $\delta$: measures how far what we observed in our sample $(\hat{\beta_1})$ is from what we would expect if the null hypothesis were true $(\beta_1=0)$
• Calculated from a sampling distribution of the estimator (i.e. $\hat{\beta_1})$
• In econometrics, we use $t$-distributions which have $n-k-1$ degrees of freedom1
• Rejection region: if the test statistic reaches a “critical value” of $\delta$, then we reject the null hypothesis

# Hypothesis Testing by Simulation, with infer

## Imagine a Null World, where $H_0$ is True

Our world, and a world where $\beta_1=0$ by assumption.

## Comparing the Worlds I

• From that null world where $H_0: \, \beta_1=0$ is true, we simulate another sample and calculate OLS estimators again

## Comparing the Worlds II

• From that null world where $H_0: \, \beta_1=0$ is true, let’s simulate 1,000 samples and calculate slope $(\hat{\beta_1})$ for each

## Prepping the infer Pipeline

• Before I show you how to do this, let’s first save our estimated slope from our actual sample
• We’ll want this later!
# save as our_slope
our_slope <- school_reg %>%
tidy() %>%
filter(term == "str") %>%
pull(estimate)

# look at it
our_slope
[1] -2.279808

## The infer Pipeline: specify()

data %>%

specify(y ~ x)

• Take our data and pipe it into the specify() function, which is essentially a lm() function for regression (for our purposes)
ca_school %>%
specify(testscr ~ str)

## The infer Pipeline: hypothesize()

data %>%

specify(y ~ x) %>%

hypothesize(null = "independence")

• Describe what the null hypothesis is here
• In infer’s language, str and testscr are independent $(\beta_1=0)$1
ca_school %>%
specify(testscr ~ str) %>%
hypothesize(null = "independence")

## The infer Pipeline: generate()

data %>%

specify(y ~ x) %>%

hypothesize(null = "independence") %>%

generate(reps = n, type = "permute")

• Now the magic starts, as we run a number of simulated samples
• Set the number of reps and set the type equal to "permute" (not bootstrap)
• Permutation randomly matches $X$-values and $Y$-values from the data so that there is no relationship between $X$ and $Y$
ca_school %>%
specify(testscr ~ str) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000,
type = "permute")

## The infer Pipeline: calculate()

data %>%

specify(y ~ x) %>%

hypothesize(null = "independence") %>%

generate(reps = n, type = "permute") %>%

calculate(stat = "slope")

• We calculate sample statistics for each of the 1,000 replicate samples

• In our case, calculate the slope1 $(\hat{\beta}_1)$ for each replicate

ca_school %>%
specify(testscr ~ str) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000,
type = "permute") %>%
calculate(stat = "slope")

## The infer Pipeline: get_p_value()

data %>%

specify(y ~ x) %>%

hypothesize(null = "independence") %>%

generate(reps = n, type = "permute") %>%

calculate(stat = "slope") %>%

get_p_value(obs stat = "", direction = "both")

• We can calculate the p-value
• the probability of seeing a value at least as large as our_slope (-2.28) in our simulated null distribution
• Two-sided alternative $H_a: \beta_1 \neq 0$, we double the raw $p$-value
ca_school %>%
specify(testscr ~ str) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000,
type = "permute") %>%
calculate(stat = "slope") %>%
get_p_value(obs_stat = our_slope,
direction = "both")

## The infer Pipeline: visualize()

data %>%

specify(y ~ x) %>%

hypothesize(null = "independence") %>%

generate(reps = n, type = "permute") %>%

calculate(stat = "slope") %>%

visualize()

• Make a histogram of our null distribution of $\beta_1$
• Note it is centered at $\beta_1=0$ because that’s $H_0$!
ca_school %>%
specify(testscr ~ str) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000,
type = "permute") %>%
calculate(stat = "slope") %>%
visualize()

## The infer Pipeline: visualize()

data %>%

specify(y ~ x) %>%

hypothesize(null = "independence") %>%

generate(reps = n, type = "permute") %>%

calculate(stat = "slope") %>%

visualize()

• Add our our_slope to show our finding on the null distr.
ca_school %>%
specify(testscr ~ str) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000,
type = "permute") %>%
calculate(stat = "slope") %>%
visualize(obs_stat = our_slope)

## The infer Pipeline: visualize()

data %>%

specify(y ~ x) %>%

hypothesize(null = "independence") %>%

generate(reps = n, type = "permute") %>%

calculate(stat = "slope") %>%

visualize() + shade_p_value()

• Add shade_p_value() to see what $p$ is
ca_school %>%
specify(testscr ~ str) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000,
type = "permute") %>%
calculate(stat = "slope") %>%
visualize(obs_stat = our_slope) +
direction = "two_sided")

## visualize() is Just a Wrapper for ggplot

# infer
ca_school %>%
specify(testscr ~ str) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000,
type = "permute") %>%
calculate(stat = "slope") %>%
# pipe into ggplot
ggplot(data = )+
aes(x = stat)+
geom_histogram(color="white", fill="#e64173")+
geom_vline(xintercept = our_slope,
color = "blue",
size = 2,
linetype = "dashed")+
annotate(geom = "label",
x = -2.28,
y = 100,
label = expression(paste("Our ", hat(beta[1]))),
color = "blue")+
scale_y_continuous(lim=c(0,130),
expand = c(0,0))+
labs(x = expression(paste("Sampling distribution of ", hat(beta)[1], " under ", H[0], ":  ", beta[1]==0)),
y = "Samples")+
theme_classic(base_family = "Fira Sans Condensed",
base_size=20)

# Theory-Based Hypothesis Testing (What R Calculates)

## What R Does: Theory-Based Statistical Inference I

• R does things the old-fashioned way, using a theoretical null distribution instead of simulating one

• A t-distribution with $n-k-1$ df1

• Calculate a $t$-statistic for $\hat{\beta_1}$:

$\text{test statistic} = \frac{\text{estimate} - \text{null hypothesis}}{\text{standard error of estimate}}$

## What R Does: Theory-Based Statistical Inference II

$\text{test statistic} = \frac{\text{estimate} - \text{null hypothesis}}{\text{standard error of estimate}}$

• $t$ same interpretation as $Z$: number of std. dev. away from the sampling distribution’s expected value $E[\hat{\beta_1}]$1 (if $H_0$ were true)

• Compares to a critical value of $t^*$ (pre-determined by $\alpha$-level & $n-k-1$ df)

• For 95% confidence, $\alpha=0.05$, $t^* \approx 2$2

## What R Does: Theory-Based Statistical Inference III

\begin{align*} t &= \frac{\hat{\beta_1}-\beta_{1,0}}{se(\hat{\beta_1})}\\ t &= \frac{-2.28-0}{0.48}\\ t &= -4.75\\ \end{align*}

• Our sample slope $\hat{\beta_1}$ is 4.75 standard deviations below the expected value $E[\hat{\beta_1}]$ (i.e. 0) if $H_0$ were true

## What R Does: Theory-Based Statistical Inference IV

\begin{align*} t &= \frac{\hat{\beta_1}-\beta_{1,0}}{se(\hat{\beta_1})}\\ t &= \frac{-2.28-0}{0.48}\\ t &= -4.75\\ \end{align*}

• .hi[p-value]: prob. of a test statistic at least as large (in magnitude) as ours if the null hypothesis were true
• Continuous distribution implies we need probability of area beyond our value
• p-value is 2-sided for $H_a: \beta_1 \neq 0$
• $2 \times p(t_{418}> \vert -4.75\vert)=0.0000028$

## One-Sided Tests & p-Values

$H_a: \beta_1<0$

p-value: $p(t \leq t_i)$

$H_a: \beta_1>0$

p-value: $p(t \geq t_i)$

## Two-Sided Tests and p-Values

$H_a: \beta_1 \neq 0$

p-value: $2 \times p(t \geq |t_i|)$

## Calculating p-Values in R

• pt() calculates probabilities on a t distribution with arguments:
• the t-score
• df = the degrees of freedom
• lower.tail =
• TRUE if looking at area to LEFT of value
• FALSE if looking at area to RIGHT of value
2 * pt(4.75, # I'll double the right tail
df = 418,
lower.tail = F) # right tail
[1] 2.800692e-06
• $2 \times p(t_{418}> \vert -4.75\vert)=0.0000028$

## Hypothesis Tests in Regression Output I

school_reg %>% summary()

Call:
lm(formula = testscr ~ str, data = ca_school)

Residuals:
Min      1Q  Median      3Q     Max
-47.727 -14.251   0.483  12.822  48.540

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 698.9330     9.4675  73.825  < 2e-16 ***
str          -2.2798     0.4798  -4.751 2.78e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.58 on 418 degrees of freedom
Multiple R-squared:  0.05124,   Adjusted R-squared:  0.04897
F-statistic: 22.58 on 1 and 418 DF,  p-value: 2.783e-06

## Hypothesis Tests in Regression Output II

• In broom’s tidy() (with confidence intervals)
tidy(school_reg, conf.int=TRUE)
• p-value on str is 0.00000278.

## Conclusions

\begin{align*} H_0: \,& \beta_1=0\\ H_a: \, & \beta_a \neq 0\\ \end{align*}

• Because the hypothesis test’s $p$-value $<$ $\alpha$ (0.05)…

• We have sufficient evidence to reject $H_0$ in favor of our alternative hypothesis. Our sample suggests that there is a relationship between class size and test scores.

• Using the confidence intervals:

• We are 95% confident that, from similarly constructed samples, the true marginal effect of class size on test scores is between -3.22 and -1.34.

## Hypothesis Testing vs. Confidence Intervals

• Confidence intervals are all two-sided by nature

$CI_{0.95}=\left(\left[\hat{\beta_1}-\underbrace{2 \times se(\hat{\beta_1})}_{MOE}\right], \, \left[\hat{\beta_1}+\underbrace{2 \times se(\hat{\beta_1})}_{MOE}\right] \right)$

• Hypothesis test $(t$-test) of $H_0: \, \beta_1=0$ computes a $t$-value of6[Since our null hypothesis is that $\beta_{1,0}=0$, the test statistic simplifies to this neat fraction.]

$t=\frac{\hat{\beta_1}}{se(\hat{\beta_1})}$

and $p<0.05$ when $t\geq2$ (approximately)

• If our confidence interval contains the $H_0$ value (i.e. $0$, for our test), then we fail to reject $H_0$.

# The Use and Abuse of $p$-values

## p-Hacking

• Consider what 95% confident or $\alpha=0.05$ means

• If we repeat a procedure 20 times, we should expect $\frac{1}{20}$ (5%) to produce a fluke result!

Image source: Seeing Theory

## Abusing p-values and “Science”

Source: Washington Post

Source: SMBC

## Abusing p-Values and “Science” II

“The widespread use of ‘statistical significance’ (generally interpreted as $(p \leq 0.05)$ as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process.”

Wasserstein, Ronald L. and Nicole A. Lazar, (2016), “The ASA’s Statement on p-Values: Context, Process, and Purpose,” The American Statistician 30(2): 129-133

## Abusing p-Values and “Science” III

“No economist has achieved scientific success as a result of a statistically significant coefficient. Massed observations, clever common sense, elegant theorems, new policies, sagacious economic reasoning, historical perspective, relevant accounting, these have all led to scientific success. Statistical significance has not,” (p.112).

McCloskey, Dierdre N and Stephen Ziliak, 1996, The Cult of Statistical Significance

$p$ is the probability that the alternative hypothesis is false - We can never prove an alternative hypothesis, only tentatively reject a null hypothesis

$p$ is the probability that the null hypothesis is true - We’re not proving the $H_0$ is false, only saying that it’s very unlikely that if $H_0$ were true, we’d obtain a slope as rare as our sample’s slope

$p$ is the probability that our observed effects were produced purely by random chance - $p$ is computed under a specific model (think about our null world) that assumes $H_0$ is true

$p$ tells us how significant our finding is - $p$ tells us nothing about the size or the real world significance of any effect deemed “statistically significant” - it only tells us that the slope is statistically significantly different from 0 (if $H_0$ is $\beta_1=0)$

## p-Values: Restatement

• Again, p-value is the probability that, if the null hypothesis were true, we obtain (by pure random chance) a test statistic at least as extreme as the one we estimated for our sample

• A low p-value means either (and we can’t distinguish which):

1. $H_0$ is true and a highly improbable event has occurred OR
2. $H_0$ is false

## Statistical Significance In Regression Tables

Test Score
Constant 698.93***
(9.47)
STR −2.28***
(0.48)
n 420
R2 0.05
SER 18.54
* p < 0.1, ** p < 0.05, *** p < 0.01
• Statistical significance is shown by asterisks, common (but not always!) standard:
• 1 asterisk: significant at $\alpha=0.10$
• 2 asterisks: significant at $\alpha=0.05$
• 3 asterisks: significant at $\alpha=0.01$
• Rare, but sometimes regression tables include $p$-values for estimates