2.2 — Random Variables & Distributions
ECON 480 • Econometrics • Fall 2022
Dr. Ryan Safner
Associate Professor of Economics
safner@hood.edu
ryansafner/metricsF22
metricsF22.classes.ryansafner.com
A random variable (RV) takes on values that are unknown in advance, but determined by an experiment
A numerical summary of a random outcome
Random variable \(X\) takes on individual values \((x_i)\) from a set of possible values
Often capital letters to denote RV’s
\(x_i\) | \(P(X=x_i)\) |
---|---|
0 | 0.80 |
1 | 0.10 |
2 | 0.06 |
3 | 0.03 |
4 | 0.01 |
Probability distribution function (pdf) summarizes the possible outcomes of \(X\) and their probabilities
Notation: \(f_X\) is the pdf of \(X\):
\[f_X=p_i, \quad i=1,2,...,k\]
\(x_i\) | \(P(X=x_i)\) |
---|---|
0 | 0.80 |
1 | 0.10 |
2 | 0.06 |
3 | 0.03 |
4 | 0.01 |
crashes<-tibble(number = c(0,1,2,3,4),
prob = c(0.80, 0.10, 0.06, 0.03, 0.01))
ggplot(data = crashes) +
aes(x = number,
y = prob)+
geom_col(fill = "#e64173") +
labs(x = "Number of Crashes",
y = "Probability") +
scale_y_continuous(breaks = seq(0,1,0.2),
limits = c(0,1),
expand = c(0,0))+
theme_classic(base_family = "Fira Sans Condensed",
base_size = 20)
Cumulative distribution function (cdf) lists probability \(X\) will be at most (less than or equal to) a given value \(x_i\)
Notation: \(F_X=P(X \leq x_i)\)
\(x_i\) | \(f(x)\) | \(F(x)\) |
---|---|---|
0 | 0.80 | 0.80 |
1 | 0.10 | 0.90 |
2 | 0.06 | 0.96 |
3 | 0.03 | 0.99 |
4 | 0.01 | 1.00 |
# A tibble: 5 × 3
number prob cum_prob
<dbl> <dbl> <dbl>
1 0 0.8 0.8
2 1 0.1 0.9
3 2 0.06 0.96
4 3 0.03 0.99
5 4 0.01 1
\[\mathbb{E}(X)=\sum^k_{i=1} p_i x_i\]
\(\mathbb{E}(X)=p_1x_1+p_2x_2+ \cdots +p_kx_k\)
A probability-weighted average of \(X\), with each \(x_i\) weighted by its associated probability \(p_i\)
Also called the “mean” or “expectation” of \(X\), always denoted either \(\mathbb{E}(X)\) or \(\mu_X\)
\[\begin{align*}\sigma^2_X &= \mathbb{E}[(x_i-\mu_X)^2]\\ &=\sum_{i=1}^n(x_i-\mu_X)^2p_i\\ \end{align*}\]
\[\sigma_X=\sqrt{\sigma_X^2}\]
# A tibble: 1 × 1
expected_value
<dbl>
1 0.35
# A tibble: 5 × 5
number prob deviations deviations_sq weighted_devs_sq
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0.8 -0.35 0.122 0.098
2 1 0.1 0.65 0.423 0.0423
3 2 0.06 1.65 2.72 0.163
4 3 0.03 2.65 7.02 0.211
5 4 0.01 3.65 13.3 0.133
# A tibble: 1 × 2
variance sd
<dbl> <dbl>
1 0.648 0.805
Continuous random variables can take on an uncountable (infinite) number of values
So many values that the probability of any specific value is infinitely small:
\[P(X=x_i)\rightarrow 0\]
Probability density function (pdf) of a continuous variable represents the probability between two values as the area under a curve
The total area under the curve is 1
Since \(P(a)=0\) and \(P(b)=0\), \(P(a<X<b)=P(a \leq X \leq b)\)
See today’s appendix for how to graph math/stats functions in ggplot
!
\[P(a \leq X \leq b) = \int_a^b f(x) dx \]
\[P(X \leq k)\]
\[P(X \geq k)=1-P(X \leq k)\]
\(P(X \geq 2)=\) area under the pdf curve to the right of 2
\[ X \sim N(\mu,\sigma)\]
“\(X\) is distributed Normally with mean \(\mu\) and standard deviation \(\sigma\)”
Continuous, symmetric, unimodal
\[P(X=k)= \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{1}{2}\big(\frac{(k-\mu)}{\sigma}\big)^2}\]
\[Z \sim N(0,1)\]
\[\Phi(k)=P(Z \leq k)\]
(again, the area under the pdf curve to the left of some value \(k\))
68-95-99.7% empirical rule: for a normal distribution:
\(P(\mu-1\sigma \leq X \leq \mu+1\sigma) \approx\) 68%
68-95-99.7% empirical rule: for a normal distribution:
\(P(\mu-1\sigma \leq X \leq \mu+1\sigma) \approx\) 68%
\(P(\mu-2\sigma \leq X \leq \mu+2\sigma) \approx\) 95%
68-95-99.7% empirical rule: for a normal distribution:
\(P(\mu-1\sigma \leq X \leq \mu+1\sigma) \approx\) 68%
\(P(\mu-2\sigma \leq X \leq \mu+2\sigma) \approx\) 95%
\(P(\mu-3\sigma \leq X \leq \mu+3\sigma) \approx\) 99.7%
68/95/99.7% of observations fall within 1/2/3 standard deviations of the mean
\[Z=\frac{x_i-\mu}{\sigma}\]
Subtract any value by the distribution’s mean and divide by standard deviation
\(Z\): number of standard deviations \(x_i\) value is away from the mean
\[Z = \frac{X - \mu}{\sigma}\]
\[Z = \frac{634.8-1.87}{155.28}\]
\[Z = -4.1\]
This is 4.1 standard deviations \((\sigma)\) beneath the mean, an extremely low probability event.
Convert to standard normal to find \(Z\)-scores for \(8\) and \(-3.2.\)
\[P(-3.2 < X < 8)\]
\[P(\frac{-3.2-2.4}{5.6} < \frac{X-2.4}{5.6} < \frac{8-2.4}{5.6})\]
\[P(-1 < Z < 1)\]
\[P(X \pm 1\sigma)=0.68\]
Probability to the left of \(z_i\)
\[P(Z \leq z_i)= \underbrace{\Phi(z_i)}_{\text{cdf of }z_i}\]
Probability to the right of \(z_i\)
\[P(Z \geq z_i)= 1-\underbrace{\Phi(z_i)}_{\text{cdf of }z_i}\]
Probability between \(z_1\) and \(z_2\)
\[P(z_1 \geq Z \geq z_2)= \underbrace{\Phi(z_2)}_{\text{cdf of }z_2} - \underbrace{\Phi(z_1)}_{\text{cdf of }z_1}\]
pnorm()
calculates p
robabilities with a norm
al distribution with arguments:
x =
the valuemean =
the meansd =
the standard deviationlower.tail =
TRUE
if looking at area to LEFT of valueFALSE
if looking at area to RIGHT of value