2.1 — Data 101 & Descriptive Statistics
ECON 480 • Econometrics • Fall 2022
Dr. Ryan Safner
Associate Professor of Economics
safner@hood.edu
ryansafner/metricsF22
metricsF22.classes.ryansafner.com
Problem for identification: endogeneity
Problem for inference: randomness
An independent variable \((X)\) is exogenous if its variation is unrelated to other factors that affect the dependent variable \((Y)\)
An independent variable \((X)\) is endogenous if its variation is related to other factors that affect the dependent variable \((Y)\)
Note: unfortunately this is different from how economists talk about “endogenous” vs. “exogenous” variables in theoretical models…
Data is random due to natural sampling variation
Common in statistics, easy to fix
Inferential Statistics: making claims about a wider population using sample data
Sample \(\color{#6A5ACD}{\xrightarrow{\text{statistical inference}}}\) Population \(\color{#e64173}{\xrightarrow{\text{causal indentification}}}\) Unobserved Parameters
Data are information with context
Individuals are the entities described by a set of data
R
: character
or factor
type data
factor
\(\implies\) specific possible categoriescut | n | frequency | percent |
---|---|---|---|
Fair | 1610 | 0.0298480 | 2.98 |
Good | 4906 | 0.0909529 | 9.10 |
Very Good | 12082 | 0.2239896 | 22.40 |
Premium | 13791 | 0.2556730 | 25.57 |
Ideal | 21551 | 0.3995365 | 39.95 |
Good way to represent categorical data is with a frequency table
Count (n): total number of individuals in a category
Frequency: proportion of a category’s occurrence relative to all data
Charts and graphs are always better ways to visualize data
A bar graph represents categories as bars, with lengths proportional to the count or relative frequency of each category
Avoid pie charts!
People are not good at judging 2-d differences (angles, area)
People are good at judging 1-d differences (length)
Avoid pie charts!
People are not good at judging 2-d differences (angles, area)
People are good at judging 1-d differences (length)
diamonds %>%
count(cut) %>%
mutate(cut_name = as.factor(cut)) %>%
ggplot(., aes(x = cut_name, y = n, color = cut))+
geom_point(stat="identity",
fill="black",
size=12) +
geom_segment(aes(x = cut_name, y = 0,
xend = cut_name,
yend = n), size = 2)+
geom_text(aes(label = n),color="white", size=3) +
coord_flip()+
labs(x = "Cut")+
theme_pander(base_family = "Fira Sans Condensed",
base_size=20)+
guides(color = F)
R
: numeric
type data
integer
if whole numberdouble
if has decimalsDiscrete data are finite, with a countable number of alternatives
Categorical: place data into categories
Quantitative: integers
id | name | age | sex | income |
---|---|---|---|---|
1 | John | 23 | Male | 41000 |
2 | Emile | 18 | Male | 52600 |
3 | Natalya | 28 | Female | 48000 |
4 | Lakisha | 31 | Female | 60200 |
5 | Cheng | 36 | Male | 81900 |
The most common data structure we use is a spreadsheet
data.frame
or tibble
A row contains data about all variables for a single individual
A column contains data about a single variable across all individuals
id | name | age | sex | income |
---|---|---|---|---|
1 | John | 23 | Male | 41000 |
2 | Emile | 18 | Male | 52600 |
3 | Natalya | 28 | Female | 48000 |
4 | Lakisha | 31 | Female | 60200 |
5 | Cheng | 36 | Male | 81900 |
It is common to use some notation like the following:
Let \(\{x_1, x_2, \cdots, x_n\}\) be a simple data series on variable \(X\)
id | name | age | sex | income |
---|---|---|---|---|
1 | John | 23 | Male | 41000 |
2 | Emile | 18 | Male | 52600 |
3 | Natalya | 28 | Female | 48000 |
4 | Lakisha | 31 | Female | 60200 |
5 | Cheng | 36 | Male | 81900 |
Cross-sectional data: observations of individuals at a given point in time
Each observation is a unique individual
\[x_{\color{#D7250E}{i}}\]
Simplest and most common data
A “snapshot” to compare differences across individuals
Year | GDP | Unemployment | CPI |
---|---|---|---|
1950 | 8.2 | 0.06 | 100 |
1960 | 9.9 | 0.04 | 118 |
1970 | 10.2 | 0.08 | 130 |
1980 | 12.4 | 0.08 | 190 |
1985 | 13.6 | 0.06 | 196 |
Time-series data: observations of the same individual(s) over time
Each observation is a time period
\[x_{\color{#0047AB}{t}}\]
Often used for macroeconomics, finance, and forecasting
Unique challenges for time series
A “moving picture” to see how individuals change over time
City | Year | Murders | Population | UR |
---|---|---|---|---|
Philadelphia | 1986 | 5 | 3.700 | 8.7 |
Philadelphia | 1990 | 8 | 4.200 | 7.2 |
D.C. | 1986 | 2 | 0.250 | 5.4 |
D.C. | 1990 | 10 | 0.275 | 5.5 |
New York | 1986 | 3 | 6.400 | 9.6 |
\[x_{\color{#D7250E}{i}\color{#0047AB}{t}}\]
More common today for serious researchers; unique challenges and benefits
A combination of “snapshot” comparisons over time
Variables take on different values, we can describe a variable’s distribution (of these values)
We want to visualize and analyze distributions to search for meaningful patterns using statistics
Descriptive Statistics: describes or summarizes the properties of a sample
Inferential Statistics: infers properties about a larger population from the properties of a sample1
ggplot(quizzes,aes(x=scores))+
geom_histogram(breaks = seq(0,100,10),
color = "white",
fill = "#e64173")+
scale_x_continuous(breaks = seq(0,100,10))+
scale_y_continuous(limits = c(0,6), expand = c(0,0))+
labs(x = "Scores",
y = "Number of Students")+
theme_bw(base_family = "Fira Sans Condensed",
base_size=20)
The .himode of a variable is simply its most frequent value
A variable can have multiple modes
There is no dedicated mode()
function in R
, surprisingly
A workaround in dplyr
:
A distribution is symmetric if it looks roughly the same on either side of the “center”
The thinner ends (far left and far right) are called the tails of a distribution
Outlier: “extreme” value that does not appear part of the general pattern of a distribution
Can strongly affect descriptive statistics
Might be the most informative part of the data
Could be the result of errors
Should always be explored and discussed!
\[\mu=\frac{x_1+x_2+...+x_N}{N} = \frac{1}{N} \sum^N_{i=1} x_i\]
For \(N\) values of variable \(x\), “mu” is the sum of all individual \(x\) values \((x_i)\) from 1 to \(N\), divided by the \(N\) number of values1
See today’s appendix for more about the summation operator, \(\displaystyle\sum\), it’ll come up again!
\[\bar{x}=\frac{x_1+x_2+...+x_n}{n} = \frac{1}{n} \sum^n_{i=1} x_i\]
\[\begin{align*} \bar{x}&=\frac{1}{13}(0+62+66+71+71+74+76+79+83+86+88+93+95)\\ \bar{x}&=\frac{944}{13}\\ \bar{x}&=72.62\\ \end{align*}\]
\[\begin{align*} \bar{x}&=\frac{1}{12}(62+66+71+71+74+76+79+83+86+88+93+95)\\ &=\frac{944}{12}\\ &=78.67\\ \end{align*}\]
\[\{0, 62, 66, 71, 71, 74, \mathbf{76}, 79, 83, 86, 88, 93, 95\}\]
The more variation in the data, the less helpful a measure of central tendency will tell us
Beyond just the center, we also want to measure the spread
Simplest metric is range \(=max-min\)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 71.00 76.00 72.62 86.00 95.00
Boxplots are a great way to visualize the 5 number summary
Height of box: \(Q_1\) to \(Q_3\) (known as interquartile range (IQR), middle 50% of data)
Line inside box: median (50th percentile)
“Whiskers” identify data within \(1.5 \times IQR\)
Points beyond whiskers are outliers
I don’t like the options available for printing out summary statistics
So I wrote my own R function
called summary_table()
that makes nice summary tables (it uses dplyr
and tidyr
!). To use:
Download the summaries.R
file from the website1 and move it to your working directory/project folder
Load the function with the source()
command:2
data.frame
(automatically piped in if you use the pipe!) and then all variables you want to summarize, separated by commas1render
ed in Quarto
, it looks nicer:Variable | Obs | Min | Q1 | Median | Q3 | Max | Mean | Std. Dev. |
---|---|---|---|---|---|---|---|---|
cty | 234 | 9 | 14 | 17 | 19 | 35 | 16.86 | 4.26 |
cyl | 234 | 4 | 4 | 6 | 8 | 8 | 5.89 | 1.61 |
hwy | 234 | 12 | 18 | 24 | 27 | 44 | 23.44 | 5.95 |
\[deviation_i = x_i-\mu \]
There are as many deviations as there are data points \((n)\)
We can measure the average or standard deviation of a variable from its mean
Before we get there…
\[\sigma^2 = \frac{1}{N}\displaystyle\sum^N_{i=1} (x_i-\mu)^2\]
\[\sigma=\sqrt{\sigma^2}=\sqrt{\frac{1}{N}\displaystyle\sum^N_{i=1} (x_i-\mu)^2 }\]
\[\sigma^2 = \frac{1}{n-1}\displaystyle\sum^n_{i=1} (x_i-\bar{x})^2\]
\[s=\sqrt{s^2}=\sqrt{\frac{1}{n-1}\displaystyle\sum^n_{i=1} (x_i-\bar{x})^2 }\]
sd()
, Coded I# A tibble: 1 × 1
`mean(x)`
<dbl>
1 6
sd()
, Coded IIsd()
, Coded IIIPopulation size: \(N\)
Mean: \(\mu\)
Variance: \(\sigma^2=\frac{1}{N} \displaystyle\sum^N_{i=1} (x_i-\mu)^2\)
Standard deviation: \(\sigma = \sqrt{\sigma^2}\)
Population size: \(n\)
Mean: \(\bar{x}\)
Variance: \(s^2=\frac{1}{n-1} \displaystyle\sum^n_{i=1} (x_i-\bar{x})^2\)
Standard deviation: \(s = \sqrt{s^2}\)