2.1 — Data 101 & Descriptive Statistics

ECON 480 • Econometrics • Fall 2022

Dr. Ryan Safner
Associate Professor of Economics

Contents

The Two Big Problems with Data

Data 101

Descriptive Statistics

Measures of Center

Measures of Dispersion

The Two Big Problems with Data

Two Big Problems with Data

• We want to use econometrics to identify causal relationships and make inferences about them
1. Problem for identification: endogeneity

2. Problem for inference: randomness

Identification Problem: Endogeneity

• An independent variable $(X)$ is exogenous if its variation is unrelated to other factors that affect the dependent variable $(Y)$

• An independent variable $(X)$ is endogenous if its variation is related to other factors that affect the dependent variable $(Y)$

• Note: unfortunately this is different from how economists talk about “endogenous” vs. “exogenous” variables in theoretical models…

Identification Problem: Endogeneity

• An independent variable $(X)$ is exogenous if its variation is unrelated to other factors that affect the dependent variable $(Y)$

Identification Problem: Endogeneity

• An independent variable $(X)$ is endogenous if its variation is related to other factors that affect the dependent variable $(Y)$, e.g. $Z$

Inference Problem: Randomness

• Data is random due to natural sampling variation

• Taking one sample of a population will yield slightly different information than another sample of the same population
• Common in statistics, easy to fix

• Inferential Statistics: making claims about a wider population using sample data

• We use common tools and techniques to deal with randomness

The Two Problems: Where We’re Heading…Ultimately

Sample $\color{#6A5ACD}{\xrightarrow{\text{statistical inference}}}$ Population $\color{#e64173}{\xrightarrow{\text{causal indentification}}}$ Unobserved Parameters

• We want to identify causal relationships between population variables
• Logically first thing to consider
• Endogeneity problem
• We’ll use sample statistics to infer something about population parameters
• In practice, we’ll only ever have a finite sample distribution of data
• We don’t know the population distribution of data
• Randomness problem

Data 101

Data 101

• Data are information with context

• Individuals are the entities described by a set of data

• e.g. persons, households, firms, countries

Data 101

• Variables are particular characteristics about an individual
• e.g. age, income, profits, population, GDP, marital status, type of legal institutions
• Observations or cases are the separate individuals described by a collection of variables
• e.g. for one individual, we have their age, sex, income, education, etc.
• individuals and observations are not necessarily the same:
• e.g. we can have multiple observations on the same individual over time

Categorical Variables

• Categorical variables place an individual into one of several possible categories
• e.g. sex, season, political party
• may be responses to survey questions
• can be quantitative (e.g. age, zip code)
• In R: character or factor type data
• factor $\implies$ specific possible categories

Categorical Variables: Visualizing I

diamonds %>%
count(cut) %>%
mutate(frequency = n / sum(n),
percent = round(frequency * 100, 2))

Summary of diamonds by cut
cut n frequency percent
Fair 1610 0.0298480 2.98
Good 4906 0.0909529 9.10
Very Good 12082 0.2239896 22.40
Premium 13791 0.2556730 25.57
Ideal 21551 0.3995365 39.95
• Good way to represent categorical data is with a frequency table

• Count (n): total number of individuals in a category

• Frequency: proportion of a category’s occurrence relative to all data

• Multiply proportions by 100% to get percentages

Categorical Variables: Visualizing II

• Charts and graphs are always better ways to visualize data

• A bar graph represents categories as bars, with lengths proportional to the count or relative frequency of each category

ggplot(diamonds, aes(x=cut,
fill=cut))+
geom_bar()+
guides(fill=F)+
theme_pander(base_family = "Fira Sans Condensed",
base_size=20)

Categorical Data: Pie Charts

• Avoid pie charts!

• People are not good at judging 2-d differences (angles, area)

• People are good at judging 1-d differences (length)

Categorical Data: Pie Charts

• Avoid pie charts!

• People are not good at judging 2-d differences (angles, area)

• People are good at judging 1-d differences (length)

Categorical Data: Alternatives to Pie Charts I

• Try something else: a stacked bar chart
diamonds %>%
count(cut) %>%
ggplot(data = .)+
aes(x = "",
y = n)+
geom_col(aes(fill = cut))+
geom_label(aes(label = cut,
color = cut),
position = position_stack(vjust = 0.5)
)+
guides(color = F,
fill = F)+
theme_void()

Categorical Data: Alternatives to Pie Charts II

• Try something else: a lollipop chart
diamonds %>%
count(cut) %>%
mutate(cut_name = as.factor(cut)) %>%
ggplot(., aes(x = cut_name, y = n, color = cut))+
geom_point(stat="identity",
fill="black",
size=12)  +
geom_segment(aes(x = cut_name, y = 0,
xend = cut_name,
yend = n), size = 2)+
geom_text(aes(label = n),color="white", size=3) +
coord_flip()+
labs(x = "Cut")+
theme_pander(base_family = "Fira Sans Condensed",
base_size=20)+
guides(color = F)

Categorical Data: Alternatives to Pie Charts III

• Try something else: a treemap
library(treemapify)
diamonds %>%
count(cut) %>%
ggplot(., aes(area = n, fill = cut)) +
geom_treemap() +
guides(fill = FALSE) +
geom_treemap_text(aes(label = cut),
colour = "white",
place = "topleft",
grow = TRUE)

Quantitative Data I

• Quantitative variables take on numerical values of equal units that describe an individual
• Units: points, dollars, inches
• Context: GPA, prices, height
• We can mathematically manipulate only quantitative data
• e.g. sum, average, standard deviation
• In R: numeric type data
• integer if whole number
• double if has decimals

Discrete Data

• Discrete data are finite, with a countable number of alternatives

• Categorical: place data into categories

• e.g. letter grades: A, B, C, D, F
• e.g. class level: freshman, sophomore, junior, senior
• Quantitative: integers

• e.g. SAT Score, number of children, age (years)

Continuous Data

• Continuous data are infinitely divisible, with an uncountable number of alternatives
• e.g. weight, length, temperature, GPA
• Many discrete variables may be treated as if they are continuous
• e.g. SAT scores (whole points), wages (dollars and cents)

id name age sex income
1 John 23 Male 41000
2 Emile 18 Male 52600
3 Natalya 28 Female 48000
4 Lakisha 31 Female 60200
5 Cheng 36 Male 81900
• The most common data structure we use is a spreadsheet

• In R: a data.frame or tibble
• A row contains data about all variables for a single individual

• A column contains data about a single variable across all individuals

id name age sex income
1 John 23 Male 41000
2 Emile 18 Male 52600
3 Natalya 28 Female 48000
4 Lakisha 31 Female 60200
5 Cheng 36 Male 81900
• Each cell can be referenced by its row and column (in that order!), df[row,column]

example[3,2] # value in row 3, column 2
# A tibble: 1 × 1
name
<chr>
1 Natalya

• Recall with tidyverse you can do this with select() and filter() or slice()

• It is common to use some notation like the following:

• Let $\{x_1, x_2, \cdots, x_n\}$ be a simple data series on variable $X$

• $n$ individual observations
• $x_i$ is the value of the $i$th observation for $i=1,2,\cdots, n$

Quick Check

Let $x$ represent the score on a homework assignment:

$75, 100, 92, 87, 79, 0, 95$

1. What is $n$?
2. What is $x_1$?
3. What is $x_6$?

Datasets: Cross-Sectional

id name age sex income
1 John 23 Male 41000
2 Emile 18 Male 52600
3 Natalya 28 Female 48000
4 Lakisha 31 Female 60200
5 Cheng 36 Male 81900
• Cross-sectional data: observations of individuals at a given point in time

• Each observation is a unique individual

$x_{\color{#D7250E}{i}}$

• Simplest and most common data

• A “snapshot” to compare differences across individuals

Datasets: Time-Series

Year GDP Unemployment CPI
1950 8.2 0.06 100
1960 9.9 0.04 118
1970 10.2 0.08 130
1980 12.4 0.08 190
1985 13.6 0.06 196
• Time-series data: observations of the same individual(s) over time

• Each observation is a time period

$x_{\color{#0047AB}{t}}$

• Often used for macroeconomics, finance, and forecasting

• Unique challenges for time series

• A “moving picture” to see how individuals change over time

Datasets: Panel

City Year Murders Population UR
Philadelphia 1986 5 3.700 8.7
Philadelphia 1990 8 4.200 7.2
D.C. 1986 2 0.250 5.4
D.C. 1990 10 0.275 5.5
New York 1986 3 6.400 9.6
• Panel, or longitudinal dataset: a time-series for each cross-sectional entity
• Must be same individuals over time
• Each obs. is an individual in a time period

$x_{\color{#D7250E}{i}\color{#0047AB}{t}}$

• More common today for serious researchers; unique challenges and benefits

• A combination of “snapshot” comparisons over time

Descriptive Statistics

Variables and Distributions

• Variables take on different values, we can describe a variable’s distribution (of these values)

• We want to visualize and analyze distributions to search for meaningful patterns using statistics

Two Branches of Statistics

• Two main branches of statistics:
1. Descriptive Statistics: describes or summarizes the properties of a sample

2. Inferential Statistics: infers properties about a larger population from the properties of a sample1

Histogram

• A common way to present a quantitative variable’s distribution is a histogram
• The quantitative analog to the bar graph for a categorical variable
• Divide up values into bins of a certain size, and count the number of values falling within each bin, representing them visually as bars

Histogram: Bin Size

• A common way to present a quantitative variable’s distribution is a histogram
• The quantitative analog to the bar graph for a categorical variable
• Divide up values into bins of a certain size, and count the number of values falling within each bin, representing them visually as bars
• Changing the bin-width will affect the bars

Histogram: Example

Example

A class of 13 students takes a quiz (out of 100 points) with the following results:

$\{ 0, 62, 66, 71, 71, 74, 76, 79, 83, 86, 88, 93, 95 \}$

Histogram: Example

Example

A class of 13 students takes a quiz (out of 100 points) with the following results:

$\{ 0, 62, 66, 71, 71, 74, 76, 79, 83, 86, 88, 93, 95 \}$

ggplot(quizzes,aes(x=scores))+
geom_histogram(breaks = seq(0,100,10),
color = "white",
fill = "#e64173")+
scale_x_continuous(breaks = seq(0,100,10))+
scale_y_continuous(limits = c(0,6), expand = c(0,0))+
labs(x = "Scores",
y = "Number of Students")+
theme_bw(base_family = "Fira Sans Condensed",
base_size=20)

Descriptive Statistics

• We are often interested in the shape or pattern of a distribution, particularly:
• Measures of center
• Measures of dispersion
• Shape of distribution

Measures of Center

Mode

• The .himode of a variable is simply its most frequent value

• A variable can have multiple modes

Example

A class of 13 students takes a quiz (out of 100 points) with the following results:

$\{ 0, 62, 66, \mathbf{71}, \mathbf{71}, 74, 76, 79, 83, 86, 88, 93, 95 \}$

Mode

• There is no dedicated mode() function in R, surprisingly

• A workaround in dplyr:

quizzes %>%
count(scores) %>%
arrange(desc(n))

Multi-Modal Distributions

• Looking at a histogram, the modes are the “peaks” of the distribution
• Note: depends on how wide you make the bins!
• May be unimodal, bimodal, trimodal, etc

Symmetry and Skew I

• A distribution is symmetric if it looks roughly the same on either side of the “center”

• The thinner ends (far left and far right) are called the tails of a distribution

Symmetry and Skew I

• If one tail stretches farther than the other, distribution is skewed in the direction of the longer tail
• In this example, skewed to the left

Outliers

• Outlier: “extreme” value that does not appear part of the general pattern of a distribution

• Can strongly affect descriptive statistics

• Might be the most informative part of the data

• Could be the result of errors

• Should always be explored and discussed!

Arithmetic Mean (Population)

• The natural measure of the center of a population’s distribution is its “average” or arithmetic mean $\mu$

$\mu=\frac{x_1+x_2+...+x_N}{N} = \frac{1}{N} \sum^N_{i=1} x_i$

• For $N$ values of variable $x$, “mu” is the sum of all individual $x$ values $(x_i)$ from 1 to $N$, divided by the $N$ number of values1

• See today’s appendix for more about the summation operator, $\displaystyle\sum$, it’ll come up again!

Arithmetic Mean (Sample)

• When we have a sample, we compute the sample mean $\bar{x}$

$\bar{x}=\frac{x_1+x_2+...+x_n}{n} = \frac{1}{n} \sum^n_{i=1} x_i$

• For $n$ values of variable $x$, “x-bar” is the sum of all individual $x$ values $(x_i)$ divided by the $n$ number of values

Arithmetic Mean (Sample)

Example

$\{0, 62, 66, 71, 71, 74, 76, 79, 83, 86, 88, 93, 95\}$

\begin{align*} \bar{x}&=\frac{1}{13}(0+62+66+71+71+74+76+79+83+86+88+93+95)\\ \bar{x}&=\frac{944}{13}\\ \bar{x}&=72.62\\ \end{align*}

quizzes %>%
summarize(mean = mean(scores))
# A tibble: 1 × 1
mean
<dbl>
1  72.6

Arithmetic Mean: Affected by Outliers

Example: If we drop the outlier (0)

$\{62, 66, 71, 71, 74, 76, 79, 83, 86, 88, 93, 95\}$

\begin{align*} \bar{x}&=\frac{1}{12}(62+66+71+71+74+76+79+83+86+88+93+95)\\ &=\frac{944}{12}\\ &=78.67\\ \end{align*}

quizzes %>%
filter(scores > 0) %>%
summarize(mean = mean(scores))
# A tibble: 1 × 1
mean
<dbl>
1  78.7

Median

$\{0, 62, 66, 71, 71, 74, \mathbf{76}, 79, 83, 86, 88, 93, 95\}$

• The median is the midpoint of the distribution
• 50% to the left of the median, 50% to the right of the median
• Arrange values in numerical order
• For odd $n$: median is middle observation
• For even $n$: median is average of two middle observations

Mean, Median, Symmetry, & Skew I

• Symmetric distribution: mean $\approx$ median
symmetric %>%
summarize(mean = mean(x),
median = median(x))
# A tibble: 1 × 2
mean median
<dbl>  <dbl>
1     4      4

Mean, Median, Symmetry, & Skew II

• Left-skewed: mean $<$ median
leftskew %>%
summarize(mean = mean(x),
median = median(x))
# A tibble: 1 × 2
mean median
<dbl>  <dbl>
1  4.62      5

Mean, Median, Symmetry, & Skew III

• Right-skewed: mean $>$ median
rightskew %>%
summarize(mean = mean(x),
median = median(x))
# A tibble: 1 × 2
mean median
<dbl>  <dbl>
1  3.38      3

Measures of Dispersion

Range

• The more variation in the data, the less helpful a measure of central tendency will tell us

• Beyond just the center, we also want to measure the spread

• Simplest metric is range $=max-min$

Five Number Summary I

• Common set of summary statistics of a distribution: “five number summary”:
1. Minimum value
2. 25th percentile $(Q_1$, median of first 50% of data)
3. 50th percentile (median, $Q_2)$
4. 25th percentile $(Q_3$, median of last 50% of data)
5. Maximum value
# Base R summary command
summary(quizzes\$scores)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.00   71.00   76.00   72.62   86.00   95.00 

quizzes %>% # dplyr
summarize(Min = min(scores),
Q1 = quantile(scores, 0.25),
Median = median(scores),
Q3 = quantile(scores, 0.75),
Max = max(scores))
# A tibble: 1 × 5
Min    Q1 Median    Q3   Max
<dbl> <dbl>  <dbl> <dbl> <dbl>
1     0    71     76    86    95

Five Number Summary II

• The $n$th percentile of a distribution is the value that places $n$ percent of values beneath it
quizzes %>%
summarize("37th percentile" = quantile(scores,0.37))
# A tibble: 1 × 1
37th percentile
<dbl>
1              72.3

Boxplot I

• Boxplots are a great way to visualize the 5 number summary

• Height of box: $Q_1$ to $Q_3$ (known as interquartile range (IQR), middle 50% of data)

• Line inside box: median (50th percentile)

• “Whiskers” identify data within $1.5 \times IQR$

• Points beyond whiskers are outliers

• common definition: Outlier $>1.5 \times IQR$

Boxplot Comparisons I

• Boxplots (and five number summaries) are great for comparing two distributions

Example

\begin{align*} \text{Quiz 1}&: \{0, 62, 66, 71, 71, 74, 76, 79, 83, 86, 88, 93, 95\} \\ \text{Quiz 2}&: \{50, 62, 72, 73, 79, 81, 82, 82, 86, 90, 94, 98, 99\} \\ \end{align*}

Boxplot Comparisons II

quizzes_new %>% summary()
    student       quiz_1          quiz_2
Min.   : 1   Min.   : 0.00   Min.   :50.00
1st Qu.: 4   1st Qu.:71.00   1st Qu.:73.00
Median : 7   Median :76.00   Median :82.00
Mean   : 7   Mean   :72.62   Mean   :80.62
3rd Qu.:10   3rd Qu.:86.00   3rd Qu.:90.00
Max.   :13   Max.   :95.00   Max.   :99.00  

Aside: Making Nice Summary Tables I

• I don’t like the options available for printing out summary statistics

• So I wrote my own R function called summary_table() that makes nice summary tables (it uses dplyr and tidyr!). To use:

1. Download the summaries.R file from the website1 and move it to your working directory/project folder

2. Load the function with the source() command:2

source("summaries.R")

Aside: Making Nice Summary Tables II

1. The function has at least 2 arguments: the data.frame (automatically piped in if you use the pipe!) and then all variables you want to summarize, separated by commas1
mpg %>%
summary_table(hwy, cty, cyl)
# A tibble: 3 × 9
Variable   Obs   Min    Q1 Median    Q3   Max  Mean Std. Dev.
<chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>       <dbl>
1 cty        234     9    14     17    19    35 16.9         4.26
2 cyl        234     4     4      6     8     8  5.89        1.61
3 hwy        234    12    18     24    27    44 23.4         5.95

Aside: Making Nice Summary Tables III

1. When rendered in Quarto, it looks nicer:
mpg %>%
summary_table(hwy, cty, cyl) %>%
knitr::kable(., format="html")
Variable Obs Min Q1 Median Q3 Max Mean Std. Dev.
cty 234 9 14 17 19 35 16.86 4.26
cyl 234 4 4 6 8 8 5.89 1.61
hwy 234 12 18 24 27 44 23.44 5.95

Measures of Dispersion: Deviations

• Every observation $i$ deviates from the mean of the data:

$deviation_i = x_i-\mu$

• There are as many deviations as there are data points $(n)$

• We can measure the average or standard deviation of a variable from its mean

• Before we get there…

Variance (Population)

• The population variance $\sigma^2$ of a population distribution measures the average of the squared deviations from the population mean $(\mu)$

$\sigma^2 = \frac{1}{N}\displaystyle\sum^N_{i=1} (x_i-\mu)^2$

• Why do we square deviations?
• What are these units?

Standard Deviation (Population)

• Square root the variance to get the population standard deviation $\sigma$, the average deviation from the population mean (in same units as $x$)

$\sigma=\sqrt{\sigma^2}=\sqrt{\frac{1}{N}\displaystyle\sum^N_{i=1} (x_i-\mu)^2 }$

Variance (Sample)

• The sample variance $s^2$ of a sample distribution measures the average of the squared deviations from the sample mean $(\bar{x})$

$\sigma^2 = \frac{1}{n-1}\displaystyle\sum^n_{i=1} (x_i-\bar{x})^2$

• Why do we divide by $n-1$?

Standard Deviation (Sample)

• Square root the sample variance to get the sample standard deviation $s$, the average deviation from the sample mean (in same units as $x$)

$s=\sqrt{s^2}=\sqrt{\frac{1}{n-1}\displaystyle\sum^n_{i=1} (x_i-\bar{x})^2 }$

Sample Standard Deviation: Example

Example

Calculate the sample standard deviation for the following series:

$\{2, 4, 6, 8, 10 \}$

sd(c(2,4,6,8,10))
[1] 3.162278

The Steps to Calculate sd(), Coded I

#  first let's save our data in a tibble
sd_example <- tibble(x = c(2,4,6,8,10))

# first find the mean (just so we know)

sd_example %>%
summarize(mean(x))
# A tibble: 1 × 1
mean(x)
<dbl>
1         6

# now let's make some more columns:
sd_example <- sd_example %>%
mutate(deviations = x - mean(x), # take deviations from mean
deviations_sq = deviations^2) # square them

The Steps to Calculate sd(), Coded II

sd_example # see what we made
# A tibble: 5 × 3
x deviations deviations_sq
<dbl>      <dbl>         <dbl>
1     2         -4            16
2     4         -2             4
3     6          0             0
4     8          2             4
5    10          4            16

The Steps to Calculate sd(), Coded III

sd_example %>%
# sum the squared deviations
summarize(sum_sq_devs = sum(deviations_sq),
# divide by n-1 to get variance
variance = sum_sq_devs/(n()-1),
# square root to get sd
std_dev = sqrt(variance)) 
# A tibble: 1 × 3
sum_sq_devs variance std_dev
<dbl>    <dbl>   <dbl>
1          40       10    3.16

Sample Standard Deviation: You Try

Example

Calculate the sample standard deviation for the following series:

$\{1, 3, 5, 7 \}$

sd(c(1,3,5,7))
[1] 2.581989

Descriptive Statistics: Populations vs. Samples

Population parameters

• Population size: $N$

• Mean: $\mu$

• Variance: $\sigma^2=\frac{1}{N} \displaystyle\sum^N_{i=1} (x_i-\mu)^2$

• Standard deviation: $\sigma = \sqrt{\sigma^2}$

Sample statistics

• Population size: $n$

• Mean: $\bar{x}$

• Variance: $s^2=\frac{1}{n-1} \displaystyle\sum^n_{i=1} (x_i-\bar{x})^2$

• Standard deviation: $s = \sqrt{s^2}$