`median(c(83,92,72,81))`

`[1] 82`

In your own words, explain the difference between endogeneity and exogeneity.

An *exogenous* model is one where the independent variable \((X)\) is not associated with any other factors that affect the dependent variable $(Y)$. If a model is truly exogenous, we can estimate the **causal effect** of \(X\) on \(Y\).

An *endogenous* model is one where the independent variable \((X)\) *is* associated with any other factors that affect the dependent variable \((Y)\). If a model is endogenous, we have no accurately estimated the causal effect of \(X\) on \(Y\), since other factors are getting entangled with \(X\) and \(Y\).

In your own words, explain what (sample) standard deviation *means*.

Sample standard deviation measures the average deviation (distance) of any given value of a variable from the variable’s mean.

In your own words, explain how (sample) standard deviation *is calculated*. You may also write the formula, but it is not necessary.

The formula is \[sd(X)=\sqrt{\frac{\displaystyle \sum^n_{i=1} (x_i-\bar{X})^2}{n-1}}\]

It helps to consider the calculation as a series of steps:

- Find the mean of \(X\), \((\bar{X})\)
- Subtract the mean from each value of \(X\) in the data, to get deviations, \((x_i-\bar{X})\)
- Square the deviations to ensure they are all positive, \((x_i-\bar{X})^2\)
- Take the average of the squared deviations: add them all up and divide by \(n-1\), \(\frac{1}{n-1}\displaystyle \sum^n_{i=1} (x_i-\bar{X})^2\)
- What you have in step 4 is variance (measured in units of \(X^2)\), square root to get standard deviation (measured in original units of \(X\)): \(\sqrt{\frac{1}{n-1}\displaystyle \sum^n_{i=1} (x_i-\bar{X})^2}\)

Note on Step 4: because this is a *sample*, we have do deal with **degrees of freedom (df)** loss. We use up one df to calculate the mean, \(\bar{x}\), which is needed before calculating variance or standard deviation. Hence, instead of averaging like normal: \(\frac{1}{n} \sum x_i\), we need to divide by \(n-1\).

For the remaining questions, you may use `R`

to *verify*, but please calculate all sample statistics by hand and show all work.

Suppose you have a very small class of four students that all take a quiz. Their scores are reported as follows:

\[\{83, 92, 72, 81 \} \]

Calculate the median.

Arrange the values in numerical order from smallest to largest. Find the value in the middle (i.e. an equal number of values are on either side); possibly by crossing-out one number on either side at a time (like in Elementary School).

\[\underline{72}, 81, 83, \underline{92}\]

Since we have an even number of observations, we have two numbers in the middle, \(81\) and \(83\), so we must take the average of them:

\[\frac{81+83}{2}=82\]

Using R:

`median(c(83,92,72,81))`

`[1] 82`

Calculate the sample mean, \(\bar{x}\).

\[\begin{align*}\bar{x}&=\frac{1}{n} \sum^n_{i=1} x_i\\ \bar{x}&=\frac{72+81+83+92}{4}\\ \bar{x}&=\frac{328}{4}\\ \bar{x}&=82\\ \end{align*}\]

Using R:

`mean(c(83,92,72,81))`

`[1] 82`

Calculate the sample standard deviation, \(s\).

My suggestion is to use the “table” method, and follow the 5 steps described in problem 2b.

\(x_i\) | \(x_i-\bar{X}\) | \((x_i-\bar{X})^2\) |
---|---|---|

\(72\) | \(-10\) | \(100\) |

\(81\) | \(-1\) | \(1\) |

\(83\) | \(1\) | \(1\) |

\(92\) | \(10\) | \(100\) |

\(\sum\) | \(202\) | |

\(\frac{1}{3} \times \sum\) | \(\approx 67.33\) | |

\(\mathbf{\sqrt{\frac{1}{3} \times \sum}}\) | \(\approx \mathbf{8.21}\) |

In R:

`sd(c(83,92,72,81))`

`[1] 8.205689`

Make or sketch a rough histogram of this data, with the size of each bin being 10 (i.e. 70’s, 80’s, 90’s, 100’s). You can draw this by hand or use `R`

.

If you are using `ggplot`

, you want to use `+ geom_histogram(breaks = seq(start,end,by))`

and add another layer `+ scale_x_continuous(breaks=seq(start,end,by))`

. The first layer creates bins in the histogram, and the second layer creates ticks on the x axis; both by creating a `seq`

uence starting at some `start`

ing value, some `end`

ing value, `by`

a certain interval (e.g. by 2, or by 10).

Is this distribution roughly symmetric or skewed? What would we expect about the mean and the median?

```
# write your code here!
# load tidyverse (for tibble and ggplot2)
library(tidyverse)
```

```
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6 ✔ purrr 0.3.4
✔ tibble 3.1.8 ✔ dplyr 1.0.10
✔ tidyr 1.2.1 ✔ stringr 1.4.1
✔ readr 2.1.2 ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
```

```
# make a dataframe of our data,
# called df
# one variable in it, called quiz
<- tibble(quiz = c(83,92,72,81))
df
# use this as our data for plot
ggplot(data = df)+
aes(x = quiz)+
geom_histogram(breaks = seq(0,100,10), # make bins of size 10 between 0 and 100
color = "white", # color is for borders
fill = "#e64173")+ # fill is for area
scale_x_continuous(breaks = seq(0,100,10))+ # have x axix ticks same as breaks
scale_y_continuous(limits = c(0,2.25),
expand = c(0,0))+
theme_classic()
```

Suppose instead the person who got the 72 did not show up that day to class, and got a 0 instead. Recalculate the mean and median. What happened and why?

\[\underline{0}, 81, 83, \underline{92}\]

\[\frac{81+83}{2}=82\]

Replacing the 72 with a 0, and keeping the same number of observations does not change the median!

\[\begin{align*}\bar{x}&=\frac{1}{n} \sum^n_{i=1} x_i\\ \bar{x}&=\frac{0+81+83+92}{4}\\ \bar{x}&=\frac{256}{4}\\ \bar{x}&=64\\ \end{align*}\]

The mean is pulled down significantly by the outlier.

In R:

`mean(c(83,92,0,81))`

`[1] 64`

`median(c(83,92,0,81))`

`[1] 82`

If we were to look at the histogram, it would be skewed, and the mean would be lower than the mean:

```
# make new tibble called df_2
<- tibble(quiz = c(83,92,0,81)) # replace 72 with 0
df_2
# use this as our data for plot
ggplot(data = df_2)+
aes(x = quiz)+
geom_histogram(breaks = seq(0,100,10), # make bins of size 10 between 0 and 100
color = "white", # color is for borders
fill = "#e64173")+ # fill is for area
geom_vline(aes(xintercept = median(quiz)),
size = 1,
color = "green",
linetype = "dashed")+ # green dashed line is median
geom_label(aes(x = median(quiz), y = 1),
label = "Median",
color = "green")+ # label median line
geom_vline(aes(xintercept = mean(quiz)),
size = 1,
color = "red",
linetype = "dotted")+ # red dashed line is mean
geom_label(aes(x = mean(quiz), y = 1),
label = "Mean",
color = "red")+ # label mean line
scale_x_continuous(breaks = seq(0,100,10))+ # have x axix ticks same as breaks
scale_y_continuous(limits = c(0,2.5),
expand = c(0,0))+
theme_classic()
```

Suppose the probabilities of a visitor to Amazon’s website buying 0, 1, or 2 books are 0.2, 0.4, and 0.4 respectively.

Calculate the *expected number* of books a visitor will purchase.

\(x_i\) | \(P(X=x_i)\) |
---|---|

0 | 0.20 |

1 | 0.40 |

2 | 0.40 |

The expected value of \(X\) is the probability weighted average of \(X\):

\[\begin{align*}E(X)&=\sum^n_{i=1} p_ix_i\\ &=0.2(0)+(0.4)1+(0.4)2\\ &=0+0.4+0.8\\ &=1.2\\ \end{align*}\]

Calculate the *standard deviation* of book purchases.

The formula(s) for standard deviation of a random variable is:

\[\sigma_X = sd(X) = \sqrt{E[(X-E[X])^2]} = \sqrt{\sum^n_{i=1}p_i(x_i-E[X])^2}\]

I suggest using the table method, again. Working from the inside out of the formula, the steps are:

- Find the expected value of \(X\), \(E[X]\).
- Subtract the expected value from each value of \(X\) in the data, to get deviations, \((x_i-E[X])\)
- Square the deviations to ensure they are all positive, \((x_i-E[X])^2\)
- Take the probability-weighted average of the squared deviations: multiply each squared deviation by the probability of its associated \(x\) value and add them all up \(\displaystyle \sum^n_{i=1} p_i(x_i-E[X])^2\)
- What you have in step 4 is variance (measured in units of \(X^2)\), square root to get standard deviation (measured in original units of \(X\)): \(\sqrt{\displaystyle \sum^n_{i=1} p_i(x_i-E[X])^2}\)

\(x_i\) | \(P(X=x_i)\) | \(x_i-E[X]\) | \((x_i-E[X])^2\) | \(p_i(x_i-E[X])^2\) |
---|---|---|---|---|

0 | 0.20 | \(-1.20\) | \(1.44\) | \(0.288\) |

1 | 0.40 | \(-0.20\) | \(0.04\) | \(0.016\) |

2 | 0.40 | \(0.80\) | \(0.64\) | \(0.256\) |

\(\sum\) | \(0.560\) | |||

\(\mathbf{\sqrt{\sum}}\) | \(\mathbf{0.748}\) |

Try doing this in `R`

by making an initial tibble of the data, and then making new columns to the “table” like we did in class.

```
# write your code here!
# make a dataframe called "amazon" of # of books and associated probabilities
<- tibble(books = c(0,1,2),
amazon prob = c(0.2,0.4,0.4))
# look at it
amazon
```

```
# find expected value
%>%
amazon summarize(exp_value = sum(books*prob))
```

```
# it's 1.2, let's save exp_value
<- 1.2
exp_value
# make new columns: devs, devs_sq, p_weight_devs_sq
# save to new tibble
<-amazon %>%
amazon_tablemutate(devs = books - exp_value,
devs_sq = devs^2,
p_weight_devs_sq = prob*devs^2)
# look at the tibble
amazon_table
```

```
# now let's take these and summarize
%>%
amazon_table summarize(var = sum(p_weight_devs_sq), # variance
sd = sqrt(var)) # sqrt to get sd, confirm its same!
```

Scores on the SAT (out of 1600) are approximately normally distributed with a mean of 500 and standard deviation of 100.

What is the probability of getting a score between a 400 and a 600?

Let random variable \(S\) be the score earned on the SAT.

Convert thse numbers to \(Z\)-scores.

\[\begin{align*}P(400 \leq S \leq 600)&=P\big(\frac{400-500}{100} \leq \frac{S - 500}{100} \leq \frac{600-500}{100}\big)\\ &=P\big(-1 \leq Z \leq 1\big)\\ &\approx 0.68\\ \end{align*}\]

Using the 68-95-99.7 rule: about 68% of the values fall within one standard deviation \((\pm 1\) \(Z\)-score) of the mean.

You don’t need to draw the pdf, but it helps to visualize what we’re looking for, and how converting to \(Z\)-scores helps:

```
# see class 2.3 appendix on how to graph and shade stats graphs
# it helps to first figure out where the x-axis ticks should be
# show about 4 standard deviations above and below the mean (mu +/- 4*sd)
# then have ticks in intervals of one sd
# in this case, with mean 500 and sd 100, it should be seq(100,900,100)
<- ggplot(data = tibble(scores = seq(from = 100,
s_plot to = 900,
by = 100)))+
aes(x = scores)+
stat_function(fun = dnorm,
geom = "area",
args = list(mean = 500, sd = 100),
size = 2,
fill = "gray")+
labs(x = "SAT Scores (out of 1600)",
y = "Probability")+
scale_x_continuous(breaks=seq(from = 100,
to = 900,
by = 100))+
scale_y_continuous(limits = c(0,0.0045),
expand = c(0,0))+
theme_classic(base_family = "Fira Sans Condensed",
base_size = 20)
+stat_function(fun = dnorm,
s_plotargs = list(mean = 500, sd = 100),
geom = "area",
xlim = c(400,600),
size = 2, fill = "#e64173")
```

```
<- ggplot(data = tibble(Z = seq(from = -4,
Z to = 4,
by = 1)))+
aes(x = Z)+
stat_function(fun = dnorm,
geom = "area",
size = 2,
fill = "gray")+
labs(x = "Z-Scores",
y = "Probability")+
scale_x_continuous(breaks=seq(from = -4,
to = 4,
by = 1))+
scale_y_continuous(limits = c(0,0.45),
expand = c(0,0))+
theme_classic(base_family = "Fira Sans Condensed",
base_size = 20)
+ stat_function(fun = dnorm,
Z geom = "area",
xlim = c(-1,1),
fill = "#e64173")
```