Problem Set 5

Author

Answer Key

Published

November 28, 2022

# load packages, they are all installed for you already (in the cloud only)
library(tidyverse) # your friend and mine
library(broom) # for tidy regression
library(modelsummary) # for nice regression tables
library(car) # for f-test

Theory and Concepts

Question 1

In your own words, describe what the “dummy variable trap” means. What precisely is the problem, and what is the standard way to prevent it?

The dummy variable trap is a specific form of perfect multicollinearity, which prevents an OLS regression from making logical sense. If two (or more) variables can be expressed as an exact linear function of one another, we cannot estimate a regression. In terms of dummy variables, we could have dummy variables for each category an observation i can fall into, such as the four seasons:

  • Spring (\(=1\) if i is in Spring, \(=0\) if not)
  • Summer (\(=1\) if i is in Summer, \(=0\) if not)
  • Fall (\(=1\) if i is in Fall, \(=0\) if not)
  • Winter (\(=1\) if i is in Winter, \(=0\) if not)

If we were to include all four dummy variables in the regression, each and every observation will have the following property:

\[Spring_i + Summer_i+ Fall_i +Winter_i=1\]

That is, for each and every observation i, the sum of the four dummy variable values must always equal 1 (because every observation must be either Spring, Summer, Fall, or Winter).

Our constant (formerly called the intercept), \(\beta_0\), is the same value for every observation, as if we had a dummy variable \(X_0\) that always equals 1 for each observation i. Now with the 4 season dummies, we have a perfectly collinear relationship between the constant and the sum of the dummies: \[1=X_{0i}=Spring_i+Summer_i+Fall_i+Winter_i\]

In any case, we need to drop one of these variables (either \(X_0\) or one of the dummies). If we do this, then we break the collinearity. For example, if we drop \(Summer_i\), then the sum of \(Spring_i+Fall_i+Winter_i \neq 1\), since for observations in Summer, this equation\(=0\).

Question 2

In your own words, describe what an interaction term is used for, and give an example. You can use any type of interaction to explain your answer.

Suppose we have a model of: \[Y=\beta_0+\beta_1 X_1+ \beta_2 X_2\]

\(\hat{\beta_1}\) tells us the marginal effect of \(X_1 \rightarrow Y\) holding \(X_2\) constant. However, it might be that the effect of \(X_1 \rightarrow Y\) may be different for different values of \(X_2\). We can explore this possibility with an interaction term, adding:

\[Y=\beta_0+\beta_1 X_1+ \beta_2 X_2+\beta_3 X_1 * X_2\] Now, the marginal effect of on \(Y\) of a change in \(X_1\):

\[\frac{\Delta Y}{\Delta X_1}=\beta_1+\beta_3 X_2\]

So, the effect depends on what \(X_2\) is. \(\beta_3\) on the interaction term describes the increment to the effect of \(X_1\) on \(Y\) that depends on the value of \(X_2\).

Note \(X_1\) and \(X_2\) could each be a continuous variable or a dummy variable, so we could have three possible combinations for an interaction effect:

  • Two continuous variables (example above), e.g. wage and experience
  • Two dummy variables, e.g. male (\(=1\) if male, \(=0\) if female) and married (\(=1\) if married, \(=0\) if unmarried)
  • One continuous and one dummy variable (e.g. male and experience)

Question 3

In your own words, describe when and why using logged variables can be useful.

Logged variables can help us out for theoretical reasons or data-fitting reasons.

Obviously, if a log function appears to fit nonlinear data better than a linear function (or even a polynomial), then using a log function is a good idea.

Sometimes we need functional forms that behave in a way that would be predicted by theory, such as diminishing returns, where the effect of something is always positive, but always growing smaller (such as the marginal product of labor or marginal utility). A log function always increases at a decreasing rate.

Additionally, logs can allow us to talk about percentage changes in variables, rather than unit changes. If we have both \(X\) and \(Y\) logged, we can talk about the elasticity between \(X\) and \(Y\), a 1% change in \(X\) leads to a \(\hat{\beta_1}\)% change in \(Y\).

Question 4

In your own words, describe when we would use an \(F\)-test, and give some example (null) hypotheses. Describe intuitively and specifically (no need for the formula) what exactly \(F\) is trying to test for.

\(F\)-tests are used to test joint-hypotheses that hypothesizes values for multiple parameters. This is as opposed to an ordinary \(t\)-test which only hypothesizes a value for one parameter (e.g. \(H_0\): \(\beta_1=0\)).

We can run \(F\)-tests for a few different scenarios, but the most common is checking whether multiple variables are jointly-significant. The null hypothesis would be that several parameters are equal to 0, e.g.: \[H_0 \text{: } \beta_1=0; \beta_2=0\]

The formula that calculates the value of the \(F\)-statistic essentially measures if the \(R^2\) improves by a statistically significant amount when we move from the restricted regression (under the null hypothesis) to the unrestricted regression (where the null hypothesis is false).

For example, for a regression: \[\hat{Y_i}=\beta_0+\beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 X_{3i} + \beta_4 X_{4i}\]

the null hypothesis: \[H_0 \text{: } \beta_1=0; \beta_2=0;\]

would produce a restricted regression of: \[\hat{Y_i} = \beta_0+ \beta_3 X_{3i} + \beta_4 X_{4i}\]

and compare the \(R^2\) of this regression to the \(R^2\) of the original (unrestricted) regression. If the \(R^2\) improves enough, the \(F\)-statistic is large enough, and hence we can reject the null hypothesis and conclude that \(X_{1i}\) and \(X_{2i}\) are jointly significant (we should not omit them from our regression).

Theory Problems

For the following questions, please show all work and explain answers as necessary. You may lose points if you only write the correct answer. You may use R to verify your answers, but you are expected to reach the answers in this section “manually.”

Question 5

Suppose data on many countries’ legal systems (Common Law or Civil Law) and their GDP per capita gives us the following summary statistics:

Legal System Avg. GDP Growth Rate Std. dev. \(n\)
Common Law \(1.84\) \(3.55\) \(19\)
Civil Law \(4.97\) \(4.27\) \(141\)
Difference \(-3.13\) \(1.02\) \(-\)

Part A

Using the group means, write a regression equation for a regression of GDP Growth rate on Common Law. Define:

\[\text{Common Law}_i = \begin{cases} 1 & \text{if country } i \text{ has common law} \\ 0 & \text{if country } i \text{ has civil law}\\ \end{cases}\]

\(\widehat{\text{GDP Growth Rate}_i}=4.97-3.13 \, \text{Common Law}_i\)

  • \(\hat{\beta_0}\): the average GDP growth rate for Civil Law countries
  • \(\hat{\beta_1}\): the difference in average GDP growth rate between Common Law and Civil Law countries

Part B

How do we use the regression to find the average GDP Growth rate for common law countries? For civil law countries? For the difference?

This helps you construct the regression in part (a). The trick here is to recognize what each coefficient describes:

  • \(\beta_0\): average GDP Growth Rate for Civil Law countries (Common Law\(=0)\)
  • \(\beta_1\): difference in average GDP Growth Rates between Civil and Common Law countries
  • \(\beta_0+\beta_1\): average GDP Growth Rate for Common Law countries (Common Law\(=1)\)

Part C

Looking at the coefficients, does there appear to be a statistically significant difference in average GDP Growth Rates between Civil and Common law countries?

For this, we want to look at the coefficient and standard error for \(\beta_3\), which measures the difference in GDP Growth Rate between Civil and Common law countries. The standard error of \(\hat{\beta_3}\) is given by the standard error of the difference in means (1.02). We run a \(t\)-test:

\[t=\frac{-3.13}{1.02} \approx 3.07\]

This is a fairly large \(t\)-statistic, and so it is likely to be statistically significant.

Part D

Is the estimate on the difference likely to be unbiased? Why or why not?

Whether or not a country is common law or civil law is likely to be endogenous. There are a host of other variables that are correlated both with a country’s legal system and its GDP growth rate – who it was colonized by, what its history is, geography, etc.

Part E

Now using the same table above, reconstruct the regression equation if instead of Common Law, we had used:

\[\text{Civil Law}_i = \begin{cases} 1 & \text{if country } i \text{ has civil law} \\ 0 & \text{if country } i \text{ has common law}\\ \end{cases}\]

\[\widehat{\text{GDP Growth Rate}_i} = 1.84 + 3.13 \, \text{Civil Law}_i\]

Question 6

Suppose a real estate agent collects data on houses that have sold in a particular neighborhood over the past year, with the following variables:

Variable Description
\(Price_h\) price of house \(h\) (in thousands of $)
\(Bedrooms_h\) number of bedrooms in house \(h\)
\(Baths_h\) number of bathrooms in house \(h\)
\(Pool_h\) \(\begin{cases} =1 & \text{if house } h \text{ has a pool} \\ =0 & \text{if house } h \text{ does not have a pool} \end{cases}\)
\(View_h\) \(\begin{cases} =1 & \text{if house } h \text{ has a nice view} \\ =0 & \text{if house } h \text{ does not have a nice view} \end{cases}\)

Part A

Suppose she estimates the following regression:

\[\widehat{\text{Price}}_h=119.20+29.76 \, \text{Bedrooms}_h+24.09 \, \text{View}_h+14.06 \, (\text{Bedrooms}_h \times \text{View}_h)\]

What does each coefficient mean?

  • \(\hat{\beta_0}=119.20\): Houses with 0 bedrooms and no view sell for $119.20 thousand
  • \(\hat{\beta_1}=29.76\): For every additional bedroom for houses with no view, price increases by $29.76 thousand
  • \(\hat{\beta_2}=24.09\): A house with 0 bedrooms and a view sells for $24.09 thousand higher than a house with 0 bedrooms and no view
  • \(\hat{\beta_3}=14.06\): A house with a view gains an $14.06 thousand more in price for every additional bedroom than does a house with no view

Part B

Write out two separate regression equations, one for houses _*with a nice view, and one for homes without a nice view. Explain each coefficient in each regression.

For houses without a nice view \((View_h=0)\):

\[\begin{align*} \widehat{Price_h} &= 119.20 +29.76 \, Bedrooms_h +24.09(\mathbf{0})+14.06\, Bedrooms_h(\mathbf{0})\\ &=119.20+29.76\, Bedrooms_h\\ \end{align*}\]

Houses without a nice view sell for $119.20 thousand for no bedrooms, and gain $29.76 thousand for every additional bedroom.

For houses with a nice view \((View_h=1)\):

\[\begin{align*} \widehat{Price}_h &= 119.20 +29.76\, Bedrooms_h+24.09(\mathbf{1})+14.06\, Bedrooms_h(\mathbf{1})\\ &=119.20+29.76\, Bedrooms_h+24.09+14.06\, Bedrooms_h\\ &=(119.20+24.09)+(29.76+14.06)\, Bedrooms_h\\ &=143.29+43.82\, Bedrooms_h\\ \end{align*}\]

Houses with a nice view sell for $143.29 thousand for no bedrooms, and gain $43.82 thousand for every additional bedroom.

Part C

Suppose she estimates the following regression:

\[\widehat{\text{Price}}_h=189.20+42.40 \, \text{Pool}_h+12.10 \, \text{View}_h+12.09 \, (\text{Pool}_h \times \text{View}_h)\]

What does each coefficient mean?

  • \(\hat{\beta_0}=189.20\): Houses with no pool and no view sell for $189.20 thousand on average
  • \(\hat{\beta_1}=29.76\): Houses with no view gain $42.40 thousand for having a pool
  • \(\hat{\beta_2}=24.09\): Houses with no pool gain $12.10 thousand for having a view
  • \(\hat{\beta_3}=14.06\): Houses with a pool gain $12.09 thousand more for also having a view than a house with no pool also having a view. Note you could equivalently say houses with a view gain $12.09 thousand more for also having a pool than a house without a view also having a pool.

Part D

Find the expected price for:

  • a house with no pool and no view
  • a house with no pool and a view
  • a house with a pool and without a view
  • a house with a pool and with a view
  • No Pool & No View: \(\beta_0=\$189.20\) thousand
  • No Pool & View: \(\beta_0+\beta_2=189.20+12.10=\$201.30\) thousand
  • Pool & No view: \(\beta_0+\beta_1=189.20+42.40=\$231.60\) thousand
  • Pool & View: \(\beta_0+\beta_1+\beta_2+\beta_3=189.20+42.40+12.10+12.09=\$255.79\) thousand

Part E

Suppose she estimates the following regression:

\[\widehat{\text{Price}}_h=87.90+53.94 \, \text{Bedrooms}_h+15.29 \, \text{Baths}_h+16.19 \, (\text{Bedrooms}_h \times \text{Baths}_h)\] What is the marginal effect of adding an additional bedroom if the house has 1 bathroom? 2 bathrooms? 3 bathrooms?

  • \(\displaystyle\frac{\Delta Price}{\Delta Bed}|Bath=1=\beta_1+\beta_3(Bath)=53.94+16.19(1)=\$70.13\) thousand
  • \(\displaystyle\frac{\Delta Price}{\Delta Bed}|Bath=2=\beta_1+\beta_3(Bath)=53.94+16.19(2)=\$86.32\) thousand
  • \(\displaystyle\frac{\Delta Price}{\Delta Bed}|Bath=3=\beta_1+\beta_3(Bath)=53.94+16.19(3)=\$102.51\) thousand

Part F

Using the regression from Part E, what is the marginal effect of adding an additional bathroom if the house has 1 bedroom? 2 bedrooms? 3 bedrooms?

  • \(\displaystyle\frac{\Delta Price}{\Delta Bath}|Bed=1=\beta_2+\beta_3(Bed)=15.29+16.19(1)=\$31.48\) thousand
  • \(\displaystyle\frac{\Delta Price}{\Delta Bath}|Bed=2=\beta_2+\beta_3(Bed)=15.29+16.19(2)=\$47.67\) thousand
  • \(\displaystyle\frac{\Delta Price}{\Delta Bath}|Bed=3=\beta_2+\beta_3(Bed)=15.29+16.19(3)=\$63.86\) thousand

Question 7

Suppose we want to examine the change in average global temperature over time. We have data on the deviation in temperature from pre-industrial times (in Celcius), and the year.

Part A

Suppose we estimate the following simple model relating deviation in temperature to year:

\[\widehat{\text{Temperature}}_i=-10.46+0.006 \, \text{Year}_i\] Interpret the coefficient on Year (i.e. \(\hat{\beta_1})\)

Every (additional) year, temperature increases by 0.006 degrees.

Part B

Predict the (deviation in) temperature for the year 1900 and for the year 2000.

In 1900: \[\begin{align*} \widehat{\text{Temperature}_{1900}}&=-10.46+0.006 (1900)\\ &=-10.46+11.4\\ &=0.94\text{ degrees} \\ \end{align*}\]

In 2000: \[\begin{align*} \widehat{\text{Temperature}_{2000}}&=-10.46+0.006 (2000)\\ &=-10.46+12\\ &=1.54\text{ degrees} \\ \end{align*}\]

Part C

Suppose we believe temperature deviations are increasing at an increasing rate, and introduce a quadratic term and estimate the following regression model:

\[\widehat{\text{Temperature (dev)}}_i=155.68-0.116 \, \text{Year}_i+0.000044 \, \text{Year}_i^2\]

What is the marginal effect on (deviation in) global temperature of one additional year elapsing?

The marginal effect is measured by the first derivative of the regression equation with respect to Year. But you can just remember the resulting formula and plug in the parameters:

\[\begin{align*} \frac{d \, Y}{d \, X} &= \beta_1+2\beta_2 X\\ \frac{d \, Temperature}{d \, Year} &= -0.116+2(0.000044)\, Year\\ &=0.116+0.000088 \, Year\\ \end{align*}\]

Part D

Our quadratic function is a \(U\)-shape. According to the model, at what year was temperature (deviation) at its minimum?

We can set the derivative equal to 0, or you can just remember the formula and plug in the parameters:

\[\begin{align*} \frac{d Y}{d X} &= \beta_1+2\beta_2 X\\ 0 &=\beta_1+2\beta_2 X\\ -\beta_1&=2\beta_2 X\\ -\frac{1}{2} \times \frac{\beta_1}{\beta_2}&=Year^*\\ -\frac{1}{2} \times\frac{(-0.116)}{(0.000044)} &= Year^*\\ -\frac{1}{2} \times -2636 & \approx Year^*\\ 1318 & \approx Year ^*\\ \end{align*}\]

Question 8

Suppose we want to examine the effect of cell phone use while driving on traffic fatalities. While we cannot measure the amount of cell phone activity while driving, we do have a good proxy variable, the number of cell phone subscriptions (in 1000s) in a state, along with traffic fatalities in that state.

Part A

Suppose we estimate the following simple regression:

\[\widehat{\text{fatalities}}_i=123.98+0.091\, \text{cell plans}_i\] Interpret the coefficient on cell plans (i.e. \(\hat{\beta_1})\).

For every additional (thousand) cell phone plans, traffic fatalities increase by 0.091.

Part B

Now suppose we estimate the regression using a linear-log model:

\[\widehat{\text{fatalities}}_i=-3557.08+515.81\, \text{ln(cell plans}_i)\] Interpret the coefficient on ln(cell plans) (i.e. \(\hat{\beta_1})\)

A 1% increase in cell phone plans will increase traffic fatalities by \(\frac{515.81}{100}=5.1581\).

Part C

Now suppose we estimate the regression using a log-linear model:

\[\widehat{\text{ln(fatalities})}_i=5.43+0.0001\,\text{cell plans}_i\] Interpret the coefficient on cell plans (i.e. \(\hat{\beta_1})\)

For every additional (thousand) cell phone plans, traffic fatalities will increase by \(0.001 \times 100\%=0.1\%\).

Part D

Now suppose we estimate the regression using a log-log model:

\[\widehat{\text{ln(fatalities})}_i=-0.89+0.85\,\text{ln(cell plans})_i\] Interpret the coefficient on cell plans (i.e. \(\hat{\beta_1})\)

This is an elasticity. A 1% increase in cell phone plans will cause a 0.85% increase in traffic fatalities.

Part E

Suppose we include several other variables into our regression and want to determine which variable(s) have the largest effects, a State’s cell plans, population, or amount of miles driven. Suppose we decide to standardize the data to compare units, and we get:

\[\widehat{\text{fatalities}}_Z=0+0.002 \, \text{cell plans}_{Z}-0.00007\,\text{population}_{Z}+0.019\,\text{miles driven}_{Z}\] Interpret the coefficients on cell plans, population, and miles driven. Which has the largest effect on fatalities?

  • A 1-standard-deviation-increase in cell plans will lead to a 0.002 standard deviation increase in fatalities.
  • A 1-standard-deviation-increase in population will lead to a 0.00007 standard deviation increase in fatalities.
  • A 1-standard-deviation-increase in miles driven will lead to a 0.019 standard deviation increase in fatalities.

It appears that if a State has more miles driven in it, it will have the strongest effect on fatalities.

Part F

Suppose we wanted to make the claim that it is only miles driven, and neither population nor cell phones determine traffic fatalities. Write (i) the null hypothesis for this claim and (ii) the estimated restricted regression equation.

\[H_0 \text{: } \beta_1=\beta_2=0\]

Restricted Regression: \(\widehat{\text{fatalities}_i}=4.35+0.019\text{miles driven}^{std}\)

Part G

Suppose the \(R^2\) on the original regression from Part E was 0.9221, and the \(R^2\) from the restricted regression is 0.9062. With 50 observations, calculate the \(F\)-statistic.

Note \(q=2\) as we are hypothesizing values for \(2\) parameters, \(k=3\) as we have 3 variables in our unrestricted regression (cell plans, population, and miles driven).

\[\begin{align*} F_{q,n-k-1}&=\cfrac{\displaystyle\frac{(R^2_u-R^2_r)}{q}}{\displaystyle\frac{(1-R^2_u)}{(n-k-1)}}\\ & \\ F_{2,50-3-1}&=\cfrac{\displaystyle\frac{(0.9221-0.9062)}{2}}{\displaystyle\frac{(1-0.9221)}{(50-3-1)}}\\ & \\ F_{2,46}&=\cfrac{\displaystyle\frac{(0.0159)}{2}}{\displaystyle\frac{(0.0779)}{(46)}}\\ &=\frac{0.00795}{0.00169}\\ &\approx 4.70\\ \end{align*}\]

The number here does not mean anything literally. The purpose of this exercise is to get you to try to understand exactly how \(F\) is calculated and how to think about it. Essentially, we are seeing if the \(R^2\) on the unrestricted model has statistically significantly increased from the \(R^2\) on the restricted model. The larger the \(F\)-statistic is, the larger the increase. We can also check statistical significance by calculating the associated \(p\)-value.

# FYI: We could estimate the p-value in R if we wanted
#     - similar to calculating p-value for a t-statistic
#     - syntax is: 
# pf(F.value, df1, df2, lower.tail = FALSE) 
#     - we plug in our calculated F-statistic (above) for F.value
#     - df1 and df2 are degrees of freedom from numerator and denominator
#     - lower.tail=FALSE implies we are looking at probability to the RIGHT  
#       of F-value on F-distribution, p(f>F) that f is higher than our F
pf(4.70, 2,46, lower.tail=FALSE)
[1] 0.01389011

R Questions

Answer the following questions using R. When necessary, please write answers in the same document (rendered to html or pdf, typed .doc(x), or handwritten) as your answers to the above questions. Be sure to include (email or print an .R file, or show in your rendered quarto document) your code and the outputs of your code with the rest of your answers.

Question 9

Download and read in LeadMortality.csv dataset. If you don’t want to download/upload it, you can read it in directly from the url by running this chunk:

# run or edit this chunk
lead <- read_csv("http://metricsf22.classes.ryansafner.com/files/data/LeadMortality.csv")
New names:
Rows: 172 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(2): city, state dbl (14): ...1, year, age, hardness, ph, infrate,
typhoid_rate, np_tub_rate,...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

Lead is toxic, particularly for young children, and for this reason government regulations severely restrict the amount of lead in our environment. In the early part of the 20th century, the underground water pipes in many U.S. cities contained lead, and lead from these pipes leached into drinking water. This exercise will have you investigate the effect of these lead pipes on infant mortality. This dataset contains data on:

Variable Description
infrate infant mortality rate (deaths per 100 in population)
lead \(=1\) if city has lead water pipes, \(=0\) if did not have lead pipes
pH water pH

and several demographic variables for 172 U.S. cities in 1900.

Part A

Using R to examine the data, find the average infant mortality rate for cities with lead pipes and for cities without lead pipes. Then, calculate the difference in mortality rates, and run a \(t\)-test to determine if this difference is statistically significant.

# mean of infrate for cities with lead
mean_lead <- lead %>%
    filter(lead == 1) %>%
    summarize(mean(infrate)) %>%
    pull() # to save as number

# look at it
mean_lead
[1] 0.4032576
# mean of infrate for cities with no lead
mean_no_lead <- lead %>%
    filter(lead == 0) %>%
    summarize(mean(infrate)) %>%
    pull() # to save as number

# look at it
mean_no_lead
[1] 0.3811679
# take difference
mean_lead - mean_no_lead
[1] 0.02208973

Cities with lead pipes have an infant mortality rate of 0.40, and cities without lead pipes have an infant mortality rate of 0.38. So the difference is 0.02.

# run t-test of difference
t.test(infrate ~ lead, data = lead)

    Welch Two Sample t-test

data:  infrate by lead
t = -0.90387, df = 109.29, p-value = 0.3681
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -0.07052551  0.02634606
sample estimates:
mean in group 0 mean in group 1 
      0.3811679       0.4032576 

We get a \(t\)-statistic of \(-0.90\) and a \(p\)-value of \(0.3681\), so the difference is not statistically significant.

Part B

Run a regression of infrate on lead, and write down the estimated regression equation. Use the regression coefficients to find:

  • the average infant mortality rate for cities with lead pipes
  • the average infant mortality rate for cities without lead pipes
  • the difference between the averages for cities with or without lead pipes
lead_reg1 <- lm(infrate ~ lead, data = lead)
summary(lead_reg1)

Call:
lm(formula = infrate ~ lead, data = lead)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.27141 -0.10643 -0.01238  0.07528  0.44121 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.38117    0.02042  18.669   <2e-16 ***
lead         0.02209    0.02475   0.892    0.373    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1514 on 170 degrees of freedom
Multiple R-squared:  0.004662,  Adjusted R-squared:  -0.001193 
F-statistic: 0.7963 on 1 and 170 DF,  p-value: 0.3735

\[\widehat{\text{Infrate}_i} = 0.38+0.02 \, \text{Lead}_i\]

  • Cities without lead pipes have an infant mortality rate of 0.38 \((\hat{\beta_0})\)
  • Cities with lead pipes have an infant mortality rate of \(0.38+0.02=0.40\) \((\hat{\beta_0}+\hat{\beta_1})\)
  • The difference is \(0.02\) \((\hat{\beta_3})\)

Part C

Does the pH of the water matter? Include ph in your regression from part B. Write down the estimated regression equation, and interpret each coefficient. What happens to the estimate on lead?

lead_reg2 <- lm(infrate ~ lead + ph, data = lead)
summary(lead_reg2)

Call:
lm(formula = infrate ~ lead + ph, data = lead)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.27074 -0.09140 -0.01517  0.07909  0.34222 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.17817    0.10676  11.035  < 2e-16 ***
lead         0.04993    0.02177   2.294    0.023 *  
ph          -0.11143    0.01472  -7.570 2.31e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1312 on 169 degrees of freedom
Multiple R-squared:  0.2567,    Adjusted R-squared:  0.2479 
F-statistic: 29.18 on 2 and 169 DF,  p-value: 1.299e-11

\[\widehat{\text{Infrate}_i}=1.17+0.05 \, \text{Lead}_i-0.11 \, \text{pH}_i\]

  • \(\hat{\beta_0}:\) the infant mortality rate for cities with a pH of 0, holding the type of pipe constant is 1.17
  • \(\hat{\beta_1}\): the infant mortality rate for cities with lead pipes is 0.05 higher than cities without lead pipes, holding pH constant
  • \(\hat{\beta_2}\): the infant mortality rate falls by 0.11 for every increase of 1 in the city water’s pH (less acidic), holding the type of pipe constant

The estimate on lead doubled and became significant at the 5% level.

Part D

The amount of lead leached from lead pipes normally depends on the chemistry of the water running through the pipes: the more acidic the water (lower pH), the more lead is leached. Create an interaction term between lead and pH, and run a regression of infrate on lead, pH, and your interaction term. Write down the estimated regression equation. Is this interaction effect significant?

lead_reg3 <- lm(infrate ~ lead + ph + lead:ph, data = lead)
summary(lead_reg3)

Call:
lm(formula = infrate ~ lead + ph + lead:ph, data = lead)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.27492 -0.09502 -0.00266  0.07965  0.35139 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.91890    0.17447   5.267  4.2e-07 ***
lead         0.46180    0.22122   2.087  0.03835 *  
ph          -0.07518    0.02427  -3.098  0.00229 ** 
lead:ph     -0.05686    0.03040  -1.871  0.06312 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1303 on 168 degrees of freedom
Multiple R-squared:  0.2719,    Adjusted R-squared:  0.2589 
F-statistic: 20.91 on 3 and 168 DF,  p-value: 1.467e-11

\[\widehat{\text{Infrate}_i}=0.92+0.46 \, \text{Lead}_i-0.08 \, \text{pH}_i- \, 0.06 \, (\text{Lead} \times \text{pH})\]

We see that this interaction is just barely insignificant, with a \(p\)-value of 0.06.

Part E

What we actually have now are two different regression lines. Visualize this with a scatterplot between infrate \((Y)\) and ph \((X)\), making your points and regression lines colored by lead.

lead_scatter <- ggplot(data = lead)+
    aes(x = ph,
        y = infrate)+
    geom_point(aes(color = as.factor(lead)))+ # making it a factor makes color discrete rather than continuous!
    geom_smooth(method = "lm")+
    # now I'm just making it pretty
    # changing color
    scale_color_viridis_d("Pipes",
                          labels = c("0" = "Not Lead",
                                     "1" = "Lead"))+ # changing labels for colors
  labs(x = "pH",
       y = "Infant Mortality Rate")+
  theme_bw(base_family = "Fira Sans Condensed",
           base_size=16)+
  theme(legend.position = "bottom")

lead_scatter
`geom_smooth()` using formula = 'y ~ x'

lead_scatter + facet_wrap(~lead, 
                        labeller = labeller(lead = c("0" = "Not Lead",
                                                     "1" = "Lead")))+ # change facet titles
    guides(color = F) # hide other legend
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
`geom_smooth()` using formula = 'y ~ x'

Part F

Do the two regression lines have the same intercept? The same slope? Use the original regression in part D to test these possibilities.

\(\beta_1\) (on lead) measures the difference in intercept between the lead & no lead regression lines. So we would want to test:

\[\begin{align*} H_0: & \beta_1=0\\ H_a: & \beta_1 \neq 0\\ \end{align*}\]

The R output tells us \(\hat{\beta_1}\) is \(0.46\) with a standard error of 0.22, so the \(t\)-statistic for this test is \(2.09\) with a \(p\)-value of 0.04, so there is a statistically significant difference at the 5% level.

\(\beta_3\) (on the interaction term) measures the difference in slope between the lead & no lead regression lines. So we would want to test:

\[\begin{align*} H_0: & \beta_3=0\\ H_a: & \beta_3 \neq 0\\ \end{align*}\]

The output tells us \(\hat{\beta_3}\) is \(-0.06\) with a standard error of 0.3, so the \(t\)-statistic for this test is \(-1.87\) with a \(p\)-value of \(0.06\), so there is not a statistically significant difference at the 5% level.

Therefore, they have difference intercepts, and the same slopes, statistically.

Part G

Take your regression equation from part D and rewrite it as two separate regression equations (one for no lead and one for lead). Interpret the coefficients for each.

For no lead (lead=0):

\[\begin{align*} \widehat{Infrate}&=0.92+0.46 \, Lead-0.08 \, pH-0.06\, (Lead \times pH)\\ &=0.92+0.46(\mathbf{0})-0.08pH-0.06((\mathbf{0}) \times pH)\\ &=0.92-0.08 \, pH\\ \end{align*}\]

For lead (lead=1):

\[\begin{align*} \widehat{Infrate}&=0.92+0.46 \, Lead-0.08 \, pH-0.06\, (Lead \times pH)\\ &=0.92+0.46(\mathbf{1})-0.08pH-0.06((\mathbf{1}) \times pH)\\ &=(0.92+0.46)+(-0.08-0.06) \, pH\\ &=1.30-0.14 \, pH\\ \end{align*}\]

Cities without lead pipes have an infant mortality rate of \(0.92\) \((\hat{\beta_0}\) from reg 1) vs. cities with lead pipes have an infant mortality rate of \(1.30\) (\(\hat{\beta_0}\) from reg 2). For every additional unit of pH, the infant mortality rate of a city without lead pipes would decrease by 0.06 \((\hat{\beta_1}\) from reg 1) vs. a city with lead pipes which would see a fall of 0.14.

So again, we can see the difference in infant mortality rates between cities with lead pipes vs. those that don’t, with a water pH of 0 is \(1.30-0.92=\$0.38\), \((\hat{\beta_0}\) from the regression in (part d)), and the cities with lead raise their infant mortality rates by \(0.14-0.08=0.06\) more from each unit of pH than cities without lead do \((\hat{\beta_3})\) from the regression in (part d)).

Part H

Double check your calculations in G are correct by running the regression in D twice, once for cities without lead pipes and once for cities with lead pipes. [Hint: filter() the data first, then use the filtered data for the data= in each regression.]

# regression for no lead

lead %>%
    filter(lead == 0) %>%
    lm(data = ., infrate ~ lead + ph + lead:ph) %>%
    summary()

Call:
lm(formula = infrate ~ lead + ph + lead:ph, data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.23779 -0.09733 -0.01506  0.07053  0.35139 

Coefficients: (2 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.91890    0.18543   4.955 7.77e-06 ***
lead              NA         NA      NA       NA    
ph          -0.07518    0.02579  -2.915  0.00521 ** 
lead:ph           NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1385 on 53 degrees of freedom
Multiple R-squared:  0.1381,    Adjusted R-squared:  0.1219 
F-statistic: 8.495 on 1 and 53 DF,  p-value: 0.005206
# regression for lead

lead %>%
    filter(lead == 1) %>%
    lm(data = ., infrate ~ lead + ph + lead:ph) %>%
    summary()

Call:
lm(formula = infrate ~ lead + ph + lead:ph, data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.27492 -0.08881  0.00664  0.08059  0.31646 

Coefficients: (2 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.38070    0.13189   10.47  < 2e-16 ***
lead              NA         NA      NA       NA    
ph          -0.13204    0.01775   -7.44 1.97e-11 ***
lead:ph           NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1263 on 115 degrees of freedom
Multiple R-squared:  0.325, Adjusted R-squared:  0.3191 
F-statistic: 55.36 on 1 and 115 DF,  p-value: 1.967e-11

Part I

Use modelsummary to make a nice output table of all of your regressions from parts B, C, and D.

modelsummary(models = list(lead_reg1,
                           lead_reg2,
                           lead_reg3),
             fmt = 2, # round to 2 decimals
             output = "html",
             coef_rename = c("(Intercept)" = "Constant",
                             "lead" = "Lead Pipes",
                             "ph" = "pH",
                             "lead:ph" = "lead x pH"),
             gof_map = list(
               list("raw" = "nobs", "clean" = "n", "fmt" = 0),
               #list("raw" = "r.squared", "clean" = "R<sup>2</sup>", "fmt" = 2),
               list("raw" = "adj.r.squared", "clean" = "Adj. R<sup>2</sup>", "fmt" = 2),
               list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
             ),
             escape = FALSE,
             stars = c('*' = .1, '**' = .05, '***' = 0.01)
)
Model 1 Model 2 Model 3
Constant 0.38*** 1.18*** 0.92***
(0.02) (0.11) (0.17)
Lead Pipes 0.02 0.05** 0.46**
(0.02) (0.02) (0.22)
pH −0.11*** −0.08***
(0.01) (0.02)
lead x pH −0.06*
(0.03)
n 172 172 172
Adj. R2 −0.001 0.25 0.26
SER 0.15 0.13 0.13
* p < 0.1, ** p < 0.05, *** p < 0.01

Question 10

Let’s look at economic freedom and GDP per capita using some data I sourced from Gapminder1, Freedom House2 and Fraser Institute Data3 and cleaned up for you, with the following variables:

Variable Description
Country Name of country
ISO Code of country (good for plotting)
econ_freedom Economic Freedom Index score (2016) from 1 (least) to 10 (most free)
pol_freedom Political freedom index score (2018) from 1 (least) top 10 (most free)
gdp_pc GDP per capita (2018 USD)
continent Continent of country

Download and read in LeadMortality.csv dataset. If you don’t want to download/upload it, you can read it in directly from the url by running this chunk:

# run or edit this chunk
freedom <- read_csv("http://metricsf22.classes.ryansafner.com/files/data/freedom.csv")
Rows: 112 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Country, ISO, continent
dbl (3): econ_freedom, gdp_pc, pol_freedom

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Part A

Does economic freedom affect GDP per capita? Create a scatterplot of gdp_pc (Y) against econ_freedom (x). Does the effect appear to be linear or nonlinear?

freedom_plot <- ggplot(data = freedom)+
    aes(x = econ_freedom,
        y = gdp_pc)+
    geom_point(aes(color = continent))+
    scale_y_continuous(labels = scales::dollar)+
  labs(x = "Economic Freedom Score (0-10)",
       y = "GDP per Capita")+
  theme_bw(base_family = "Fira Sans Condensed",
           base_size=16)+
  theme(legend.position = "bottom")

freedom_plot

The effect appears to be nonlinear.

Part B

Run a simple regression of gdp_pc on econ_freedom. Write out the estimated regression equation. What is the marginal effect of econ_freedom on gdp_pc?

freedom_reg1 <- lm(gdp_pc ~ econ_freedom, data = freedom)

summary(freedom_reg1)

Call:
lm(formula = gdp_pc ~ econ_freedom, data = freedom)

Residuals:
   Min     1Q Median     3Q    Max 
-24612 -10511  -3707   9727  65562 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -86400      13362  -6.466 2.85e-09 ***
econ_freedom    14704       1935   7.599 1.06e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15880 on 110 degrees of freedom
Multiple R-squared:  0.3442,    Adjusted R-squared:  0.3383 
F-statistic: 57.74 on 1 and 110 DF,  p-value: 1.063e-11

For every additional point on the economic freedom index (out of 10), GDP per capita increases by $14,704.

Part C

Let’s try a quadratic model. Run a quadratic regression of gdp_pc on econ_freedom. Write out the estimated regression equation.

Add the quadratic regression line to your scatterplot. This just requires a special formula inside an additional layer: geom_smooth(method = "lm", formula = "y ~ x + I(x^2)").

freedom_reg2 <- lm(gdp_pc ~ econ_freedom + I(econ_freedom^2),
                   data = freedom)

summary(freedom_reg2)

Call:
lm(formula = gdp_pc ~ econ_freedom + I(econ_freedom^2), data = freedom)

Residuals:
   Min     1Q Median     3Q    Max 
-30174  -8366   -647   2980  65158 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         280416      79511   3.527 0.000617 ***
econ_freedom        -96618      23908  -4.041 9.93e-05 ***
I(econ_freedom^2)     8327       1783   4.669 8.67e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14560 on 109 degrees of freedom
Multiple R-squared:  0.4535,    Adjusted R-squared:  0.4435 
F-statistic: 45.23 on 2 and 109 DF,  p-value: 4.986e-15
freedom_plot +
  geom_smooth(method = "lm", formula = "y~x+I(x^2)", color = "green")

Part D

What is the marginal effect of econ_freedom on gdp_pc?

The marginal effect can be found by taking the derivative of the regression with respect to econ_freedom (or recalling the rule):

\[\begin{align*} \frac{d \, Y}{d \, X} &= \beta_1+2\beta_2 X\\ \frac{d \, GDPpc}{d \, econfreedom} &= -96618+2(8327)econfreedom\, econ\\ &=-96618+16654 \, econfreedom\\ \end{align*}\]
# if you want to calculate it in R

library(broom)

freedom_reg2_tidy <- tidy(freedom_reg2)

freedom_beta_1 <- freedom_reg2_tidy %>%
    filter(term == "econ_freedom") %>%
    pull(estimate)

freedom_beta_2 <- freedom_reg2_tidy %>%
    filter(term == "I(econ_freedom^2)") %>%
    pull(estimate)

freedom_beta_1+2*freedom_beta_2
[1] -79965.3

Part E

As a quadratic model, this relationship should predict anecon_freedom score where gdp_pc is at a minimum. What is that minimum Economic Freedom score, and what is the associated GDP per capita?

We can set the derivative equal to 0, or you can just remember the formula and plug in the parameters:

\[\begin{align*} \frac{d Y}{d X} &= \beta_1+2\beta_2 X\\ 0 &=\beta_1+2\beta_2 X\\ -\beta_1&=2\beta_2 X\\ -\frac{1}{2} \times \frac{\beta_1}{\beta_2}&=econfreedom^*\\ -\frac{1}{2} \times\frac{(-96618)}{(8327)} &= econfreedom^*\\ -\frac{1}{2} \times -11.603 & \approx econfreedom^*\\ 5.801 & \approx econfreedom ^*\\ \end{align*}\]
# to calculate in R
min <- -0.5*(freedom_beta_1/freedom_beta_2)
min
[1] 5.801793
# let's visualize on the scatterplot
freedom_plot +
  geom_smooth(method = "lm", formula = "y~x+I(x^2)", color = "green")+
  geom_vline(xintercept = min, linetype = "dashed", size = 1)+
  geom_label(x = min, y = 75000, label = round(min,2))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Part F

Run a cubic model to see if we should keep going up in polynomials. Write out the estimated regression equation. Should we add a cubic term?

freedom_reg3 <- lm(gdp_pc ~ econ_freedom + I(econ_freedom^2)+ I(econ_freedom^3),
                   data = freedom)

summary(freedom_reg3)

Call:
lm(formula = gdp_pc ~ econ_freedom + I(econ_freedom^2) + I(econ_freedom^3), 
    data = freedom)

Residuals:
   Min     1Q Median     3Q    Max 
-30290  -8451   -493   3356  64636 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)
(Intercept)        531003.4   470172.4   1.129    0.261
econ_freedom      -211572.5   213908.4  -0.989    0.325
I(econ_freedom^2)   25674.4    32127.4   0.799    0.426
I(econ_freedom^3)    -862.2     1594.2  -0.541    0.590

Residual standard error: 14610 on 108 degrees of freedom
Multiple R-squared:  0.455, Adjusted R-squared:  0.4399 
F-statistic: 30.05 on 3 and 108 DF,  p-value: 3.318e-14

There’s no good theoretical reason why we should expect economic freedom to “change direction” twice - go down, then up, then down again - in its effect on GDP.

Statistically, we can see that \(\hat{\beta_3}\) on I(econ_freedom^3) is not significant (p-value is 0.590), so we should not include the cubic term.

# let's visualize it on the scatterplot
freedom_plot+
  geom_smooth(method = "lm", formula = "y~x+I(x^2)", color = "green")+
  geom_smooth(method = "lm", formula = "y~x+I(x^2)+I(x^3)", color = "orange")

Part G

Another way we can test for non-linearity is to run an \(F\)-test on all non-linear variables - i.e. the quadratic term and the cubic term \((\hat{\beta_2}\) and \(\hat{\beta_3}\)) and test against the null hypothesis that:

\[H_0: \hat{\beta_2} = \hat{\beta_3} = 0\]

Run this joint hypothesis test, and what can you conclude?

# run F test
library(car)
linearHypothesis(freedom_reg3, c("I(econ_freedom^2)", "I(econ_freedom^3)"))

The null hypothesis is that the polynomial terms (quadratic and cubic) jointly do not matter (and the relationship is therefore linear). We have sufficient evidence to reject that hypothesis (p-value is very small). Thus, the relationship is in fact not linear.

Part H

Instead of a polynomial model, try out a logarithmic model. It is hard to interpret percent changes on an index, but it is easy to understand percent changes in GDP per capita, so run a log-linear regression. Write out the estimated regression equation. What is the marginal effect of econ_freedom?

# log linear model 
freedom_reg4 <- lm(log(gdp_pc) ~ econ_freedom, data = freedom)

summary(freedom_reg4)

Call:
lm(formula = log(gdp_pc) ~ econ_freedom, data = freedom)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1046 -0.9507 -0.0533  0.9021  3.3551 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -0.2953     1.0323  -0.286    0.775    
econ_freedom   1.2889     0.1495   8.621 5.53e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.227 on 110 degrees of freedom
Multiple R-squared:  0.4032,    Adjusted R-squared:  0.3978 
F-statistic: 74.32 on 1 and 110 DF,  p-value: 5.525e-14

For every 1 point increase on the economic freedom index, a country’s GDP per capita increases by \(1.2889 \times 100\%=128.90\%\)

Part I

Make a scatterplot of your log-linear model with a regression line.

log_freedom_plot <- ggplot(data = freedom)+
    aes(x = econ_freedom,
        y = log(gdp_pc))+
    geom_point(aes(color = continent))+
    geom_smooth(method = "lm")+
  labs(x = "Economic Freedom Score (0-10)",
       y = "Log GDP per Capita")+
  theme_bw(base_family = "Fira Sans Condensed",
           base_size=16)+
  theme(legend.position = "bottom")

log_freedom_plot
`geom_smooth()` using formula = 'y ~ x'

Part J

Put all of your results together in a regression output table with modelsummary from your answers in questions B, C, G, and H.

modelsummary(models = list("GDP per Capita" = freedom_reg1,
                           "GDP per Capita" = freedom_reg2,
                           "GDP per Capita" = freedom_reg3,
                           "Log GDP per Capita" = freedom_reg4),
             fmt = 2, # round to 2 decimals
             output = "html",
             coef_rename = c("(Intercept)" = "Constant",
                             "econ_freedom" = "Economic Freedom Score (0-10)",
                             "I(econ_freedom^2)" = "Economic Freedom Score<sup>2</sup>",
                             "I(econ_freedom^3)" = "Economic Freedom Score<sup>3</sup>"),
             gof_map = list(
               list("raw" = "nobs", "clean" = "n", "fmt" = 0),
               #list("raw" = "r.squared", "clean" = "R<sup>2</sup>", "fmt" = 2),
               list("raw" = "adj.r.squared", "clean" = "Adj. R<sup>2</sup>", "fmt" = 2),
               list("raw" = "rmse", "clean" = "SER", "fmt" = 2)
             ),
             escape = FALSE,
             stars = c('*' = .1, '**' = .05, '***' = 0.01)
)
GDP per Capita GDP per Capita GDP per Capita Log GDP per Capita
Constant −86400.07*** 280416.48*** 531003.39 −0.30
(13361.77) (79511.27) (470172.38) (1.03)
Economic Freedom Score (0-10) 14704.30*** −96618.52*** −211572.48 1.29***
(1935.13) (23908.06) (213908.44) (0.15)
Economic Freedom Score2 8326.61*** 25674.41
(1783.32) (32127.38)
Economic Freedom Score3 −862.15
(1594.20)
n 112 112 112 112
Adj. R2 0.34 0.44 0.44 0.40
SER 15739.48 14368.05 14348.64 1.22
* p < 0.1, ** p < 0.05, *** p < 0.01

Footnotes

  1. GDP per capita (2018)↩︎

  2. Political freedom score (2018)↩︎

  3. Economic Freedom score (2016)↩︎