Problem Set 1

Author

Answer Key

Published

September 21, 2022

The Popularity of Baby Names

Install and load the package babynames. Get help for ?babynames to see what the data includes. Also, don’t forget to load tidyverse!

# write your code here! 
# install.packages("babynames") # install for first use 
# Note I've “commented” out some of these commands  (with a #) so they do not run when I run this chunk or render this document

library(babynames) # load for data
library(tidyverse) # load for data wrangling

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.2      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Question 1

Part A

What are the top 5 boys names for 2017, and what percent (note not the proportion!) of overall names is each?

# write your code here! 

# save as a new tibble
top_5_boys_2017 <- babynames %>% # take data
  filter(sex == "M", # filter by males
         year == 2017) %>% # and for 2007
  arrange(desc(n)) %>% # arrange in largest-to-smallest order of n (number)
  slice(1:5) %>% # optional, look only at first 5 rows; head(n=5) also works
  mutate(percent = round(prop*100, 2)) # also optional, make a percent variable rounded to 2 decimals

# look at our new tibble
top_5_boys_2017

The top 5 names are

top_5_boys_2017 %>%
  select(name,percent) %>%
  knitr::kable() # for nicer table in rendered document

name	percent
Liam	0.95
Noah	0.93
William	0.76
James	0.72
Logan	0.71

Alternatively, you could just write what you found manually into an object like:

top_5_boys_2017_alt <- c("Liam", "Noah", "William", "James", "Logan")

top_5_boys_2017_alt

[1] "Liam"    "Noah"    "William" "James"   "Logan"

# you could alternatively add a command, 
# %>% pull(name) to the first chunk of code, 
# and it would do the same thing, but we'd want to save it, 
# for example:

top_5_boys_2017_alt <- babynames %>%
  filter(sex=="M",
         year==2017) %>%
  arrange(desc(n)) %>% 
  slice(1:5) %>%
  mutate(percent = round(prop*100, 2)) %>%
  pull(name)
  
top_5_boys_2017_alt

[1] "Liam"    "Noah"    "William" "James"   "Logan"

Part B

What are the top 5 girls names for 2017, and what percent of overall names is each?

# write your code here! 
# save as a new tibble
top_5_girls_2017 <- babynames %>% # take data
  filter(sex == "F", # filter by females
         year == 2017) %>% # and for 2007
  arrange(desc(n)) %>% # arrange in largest-to-smallest order of n (number)
  slice(1:5) %>% # optional, look only at first 5 rows; head(., n=5) also works
  mutate(percent = round(prop*100, 2)) # also optional, make a percent variable rounded to 2 decimals

# look at our new tibble
top_5_girls_2017

The top 5 names are

top_5_girls_2017 %>%
  select(name,percent) %>%
  knitr::kable()

name	percent
Emma	1.05
Olivia	0.99
Ava	0.85
Isabella	0.81
Sophia	0.79

Alternatively, you could just write what you found manually into an object like:

top_5_girls_2017_alt <- c("Emma", "Olivia", "Ava", "Isabella", "Sophia")

Question 2

Make two barplots of these top 5 names, one for each sex. Map aesthetics x to name and y to prop [or percent, if you made that variable, as I did.] and use geom_col (since you are declaring a specific y, otherwise you could just use geom_bar() and just an x.)

# write your code here! 
ggplot(data = top_5_boys_2017)+
  aes(x = reorder(name, n), #note this reorders the x variable from small to large n
      y = percent, # you can use prop if you didn't make a percent variable
      fill = name)+ # optional color!
  geom_col()+
  
  # all of the above is sufficient, now I'm just making it pretty
  scale_y_continuous(labels = function(x){paste0(x, "%")}, # add percent signs
                     breaks = seq(from = 0, # make line breaks every 0.25%
                                  to = 1,
                                  by = 0.25),
                     limits = c(0,1), # limit axis to between 0 and 1
                     expand = c(0,0))+ # don't let it go beyond this
  labs(x = "Name",
       y = "Percent of All Babies With Name",
       title = "Most Popular Boys Names Since 1880",
       fill = "Boy's Name",
       caption = "Source: SSA")+
  ggthemes::theme_pander(base_family = "Fira Sans Condensed", base_size=16)+
  coord_flip()+ # flip axes to make horizontal!
  scale_fill_viridis_d(option = "default")+ # use viridis discrete color palette
  theme(legend.position = "") # hide legend

Warning in viridisLite::viridis(n, alpha, begin, end, direction, option): Option
'default' does not exist. Defaulting to 'viridis'.

ggplot(data = top_5_girls_2017)+
  aes(x = reorder(name, n), #note this reorders the x variable from small to large n
      y = percent, # you can use prop if you didn't make a percent variable
      fill = name)+ # optional color!
  geom_col()+
  # all of the above is sufficient, now I'm just making it pretty
  scale_y_continuous(labels = function(x){paste0(x, "%")}, # add percent signs
                     breaks = seq(from = 0, # make line breaks every 0.25%
                                  to = 1.25,
                                  by = 0.25),
                     limits = c(0,1.3), # limit axis to between 0 and 1.2
                     expand = c(0,0))+ # don't let it go beyond this
  labs(x = "Name",
       y = "Percent of All Girls With Name",
       title = "Most Popular Girls Names Since 1880",
       fill = "Girl's Name",
       caption = "Source: SSA")+
  ggthemes::theme_pander(base_family = "Fira Sans Condensed", base_size=16)+
  coord_flip()+ # flip axes to make horizontal!
  scale_fill_viridis_d(option = "default")+ # use viridis discrete color palette
  theme(legend.position = "") # hide legend

Warning in viridisLite::viridis(n, alpha, begin, end, direction, option): Option
'default' does not exist. Defaulting to 'viridis'.

If you had gone the alternate route by saving an object of names (like I did above with top_5_boys_2017_alt and top_5_girls_2017_alt), you could filter the data using the %in% operator to use for your data layer of each plot.

Note you can also simply pipe your wrangling code into ggplot(), since the first layer is the data source:

babynames %>%
  filter(name %in% top_5_boys_2017_alt) %>%
  ggplot()+ # this pipes the above into the data layer
  # the rest of the plot code...

Question 3

Find your name. [If your name isn’t in there 😟, pick a random name.] count by sex how many babies since 1880 were named your name. [Hint: if you do only this, you’ll get the number of rows (years) there are in the data. You want to add the number of babies in each row (n), so inside count, add , wt = n to weight the count by n.] Also create a variable for the percent of each sex.

# write your code here! 
babynames %>%
  filter(name == "Ryan") %>%
  count(sex, wt = n) %>%
  mutate(percent = round((n/sum(n)*100),2))

Question 4

Make a line graph of the number of babies with your name over time, colored by sex.

# write your code here! 

# first wrangle data
babynames %>%
  filter(name == "Ryan") %>%

  # now we pipe into ggplot
  ggplot()+
  aes(x = year,
      y = n,
      color = sex)+
  geom_line(size = 1)+
  scale_color_manual(values = c("F" = "#e64173", # make my own colors
                                "M" = "#0047AB"))+
  labs(x = "Year",
       y = "Number of Babies",
       title = "Popularity of Babies Named 'Ryan'",
       color = "Sex",
       caption = "Source: SSA")+
    theme_classic(base_family = "Fira Sans Condensed", base_size=16)

Question 5

Part A

Find the most common name for boys by year between 1980-2017. [Hint: you’ll want to first group_by(year). Once you’ve got all the right conditions, you’ll get a table with a lot of data. You only want to keep just the 1st row of each grouped-year’s data, so add %>% slice(1).]

# write your code here! 

babynames %>%
  group_by(year) %>% # we want one observation per year
  filter(sex == "M",
         year > 1979) %>% # or >==1980
  arrange(desc(n)) %>% # start with largest n first
  slice(1) # take first row only

Part B

Now do the same for girls.

# write your code here! 

babynames %>%
  group_by(year) %>% # we want one observation per year
  filter(sex == "F",
         year > 1979) %>% # or >==1980
  arrange(desc(n)) %>% # start with largest n first
  slice(1) # take first row only

Question 6

Now let’s graph the evolution of the most common names since 1880.

Part A

First, find out what are the top 10 overall most popular names for boys and for girls in the data. [Hint: first group_by(name).] You may want to create two objects, each with these top 5 names as character elements.

# write your code here! 

babynames %>%
  group_by(name) %>% # we want one row per name
  filter(sex == "M") %>%
  summarize(total = sum(n)) %>% # add upp all of the n's for all years for each name
  arrange(desc(total)) %>% # list largest total first
  slice(1:5)

# make a vector of the names (we'll need this for our graph below)
top_boys_names <- c("James", "John", "Robert", "Michael", "William")

# you could alternatively add a command, 
# %>% pull(name) to the first chunk of code, 
# and it would do the same thing, but we'd want to save it, 
# for example:

babynames %>%
  group_by(name) %>% # we want one row per name
  filter(sex == "M") %>%
  summarize(total = sum(n)) %>% # add upp all of the n's for all years for each name
  arrange(desc(total)) %>% # list largest total first
  slice(1:5) %>%
  pull(name)

[1] "James"   "John"    "Robert"  "Michael" "William"

babynames %>%
  group_by(name) %>% # we want one row per name
  filter(sex == "F") %>%
  summarize(total = sum(n)) %>% # add upp all of the n's for all years for each name
  arrange(desc(total)) %>% # list largest total first
  slice(1:5)

# make a vector of the names (we'll need this for our graph below)
top_girls_names <- c("Mary", "Elizabeth", "Patricia", "Jennifer", "Linda")

Part B

Now make two linegraphs of these 5 names over time, one for boys, and one for girls. [Hint: you’ll first want to subset the data to use for your data layer in the plot. First group_by(year) and also make sure you only use the names you found in Part A. Try using the %in% command to do this.]

# write your code here! 

babynames %>%
  group_by(year) %>%
  filter(sex == "M",
         name %in% top_boys_names) %>%
  ggplot()+
  aes(x = year,
      y = prop,
      color = name)+
  geom_line(size = 1)+
  labs(x = "Year",
       y = "Proportion of Babies with Name",
       title = "Most Popular Boys Names Since 1880",
       color = "Boy's Name",
       caption = "Source: SSA")+
  theme_classic(base_family = "Fira Sans Condensed", base_size = 16)

babynames %>%
  group_by(year) %>%
  filter(sex == "F",
         name %in% top_girls_names) %>%
  ggplot()+
  aes(x = year,
      y = prop,
      color = name)+
  geom_line(size = 1)+
  labs(x = "Year",
       y = "Proportion of Babies with Name",
       title = "Most Popular Girls Names Since 1880",
       color = "Girl's Name",
       caption = "Source: SSA")+
  theme_classic(base_family = "Fira Sans Condensed", base_size = 16)

Political and Economic Freedom Around the World

For the remaining questions, we’ll look at the relationship between Economic Freedom and Political Freedom in countries around the world today. Our data for economic freedom comes from the Fraser Institute, and our data for political freedom comes from Freedom House.

Question 7

Download these two datasets that I’ve cleaned up a bit: [If you want a challenge, try downloading them from the websites and cleaning them up yourself!]

Below is a brief description of the variables I’ve put in each dataset:

Econ Freedom

Variable	Description
`year`	Year
`ISO`	Three-letter country code
`country`	Name of the country
`ef_index`	Total economic freedom index (0 - least to 100 - most)
`rank`	Rank of the country in terms of economic freedom
`continent`	Continent the country is in

Pol Freedom

Variable	Description
`country`	Name of the country
`C/T`	Whether the location is a country (C) or territory (T)
`year`	Year
`status`	Whether the location is Free (F), Partly Free (F) or Not Free (NF)
`fh_score`	Total political freedom index (0 - least to 100 - most)

Import and save them each as an object using my_df_name <- read_csv("name_of_the_file.csv"). I suggest one as econ and the other as pol, but it’s up to you. Look at each object you’ve created.

# write your code here! 

# import data with read_csv from readr

# note these file paths assume you have these files right in your working directory

econ <- read_csv("econ_freedom.csv")

Rows: 4050 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): ISO, country, continent
dbl (3): year, ef_index, rank

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

pol <- read_csv("pol_freedom.csv")

Rows: 1885 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): country, C/T, status
dbl (2): year, fh_score

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# look at each dataframe
econ

pol

Question 8

Now let’s join them together so that we can have a single dataset to work with. You can learn more about this in the 1.4 slides. Since both datasets have both country and year (spelled exactly the same in both!), we can use these two variables as a key to combine observations. Run the following code (substituting whatever you want to name your objects):

freedom <- left_join(econ, pol, # join pol tibble to econ tibble
                     by = c("country", "year")) # keys to match variables between two tibbles!

Take a look at freedom to make sure it appears to have worked.

# write your code here! 
freedom

Question 9

Part A

Make a barplot of the 10 countries with the highest Economic Freedom index score in 2018. You may want to find this first and save it as an object to use for your plot’s data layer. Use geom_col() since we will map ef_index to y. If you want to order the bars, set x = fct_reorder(ISO, desc(ef_index)) to reorder ISO (or country, if you prefer) by EF score in descending order.

# write your code here! 

# grab the top 10 countries by ef in 2018
ef_10 <- freedom %>%
  filter(year == 2018) %>%
  arrange(desc(ef_index)) %>%
  slice(1:10)

# look at it just to check
ef_10

# now plot it
ggplot(data = ef_10)+
  aes(x = fct_reorder(ISO, desc(ef_index)), # reorder ISO by ef in order
      y = ef_index)+
  geom_col(aes(fill = continent))+ # coloring is optional
  
  # above is sufficient, now let's just make it prettier
  geom_text(aes(label = ef_index), # add the score onto the bar
            vjust = 1.2, # adjust it vertically
            color = "white")+
  scale_y_continuous(breaks = seq(0,10,2),
                     limits = c(0,10),
                     expand = c(0,0)
                     )+
  labs(x = "Country",
       y = "Economic Freedom Score",
       title = "Top 10 Countries by Economic Freedom",
       caption = "Source: Frasier Institute",
       fill = "Continent")+
  theme_minimal(base_family = "Fira Sans Condensed")+
  theme(legend.position = "bottom",
        plot.title = element_text(face = "bold",
                                  size = rel(1.5))
        )

Part B

Make a barplot of the 10 countries with the highest Freedom House index score in 2018, similar to what you did for Part A.

# write your code here! 

# grab the top 10 countries by fh in 2018
pf_10 <- freedom %>%
  filter(year == 2018) %>%
  arrange(desc(fh_score)) %>%
  slice(1:10)

# look at it just to check
pf_10

# now plot it
ggplot(data = pf_10)+
  aes(x = fct_reorder(ISO, desc(fh_score)),
      y = fh_score)+
  geom_col(aes(fill = continent))+ # coloring is optional
  # above is sufficient, now let's just make it prettier
  geom_text(aes(label = fh_score), # add the score onto the bar
            vjust = 1.2, # adjust it vertically
            color = "white")+
  scale_y_continuous(breaks = seq(0,100,20),
                     limits = c(0,100),
                     expand = c(0,0))+
  labs(x = "Country",
       y = "Political Freedom Score",
       title = "Top 10 Countries by Political Freedom",
       caption = "Source: Freedom House",
       fill = "Continent")+
  theme_minimal(base_family = "Fira Sans Condensed")+
  theme(legend.position = "bottom",
        plot.title = element_text(face = "bold", size = rel(1.5))
        )

Question 10

Now make a scatterplot of Political freedom (fh_score as y) on Economic Freedom (ef_index as x) in the year 2018, and color by continent.

# write your code here! 

# note I'm going to save the plot as an object called p, for next question
p <- freedom %>%
  filter(year == "2018") %>% 
  ggplot()+
  aes(x = ef_index,
      y = fh_score)+
  # doing just geom_point() is fine, but since there's a lot of overlap, here are some things I like to do:
  geom_point(aes(fill = continent), # fill the points with color by continent
             alpha = 0.9, # make points slightly transparent
             color = "white", # outline the points with a white border
             pch = 21, # this shape has an outline and a fill color
             size = 3)+
  scale_x_continuous(breaks = seq(0,10,2),
                     limits = c(0,10),
                     expand = c(0,0))+
  scale_y_continuous(breaks = seq(0,100,20),
                     limits = c(0,105),
                     expand = c(0,0))+
  labs(x = "Economic Freedom Score",
       y = "Political Freedom Score",
       caption = "Sources: Frasier Institute, Freedom House",
       title = "Economic Freedom & Political Freedom",
       fill = "Continent")+
  theme_minimal(base_family = "Fira Sans Condensed")+
  theme(legend.position = "bottom",
        plot.title = element_text(face = "bold", size = rel(1.5))
        )

# look at plot
p

Warning: Removed 13 rows containing missing values (geom_point).

Question 11

Save your plot from Question 10 as an object, and add a new layer where we will highlight a few countries. Pick a few countries (I suggest using the ISO code) and create a new object filtering the data to only include these countries (again the %in% command will be most helpful here).

Additionally, install and load a package called "ggrepel", which will adjust labels so they do not overlap on a plot.

Then, add the following layer to your plot:

geom_label_repel(data = countries, # or whatever object name you created
                     aes(x = ef_index,
                         y = fh_score,
                         label = ISO, # show ISO as label (you could do country instead)
                         color = continent),
                     alpha = 0.5, # make it a bit transparent
                     box.padding = 0.75, # control how far labels are from points
                     show.legend = F) # don't want this to add to the legend

This should highlight these countries on your plot.

library(ggrepel)

# pick some countries
some_countries <- freedom %>%
  filter(year==2018,
         country %in% c("United States",
                        "United Kingdom",
                        "Sweden",
                        "China",
                        "Singapore",
                        "Russian Federation",
                        "Korea, Rep.",
                        "Hong Kong SAR, China"))

# write your code here! 

p + geom_label_repel(data = some_countries, # or whatever object name you created
                     aes(x = ef_index,
                         y = fh_score,
                         label = ISO, # show ISO as label (you could do country instead)
                         color = continent),
                     alpha = 0.75, # make it a bit transparent
                     box.padding = 0.75, # control how far labels are from points
                     show.legend = F) # don't want this to add to the legend

Warning: Removed 13 rows containing missing values (geom_point).

Question 12

Let’s just look only at the United States and see how it has fared in both measures of freedom over time. filter() the data to look only at the United States (its ISO is "USA"). Use both a geom_point() layer and a geom_path() layer, which will connect the dots over time. Let’s also see this by labeling the years with an additional layer geom_text_repel(aes(label = year)).

# write your code here! 

# save plot as us
us <- freedom %>%
  filter(ISO == "USA") %>%
  ggplot()+
  aes(x = ef_index,
      y = fh_score)+
  geom_point(color = "red")+
  geom_path(color = "red")+
  geom_text_repel(aes(label = year),
                  color = "red")+
  scale_x_continuous(breaks = seq(8,8.5,0.05),
                     limits = c(8,8.5),
                     expand = c(0,0))+
  scale_y_continuous(breaks = seq(85,95,1),
                     limits = c(85,95),
                     expand = c(0,0))+
  labs(x = "Economic Freedom Score",
       y = "Political Freedom Score",
       caption = "Sources: Frasier Institute, Freedom House",
       title = "U.S. Political & Economic Freedom, 2013—2018",
       fill = "Continent")+
  theme_minimal(base_family = "Fira Sans Condensed")+
  theme(legend.position = "bottom",
        plot.title = element_text(face = "bold", size = rel(1.5))
        )

# look at it
us

Warning: Removed 19 rows containing missing values (geom_point).

Warning: Removed 19 row(s) containing missing values (geom_path).

Warning: Removed 19 rows containing missing values (geom_text_repel).

Note that the way I zoomed in on the scales, these look like pretty dramatic changes!

If we maintain the full perspective, the change appears minor. Be very careful how you present your analysis!

us +
  # force scales to show full range of 0-10 for x, 0-100 for y
  scale_x_continuous(breaks = seq(0,10,1),
                     limits = c(0,10),
                     expand = c(0,0)
                     )+
  scale_y_continuous(breaks = seq(0,100,10),
                     limits = c(0,100),
                     expand = c(0,0)
                     )

Scale for 'x' is already present. Adding another scale for 'x', which will
replace the existing scale.

Scale for 'y' is already present. Adding another scale for 'y', which will
replace the existing scale.

Warning: Removed 19 rows containing missing values (geom_point).

Warning: Removed 19 row(s) containing missing values (geom_path).

Warning: Removed 19 rows containing missing values (geom_text_repel).

Knit and Submit!

When you are done, click the Render button. Based on the current yaml header format: html, this will currently produce an html webpage, which should automatically open for your review.

Notice in the Files pane in R Studio (by default, the lower right one), there should now be a document called 01-problem-set.html (or if you changed the filename) ending in .html. This is the webpage, so you can find this file on your computer (or download it from Rstudio.cloud with by clicking on the checkmark box in front of the file in the Files page and then going to More -> Export... to download the file to your computer) and send this file.

If you want to make a PDF, install the package “tinytex” and run the following code to install a LaTeX distribution:

Then delete the lines in the yaml header that say format: html: self-contained: TRUE, and add a simple line that says format: pdf . Clicking Render will now produce a PDF, show it, and save it as a new file in the Files pane.

Either way, send me your output file, html or pdf (or, if you like, word) so long as it shows the input and output code of every chunk. I have set it by default to do this, with echo: true in the yaml header.

Don’t forget to add your name to the author part of the header!