Install and load the package babynames. Get help for ?babynames to see what the data includes. Also, don’t forget to load tidyverse!
# write your code here! # install.packages("babynames") # install for first use # Note I've “commented” out some of these commands (with a #) so they do not run when I run this chunk or render this documentlibrary(babynames) # load for datalibrary(tidyverse) # load for data wrangling
What are the top 5 boys names for 2017, and what percent (note not the proportion!) of overall names is each?
# write your code here! # save as a new tibbletop_5_boys_2017 <- babynames %>%# take datafilter(sex =="M", # filter by males year ==2017) %>%# and for 2007arrange(desc(n)) %>%# arrange in largest-to-smallest order of n (number)slice(1:5) %>%# optional, look only at first 5 rows; head(n=5) also worksmutate(percent =round(prop*100, 2)) # also optional, make a percent variable rounded to 2 decimals# look at our new tibbletop_5_boys_2017
The top 5 names are
top_5_boys_2017 %>%select(name,percent) %>% knitr::kable() # for nicer table in rendered document
name
percent
Liam
0.95
Noah
0.93
William
0.76
James
0.72
Logan
0.71
Alternatively, you could just write what you found manually into an object like:
# you could alternatively add a command, # %>% pull(name) to the first chunk of code, # and it would do the same thing, but we'd want to save it, # for example:top_5_boys_2017_alt <- babynames %>%filter(sex=="M", year==2017) %>%arrange(desc(n)) %>%slice(1:5) %>%mutate(percent =round(prop*100, 2)) %>%pull(name)top_5_boys_2017_alt
[1] "Liam" "Noah" "William" "James" "Logan"
Part B
What are the top 5 girls names for 2017, and what percent of overall names is each?
# write your code here! # save as a new tibbletop_5_girls_2017 <- babynames %>%# take datafilter(sex =="F", # filter by females year ==2017) %>%# and for 2007arrange(desc(n)) %>%# arrange in largest-to-smallest order of n (number)slice(1:5) %>%# optional, look only at first 5 rows; head(., n=5) also worksmutate(percent =round(prop*100, 2)) # also optional, make a percent variable rounded to 2 decimals# look at our new tibbletop_5_girls_2017
Make two barplots of these top 5 names, one for each sex. Map aesthetics x to name and y to prop [or percent, if you made that variable, as I did.] and use geom_col (since you are declaring a specific y, otherwise you could just use geom_bar() and just an x.)
# write your code here! ggplot(data = top_5_boys_2017)+aes(x =reorder(name, n), #note this reorders the x variable from small to large ny = percent, # you can use prop if you didn't make a percent variablefill = name)+# optional color!geom_col()+# all of the above is sufficient, now I'm just making it prettyscale_y_continuous(labels =function(x){paste0(x, "%")}, # add percent signsbreaks =seq(from =0, # make line breaks every 0.25%to =1,by =0.25),limits =c(0,1), # limit axis to between 0 and 1expand =c(0,0))+# don't let it go beyond thislabs(x ="Name",y ="Percent of All Babies With Name",title ="Most Popular Boys Names Since 1880",fill ="Boy's Name",caption ="Source: SSA")+ ggthemes::theme_pander(base_family ="Fira Sans Condensed", base_size=16)+coord_flip()+# flip axes to make horizontal!scale_fill_viridis_d(option ="default")+# use viridis discrete color palettetheme(legend.position ="") # hide legend
Warning in viridisLite::viridis(n, alpha, begin, end, direction, option): Option
'default' does not exist. Defaulting to 'viridis'.
ggplot(data = top_5_girls_2017)+aes(x =reorder(name, n), #note this reorders the x variable from small to large ny = percent, # you can use prop if you didn't make a percent variablefill = name)+# optional color!geom_col()+# all of the above is sufficient, now I'm just making it prettyscale_y_continuous(labels =function(x){paste0(x, "%")}, # add percent signsbreaks =seq(from =0, # make line breaks every 0.25%to =1.25,by =0.25),limits =c(0,1.3), # limit axis to between 0 and 1.2expand =c(0,0))+# don't let it go beyond thislabs(x ="Name",y ="Percent of All Girls With Name",title ="Most Popular Girls Names Since 1880",fill ="Girl's Name",caption ="Source: SSA")+ ggthemes::theme_pander(base_family ="Fira Sans Condensed", base_size=16)+coord_flip()+# flip axes to make horizontal!scale_fill_viridis_d(option ="default")+# use viridis discrete color palettetheme(legend.position ="") # hide legend
Warning in viridisLite::viridis(n, alpha, begin, end, direction, option): Option
'default' does not exist. Defaulting to 'viridis'.
If you had gone the alternate route by saving an object of names (like I did above with top_5_boys_2017_alt and top_5_girls_2017_alt), you could filter the data using the %in% operator to use for your data layer of each plot.
Note you can also simply pipe your wrangling code into ggplot(), since the first layer is the data source:
babynames %>%filter(name %in% top_5_boys_2017_alt) %>%ggplot()+# this pipes the above into the data layer# the rest of the plot code...
Question 3
Find your name. [If your name isn’t in there 😟, pick a random name.] count by sex how many babies since 1880 were named your name. [Hint: if you do only this, you’ll get the number of rows (years) there are in the data. You want to add the number of babies in each row (n), so inside count, add , wt = n to weight the count by n.] Also create a variable for the percent of each sex.
Make a line graph of the number of babies with your name over time, colored by sex.
# write your code here! # first wrangle datababynames %>%filter(name =="Ryan") %>%# now we pipe into ggplotggplot()+aes(x = year,y = n,color = sex)+geom_line(size =1)+scale_color_manual(values =c("F"="#e64173", # make my own colors"M"="#0047AB"))+labs(x ="Year",y ="Number of Babies",title ="Popularity of Babies Named 'Ryan'",color ="Sex",caption ="Source: SSA")+theme_classic(base_family ="Fira Sans Condensed", base_size=16)
Question 5
Part A
Find the most common name for boys by year between 1980-2017. [Hint: you’ll want to first group_by(year). Once you’ve got all the right conditions, you’ll get a table with a lot of data. You only want to keep just the 1st row of each grouped-year’s data, so add %>% slice(1).]
# write your code here! babynames %>%group_by(year) %>%# we want one observation per yearfilter(sex =="M", year >1979) %>%# or >==1980arrange(desc(n)) %>%# start with largest n firstslice(1) # take first row only
Part B
Now do the same for girls.
# write your code here! babynames %>%group_by(year) %>%# we want one observation per yearfilter(sex =="F", year >1979) %>%# or >==1980arrange(desc(n)) %>%# start with largest n firstslice(1) # take first row only
Question 6
Now let’s graph the evolution of the most common names since 1880.
Part A
First, find out what are the top 10 overall most popular names for boys and for girls in the data. [Hint: first group_by(name).] You may want to create two objects, each with these top 5 names as character elements.
# write your code here! babynames %>%group_by(name) %>%# we want one row per namefilter(sex =="M") %>%summarize(total =sum(n)) %>%# add upp all of the n's for all years for each namearrange(desc(total)) %>%# list largest total firstslice(1:5)
# make a vector of the names (we'll need this for our graph below)top_boys_names <-c("James", "John", "Robert", "Michael", "William")# you could alternatively add a command, # %>% pull(name) to the first chunk of code, # and it would do the same thing, but we'd want to save it, # for example:babynames %>%group_by(name) %>%# we want one row per namefilter(sex =="M") %>%summarize(total =sum(n)) %>%# add upp all of the n's for all years for each namearrange(desc(total)) %>%# list largest total firstslice(1:5) %>%pull(name)
[1] "James" "John" "Robert" "Michael" "William"
babynames %>%group_by(name) %>%# we want one row per namefilter(sex =="F") %>%summarize(total =sum(n)) %>%# add upp all of the n's for all years for each namearrange(desc(total)) %>%# list largest total firstslice(1:5)
# make a vector of the names (we'll need this for our graph below)top_girls_names <-c("Mary", "Elizabeth", "Patricia", "Jennifer", "Linda")
Part B
Now make two linegraphs of these 5 names over time, one for boys, and one for girls. [Hint: you’ll first want to subset the data to use for your data layer in the plot. First group_by(year) and also make sure you only use the names you found in Part A. Try using the %in% command to do this.]
# write your code here! babynames %>%group_by(year) %>%filter(sex =="M", name %in% top_boys_names) %>%ggplot()+aes(x = year,y = prop,color = name)+geom_line(size =1)+labs(x ="Year",y ="Proportion of Babies with Name",title ="Most Popular Boys Names Since 1880",color ="Boy's Name",caption ="Source: SSA")+theme_classic(base_family ="Fira Sans Condensed", base_size =16)
babynames %>%group_by(year) %>%filter(sex =="F", name %in% top_girls_names) %>%ggplot()+aes(x = year,y = prop,color = name)+geom_line(size =1)+labs(x ="Year",y ="Proportion of Babies with Name",title ="Most Popular Girls Names Since 1880",color ="Girl's Name",caption ="Source: SSA")+theme_classic(base_family ="Fira Sans Condensed", base_size =16)
Political and Economic Freedom Around the World
For the remaining questions, we’ll look at the relationship between Economic Freedom and Political Freedom in countries around the world today. Our data for economic freedom comes from the Fraser Institute, and our data for political freedom comes from Freedom House.
Question 7
Download these two datasets that I’ve cleaned up a bit: [If you want a challenge, try downloading them from the websites and cleaning them up yourself!]
Below is a brief description of the variables I’ve put in each dataset:
Econ Freedom
Variable
Description
year
Year
ISO
Three-letter country code
country
Name of the country
ef_index
Total economic freedom index (0 - least to 100 - most)
rank
Rank of the country in terms of economic freedom
continent
Continent the country is in
Pol Freedom
Variable
Description
country
Name of the country
C/T
Whether the location is a country (C) or territory (T)
year
Year
status
Whether the location is Free (F), Partly Free (F) or Not Free (NF)
fh_score
Total political freedom index (0 - least to 100 - most)
Import and save them each as an object using my_df_name <- read_csv("name_of_the_file.csv"). I suggest one as econ and the other as pol, but it’s up to you. Look at each object you’ve created.
# write your code here! # import data with read_csv from readr# note these file paths assume you have these files right in your working directoryecon <-read_csv("econ_freedom.csv")
Rows: 4050 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): ISO, country, continent
dbl (3): year, ef_index, rank
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pol <-read_csv("pol_freedom.csv")
Rows: 1885 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): country, C/T, status
dbl (2): year, fh_score
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# look at each dataframeecon
pol
Question 8
Now let’s join them together so that we can have a single dataset to work with. You can learn more about this in the 1.4 slides. Since both datasets have both country and year (spelled exactly the same in both!), we can use these two variables as a key to combine observations. Run the following code (substituting whatever you want to name your objects):
freedom <-left_join(econ, pol, # join pol tibble to econ tibbleby =c("country", "year")) # keys to match variables between two tibbles!
Take a look at freedom to make sure it appears to have worked.
# write your code here! freedom
Question 9
Part A
Make a barplot of the 10 countries with the highest Economic Freedom index score in 2018. You may want to find this first and save it as an object to use for your plot’s data layer. Use geom_col() since we will map ef_index to y. If you want to order the bars, set x = fct_reorder(ISO, desc(ef_index)) to reorder ISO (or country, if you prefer) by EF score in descending order.
# write your code here! # grab the top 10 countries by ef in 2018ef_10 <- freedom %>%filter(year ==2018) %>%arrange(desc(ef_index)) %>%slice(1:10)# look at it just to checkef_10
# now plot itggplot(data = ef_10)+aes(x =fct_reorder(ISO, desc(ef_index)), # reorder ISO by ef in ordery = ef_index)+geom_col(aes(fill = continent))+# coloring is optional# above is sufficient, now let's just make it prettiergeom_text(aes(label = ef_index), # add the score onto the barvjust =1.2, # adjust it verticallycolor ="white")+scale_y_continuous(breaks =seq(0,10,2),limits =c(0,10),expand =c(0,0) )+labs(x ="Country",y ="Economic Freedom Score",title ="Top 10 Countries by Economic Freedom",caption ="Source: Frasier Institute",fill ="Continent")+theme_minimal(base_family ="Fira Sans Condensed")+theme(legend.position ="bottom",plot.title =element_text(face ="bold",size =rel(1.5)) )
Part B
Make a barplot of the 10 countries with the highest Freedom House index score in 2018, similar to what you did for Part A.
# write your code here! # grab the top 10 countries by fh in 2018pf_10 <- freedom %>%filter(year ==2018) %>%arrange(desc(fh_score)) %>%slice(1:10)# look at it just to checkpf_10
# now plot itggplot(data = pf_10)+aes(x =fct_reorder(ISO, desc(fh_score)),y = fh_score)+geom_col(aes(fill = continent))+# coloring is optional# above is sufficient, now let's just make it prettiergeom_text(aes(label = fh_score), # add the score onto the barvjust =1.2, # adjust it verticallycolor ="white")+scale_y_continuous(breaks =seq(0,100,20),limits =c(0,100),expand =c(0,0))+labs(x ="Country",y ="Political Freedom Score",title ="Top 10 Countries by Political Freedom",caption ="Source: Freedom House",fill ="Continent")+theme_minimal(base_family ="Fira Sans Condensed")+theme(legend.position ="bottom",plot.title =element_text(face ="bold", size =rel(1.5)) )
Question 10
Now make a scatterplot of Political freedom (fh_score as y) on Economic Freedom (ef_index as x) in the year 2018, and color by continent.
# write your code here! # note I'm going to save the plot as an object called p, for next questionp <- freedom %>%filter(year =="2018") %>%ggplot()+aes(x = ef_index,y = fh_score)+# doing just geom_point() is fine, but since there's a lot of overlap, here are some things I like to do:geom_point(aes(fill = continent), # fill the points with color by continentalpha =0.9, # make points slightly transparentcolor ="white", # outline the points with a white borderpch =21, # this shape has an outline and a fill colorsize =3)+scale_x_continuous(breaks =seq(0,10,2),limits =c(0,10),expand =c(0,0))+scale_y_continuous(breaks =seq(0,100,20),limits =c(0,105),expand =c(0,0))+labs(x ="Economic Freedom Score",y ="Political Freedom Score",caption ="Sources: Frasier Institute, Freedom House",title ="Economic Freedom & Political Freedom",fill ="Continent")+theme_minimal(base_family ="Fira Sans Condensed")+theme(legend.position ="bottom",plot.title =element_text(face ="bold", size =rel(1.5)) )# look at plotp
Save your plot from Question 10 as an object, and add a new layer where we will highlight a few countries. Pick a few countries (I suggest using the ISO code) and create a new object filtering the data to only include these countries (again the %in% command will be most helpful here).
Additionally, install and load a package called "ggrepel", which will adjust labels so they do not overlap on a plot.
Then, add the following layer to your plot:
geom_label_repel(data = countries, # or whatever object name you createdaes(x = ef_index,y = fh_score,label = ISO, # show ISO as label (you could do country instead)color = continent),alpha =0.5, # make it a bit transparentbox.padding =0.75, # control how far labels are from pointsshow.legend = F) # don't want this to add to the legend
This should highlight these countries on your plot.
library(ggrepel)# pick some countriessome_countries <- freedom %>%filter(year==2018, country %in%c("United States","United Kingdom","Sweden","China","Singapore","Russian Federation","Korea, Rep.","Hong Kong SAR, China"))# write your code here! p +geom_label_repel(data = some_countries, # or whatever object name you createdaes(x = ef_index,y = fh_score,label = ISO, # show ISO as label (you could do country instead)color = continent),alpha =0.75, # make it a bit transparentbox.padding =0.75, # control how far labels are from pointsshow.legend = F) # don't want this to add to the legend
Let’s just look only at the United States and see how it has fared in both measures of freedom over time. filter() the data to look only at the United States (its ISO is "USA"). Use both a geom_point() layer and a geom_path() layer, which will connect the dots over time. Let’s also see this by labeling the years with an additional layer geom_text_repel(aes(label = year)).
# write your code here! # save plot as usus <- freedom %>%filter(ISO =="USA") %>%ggplot()+aes(x = ef_index,y = fh_score)+geom_point(color ="red")+geom_path(color ="red")+geom_text_repel(aes(label = year),color ="red")+scale_x_continuous(breaks =seq(8,8.5,0.05),limits =c(8,8.5),expand =c(0,0))+scale_y_continuous(breaks =seq(85,95,1),limits =c(85,95),expand =c(0,0))+labs(x ="Economic Freedom Score",y ="Political Freedom Score",caption ="Sources: Frasier Institute, Freedom House",title ="U.S. Political & Economic Freedom, 2013—2018",fill ="Continent")+theme_minimal(base_family ="Fira Sans Condensed")+theme(legend.position ="bottom",plot.title =element_text(face ="bold", size =rel(1.5)) )# look at itus
Note that the way I zoomed in on the scales, these look like pretty dramatic changes!
If we maintain the full perspective, the change appears minor. Be very careful how you present your analysis!
us +# force scales to show full range of 0-10 for x, 0-100 for yscale_x_continuous(breaks =seq(0,10,1),limits =c(0,10),expand =c(0,0) )+scale_y_continuous(breaks =seq(0,100,10),limits =c(0,100),expand =c(0,0) )
Scale for 'x' is already present. Adding another scale for 'x', which will
replace the existing scale.
Scale for 'y' is already present. Adding another scale for 'y', which will
replace the existing scale.
When you are done, click the Render button. Based on the current yaml header format: html, this will currently produce an html webpage, which should automatically open for your review.
Notice in the Files pane in R Studio (by default, the lower right one), there should now be a document called 01-problem-set.html (or if you changed the filename) ending in .html. This is the webpage, so you can find this file on your computer (or download it from Rstudio.cloud with by clicking on the checkmark box in front of the file in the Files page and then going to More -> Export... to download the file to your computer) and send this file.
If you want to make a PDF, install the package “tinytex” and run the following code to install a LaTeX distribution:
Then delete the lines in the yaml header that say format: html: self-contained: TRUE, and add a simple line that says format: pdf . Clicking Render will now produce a PDF, show it, and save it as a new file in the Files pane.
Either way, send me your output file, html or pdf (or, if you like, word) so long as it shows the input and output code of every chunk. I have set it by default to do this, with echo: true in the yaml header.
Don’t forget to add your name to the author part of the header!