1.4 — Data Wrangling

(Answer Key)

Author

Ryan Safner

Published

September 6, 2022

Required Packages

First, install the following two packages with the command install.packages("tidyverse") and install.packages("gapminder") in the console below. Alternatively, you will probably already get a yellow banner at the top of this file indicating you need to install the packages, and can install them by clicking Install. Don’t install any package in an R chunk in this document, since it needs to be installed into R Studio.

Then, load the package by running (clicking the green play button) the chunk below:

library("tidyverse") # my friend and yours
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.8     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library("gapminder") # for dataset

gapminder <- gapminder # explicitly save data as a dataframe

Warm Up to dplyr with gapminder Again

Question 1

Let’s look at the data again by running the following chunk. glimpse() is a suped-up tidyverse version of str(). You can also start to see how to use the pipe operator %>%.

gapminder %>% 
  glimpse()
Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

Question 2

Now select() only the variables year, lifeExp, and country.

gapminder %>% 
  select(year, lifeExp, country)
ABCDEFGHIJ0123456789
year
<int>
lifeExp
<dbl>
country
<fct>
195228.80100Afghanistan
195730.33200Afghanistan
196231.99700Afghanistan
196734.02000Afghanistan
197236.08800Afghanistan
197738.43800Afghanistan
198239.85400Afghanistan
198740.82200Afghanistan
199241.67400Afghanistan
199741.76300Afghanistan

Question 3

Now select() all variables except pop.

gapminder %>%
  select(-pop)
ABCDEFGHIJ0123456789
country
<fct>
continent
<fct>
year
<int>
lifeExp
<dbl>
gdpPercap
<dbl>
AfghanistanAsia195228.80100779.4453
AfghanistanAsia195730.33200820.8530
AfghanistanAsia196231.99700853.1007
AfghanistanAsia196734.02000836.1971
AfghanistanAsia197236.08800739.9811
AfghanistanAsia197738.43800786.1134
AfghanistanAsia198239.85400978.0114
AfghanistanAsia198740.82200852.3959
AfghanistanAsia199241.67400649.3414
AfghanistanAsia199741.76300635.3414

Question 4

rename() the variable continent to cont.

gapminder %>%
  rename(cont = continent)
ABCDEFGHIJ0123456789
country
<fct>
cont
<fct>
year
<int>
lifeExp
<dbl>
pop
<int>
gdpPercap
<dbl>
AfghanistanAsia195228.801008425333779.4453
AfghanistanAsia195730.332009240934820.8530
AfghanistanAsia196231.9970010267083853.1007
AfghanistanAsia196734.0200011537966836.1971
AfghanistanAsia197236.0880013079460739.9811
AfghanistanAsia197738.4380014880372786.1134
AfghanistanAsia198239.8540012881816978.0114
AfghanistanAsia198740.8220013867957852.3959
AfghanistanAsia199241.6740016317921649.3414
AfghanistanAsia199741.7630022227415635.3414

Question 5

arrange() the data by year.

gapminder %>%
  arrange(year)
ABCDEFGHIJ0123456789
country
<fct>
continent
<fct>
year
<int>
lifeExp
<dbl>
AfghanistanAsia195228.80100
AlbaniaEurope195255.23000
AlgeriaAfrica195243.07700
AngolaAfrica195230.01500
ArgentinaAmericas195262.48500
AustraliaOceania195269.12000
AustriaEurope195266.80000
BahrainAsia195250.93900
BangladeshAsia195237.48400
BelgiumEurope195268.00000

Question 6

Now arrange() by year, but in descending order.

gapminder %>%
  arrange(desc(year))
ABCDEFGHIJ0123456789
country
<fct>
continent
<fct>
year
<int>
lifeExp
<dbl>
AfghanistanAsia200743.82800
AlbaniaEurope200776.42300
AlgeriaAfrica200772.30100
AngolaAfrica200742.73100
ArgentinaAmericas200775.32000
AustraliaOceania200781.23500
AustriaEurope200779.82900
BahrainAsia200775.63500
BangladeshAsia200764.06200
BelgiumEurope200779.44100

Question 7

Now arrange() by year, then by lifeExp

gapminder %>%
  arrange(year, lifeExp)
ABCDEFGHIJ0123456789
country
<fct>
continent
<fct>
year
<int>
lifeExp
<dbl>
AfghanistanAsia195228.80100
GambiaAfrica195230.00000
AngolaAfrica195230.01500
Sierra LeoneAfrica195230.33100
MozambiqueAfrica195231.28600
Burkina FasoAfrica195231.97500
Guinea-BissauAfrica195232.50000
Yemen, Rep.Asia195232.54800
SomaliaAfrica195232.97800
GuineaAfrica195233.60900

Question 8

Let’s try subsetting some rows. filter() observations with pop greater than 1 billion (9 zeros).

gapminder %>%
  filter(pop > 1000000000)
ABCDEFGHIJ0123456789
country
<fct>
continent
<fct>
year
<int>
lifeExp
<dbl>
pop
<int>
gdpPercap
<dbl>
ChinaAsia198265.5251000281000962.4214
ChinaAsia198767.27410840350001378.9040
ChinaAsia199268.69011649700001655.7842
ChinaAsia199770.42612300750002289.2341
ChinaAsia200272.02812804000003119.2809
ChinaAsia200772.96113186830964959.1149
IndiaAsia200262.87910341725471746.7695
IndiaAsia200764.69811103963312452.2104

Question 9

Redo the same command from question 8, but of that subset of data, only look at India.

gapminder %>%
  filter(pop > 1000000000,
         country == "India")
ABCDEFGHIJ0123456789
country
<fct>
continent
<fct>
year
<int>
lifeExp
<dbl>
pop
<int>
gdpPercap
<dbl>
IndiaAsia200262.87910341725471746.769
IndiaAsia200764.69811103963312452.210

Question 10

Let’s pipe a bunch of commands together. select() your data to look only at year, gdpPercap, and country in the year 1997, for countries that have a gdpPercap greater than 20,000, and arrange() them alphabetically.

gapminder %>%
  select(year, gdpPercap, country) %>%
  filter(year == 1997,
         gdpPercap > 20000) %>%
  arrange(country)
ABCDEFGHIJ0123456789
year
<int>
gdpPercap
<dbl>
country
<fct>
199726997.94Australia
199729095.92Austria
199720292.02Bahrain
199727561.20Belgium
199728954.93Canada
199729804.35Denmark
199723723.95Finland
199725889.78France
199727788.88Germany
199728377.63Hong Kong, China

Question 11

Make a new variable with mutate() called GDP, which is equal to gdpPercap * pop.

gapminder %>%
  mutate(GDP = gdpPercap * pop)
ABCDEFGHIJ0123456789
country
<fct>
continent
<fct>
year
<int>
lifeExp
<dbl>
pop
<int>
gdpPercap
<dbl>
AfghanistanAsia195228.801008425333779.4453
AfghanistanAsia195730.332009240934820.8530
AfghanistanAsia196231.9970010267083853.1007
AfghanistanAsia196734.0200011537966836.1971
AfghanistanAsia197236.0880013079460739.9811
AfghanistanAsia197738.4380014880372786.1134
AfghanistanAsia198239.8540012881816978.0114
AfghanistanAsia198740.8220013867957852.3959
AfghanistanAsia199241.6740016317921649.3414
AfghanistanAsia199741.7630022227415635.3414

Question 12

Make a new variable that is pop in millions.

gapminder %>%
  mutate(pop_mil = pop / 1000000)
ABCDEFGHIJ0123456789
country
<fct>
continent
<fct>
year
<int>
lifeExp
<dbl>
pop
<int>
gdpPercap
<dbl>
AfghanistanAsia195228.801008425333779.4453
AfghanistanAsia195730.332009240934820.8530
AfghanistanAsia196231.9970010267083853.1007
AfghanistanAsia196734.0200011537966836.1971
AfghanistanAsia197236.0880013079460739.9811
AfghanistanAsia197738.4380014880372786.1134
AfghanistanAsia198239.8540012881816978.0114
AfghanistanAsia198740.8220013867957852.3959
AfghanistanAsia199241.6740016317921649.3414
AfghanistanAsia199741.7630022227415635.3414

Question 13

Use the summarize() command to get the overall average GDP per capita in the data.

gapminder %>%
  summarize(avg_gdppc = mean(gdpPercap))
ABCDEFGHIJ0123456789
avg_gdppc
<dbl>
7215.327

Question 14

Use summarize() to get the number of observations, the mean, median, minimum, maximum, and standard deviation of GDP per capita.

gapminder %>%
  summarize(n(),
            mean_GDP = mean(gdpPercap),
            median_GDP = median(gdpPercap),
            min_GDP = min(gdpPercap),
            max_GDP = max(gdpPercap),
            sd_GDP = sd(gdpPercap))
ABCDEFGHIJ0123456789
n()
<int>
mean_GDP
<dbl>
median_GDP
<dbl>
min_GDP
<dbl>
max_GDP
<dbl>
sd_GDP
<dbl>
17047215.3273531.847241.1659113523.19857.455

Question 15

The code below gets the average GDP per capita by continent. Run it to see the results.

gapminder %>%
  group_by(continent) %>%
  summarize(avg_gdppc = mean(gdpPercap))
ABCDEFGHIJ0123456789
continent
<fct>
avg_gdppc
<dbl>
Africa2193.755
Americas7136.110
Asia7902.150
Europe14469.476
Oceania18621.609

Now, modify it to show the average GDP per capita by year (i.e. over time).

gapminder %>%
  group_by(year) %>%
  summarize(avg_gdppc = mean(gdpPercap))
ABCDEFGHIJ0123456789
year
<int>
avg_gdppc
<dbl>
19523725.276
19574299.408
19624725.812
19675483.653
19726770.083
19777313.166
19827518.902
19877900.920
19928158.609
19979090.175

Question 16

Get the average GDP per capita by year and by continent with group_by() and summarize(). Then save this as an object. Next, make a line graph with ggplot() (automatically loaded with tidyverse!) using this object as the source for your data layer. Be sure to map color = continent in the aes layer! This should plot GDP per capita over time by continent.

gdp_grouped <- gapminder %>%
  group_by(year, continent) %>%
  summarize(Average_GDP = mean(gdpPercap))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
ggplot(data = gdp_grouped) +
  aes(x = year,
      y = Average_GDP,
      color = continent) +
  geom_line()

Question 17

Copy your code from question 16 and redo this all in one step: rather than saving your subsetted data as an object, pipe it directly into ggplot’s data layer!

gapminder %>%
  group_by(year, continent) %>%
  summarize(Average_GDP = mean(gdpPercap)) %>%
ggplot(data = .) +
  aes(x = year,
      y = Average_GDP,
      color = continent) +
  geom_line()
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

An Example Data: Majors

Now let’s step it up to work with some data “in the wild” to answer some research questions. This will have you combine your dplyr skills and add some new things such as importing with readr.

Let’s look at fivethirtyeight’s article ” The Economic Guide To Picking A College Major ”. fivethirtyeight is great about making the data behind their articles public, we can download all of their data here. Search for college majors and click download (the blue arrow button). We will look at the recent-grads.csv file.

The description in the readme file for the data is as follows:

Header Description
Rank Rank by median earnings
Major_code Major code, FO1DP in ACS PUMS
Major Major description
Major_category Category of major from Carnevale et al
Total Total number of people with major
Sample_size Sample size (unweighted) of full-time, year-round ONLY (used for earnings)
Men Male graduates
Women Female graduates
ShareWomen Women as share of total
Employed Number employed (ESR == 1 or 2)
Full_time Employed 35 hours or more
Part_time Employed less than 35 hours
Full_time_year_round Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35)
Unemployed Number unemployed (ESR == 3)
Unemployment_rate Unemployed / (Unemployed + Employed)
Median Median earnings of full-time, year-round workers
P25th 25th percentile of earnigns
P75th 75th percentile of earnings
College_jobs Number with job requiring a college degree
Non_college_jobs Number with job not requiring a college degree
Low_wage_jobs Number in low-wage service jobs

Question 18

Import the data with read_csv() and assign it to an object (a tibble) called majors (or whatever you want to call it).

One way to avoid problems is to move this to the same file as R’s working directory, which again you can determine with getwd(). If you are doing this in R Studio Cloud, use the Upload button to add the file to the working directory (once you’ve downloaded it from the internet!). If you are doing this on your computer, move the downloaded file on your computer to wherever your R Project folder for this project is.

majors <- read_csv("../data/recent-grads.csv") # note my location on my computer is different than yours!
Rows: 173 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): Major, Major_category
dbl (19): Rank, Major_code, Total, Men, Women, ShareWomen, Sample_size, Empl...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Once it’s loaded, get a look at the data with glimpse:

majors %>% # or whatever you named your tibble with the data
  glimpse()
Rows: 173
Columns: 21
$ Rank                 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
$ Major_code           <dbl> 2419, 2416, 2415, 2417, 2405, 2418, 6202, 5001, 2…
$ Major                <chr> "PETROLEUM ENGINEERING", "MINING AND MINERAL ENGI…
$ Total                <dbl> 2339, 756, 856, 1258, 32260, 2573, 3777, 1792, 91…
$ Men                  <dbl> 2057, 679, 725, 1123, 21239, 2200, 2110, 832, 803…
$ Women                <dbl> 282, 77, 131, 135, 11021, 373, 1667, 960, 10907, …
$ Major_category       <chr> "Engineering", "Engineering", "Engineering", "Eng…
$ ShareWomen           <dbl> 0.1205643, 0.1018519, 0.1530374, 0.1073132, 0.341…
$ Sample_size          <dbl> 36, 7, 3, 16, 289, 17, 51, 10, 1029, 631, 399, 14…
$ Employed             <dbl> 1976, 640, 648, 758, 25694, 1857, 2912, 1526, 764…
$ Full_time            <dbl> 1849, 556, 558, 1069, 23170, 2038, 2924, 1085, 71…
$ Part_time            <dbl> 270, 170, 133, 150, 5180, 264, 296, 553, 13101, 1…
$ Full_time_year_round <dbl> 1207, 388, 340, 692, 16697, 1449, 2482, 827, 5463…
$ Unemployed           <dbl> 37, 85, 16, 40, 1672, 400, 308, 33, 4650, 3895, 2…
$ Unemployment_rate    <dbl> 0.018380527, 0.117241379, 0.024096386, 0.05012531…
$ Median               <dbl> 110000, 75000, 73000, 70000, 65000, 65000, 62000,…
$ P25th                <dbl> 95000, 55000, 50000, 43000, 50000, 50000, 53000, …
$ P75th                <dbl> 125000, 90000, 105000, 80000, 75000, 102000, 7200…
$ College_jobs         <dbl> 1534, 350, 456, 529, 18314, 1142, 1768, 972, 5284…
$ Non_college_jobs     <dbl> 364, 257, 176, 102, 4440, 657, 314, 500, 16384, 1…
$ Low_wage_jobs        <dbl> 193, 50, 0, 0, 972, 244, 259, 220, 3253, 3170, 98…

Question 19

What are all the unique values of Major? How many are there?

# distinct values of major
majors %>%
  distinct(Major)
ABCDEFGHIJ0123456789
Major
<chr>
PETROLEUM ENGINEERING
MINING AND MINERAL ENGINEERING
METALLURGICAL ENGINEERING
NAVAL ARCHITECTURE AND MARINE ENGINEERING
CHEMICAL ENGINEERING
NUCLEAR ENGINEERING
ACTUARIAL SCIENCE
ASTRONOMY AND ASTROPHYSICS
MECHANICAL ENGINEERING
ELECTRICAL ENGINEERING
# count number of distinct values
majors %>%
  summarize(n_distinct(Major))
ABCDEFGHIJ0123456789
n_distinct(Major)
<int>
173

Question 20

Which major has the lowest unemployment rate?

majors %>%
  arrange(Unemployment_rate) %>%
  select(Major, Unemployment_rate)
ABCDEFGHIJ0123456789
Major
<chr>
MATHEMATICS AND COMPUTER SCIENCE
MILITARY TECHNOLOGIES
BOTANY
SOIL SCIENCE
EDUCATIONAL ADMINISTRATION AND SUPERVISION
ENGINEERING MECHANICS PHYSICS AND SCIENCE
COURT REPORTING
MATHEMATICS TEACHER EDUCATION
PETROLEUM ENGINEERING
GENERAL AGRICULTURE

Question 21

What are the top three majors that have the highest percentage of women?

majors %>%
  arrange(desc(ShareWomen)) %>%
  select(Major, ShareWomen) %>%
  slice(1:3) # head(n=3) also works
ABCDEFGHIJ0123456789
Major
<chr>
ShareWomen
<dbl>
EARLY CHILDHOOD EDUCATION0.9689537
COMMUNICATION DISORDERS SCIENCES AND SERVICES0.9679981
MEDICAL ASSISTING SERVICES0.9278072

Question 22

Make a boxplot of Median wage by Major_Category.

ggplot(data = majors)+
  aes(x = Major_category,
      y = Median,
      fill = Major_category)+
  geom_boxplot()+
  theme(axis.text.x=element_text(angle=45, hjust=1), legend.position = "")

Question 23

Is there a systematic difference between STEM majors and non-STEM majors? First, define

stem_categories <- c("Biology & Life Science",
                     "Computers & Mathematics",
                     "Engineering",
                     "Physical Sciences")

Next, make a variable stem, for whether or not a Major_category is "stem" or "not stem".

majors <- majors %>%
  mutate(stem = ifelse(Major_category %in% stem_categories,
                       yes = "stem",
                       no = "not_stem")) %>%
  select(stem, everything()) # just so we can see it up front to verify

# instead of ifelse(), you could use case_when:

majors %>%
  mutate(stem = case_when(Major_category %in% stem_categories ~ "stem",
                          TRUE ~ "not_stem")) # remaining case if not for stem 
ABCDEFGHIJ0123456789
stem
<chr>
Rank
<dbl>
Major_code
<dbl>
stem12419
stem22416
stem32415
stem42417
stem52405
stem62418
not_stem76202
stem85001
stem92414
stem102408

Finally, summarize() Median for stem and not stem groups.

majors %>%
  group_by(stem) %>%
  summarize(mean(Median))
ABCDEFGHIJ0123456789
stem
<chr>
mean(Median)
<dbl>
not_stem35624.77
stem47860.94

Footnotes

  1. Note that you can install multiple packages at the same time by combining the package names into a vector, like so: install.packages(c("tidyverse", "gapminder")).↩︎

  2. This will download a .zip file that contains many spreadsheets. Unzip it with a program that unzips files (such as WinZip, 7-zip, the Unarchiver, etc).↩︎