1.4 — Data Wrangling

(Answer Key)

Author

Ryan Safner

Published

September 6, 2022

Required Packages

First, install the following two packages with the command install.packages("tidyverse") and install.packages("gapminder") in the console below.1 Alternatively, you will probably already get a yellow banner at the top of this file indicating you need to install the packages, and can install them by clicking Install. Don’t install any package in an R chunk in this document, since it needs to be installed into R Studio.

Then, load the package by running (clicking the green play button) the chunk below:

library("tidyverse") # my friend and yours
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.8     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library("gapminder") # for dataset

gapminder <- gapminder # explicitly save data as a dataframe

Warm Up to dplyr with gapminder Again

Question 1

Let’s look at the data again by running the following chunk. glimpse() is a suped-up tidyverse version of str(). You can also start to see how to use the pipe operator %>%.

gapminder %>% 
  glimpse()
Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

Question 2

Now select() only the variables year, lifeExp, and country.

gapminder %>% 
  select(year, lifeExp, country)

Question 3

Now select() all variables except pop.

gapminder %>%
  select(-pop)

Question 4

rename() the variable continent to cont.

gapminder %>%
  rename(cont = continent)

Question 5

arrange() the data by year.

gapminder %>%
  arrange(year)

Question 6

Now arrange() by year, but in descending order.

gapminder %>%
  arrange(desc(year))

Question 7

Now arrange() by year, then by lifeExp

gapminder %>%
  arrange(year, lifeExp)

Question 8

Let’s try subsetting some rows. filter() observations with pop greater than 1 billion (9 zeros).

gapminder %>%
  filter(pop > 1000000000)

Question 9

Redo the same command from question 8, but of that subset of data, only look at India.

gapminder %>%
  filter(pop > 1000000000,
         country == "India")

Question 10

Let’s pipe a bunch of commands together. select() your data to look only at year, gdpPercap, and country in the year 1997, for countries that have a gdpPercap greater than 20,000, and arrange() them alphabetically.

gapminder %>%
  select(year, gdpPercap, country) %>%
  filter(year == 1997,
         gdpPercap > 20000) %>%
  arrange(country)

Question 11

Make a new variable with mutate() called GDP, which is equal to gdpPercap * pop.

gapminder %>%
  mutate(GDP = gdpPercap * pop)

Question 12

Make a new variable that is pop in millions.

gapminder %>%
  mutate(pop_mil = pop / 1000000)

Question 13

Use the summarize() command to get the overall average GDP per capita in the data.

gapminder %>%
  summarize(avg_gdppc = mean(gdpPercap))

Question 14

Use summarize() to get the number of observations, the mean, median, minimum, maximum, and standard deviation of GDP per capita.

gapminder %>%
  summarize(n(),
            mean_GDP = mean(gdpPercap),
            median_GDP = median(gdpPercap),
            min_GDP = min(gdpPercap),
            max_GDP = max(gdpPercap),
            sd_GDP = sd(gdpPercap))

Question 15

The code below gets the average GDP per capita by continent. Run it to see the results.

gapminder %>%
  group_by(continent) %>%
  summarize(avg_gdppc = mean(gdpPercap))

Now, modify it to show the average GDP per capita by year (i.e. over time).

gapminder %>%
  group_by(year) %>%
  summarize(avg_gdppc = mean(gdpPercap))

Question 16

Get the average GDP per capita by year and by continent with group_by() and summarize(). Then save this as an object. Next, make a line graph with ggplot() (automatically loaded with tidyverse!) using this object as the source for your data layer. Be sure to map color = continent in the aes layer! This should plot GDP per capita over time by continent.

gdp_grouped <- gapminder %>%
  group_by(year, continent) %>%
  summarize(Average_GDP = mean(gdpPercap))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
ggplot(data = gdp_grouped) +
  aes(x = year,
      y = Average_GDP,
      color = continent) +
  geom_line()

Question 17

Copy your code from question 16 and redo this all in one step: rather than saving your subsetted data as an object, pipe it directly into ggplot’s data layer!

gapminder %>%
  group_by(year, continent) %>%
  summarize(Average_GDP = mean(gdpPercap)) %>%
ggplot(data = .) +
  aes(x = year,
      y = Average_GDP,
      color = continent) +
  geom_line()
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

An Example Data: Majors

Now let’s step it up to work with some data “in the wild” to answer some research questions. This will have you combine your dplyr skills and add some new things such as importing with readr.

Let’s look at fivethirtyeight’s article ” The Economic Guide To Picking A College Major ”. fivethirtyeight is great about making the data behind their articles public, we can download all of their data here. Search for college majors and click download (the blue arrow button).2 We will look at the recent-grads.csv file.

The description in the readme file for the data is as follows:

Header Description
Rank Rank by median earnings
Major_code Major code, FO1DP in ACS PUMS
Major Major description
Major_category Category of major from Carnevale et al
Total Total number of people with major
Sample_size Sample size (unweighted) of full-time, year-round ONLY (used for earnings)
Men Male graduates
Women Female graduates
ShareWomen Women as share of total
Employed Number employed (ESR == 1 or 2)
Full_time Employed 35 hours or more
Part_time Employed less than 35 hours
Full_time_year_round Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35)
Unemployed Number unemployed (ESR == 3)
Unemployment_rate Unemployed / (Unemployed + Employed)
Median Median earnings of full-time, year-round workers
P25th 25th percentile of earnigns
P75th 75th percentile of earnings
College_jobs Number with job requiring a college degree
Non_college_jobs Number with job not requiring a college degree
Low_wage_jobs Number in low-wage service jobs

Question 18

Import the data with read_csv() and assign it to an object (a tibble) called majors (or whatever you want to call it).

One way to avoid problems is to move this to the same file as R’s working directory, which again you can determine with getwd(). If you are doing this in R Studio Cloud, use the Upload button to add the file to the working directory (once you’ve downloaded it from the internet!). If you are doing this on your computer, move the downloaded file on your computer to wherever your R Project folder for this project is.

majors <- read_csv("../data/recent-grads.csv") # note my location on my computer is different than yours!
Rows: 173 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): Major, Major_category
dbl (19): Rank, Major_code, Total, Men, Women, ShareWomen, Sample_size, Empl...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Once it’s loaded, get a look at the data with glimpse:

majors %>% # or whatever you named your tibble with the data
  glimpse()
Rows: 173
Columns: 21
$ Rank                 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
$ Major_code           <dbl> 2419, 2416, 2415, 2417, 2405, 2418, 6202, 5001, 2…
$ Major                <chr> "PETROLEUM ENGINEERING", "MINING AND MINERAL ENGI…
$ Total                <dbl> 2339, 756, 856, 1258, 32260, 2573, 3777, 1792, 91…
$ Men                  <dbl> 2057, 679, 725, 1123, 21239, 2200, 2110, 832, 803…
$ Women                <dbl> 282, 77, 131, 135, 11021, 373, 1667, 960, 10907, …
$ Major_category       <chr> "Engineering", "Engineering", "Engineering", "Eng…
$ ShareWomen           <dbl> 0.1205643, 0.1018519, 0.1530374, 0.1073132, 0.341…
$ Sample_size          <dbl> 36, 7, 3, 16, 289, 17, 51, 10, 1029, 631, 399, 14…
$ Employed             <dbl> 1976, 640, 648, 758, 25694, 1857, 2912, 1526, 764…
$ Full_time            <dbl> 1849, 556, 558, 1069, 23170, 2038, 2924, 1085, 71…
$ Part_time            <dbl> 270, 170, 133, 150, 5180, 264, 296, 553, 13101, 1…
$ Full_time_year_round <dbl> 1207, 388, 340, 692, 16697, 1449, 2482, 827, 5463…
$ Unemployed           <dbl> 37, 85, 16, 40, 1672, 400, 308, 33, 4650, 3895, 2…
$ Unemployment_rate    <dbl> 0.018380527, 0.117241379, 0.024096386, 0.05012531…
$ Median               <dbl> 110000, 75000, 73000, 70000, 65000, 65000, 62000,…
$ P25th                <dbl> 95000, 55000, 50000, 43000, 50000, 50000, 53000, …
$ P75th                <dbl> 125000, 90000, 105000, 80000, 75000, 102000, 7200…
$ College_jobs         <dbl> 1534, 350, 456, 529, 18314, 1142, 1768, 972, 5284…
$ Non_college_jobs     <dbl> 364, 257, 176, 102, 4440, 657, 314, 500, 16384, 1…
$ Low_wage_jobs        <dbl> 193, 50, 0, 0, 972, 244, 259, 220, 3253, 3170, 98…

Question 19

What are all the unique values of Major? How many are there?

# distinct values of major
majors %>%
  distinct(Major)
# count number of distinct values
majors %>%
  summarize(n_distinct(Major))

Question 20

Which major has the lowest unemployment rate?

majors %>%
  arrange(Unemployment_rate) %>%
  select(Major, Unemployment_rate)

Question 21

What are the top three majors that have the highest percentage of women?

majors %>%
  arrange(desc(ShareWomen)) %>%
  select(Major, ShareWomen) %>%
  slice(1:3) # head(n=3) also works

Question 22

Make a boxplot of Median wage by Major_Category.

ggplot(data = majors)+
  aes(x = Major_category,
      y = Median,
      fill = Major_category)+
  geom_boxplot()+
  theme(axis.text.x=element_text(angle=45, hjust=1), legend.position = "")

Question 23

Is there a systematic difference between STEM majors and non-STEM majors? First, define

stem_categories <- c("Biology & Life Science",
                     "Computers & Mathematics",
                     "Engineering",
                     "Physical Sciences")

Next, make a variable stem, for whether or not a Major_category is "stem" or "not stem".

majors <- majors %>%
  mutate(stem = ifelse(Major_category %in% stem_categories,
                       yes = "stem",
                       no = "not_stem")) %>%
  select(stem, everything()) # just so we can see it up front to verify

# instead of ifelse(), you could use case_when:

majors %>%
  mutate(stem = case_when(Major_category %in% stem_categories ~ "stem",
                          TRUE ~ "not_stem")) # remaining case if not for stem 

Finally, summarize() Median for stem and not stem groups.

majors %>%
  group_by(stem) %>%
  summarize(mean(Median))

Footnotes

  1. Note that you can install multiple packages at the same time by combining the package names into a vector, like so: install.packages(c("tidyverse", "gapminder")).↩︎

  2. This will download a .zip file that contains many spreadsheets. Unzip it with a program that unzips files (such as WinZip, 7-zip, the Unarchiver, etc).↩︎