1.4 — Data Wrangling
ECON 480 • Econometrics • Fall 2022
Dr. Ryan Safner
Associate Professor of Economics
safner@hood.edu
ryansafner/metricsF22
metricsF22.classes.ryansafner.com
Tidying (Pivoting/Reshaping) Data
NA
”) 😡Ideally, you’d want to be able to do all of this in one program
“Yet far too much handcrafted work - what data scientists call”data wrangling,” “data munging,” and “data janitor work” - is still required. Data scientists, according to interviews and expert estimates, spend from 50 to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.”
Source: New York Times
tidyverse
I“The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
tidyverse
IItidyverse
IIItidyverse
IVtidyverse
contains a lot of packages [1] "broom" "cli" "crayon" "dbplyr"
[5] "dplyr" "dtplyr" "forcats" "googledrive"
[9] "googlesheets4" "ggplot2" "haven" "hms"
[13] "httr" "jsonlite" "lubridate" "magrittr"
[17] "modelr" "pillar" "purrr" "readr"
[21] "readxl" "reprex" "rlang" "rstudioapi"
[25] "rvest" "stringr" "tibble" "tidyr"
[29] "xml2" "tidyverse"
library(tidyverse)
:
ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats
tidyverse
:A tibble
(or tbl_df
) is a friendlier data.frame
Fundamental grammar of tidyverse:
Loading tidyverse
automatically converts all data.frames
to tibbles
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# … with 53,930 more rows
Rows: 53,940
Columns: 10
$ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
$ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
$ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
$ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
$ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
$ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
$ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
$ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
$ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
tibble
from a data.frame
with as_tibble()
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# … with 53,930 more rows
tibble
from scratch with tibble()
, works like data.frame()
tibble
row by row with tribble()
The magrittr
package allows use of the “pipe” operator (%>%
)1
%>%
“pipes” the output of the left of the pipe into the (1st) argument of the right
Running a function f
on object x
as f(x)
becomes x %>% f
in pipeable form
x
and then run function f
on it”take x
and perform function f()
on x
and then perform function g()
on that result
Without pipes:
Load common spreadsheet files (.csv
, .tsv
) with simple commands:
read_*(path/to/my_data.*)
*
can be .csv
or .tsv
Can also export your data from R into a common spreadsheet file with:
write_*(my_df, path = path/to/file_name.*)
my_df
is the name of your tibble
, and file_name
is the name of the file you want to save asOften this is enough, but much more customization possible
Read more on the tidyverse website and the Readr Cheatsheet
readxl
has equivalent commands for Excel data types:
read_*("path/to/my/data.*")
write_*(my_dataframe, path=path/to/file_name.*)
*
can be .xls
or .xlsx
haven
has equivalent commands for other data types:
read_*("path/to/my_data.dta")
for STATA .dta
fileswrite_*(my_dataframe, path=path/to/file_name.*)
*
can be .dta
(STATA), .sav
(SPSS), .sas7bdat
(SAS)“where the hell is my data file”??
Recall R
looks for files to read_*()
in the default working directory1
You can tell R
where this data is by making the path
a part of the file’s name when importing
..
to “move up one folder”/
to “enter a folder”R
where this data is by making the path
a part of the file’s name when importing
..
to “move up one folder”/
to “enter a folder”Suggestion to make your data import easier: Download and move files to R’s working directory
Your computer and working directory are different from mine (and others)
This is not a reproducible workflow!
We’ll finally fix this next class with R Projects
tidyverse
: all associated packages and functions require tidy data
“Happy families are all alike; every unhappy family is unhappy in its own way.” - Leo Tolstoy
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” - Hadley Wickham
tidyr
package helps reshape data into more usable format
Most common use: reshaping data between “long” and “wide”
Source: Garrick Aden-Buie’s tidyexplain
pivot_longer()
Iyear
variable! (e.g. 2000
, 2010
)pivot_longer()
IIpivot_longer()
these columns into a new pair of variables to make a longer dataframe
year
), not variables themselves! (2000
and 2010
)names_to
: name of variable to create whose values form the column names (the “names” 2000
and 2010
are values of year
)values_to
: name of the variable to create whose values are spread over the cells (we’ll call it number of cases
for each country in each year)pivot_longer()
IIIpivot_longer()
a wide data frame into a long data frameex_wide %>%
pivot_longer(c("2000","2010"), # select columns
names_to = "year", # variable for column names
values_to = "cases") # values
# A tibble: 6 × 3
Country year cases
<chr> <chr> <dbl>
1 United States 2000 140
2 United States 2010 180
3 Canada 2000 102
4 Canada 2010 98
5 China 2000 111
6 China 2010 123
pivot_wider()
I# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 United States 2000 cases 140
2 United States 2000 population 300
3 United States 2010 cases 180
4 United States 2010 population 310
5 Canada 2000 cases 102
6 Canada 2000 population 110
7 Canada 2010 cases 98
8 Canada 2010 population 121
9 China 2000 cases 111
10 China 2000 population 1201
11 China 2010 cases 123
12 China 2010 population 1241
Cases
and one for Population
(categorized by type
of variable)pivot_wider()
II# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 United States 2000 cases 140
2 United States 2000 population 300
3 United States 2010 cases 180
4 United States 2010 population 310
5 Canada 2000 cases 102
6 Canada 2000 population 110
7 Canada 2010 cases 98
8 Canada 2010 population 121
9 China 2000 cases 111
10 China 2000 population 1201
11 China 2010 cases 123
12 China 2010 population 1241
pivot_wider()
these columns into a new pair of variables
names_from
: column that contains variable names (here, the type
)values_from
: column that contains values from multiple variables (here, the count
)pivot_wider()
IIIpivot_wider()
a long data frame into a wide data frame# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 United States 2000 cases 140
2 United States 2000 population 300
3 United States 2010 cases 180
4 United States 2010 population 310
5 Canada 2000 cases 102
6 Canada 2000 population 110
7 Canada 2010 cases 98
8 Canada 2010 population 121
9 China 2000 cases 111
10 China 2000 population 1201
11 China 2010 cases 123
12 China 2010 population 1241
ex_long %>%
pivot_wider(names_from = "type", # column with names of vars
values_from = "count") # column with values of vars
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 United States 2000 140 300
2 United States 2010 180 310
3 Canada 2000 102 110
4 Canada 2010 98 121
5 China 2000 111 1201
6 China 2010 123 1241
dplyr
Idplyr
uses more efficient & intuitive commands to manipulate tibblesBase R
grammar passively runs functions on nouns: function(object)
dplyr
grammar actively uses verbs: verb(df, conditions)
1dplyr
II%>%
pipe operatortibble
dbplyr
), Apache Spark (sparklyr
)dplyr
Verbsdplyr
verbsVerb | Does |
---|---|
filter() |
Keep only selected observations |
select() |
Keep only selected variables |
arrange() |
Reorder rows (e.g. in numerical order) |
mutate() |
Create new variables |
summarize() |
Collapse data into summary statistics |
group_by() |
Perform any of the above functions by groups/categories |
arrange()
: Reorder observationsarrange()
arrange
reorders observations (rows) in a logical order
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Sao Tome and Principe Africa 1952 46.5 60011 880.
2 Sao Tome and Principe Africa 1957 48.9 61325 861.
3 Djibouti Africa 1952 34.8 63149 2670.
4 Sao Tome and Principe Africa 1962 51.9 65345 1072.
5 Sao Tome and Principe Africa 1967 54.4 70787 1385.
6 Djibouti Africa 1957 37.3 71851 2865.
7 Sao Tome and Principe Africa 1972 56.5 76595 1533.
8 Sao Tome and Principe Africa 1977 58.6 86796 1738.
9 Djibouti Africa 1962 39.7 89898 3021.
10 Sao Tome and Principe Africa 1982 60.4 98593 1890.
# … with 1,694 more rows
arrange()
: Ties# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Sao Tome and Principe Africa 1952 46.5 60011 880.
2 Djibouti Africa 1952 34.8 63149 2670.
3 Bahrain Asia 1952 50.9 120447 9867.
4 Iceland Europe 1952 72.5 147962 7268.
5 Comoros Africa 1952 40.7 153936 1103.
6 Kuwait Asia 1952 55.6 160000 108382.
7 Equatorial Guinea Africa 1952 34.5 216964 376.
8 Reunion Africa 1952 52.7 257700 2719.
9 Gambia Africa 1952 30 284320 485.
10 Swaziland Africa 1952 41.4 290243 1148.
# … with 1,694 more rows
arrange()
: Descending Orderdesc()
around a variable re-order in the opposite directionselect()
Variablesselect()
select
keeps only selected variables (columns)
# A tibble: 1,704 × 3
country year pop
<fct> <int> <int>
1 Afghanistan 1952 8425333
2 Afghanistan 1957 9240934
3 Afghanistan 1962 10267083
4 Afghanistan 1967 11537966
5 Afghanistan 1972 13079460
6 Afghanistan 1977 14880372
7 Afghanistan 1982 12881816
8 Afghanistan 1987 13867957
9 Afghanistan 1992 16317921
10 Afghanistan 1997 22227415
# … with 1,694 more rows
select()
exceptselect
“all except” by negating a variable with -
# A tibble: 1,704 × 5
country continent year lifeExp pop
<fct> <fct> <int> <dbl> <int>
1 Afghanistan Asia 1952 28.8 8425333
2 Afghanistan Asia 1957 30.3 9240934
3 Afghanistan Asia 1962 32.0 10267083
4 Afghanistan Asia 1967 34.0 11537966
5 Afghanistan Asia 1972 36.1 13079460
6 Afghanistan Asia 1977 38.4 14880372
7 Afghanistan Asia 1982 39.9 12881816
8 Afghanistan Asia 1987 40.8 13867957
9 Afghanistan Asia 1992 41.7 16317921
10 Afghanistan Asia 1997 41.8 22227415
# … with 1,694 more rows
select()
: Reordering columnsselect
reorders the columns in the order you provide
everything()
at the end# A tibble: 1,704 × 6
pop country continent year lifeExp gdpPercap
<int> <fct> <fct> <int> <dbl> <dbl>
1 8425333 Afghanistan Asia 1952 28.8 779.
2 9240934 Afghanistan Asia 1957 30.3 821.
3 10267083 Afghanistan Asia 1962 32.0 853.
4 11537966 Afghanistan Asia 1967 34.0 836.
5 13079460 Afghanistan Asia 1972 36.1 740.
6 14880372 Afghanistan Asia 1977 38.4 786.
7 12881816 Afghanistan Asia 1982 39.9 978.
8 13867957 Afghanistan Asia 1987 40.8 852.
9 16317921 Afghanistan Asia 1992 41.7 649.
10 22227415 Afghanistan Asia 1997 41.8 635.
# … with 1,694 more rows
select()
Helper Functionsselect
has a lot of helper functions, useful for when you have hundreds of variables
?select()
for a list# A tibble: 1,704 × 2
country continent
<fct> <fct>
1 Afghanistan Asia
2 Afghanistan Asia
3 Afghanistan Asia
4 Afghanistan Asia
5 Afghanistan Asia
6 Afghanistan Asia
7 Afghanistan Asia
8 Afghanistan Asia
9 Afghanistan Asia
10 Afghanistan Asia
# … with 1,694 more rows
select()
Helper Functionsselect
has a lot of helper functions, useful for when you have hundreds of variables
?select()
for a list# A tibble: 1,704 × 2
country gdpPercap
<fct> <dbl>
1 Afghanistan 779.
2 Afghanistan 821.
3 Afghanistan 853.
4 Afghanistan 836.
5 Afghanistan 740.
6 Afghanistan 786.
7 Afghanistan 978.
8 Afghanistan 852.
9 Afghanistan 649.
10 Afghanistan 635.
# … with 1,694 more rows
rename()
Variablesrename
changes the name of a variable (column)
new_name = old_name
# rename gdpPercap to GDP and lifeExp to population
gapminder %>%
rename(GDP = gdpPercap,
LE = lifeExp)
# A tibble: 1,704 × 6
country continent year LE pop GDP
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# … with 1,694 more rows
filter()
Select Rows by Conditionfilter()
filter
keeps only selected observations (rows)# A tibble: 624 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Algeria Africa 1952 43.1 9279525 2449.
2 Algeria Africa 1957 45.7 10270856 3014.
3 Algeria Africa 1962 48.3 11000948 2551.
4 Algeria Africa 1967 51.4 12760499 3247.
5 Algeria Africa 1972 54.5 14760787 4183.
6 Algeria Africa 1977 58.0 17152804 4910.
7 Algeria Africa 1982 61.4 20033753 5745.
8 Algeria Africa 1987 65.8 23254956 5681.
9 Algeria Africa 1992 67.7 26298373 5023.
10 Algeria Africa 1997 69.2 29072015 4797.
# … with 614 more rows
TRUE
1> |
greater than | < |
less than |
>= |
greater than or equal to | <= |
less than or equal to |
== 2 |
is equal to | != |
is not equal to |
& |
and | \(\vert\) | or |
%in% |
is member of | %notin% |
is not a member of |
filter()
with Conditionals I,
# look only at African observations in 1997
gapminder %>%
filter(continent == "Africa",
year == 1997)
# A tibble: 52 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Algeria Africa 1997 69.2 29072015 4797.
2 Angola Africa 1997 41.0 9875024 2277.
3 Benin Africa 1997 54.8 6066080 1233.
4 Botswana Africa 1997 52.6 1536536 8647.
5 Burkina Faso Africa 1997 50.3 10352843 946.
6 Burundi Africa 1997 45.3 6121610 463.
7 Cameroon Africa 1997 52.2 14195809 1694.
8 Central African Republic Africa 1997 46.1 3696513 741.
9 Chad Africa 1997 51.6 7562011 1005.
10 Comoros Africa 1997 60.7 527982 1174.
# … with 42 more rows
filter()
with Conditionals II# look only at African observations OR observations in 1997
gapminder %>%
filter(continent == "Africa" |
year == 1997)
# A tibble: 714 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1997 41.8 22227415 635.
2 Albania Europe 1997 73.0 3428038 3193.
3 Algeria Africa 1952 43.1 9279525 2449.
4 Algeria Africa 1957 45.7 10270856 3014.
5 Algeria Africa 1962 48.3 11000948 2551.
6 Algeria Africa 1967 51.4 12760499 3247.
7 Algeria Africa 1972 54.5 14760787 4183.
8 Algeria Africa 1977 58.0 17152804 4910.
9 Algeria Africa 1982 61.4 20033753 5745.
10 Algeria Africa 1987 65.8 23254956 5681.
# … with 704 more rows
filter()
with Conditionals III# look only at U.S. and U.K. observations in 2002
gapminder %>%
filter(country %in%
c("United States",
"United Kingdom"),
year == 2002)
# A tibble: 2 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 United Kingdom Europe 2002 78.5 59912431 29479.
2 United States Americas 2002 77.3 287675526 39097.
mutate()
: Create New Variablesmutate()
mutate
creates a new variable (column)
new_variable_name = operation
mutate()
: Setting a Specific Value# create variable called "europe" if country is in Europe
mutate(gapminder,
europe = case_when(continent == "Europe" ~ "In Europe",
continent != "Europe" ~ "Not in Europe"))
# A tibble: 1,704 × 7
country continent year lifeExp pop gdpPercap europe
<fct> <fct> <int> <dbl> <int> <dbl> <chr>
1 Afghanistan Asia 1952 28.8 8425333 779. Not in Europe
2 Afghanistan Asia 1957 30.3 9240934 821. Not in Europe
3 Afghanistan Asia 1962 32.0 10267083 853. Not in Europe
4 Afghanistan Asia 1967 34.0 11537966 836. Not in Europe
5 Afghanistan Asia 1972 36.1 13079460 740. Not in Europe
6 Afghanistan Asia 1977 38.4 14880372 786. Not in Europe
7 Afghanistan Asia 1982 39.9 12881816 978. Not in Europe
8 Afghanistan Asia 1987 40.8 13867957 852. Not in Europe
9 Afghanistan Asia 1992 41.7 16317921 649. Not in Europe
10 Afghanistan Asia 1997 41.8 22227415 635. Not in Europe
# … with 1,694 more rows
mutate()
: Changing a Variable’s Scale# A tibble: 1,704 × 7
country continent year lifeExp pop gdpPercap pop_mil
<fct> <fct> <int> <dbl> <int> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 8.43
2 Afghanistan Asia 1957 30.3 9240934 821. 9.24
3 Afghanistan Asia 1962 32.0 10267083 853. 10.3
4 Afghanistan Asia 1967 34.0 11537966 836. 11.5
5 Afghanistan Asia 1972 36.1 13079460 740. 13.1
6 Afghanistan Asia 1977 38.4 14880372 786. 14.9
7 Afghanistan Asia 1982 39.9 12881816 978. 12.9
8 Afghanistan Asia 1987 40.8 13867957 852. 13.9
9 Afghanistan Asia 1992 41.7 16317921 649. 16.3
10 Afghanistan Asia 1997 41.8 22227415 635. 22.2
# … with 1,694 more rows
mutate()
: Variable Based on Other Variables# create GDP variable from gdpPercap and pop, in billions
gapminder %>%
mutate(GDP = ((gdpPercap * pop) / 1000000000))
# A tibble: 1,704 × 7
country continent year lifeExp pop gdpPercap GDP
<fct> <fct> <int> <dbl> <int> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 6.57
2 Afghanistan Asia 1957 30.3 9240934 821. 7.59
3 Afghanistan Asia 1962 32.0 10267083 853. 8.76
4 Afghanistan Asia 1967 34.0 11537966 836. 9.65
5 Afghanistan Asia 1972 36.1 13079460 740. 9.68
6 Afghanistan Asia 1977 38.4 14880372 786. 11.7
7 Afghanistan Asia 1982 39.9 12881816 978. 12.6
8 Afghanistan Asia 1987 40.8 13867957 852. 11.8
9 Afghanistan Asia 1992 41.7 16317921 649. 10.6
10 Afghanistan Asia 1997 41.8 22227415 635. 14.1
# … with 1,694 more rows
mutate()
: Change Class of Variableclass
of a variable inside mutate()
with as.*()
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <fct> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# … with 1,694 more rows
mutate()
: Create Multiple Variables# A tibble: 1,704 × 8
country continent year lifeExp pop gdpPercap GDP pop_mil…¹
<fct> <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330. 8.43
2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670. 9.24
3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797. 10.3
4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150. 11.5
5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274. 13.1
6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231. 14.9
7 Afghanistan Asia 1982 39.9 12881816 978. 12598563401. 12.9
8 Afghanistan Asia 1987 40.8 13867957 852. 11820990309. 13.9
9 Afghanistan Asia 1992 41.7 16317921 649. 10595901589. 16.3
10 Afghanistan Asia 1997 41.8 22227415 635. 14121995875. 22.2
# … with 1,694 more rows, and abbreviated variable name ¹pop_millions
transmute()
: Keep Only New Variablestransmute
keeps only newly created variables (it select()
s only the new mutate()
d variables)# A tibble: 1,704 × 2
GDP pop_millions
<dbl> <dbl>
1 6567086330. 8.43
2 7585448670. 9.24
3 8758855797. 10.3
4 9648014150. 11.5
5 9678553274. 13.1
6 11697659231. 14.9
7 12598563401. 12.9
8 11820990309. 13.9
9 10595901589. 16.3
10 14121995875. 22.2
# … with 1,694 more rows
mutate()
: Conditionalsmutate()
:gapminder %>%
select(country, year, lifeExp) %>%
mutate(long_life_1 = lifeExp > 70,
long_life_2 = case_when(lifeExp > 70 ~ "Long",
lifeExp <= 70 ~ "Short"))
# A tibble: 1,704 × 5
country year lifeExp long_life_1 long_life_2
<fct> <int> <dbl> <lgl> <chr>
1 Afghanistan 1952 28.8 FALSE Short
2 Afghanistan 1957 30.3 FALSE Short
3 Afghanistan 1962 32.0 FALSE Short
4 Afghanistan 1967 34.0 FALSE Short
5 Afghanistan 1972 36.1 FALSE Short
6 Afghanistan 1977 38.4 FALSE Short
7 Afghanistan 1982 39.9 FALSE Short
8 Afghanistan 1987 40.8 FALSE Short
9 Afghanistan 1992 41.7 FALSE Short
10 Afghanistan 1997 41.8 FALSE Short
# … with 1,694 more rows
mutate()
is Order Awaremutate()
is order-aware, so you can chain multiple mutates that depend on previous mutatesgapminder %>%
select(country, year, lifeExp) %>%
mutate(dog_years = lifeExp * 7,
comment = paste("Life expectancy in", country, "is", dog_years, "in dog years.", sep = " "))
# A tibble: 1,704 × 5
country year lifeExp dog_years comment
<fct> <int> <dbl> <dbl> <chr>
1 Afghanistan 1952 28.8 202. Life expectancy in Afghanistan is 201.60…
2 Afghanistan 1957 30.3 212. Life expectancy in Afghanistan is 212.32…
3 Afghanistan 1962 32.0 224. Life expectancy in Afghanistan is 223.97…
4 Afghanistan 1967 34.0 238. Life expectancy in Afghanistan is 238.14…
5 Afghanistan 1972 36.1 253. Life expectancy in Afghanistan is 252.61…
6 Afghanistan 1977 38.4 269. Life expectancy in Afghanistan is 269.06…
7 Afghanistan 1982 39.9 279. Life expectancy in Afghanistan is 278.97…
8 Afghanistan 1987 40.8 286. Life expectancy in Afghanistan is 285.75…
9 Afghanistan 1992 41.7 292. Life expectancy in Afghanistan is 291.71…
10 Afghanistan 1997 41.8 292. Life expectancy in Afghanistan is 292.34…
# … with 1,694 more rows
mutate()
: Scoped-functions Imutate
that work on a subset of variables:
mutate_all()
affects every variablemutate_at()
affects named or selected variablesmutate_if()
affects variables that meet a criteria# round all observations of numeric variables to 2 digits
gapminder %>%
mutate_if(is.numeric, round, digits = 2)
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.8 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# … with 1,694 more rows
mutate()
: Scoped-functions IImutate
that work on a subset of variables:
mutate_all()
affects every variablemutate_at()
affects named or selected variablesmutate_if()
affects variables that meet a criteria# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <int> <dbl> <int> <dbl>
1 AFGHANISTAN ASIA 1952 28.8 8425333 779.
2 AFGHANISTAN ASIA 1957 30.3 9240934 821.
3 AFGHANISTAN ASIA 1962 32.0 10267083 853.
4 AFGHANISTAN ASIA 1967 34.0 11537966 836.
5 AFGHANISTAN ASIA 1972 36.1 13079460 740.
6 AFGHANISTAN ASIA 1977 38.4 14880372 786.
7 AFGHANISTAN ASIA 1982 39.9 12881816 978.
8 AFGHANISTAN ASIA 1987 40.8 13867957 852.
9 AFGHANISTAN ASIA 1992 41.7 16317921 649.
10 AFGHANISTAN ASIA 1997 41.8 22227415 635.
# … with 1,694 more rows
dplyr
functions never modify their inputs (i.e. never overwrite the original tibble
)<-
to assign it to a new tibble
tibble
by name# A tibble: 624 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Algeria Africa 1952 43.1 9279525 2449.
2 Algeria Africa 1957 45.7 10270856 3014.
3 Algeria Africa 1962 48.3 11000948 2551.
4 Algeria Africa 1967 51.4 12760499 3247.
5 Algeria Africa 1972 54.5 14760787 4183.
6 Algeria Africa 1977 58.0 17152804 4910.
7 Algeria Africa 1982 61.4 20033753 5745.
8 Algeria Africa 1987 65.8 23254956 5681.
9 Algeria Africa 1992 67.7 26298373 5023.
10 Algeria Africa 1997 69.2 29072015 4797.
# … with 614 more rows
# A tibble: 624 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Algeria Africa 1952 43.1 9279525 2449.
2 Algeria Africa 1957 45.7 10270856 3014.
3 Algeria Africa 1962 48.3 11000948 2551.
4 Algeria Africa 1967 51.4 12760499 3247.
5 Algeria Africa 1972 54.5 14760787 4183.
6 Algeria Africa 1977 58.0 17152804 4910.
7 Algeria Africa 1982 61.4 20033753 5745.
8 Algeria Africa 1987 65.8 23254956 5681.
9 Algeria Africa 1992 67.7 26298373 5023.
10 Algeria Africa 1997 69.2 29072015 4797.
# … with 614 more rows
# Save and view at same time by wrapping whole command with ()
(africa <- gapminder %>%
filter(continent == "Africa"))
# A tibble: 624 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Algeria Africa 1952 43.1 9279525 2449.
2 Algeria Africa 1957 45.7 10270856 3014.
3 Algeria Africa 1962 48.3 11000948 2551.
4 Algeria Africa 1967 51.4 12760499 3247.
5 Algeria Africa 1972 54.5 14760787 4183.
6 Algeria Africa 1977 58.0 17152804 4910.
7 Algeria Africa 1982 61.4 20033753 5745.
8 Algeria Africa 1987 65.8 23254956 5681.
9 Algeria Africa 1992 67.7 26298373 5023.
10 Algeria Africa 1997 69.2 29072015 4797.
# … with 614 more rows
summarize()
: Create Statisticssummarize()
summarize
1 outputs a tibble of desired summary statistics
mutate()
-ing a new variablesummarize()
: Useful commandssummarize()
commands:Command | Does |
---|---|
n() |
Number of observations |
n_distinct() |
Number of unique observations |
sum() |
Sum all observations of a variable |
mean() |
Average of all observations of a variable |
median() |
50th percentile of all observations of a variable |
sd() |
Standard deviation of all observations of a variable |
Most commands require you to put a variable name inside the command’s argument parentheses. n()
and n_distinct()
require empty parentheses!
summarize()
: Useful commands IIsummarize()
commands:Command | Does |
---|---|
min() |
Minimum value of a variable |
max() |
Maximum value of a variable |
quantile(., 0.25) |
Specified percentile (e.g. 25 th percentile) of a variable |
first() |
First value of a variable |
last() |
Last value of a variable |
nth(., 2) |
Specified position of a variable (example 2 nd) |
The .
in quantile()
and nth()
are where you would put your variable name.
summarize()
counts# summarize with n() gives size of current group, has no arguments
gapminder %>%
summarize(amount = n()) # I've called it "amount"
# A tibble: 1 × 1
amount
<int>
1 1704
# count() is a dedicated command, counts observations by specified variable
gapminder %>%
count(year) # counts how many observations per year
# A tibble: 12 × 2
year n
<int> <int>
1 1952 142
2 1957 142
3 1962 142
4 1967 142
5 1972 142
6 1977 142
7 1982 142
8 1987 142
9 1992 142
10 1997 142
11 2002 142
12 2007 142
summarize()
ConditionallyTRUE
)TRUE=1
and FALSE=0
sum(x)
becomes the number of TRUE
s in x
mean(x)
becomes the proportionsummarize()
Multiple Variablessummarize()
multiple variables at once, separate by commassummarize()
Multiple Statisticssummarize()
multiple statistics of a variable at once, separate by commassummarize()
Scoped Versionssummarize()
that work on a subset of variables
summarize_all()
: affects every variablesummarize_at()
: affects named or selected variablessummarize_if()
: affects variables that meet a criteriagroup_by()
: Grouped summariesgroup_by() + summarize()
Ifactor
variables grouping a variable into categories, we can run dplyr
verbs by group
summarize()
group_by()
# get average life expectancy and gdp by continent
gapminder %>%
group_by(continent) %>%
summarize(avg_life = mean(lifeExp),
avg_GDP = mean(gdpPercap))
# A tibble: 5 × 3
continent avg_life avg_GDP
<fct> <dbl> <dbl>
1 Africa 48.9 2194.
2 Americas 64.7 7136.
3 Asia 60.1 7902.
4 Europe 71.9 14469.
5 Oceania 74.3 18622.
group_by() + summarize()
II# track changes in average life expectancy and gdp over time
gapminder %>%
group_by(year) %>%
summarize(mean_life = mean(lifeExp),
mean_GDP = mean(gdpPercap))
# A tibble: 12 × 3
year mean_life mean_GDP
<int> <dbl> <dbl>
1 1952 49.1 3725.
2 1957 51.5 4299.
3 1962 53.6 4726.
4 1967 55.7 5484.
5 1972 57.6 6770.
6 1977 59.6 7313.
7 1982 61.5 7519.
8 1987 63.2 7901.
9 1992 64.2 8159.
10 1997 65.0 9090.
11 2002 65.7 9918.
12 2007 67.0 11680.
group_by() + summarize()
III# track changes in average life expectancy and gdp over time
gapminder %>%
group_by(continent, year) %>%
summarize(mean_life = mean(lifeExp),
mean_GDP = mean(gdpPercap))
# A tibble: 60 × 4
# Groups: continent [5]
continent year mean_life mean_GDP
<fct> <int> <dbl> <dbl>
1 Africa 1952 39.1 1253.
2 Africa 1957 41.3 1385.
3 Africa 1962 43.3 1598.
4 Africa 1967 45.3 2050.
5 Africa 1972 47.5 2340.
6 Africa 1977 49.6 2586.
7 Africa 1982 51.6 2482.
8 Africa 1987 53.3 2283.
9 Africa 1992 53.6 2282.
10 Africa 1997 53.6 2379.
# … with 50 more rows
tidyverse
uses same grammar and design philosophygapminder %>%
group_by(continent, year) %>%
summarize(mean_life = mean(lifeExp),
mean_GDP = mean(gdpPercap)) %>%
# now pipe this tibble in as data for ggplot!
ggplot(data = ., # . pipes the above in (to data layer)
aes(x = year,
y = mean_life,
color = continent))+
geom_path(size = 1)+
labs(x = "Year",
y = "Average Life Expectancy (Years)",
color = "Continent",
title = "Average Life Expectancy Over Time")+
theme_classic(base_family = "Fira Sans Condensed",
base_size = 20)
dplyr
: Other Useful Commandstally()
: counts for categoriestally
provides counts, best used with group_by
for factors
slice()
: Filter row by positionslice()
subsets observations by position instead of filter
ing by values# A tibble: 3 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Albania Europe 1962 64.8 1728137 2313.
2 Albania Europe 1967 66.2 1984060 2760.
3 Albania Europe 1972 67.7 2263554 3313.
# A tibble: 3 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1957 30.3 9240934 821.
2 Afghanistan Asia 1962 32.0 10267083 853.
3 Bosnia and Herzegovina Europe 1977 69.9 4086000 3528.
pull()
: Extract columnspull()
extracts a column from a tibble
(just like $
for a data.frame
)# Get all U.S. life expectancy observations
gapminder %>%
filter(country == "United States") %>%
pull(lifeExp)
[1] 68.440 69.490 70.210 70.760 71.340 73.380 74.650 75.020 76.090 76.810
[11] 77.310 78.242
# Get U.S. life expectancy in 2007
gapminder %>%
filter(country == "United States" & year == 2007) %>%
pull(lifeExp)
[1] 78.242
distinct()
: Show unique valuesdistinct()
shows the distinct values of a specified variable (recall n_distinct()
inside summarize()
just gives you the number of values)# A tibble: 142 × 1
country
<fct>
1 Afghanistan
2 Albania
3 Algeria
4 Angola
5 Argentina
6 Australia
7 Austria
8 Bahrain
9 Bangladesh
10 Belgium
# … with 132 more rows