Getting Started With `dplyr`
Overview
- tidy data
dplyr
ggplot2
Tidy Data
Tidy Data
- each variable is in a column
- each row is an observation
- each dataset only has one unit of observation
(implicit: no lists)
@wickham2014tidy
Tidy Data
Variables: time, sex, age, GDP, ongoing war, party_id,…
“Observational units”: country-year, respondant, respondant-wave, district, district-month,…
What’s wrong with this data?
Why Tidy Data?
Consistently formatted data makes cool tools possible.
dplyr
Philosopy
Describe 80% of data analysis operations in five consistent verbs:
- select (conditionally return columns)
- filter (conditionally return rows)
- mutate (add a column to every row)
- group by (specify a group to operate one)
- summarize (return one row per group)
The pipe/magrittr
bop(
scoop(
hop(foo_foo, through = forest),
up = field_mice
),
on = head
)
The pipe
bop(
scoop(
hop(foo_foo, through = forest),
up = field_mice
),
on = head
)
foo_foo %>%
hop(through = forest) %>%
scoop(up = field_mouse) %>%
bop(on = head)
http://r4ds.had.co.nz/pipes.html
Pipes
Pipes can be used with most functions!
df %>%
filter(year > 2000) %>%
lm(income ~ age + educ)
rnorm(100) %>%
matrix(ncol = 2) %>%
plot()
df %>%
mutate(high_GDP = GDP > 2000) %>%
View()
Motivating Example
Of countries that have experienced a civil war since 1980, what’s the continent-average GDP per capita?
Fearon and Laitin
fl <- haven::read_dta("Fearon_Laitin_repdata.dta")
names(fl)