Getting Started With `dplyr`

Overview

  • tidy data
  • dplyr
  • ggplot2

Tidy Data

Tidy Data

  1. each variable is in a column
  2. each row is an observation
  3. each dataset only has one unit of observation

(implicit: no lists)

@wickham2014tidy

Tidy Data

Variables: time, sex, age, GDP, ongoing war, party_id,…

“Observational units”: country-year, respondant, respondant-wave, district, district-month,…

What’s wrong with this data?

Why Tidy Data?

Consistently formatted data makes cool tools possible.

dplyr

Philosopy

Describe 80% of data analysis operations in five consistent verbs:

  1. select (conditionally return columns)
  2. filter (conditionally return rows)
  3. mutate (add a column to every row)
  4. group by (specify a group to operate one)
  5. summarize (return one row per group)

The pipe/magrittr

bop(
  scoop(
    hop(foo_foo, through = forest),
    up = field_mice
  ), 
  on = head
)

The pipe

bop(
  scoop(
    hop(foo_foo, through = forest),
    up = field_mice
  ), 
  on = head
)
foo_foo %>%
  hop(through = forest) %>%
  scoop(up = field_mouse) %>%
  bop(on = head)

http://r4ds.had.co.nz/pipes.html

Pipes

Pipes can be used with most functions!

df %>% 
  filter(year > 2000) %>% 
  lm(income ~ age + educ)
rnorm(100) %>%
  matrix(ncol = 2) %>%
  plot()
df %>% 
  mutate(high_GDP = GDP > 2000) %>% 
  View()

Motivating Example

Of countries that have experienced a civil war since 1980, what’s the continent-average GDP per capita?

Fearon and Laitin

fl <- haven::read_dta("Fearon_Laitin_repdata.dta")
names(fl)
Previous
Next