R语言

Lecture05 : 使用tidyverse处理数据

罗益民

2024-09-20

本节内容一览

Part01 tidyverse简介
Part02 使用dplyr进行数据转换
- dplyr核心函数
- dplyr包其它常用函数
- 管道
Part03 使用dplyr处理关系数据

Part01 tidyverse介绍

tidyverse 系统

tidyverse是一系列用于数据输入输出、数据整理和数据汇总的R扩展包集合，使用这些包遵循相近的编程风格，比直接使用基本R编程要更直观、容易理解。

tidyverse 扩展包

package	使用场景
readr	数据读取
ggplot2	数据可视化
tibble	简单数据框
tidyr	数据整洁，数据清理
dplyr	数据处理、关系数据
stringr	处理字符串
forcats	处理因子
lubridate	处理日期和时间
purr	循环和迭代
magrittr	管道

由于篇幅有限，本章仅介绍dplyr和管道操作

Part02 使用dplyr进行数据转换

dplyr核心函数

我们将学习5个dplyr 核心函数，这些函数可以解决数据处理中的绝大多数难题。

按值筛选观测(按行筛选)：filter()
对行进行重新排序：arrange()
选取变量（按列筛选）：select()
使用现有变量的函数创建新变量：mutate()
将多个值总结为一个摘要统计量：summarize()

这些函数都可以和group_by()函数联合起来使用， group_by() 函数可以改变以上每个函数的作用范围，让其从在整个数据集上操作变为在每个分组上分别操作。

5 个函数的工作方式都是相同的:

第一个参数是一个tibble。
随后的参数使用变量名称（不带引号）描述了在数据框上进行的操作。
输出结果是一个新tibble。

数据

我们使用 nycflights13::flights数据集进行演示。这个数据集包含了 2013 年从纽约市出发的所有 336776 架次航班的信息。

if(!require("nycflights13")){
  installed.packages("nycflights13")
}

library(nycflights13)
flights

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

使用`filter()`筛选行

library(tidyverse)
filter(flights,month==1,day==1)

# A tibble: 842 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 832 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

filter(flights,month==11 | month==12)

# A tibble: 55,403 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    11     1        5           2359         6      352            345
 2  2013    11     1       35           2250       105      123           2356
 3  2013    11     1      455            500        -5      641            651
 4  2013    11     1      539            545        -6      856            827
 5  2013    11     1      542            545        -3      831            855
 6  2013    11     1      549            600       -11      912            923
 7  2013    11     1      550            600       -10      705            659
 8  2013    11     1      554            600        -6      659            701
 9  2013    11     1      554            600        -6      826            827
10  2013    11     1      554            600        -6      749            751
# ℹ 55,393 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

filter(flights,month==11 %in% c(11,12))

# A tibble: 27,004 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 26,994 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

filter(flights,dep_delay==2 & arr_time < 900)

# A tibble: 517 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1     2157           2155         2       43             41
 3  2013     1     2      602            600         2      646            659
 4  2013     1     2      617            615         2      837            817
 5  2013     1     2      647            645         2      849            916
 6  2013     1     2      734            732         2      844            853
 7  2013     1     2     2030           2028         2        8           2351
 8  2013     1     2     2137           2135         2      208            224
 9  2013     1     3      532            530         2      851            831
10  2013     1     4      602            600         2      718            730
# ℹ 507 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

还可以进一步使用slice(n)函数取部分结果：

filter(flights,month==1,day==1) |> slice(1:3)

# A tibble: 3 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

使用`arrange()`排列行

arrange() 与 filter()函数非常相似，但不是选择行，而是改变行的顺序

# 根据多列依次排序
arrange(flights, year, month, day)

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

# 降序排序
arrange(flights, desc(arr_time))

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1     2209           2155        14     2400           2337
 2  2013     1     5     2116           2130       -14     2400             18
 3  2013     1    13     2243           2129        74     2400           2224
 4  2013     1    16     2138           2107        31     2400           2322
 5  2013     1    17     2256           2249         7     2400           2357
 6  2013     1    22     2212           2055        77     2400           2250
 7  2013     1    22     2249           2125        84     2400           2250
 8  2013     1    25     2055           1725       210     2400           1933
 9  2013     1    28     2303           2250        13     2400           2354
10  2013     1    30     2155           1915       160     2400           2137
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

使用`select()`选择列

# 按名称选择列
select(flights, year, month, day)

# A tibble: 336,776 × 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# ℹ 336,766 more rows

# 选择“year”和“day”之间的所有列（包括“year”和“day”）
select(flights, year:day)

# A tibble: 336,776 × 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# ℹ 336,766 more rows

# 选择不在“year”和“day”之间的所有列（不包括“year”和“day”）
select(flights, -(year:day))

# A tibble: 336,776 × 16
   dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
      <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
 1      517            515         2      830            819        11 UA     
 2      533            529         4      850            830        20 UA     
 3      542            540         2      923            850        33 AA     
 4      544            545        -1     1004           1022       -18 B6     
 5      554            600        -6      812            837       -25 DL     
 6      554            558        -4      740            728        12 UA     
 7      555            600        -5      913            854        19 B6     
 8      557            600        -3      709            723       -14 EV     
 9      557            600        -3      838            846        -8 B6     
10      558            600        -2      753            745         8 AA     
# ℹ 336,766 more rows
# ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

可以在 select () 函数中使用一些辅助函数，如: starts_with("abc"), ends_with("xyz"), where, contains("ijk"), matches("(.)\\1")等

# 选择列名以arr开头的列 
select(flights, starts_with("arr"))

# A tibble: 336,776 × 2
   arr_time arr_delay
      <int>     <dbl>
 1      830        11
 2      850        20
 3      923        33
 4     1004       -18
 5      812       -25
 6      740        12
 7      913        19
 8      709       -14
 9      838        -8
10      753         8
# ℹ 336,766 more rows

# 选择字符串类型的列
select(flights,where(is.character))

# A tibble: 336,776 × 4
   carrier tailnum origin dest 
   <chr>   <chr>   <chr>  <chr>
 1 UA      N14228  EWR    IAH  
 2 UA      N24211  LGA    IAH  
 3 AA      N619AA  JFK    MIA  
 4 B6      N804JB  JFK    BQN  
 5 DL      N668DN  LGA    ATL  
 6 UA      N39463  EWR    ORD  
 7 B6      N516JB  EWR    FLL  
 8 EV      N829AS  LGA    IAD  
 9 B6      N593JB  JFK    MCO  
10 AA      N3ALAA  LGA    ORD  
# ℹ 336,766 more rows

使用`mutate()`添加新变量

mutate()将新列添加在现有数据集的最后，因此我们需要先创建一个更狭窄的数据集，以便能够看到新变量。

slim_flights <- select(flights,year:day,ends_with("delay"),distance,air_time)
mutate(slim_flights,
  gain = arr_delay - dep_delay,
  speed = distance / air_time * 60
)

# A tibble: 336,776 × 9
    year month   day dep_delay arr_delay distance air_time  gain speed
   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
 1  2013     1     1         2        11     1400      227     9  370.
 2  2013     1     1         4        20     1416      227    16  374.
 3  2013     1     1         2        33     1089      160    31  408.
 4  2013     1     1        -1       -18     1576      183   -17  517.
 5  2013     1     1        -6       -25      762      116   -19  394.
 6  2013     1     1        -4        12      719      150    16  288.
 7  2013     1     1        -5        19     1065      158    24  404.
 8  2013     1     1        -3       -14      229       53   -11  259.
 9  2013     1     1        -3        -8      944      140    -5  405.
10  2013     1     1        -2         8      733      138    10  319.
# ℹ 336,766 more rows

如果只想保留新变量，可以使用 transmute() 函数：

transmute(flights,
  gain = arr_delay - dep_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

# A tibble: 336,776 × 3
    gain hours gain_per_hour
   <dbl> <dbl>         <dbl>
 1     9 3.78           2.38
 2    16 3.78           4.23
 3    31 2.67          11.6 
 4   -17 3.05          -5.57
 5   -19 1.93          -9.83
 6    16 2.5            6.4 
 7    24 2.63           9.11
 8   -11 0.883        -12.5 
 9    -5 2.33          -2.14
10    10 2.3            4.35
# ℹ 336,766 more rows

使用`summarize()`进行分组摘要

summarize()一般与 group_by() 一起使用。 group_by() 可以将分析单位从整个数据集更改为单个分组。接下来，在分组后的数据框上使用 dplyr 函数时，它们会自动地应用到每个分组。

by_day <- group_by(flights, year, month)
# 分组统计每个月的航班延误均值
summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))

# A tibble: 12 × 3
# Groups:   year [1]
    year month delay
   <int> <int> <dbl>
 1  2013     1 10.0 
 2  2013     2 10.8 
 3  2013     3 13.2 
 4  2013     4 13.9 
 5  2013     5 13.0 
 6  2013     6 20.8 
 7  2013     7 21.7 
 8  2013     8 12.6 
 9  2013     9  6.72
10  2013    10  6.24
11  2013    11  5.44
12  2013    12 16.6

group_by() 和 summarize() 的组合构成了使用 dplyr 包时最常用的操作之一：分组摘要

使用`ungroup()`取消分组

flights <- ungroup(by_day)
flights

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

summarize(flights,count=n())

# A tibble: 1 × 1
   count
   <int>
1 336776

dplyr包其它常用函数

使用`count()`进行分组计数

因为计数太常用了，所以 dplyr 提供了一个简单的辅助函数，用于只需要计数的情况。dplyr包的count()功能与基础包中的table()类似，都是分组统计数量

table(flights$year)


  2013 
336776

count(flights,year)

# A tibble: 1 × 2
   year      n
  <int>  <int>
1  2013 336776

# 根据多个变量分组统计数量
count(flights,year,month)

# A tibble: 12 × 3
    year month     n
   <int> <int> <int>
 1  2013     1 27004
 2  2013     2 24951
 3  2013     3 28834
 4  2013     4 28330
 5  2013     5 28796
 6  2013     6 28243
 7  2013     7 29425
 8  2013     8 29327
 9  2013     9 27574
10  2013    10 28889
11  2013    11 27268
12  2013    12 28135

# 排序统计结果
count(flights,month,sort=TRUE)

# A tibble: 12 × 2
   month     n
   <int> <int>
 1     7 29425
 2     8 29327
 3    10 28889
 4     3 28834
 5     5 28796
 6     4 28330
 7     6 28243
 8    12 28135
 9     9 27574
10    11 27268
11     1 27004
12     2 24951

在分组摘要中使用`n()`计数

n()不需要任何参数，返回当前分组的大小。

by_day <- group_by(flights, year, month)
summarize(by_day, delay = mean(dep_delay, na.rm = TRUE),count=n())

# A tibble: 12 × 4
# Groups:   year [1]
    year month delay count
   <int> <int> <dbl> <int>
 1  2013     1 10.0  27004
 2  2013     2 10.8  24951
 3  2013     3 13.2  28834
 4  2013     4 13.9  28330
 5  2013     5 13.0  28796
 6  2013     6 20.8  28243
 7  2013     7 21.7  29425
 8  2013     8 12.6  29327
 9  2013     9  6.72 27574
10  2013    10  6.24 28889
11  2013    11  5.44 27268
12  2013    12 16.6  28135

n()和count()有何不同？ 在本质上，count() 相当于 summarize(n = n())，两者达到的效果一样。但是count()不能用于summarize()函数中。

列拆分与合并

1. 使用separate()将一列拆成多列

当你有一列包含多个值，并且想要将这些值分隔到不同的列时。这个函数特别有用。

df <- tibble(name = c("王小明,男", "李敏,女", "张爱夏,女"))
# separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, ...)
df2 <- separate(df,name,into=c("name","sex"),sep=",")

2. 使用unite()将多列合成一列

# unite(data, col, ..., sep = "_", remove = TRUE, na.rm = FALSE)
unite(df2, "name_sex", c("name","sex"),sep=",", remove = FALSE)

# A tibble: 3 × 3
  name_sex  name   sex  
  <chr>     <chr>  <chr>
1 王小明,男 王小明 男   
2 李敏,女   李敏   女   
3 张爱夏,女 张爱夏 女

去除重复行

# tibble::tribble 是 tibble 包中的一个函数，它提供了一种更易读的方式来创建小的数据集。
# 这个函数特别适用于创建行数较少的数据框，因为它允许你按行来定义数据，使得数据的布局更加清晰。
df <- tibble::tribble(
  ~x, ~y, ~z,
  1, 1, 1,
  1, 1, 2,
  1, 1, 1,
  2, 1, 2,
  2, 2, 3,
  3, 3, 1
)
# 整体去重
distinct(df)

# A tibble: 5 × 3
      x     y     z
  <dbl> <dbl> <dbl>
1     1     1     1
2     1     1     2
3     2     1     2
4     2     2     3
5     3     3     1

# 针对x列去重
distinct(df,x)

# A tibble: 3 × 1
      x
  <dbl>
1     1
2     2
3     3

# 针对x列去重,保留所有列
distinct(df,x,.keep_all=TRUE)

# A tibble: 3 × 3
      x     y     z
  <dbl> <dbl> <dbl>
1     1     1     1
2     2     1     2
3     3     3     1

管道（pipe）

数据集经常需要选行子集、选列子集、排序、定义新变量、横向合并、长宽转换等操作，而且经常会用若干个连续的操作分步处理。

magrittr包的%>%首先提出了管道的概念，特别适用于这种分步处理，即将上一步的运算结果作为下一步的初始变量依次传入。在R4.1.0版本中，提出了原生内置的管道符|>。¹

由于%>%出现的更早，目前大量代码仍在使用。但鼓励大家使用原生|>，它具有更好的性能，且不依赖任何第三方库。

管道示例

flights |>
  filter(!is.na(dep_delay), !is.na(arr_delay)) |>
  group_by(year, month, day) |>
    summarize(
      # 平均延误时间：
      avg_delay1 = mean(arr_delay),
      # 平均正延误时间：
      avg_delay2 = mean(arr_delay[arr_delay > 0])
  )

# A tibble: 365 × 5
# Groups:   year, month [12]
    year month   day avg_delay1 avg_delay2
   <int> <int> <int>      <dbl>      <dbl>
 1  2013     1     1     12.7         32.5
 2  2013     1     2     12.7         32.0
 3  2013     1     3      5.73        27.7
 4  2013     1     4     -1.93        28.3
 5  2013     1     5     -1.53        22.6
 6  2013     1     6      4.24        24.4
 7  2013     1     7     -4.95        27.8
 8  2013     1     8     -3.23        20.8
 9  2013     1     9     -0.264       25.6
10  2013     1    10     -5.90        27.3
# ℹ 355 more rows

管道操作中的缺失值问题

flights |> 
  group_by(year,month,day) |>
    summarize(mean=mean(dep_delay))

# A tibble: 365 × 4
# Groups:   year, month [12]
    year month   day  mean
   <int> <int> <int> <dbl>
 1  2013     1     1    NA
 2  2013     1     2    NA
 3  2013     1     3    NA
 4  2013     1     4    NA
 5  2013     1     5    NA
 6  2013     1     6    NA
 7  2013     1     7    NA
 8  2013     1     8    NA
 9  2013     1     9    NA
10  2013     1    10    NA
# ℹ 355 more rows

在管道操作中必须要注意缺失值的问题

上面的例子出现大量NA，这是因为在管道操作中缺失值具有传染性：在一个环节上出现NA，后续的所有步骤都是NA。

使用na.rm=TRUE

flights |> 
  group_by(year,month,day) |>
    summarize(mean=mean(dep_delay,na.rm=TRUE))

# A tibble: 365 × 4
# Groups:   year, month [12]
    year month   day  mean
   <int> <int> <int> <dbl>
 1  2013     1     1 11.5 
 2  2013     1     2 13.9 
 3  2013     1     3 11.0 
 4  2013     1     4  8.95
 5  2013     1     5  5.73
 6  2013     1     6  7.15
 7  2013     1     7  5.42
 8  2013     1     8  2.55
 9  2013     1     9  2.28
10  2013     1    10  2.84
# ℹ 355 more rows

使用is.na()提前进行数据筛选

flights |>
  filter(!is.na(dep_delay), !is.na(arr_delay))

# A tibble: 327,346 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 327,336 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Part03 使用dplyr处理关系数据

关系数据

只涉及一张数据表的数据分析是非常罕见的。通常来说，你会有很多个数据表，而且必须综合使用它们才能回答你所感兴趣的问题。存在于多个表中的这种数据统称为关系数据。有三类操作来处理关系数据。

合并连接：向数据框中加入新变量，新变量的值是另一个数据框中的匹配观测。
筛选连接：根据是否匹配另一个数据框中的观测，筛选数据框中的观测。
集合操作：将观测作为集合元素来处理。

其中最常用的是合并连接

理解连接

下图中，有颜色的列表示作为“键”的变量：它们用于在表间匹配行。灰色列表示“值”列，是与键对应的值。

连接是将 x 中每行连接到 y 中 0 行、一行或多行的一种方法。在下图中用圆点表示匹配。圆点的数量 = 匹配的数量 = 结果中行的数量

我们将用两个简单的数据框说明连接：

x <- tribble(
  ~key, ~val_x,
  1, "x1",
  2, "x2",
  3, "x3"
)
y <- tribble(
  ~key, ~val_y,
  1, "y1",
  2, "y2",
  4, "y3"
)

内连接（inner join）

内连接是最简单的一种连接。只要两个观测的键是相等的，内连接就可以匹配它们。

x |> 
  inner_join(y,by="key")

# A tibble: 2 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2

内连接一般不适合在分析中使用，因为没有匹配的行不会包含在结果中，容易丢失观测。

外连接（outer join）

外连接则保留至少存在于一个表中的观测。外连接有3种类型。

左连接：保留 x 中的所有观测。
右连接：保留 y 中的所有观测
全连接：保留 x 和 y 中的所有观测。

示例数据

nycflights13 中包含了与 flights 相关的 4 个tibble，我们将使用这些数据来演示连接：

flights：航班信息，主要数据。
airlines：航空公司。通过carrier标识
airports：给出了每个机场的信息，通过 faa 机场编码进行标识
planes：给出了每架飞机的信息，通过 tailnum 进行标识。
weather：给出了纽约机场每小时的天气状况。

flights2 <- flights |>
  select(year:day, hour, origin, dest, tailnum, carrier)

自然连接

默认情况下by=NULL，这会使用存在于两个表中的所有变量，这种方式称为自然连接 例如，匹配航班表和天气表时使用的就是其公共变量： year、 month、 day、 hour 和 origin。

flights2

# A tibble: 336,776 × 8
    year month   day  hour origin dest  tailnum carrier
   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
 1  2013     1     1     5 EWR    IAH   N14228  UA     
 2  2013     1     1     5 LGA    IAH   N24211  UA     
 3  2013     1     1     5 JFK    MIA   N619AA  AA     
 4  2013     1     1     5 JFK    BQN   N804JB  B6     
 5  2013     1     1     6 LGA    ATL   N668DN  DL     
 6  2013     1     1     5 EWR    ORD   N39463  UA     
 7  2013     1     1     6 EWR    FLL   N516JB  B6     
 8  2013     1     1     6 LGA    IAD   N829AS  EV     
 9  2013     1     1     6 JFK    MCO   N593JB  B6     
10  2013     1     1     6 LGA    ORD   N3ALAA  AA     
# ℹ 336,766 more rows

weather

# A tibble: 26,115 × 15
   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
 7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
 8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
 9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
# ℹ 26,105 more rows
# ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
#   visib <dbl>, time_hour <dttm>

flights2 |> 
  left_join(weather)

# A tibble: 336,776 × 18
    year month   day  hour origin dest  tailnum carrier  temp  dewp humid
   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <dbl> <dbl> <dbl>
 1  2013     1     1     5 EWR    IAH   N14228  UA       39.0  28.0  64.4
 2  2013     1     1     5 LGA    IAH   N24211  UA       39.9  25.0  54.8
 3  2013     1     1     5 JFK    MIA   N619AA  AA       39.0  27.0  61.6
 4  2013     1     1     5 JFK    BQN   N804JB  B6       39.0  27.0  61.6
 5  2013     1     1     6 LGA    ATL   N668DN  DL       39.9  25.0  54.8
 6  2013     1     1     5 EWR    ORD   N39463  UA       39.0  28.0  64.4
 7  2013     1     1     6 EWR    FLL   N516JB  B6       37.9  28.0  67.2
 8  2013     1     1     6 LGA    IAD   N829AS  EV       39.9  25.0  54.8
 9  2013     1     1     6 JFK    MCO   N593JB  B6       37.9  27.0  64.3
10  2013     1     1     6 LGA    ORD   N3ALAA  AA       39.9  25.0  54.8
# ℹ 336,766 more rows
# ℹ 7 more variables: wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
#   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>

指定键列

更为常见的情况是两张表通过一个单变量来连接，而且这个变量在两张表中具有同样的名称。这种限制条件可通过 by = “key” 来实现。例如，flights 和 planes 表中都有 year 变量，但是它们的意义不同，因此我们只通过 tailnum进行连接：¹

head(flights2,2)

# A tibble: 2 × 8
   year month   day  hour origin dest  tailnum carrier
  <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
1  2013     1     1     5 EWR    IAH   N14228  UA     
2  2013     1     1     5 LGA    IAH   N24211  UA

head(planes,2)

# A tibble: 2 × 9
  tailnum  year type               manufacturer model engines seats speed engine
  <chr>   <int> <chr>              <chr>        <chr>   <int> <int> <int> <chr> 
1 N10156   2004 Fixed wing multi … EMBRAER      EMB-…       2    55    NA Turbo…
2 N102UW   1998 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…

flights2 |> 
  left_join(planes,by="tailnum")

# A tibble: 336,776 × 16
   year.x month   day  hour origin dest  tailnum carrier year.y type            
    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>           
 1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixed wing mult…
 2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixed wing mult…
 3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixed wing mult…
 4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixed wing mult…
 5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixed wing mult…
 6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixed wing mult…
 7   2013     1     1     6 EWR    FLL   N516JB  B6        2000 Fixed wing mult…
 8   2013     1     1     6 LGA    IAD   N829AS  EV        1998 Fixed wing mult…
 9   2013     1     1     6 JFK    MCO   N593JB  B6        2004 Fixed wing mult…
10   2013     1     1     6 LGA    ORD   N3ALAA  AA          NA <NA>            
# ℹ 336,766 more rows
# ℹ 6 more variables: manufacturer <chr>, model <chr>, engines <int>,
#   seats <int>, speed <int>, engine <chr>

使用命名字符向量指定匹配键列

命名字符向量 by = c("a" = "b")。这种方式会匹配 x 表中的 a 变量和 y 表中的 b 变量。输出结果中使用的是 x 表中的变量。例如，我们想在航班数据中加入机场数据，后者包含了每个机场的位置。因为每次航班都有起点机场和终点机场，所以需要指定使用哪个机场进行连接：

head(flights2,2)

# A tibble: 2 × 8
   year month   day  hour origin dest  tailnum carrier
  <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
1  2013     1     1     5 EWR    IAH   N14228  UA     
2  2013     1     1     5 LGA    IAH   N24211  UA

head(airports,2)

# A tibble: 2 × 8
  faa   name                            lat   lon   alt    tz dst   tzone       
  <chr> <chr>                         <dbl> <dbl> <dbl> <dbl> <chr> <chr>       
1 04G   Lansdowne Airport              41.1 -80.6  1044    -5 A     America/New…
2 06A   Moton Field Municipal Airport  32.5 -85.7   264    -6 A     America/Chi…

使用终点机场连接

flights2 |>
  left_join(airports, c("dest" = "faa"))

# A tibble: 336,776 × 15
    year month   day  hour origin dest  tailnum carrier name     lat   lon   alt
   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>  <dbl> <dbl> <dbl>
 1  2013     1     1     5 EWR    IAH   N14228  UA      Georg…  30.0 -95.3    97
 2  2013     1     1     5 LGA    IAH   N24211  UA      Georg…  30.0 -95.3    97
 3  2013     1     1     5 JFK    MIA   N619AA  AA      Miami…  25.8 -80.3     8
 4  2013     1     1     5 JFK    BQN   N804JB  B6      <NA>    NA    NA      NA
 5  2013     1     1     6 LGA    ATL   N668DN  DL      Harts…  33.6 -84.4  1026
 6  2013     1     1     5 EWR    ORD   N39463  UA      Chica…  42.0 -87.9   668
 7  2013     1     1     6 EWR    FLL   N516JB  B6      Fort …  26.1 -80.2     9
 8  2013     1     1     6 LGA    IAD   N829AS  EV      Washi…  38.9 -77.5   313
 9  2013     1     1     6 JFK    MCO   N593JB  B6      Orlan…  28.4 -81.3    96
10  2013     1     1     6 LGA    ORD   N3ALAA  AA      Chica…  42.0 -87.9   668
# ℹ 336,766 more rows
# ℹ 3 more variables: tz <dbl>, dst <chr>, tzone <chr>

使用起点机场连接

flights2 |>
  left_join(airports, c("origin" = "faa"))

# A tibble: 336,776 × 15
    year month   day  hour origin dest  tailnum carrier name     lat   lon   alt
   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>  <dbl> <dbl> <dbl>
 1  2013     1     1     5 EWR    IAH   N14228  UA      Newar…  40.7 -74.2    18
 2  2013     1     1     5 LGA    IAH   N24211  UA      La Gu…  40.8 -73.9    22
 3  2013     1     1     5 JFK    MIA   N619AA  AA      John …  40.6 -73.8    13
 4  2013     1     1     5 JFK    BQN   N804JB  B6      John …  40.6 -73.8    13
 5  2013     1     1     6 LGA    ATL   N668DN  DL      La Gu…  40.8 -73.9    22
 6  2013     1     1     5 EWR    ORD   N39463  UA      Newar…  40.7 -74.2    18
 7  2013     1     1     6 EWR    FLL   N516JB  B6      Newar…  40.7 -74.2    18
 8  2013     1     1     6 LGA    IAD   N829AS  EV      La Gu…  40.8 -73.9    22
 9  2013     1     1     6 JFK    MCO   N593JB  B6      John …  40.6 -73.8    13
10  2013     1     1     6 LGA    ORD   N3ALAA  AA      La Gu…  40.8 -73.9    22
# ℹ 336,766 more rows
# ℹ 3 more variables: tz <dbl>, dst <chr>, tzone <chr>

比较内置`merge()`函数和`dplyr`

base::merge() 函数可以实现所有 4 种合并连接操作。

dplyr 连接操作的优点:

可以更加清晰地表达出代码的意图
dplyr 连接操作的速度更快，而且不会弄乱行的顺序。