假设一个数据框
df
有几列,其中有一个“描述”字段,
并假设一组关键字存储在一个单独的向量中
keywords
,最佳实践是什么:
df
中为每个关键字创建列,并相应地命名;df
?例如
(df <- data.frame(
ID = letters[1:10],
DESCRIPTION = c("blue", "red", "this was green", "this was red", "blue and red", "green", NA, "green", "green, blue, and red", NA)
))
ID DESCRIPTION
1 a blue
2 b red
3 c this was green
4 d this was red
5 e blue and red
6 f green
7 g <NA>
8 h green
9 i green, blue, and red
10 j <NA>
keywords <- c("blue", "red", "green")
会回来
ID DESCRIPTION blue red green
1 a blue 1 0 0
2 b red 0 1 0
3 c this was green 0 0 1
4 d this was red 0 1 0
5 e blue and red 1 1 0
6 f green 0 0 1
7 g <NA> 0 0 0
8 h green 0 0 1
9 i green, blue, and red 1 1 1
10 j <NA> 0 0 0
Preferable using
base
R or dplyr
(eg. avoiding data.table
).
注意:答案需要可扩展(许多可能的关键字)。
使用
unnest
和pivot_wider
的方法
library(dplyr)
library(tidyr)
df %>%
mutate(nms = list(!!keywords)) %>%
unnest(nms) %>%
rowwise() %>%
mutate(values = (grepl(nms, DESCRIPTION))*1) %>%
ungroup() %>%
pivot_wider(names_from=nms, values_from=values)
# A tibble: 10 × 5
ID DESCRIPTION blue red green
<chr> <chr> <dbl> <dbl> <dbl>
1 a blue 1 0 0
2 b red 0 1 0
3 c this was green 0 0 1
4 d this was red 0 1 0
5 e blue and red 1 1 0
6 f green 0 0 1
7 g NA 0 0 0
8 h green 0 0 1
9 i green, blue, and red 1 1 1
10 j NA 0 0 0
在 base R 使用 sapply 试试这个
cbind(df, (sapply(keywords, grepl, df$DESCRIPTION))*1)
ID DESCRIPTION blue red green
1 a blue 1 0 0
2 b red 0 1 0
3 c this was green 0 0 1
4 d this was red 0 1 0
5 e blue and red 1 1 0
6 f green 0 0 1
7 g <NA> 0 0 0
8 h green 0 0 1
9 i green, blue, and red 1 1 1
10 j <NA> 0 0 0