指示在 df 列中出现 R 中另一列的关键字

问题描述 投票:0回答:1

假设一个数据框

df
有几列,其中有一个“描述”字段,

并假设一组关键字存储在一个单独的向量中

keywords
,最佳实践是什么:

  • df
    中为每个关键字创建列,并相应地命名;
  • 并将其自身出现次数的计数存储在
    df
    ?
  • 的“描述”字段中

例如

(df <- data.frame(
  ID = letters[1:10],
  DESCRIPTION = c("blue", "red", "this was green", "this was red", "blue and red", "green", NA, "green", "green, blue, and red", NA)
))
   ID          DESCRIPTION
1   a                 blue
2   b                  red
3   c       this was green
4   d         this was red
5   e         blue and red
6   f                green
7   g                 <NA>
8   h                green
9   i green, blue, and red
10  j                 <NA>
keywords <- c("blue", "red", "green")

会回来

   ID          DESCRIPTION blue red green
1   a                 blue    1   0     0
2   b                  red    0   1     0
3   c       this was green    0   0     1
4   d         this was red    0   1     0
5   e         blue and red    1   1     0
6   f                green    0   0     1
7   g                 <NA>    0   0     0
8   h                green    0   0     1
9   i green, blue, and red    1   1     1
10  j                 <NA>    0   0     0

Preferable using

base
R or
dplyr
(eg. avoiding
data.table
).

注意:答案需要可扩展(许多可能的关键字)。

r text-mining
1个回答
1
投票

使用

unnest
pivot_wider

的方法
library(dplyr)
library(tidyr)

df %>% 
  mutate(nms = list(!!keywords)) %>% 
  unnest(nms) %>% 
  rowwise() %>% 
  mutate(values = (grepl(nms, DESCRIPTION))*1) %>% 
  ungroup() %>% 
  pivot_wider(names_from=nms, values_from=values)
# A tibble: 10 × 5
   ID    DESCRIPTION           blue   red green
   <chr> <chr>                <dbl> <dbl> <dbl>
 1 a     blue                     1     0     0
 2 b     red                      0     1     0
 3 c     this was green           0     0     1
 4 d     this was red             0     1     0
 5 e     blue and red             1     1     0
 6 f     green                    0     0     1
 7 g     NA                       0     0     0
 8 h     green                    0     0     1
 9 i     green, blue, and red     1     1     1
10 j     NA                       0     0     0

base R 使用 sapply 试试这个

cbind(df, (sapply(keywords, grepl, df$DESCRIPTION))*1)
   ID          DESCRIPTION blue red green
1   a                 blue    1   0     0
2   b                  red    0   1     0
3   c       this was green    0   0     1
4   d         this was red    0   1     0
5   e         blue and red    1   1     0
6   f                green    0   0     1
7   g                 <NA>    0   0     0
8   h                green    0   0     1
9   i green, blue, and red    1   1     1
10  j                 <NA>    0   0     0
© www.soinside.com 2019 - 2024. All rights reserved.