制作在文本字符串中找到两个不同关键字的实例的数据框

问题描述 投票:0回答:1

我有一个包含两列的数据框:一个 ID 号,然后是一个文本字符串:

df <- data.frame(ID=c(1, 2, 3, 4, 5, 6, 7, 8), 
                 text = c("lorem ipsum dolor sit ABC, consectetur adipiscing XYZ",
                          "veritatis et quasi ABC architecto beatae vitae dicta YXZ explicabo", 
                          "dignissimos ducimus CBA blanditiis praesentium ZXY deleniti", 
                          "earum rerum hic BCA tenetur a sapiente delectus, ut aut XYZ", 
                          "enim ad minima veniam, ACB quis nostrum corporis ZYX suscipit",
                          "cillum dolore BAC eu fugiat nulla pariatur ZXY",
                          "sunt CBA, ABC in culpa qui officia deserunt mollit XYZ anim",
                          "debitis ACB aut rerum necessitatibus YZX, XZY saepe eveniet"))

我还有两个包含特定搜索词的不同列表:

listX <- c("ABC", "ACB", "BAC", "BCA", "CAB", "CBA")
listY <- c("XYZ", "XZY", "YXZ", "YZX", "ZXY", "ZYX")

我想搜索数据框每一行的文本,并构建一个新的数据框,其中一列包含 ID 号,然后在其他列中包含特定搜索词的匹配/组合的结果

listX
listY

output <- data.frame(ID=c(1,2,3,4,5,6,7,7,8,8),
                     X=c("ABC","ABC","CBA","BCA","ACB","BAC","CBA","ABC","ACB","ACB"),
                     Y=c("XYZ","YXZ","ZXY","XYZ","ZYX","ZXY","XYZ","XYZ","YZX","XZY"))

是否有某种方法可以以编程方式生成具有每种可能组合的输出数据帧?我知道我可能会用

grepl
来做到这一点,也许
merge
会得到不同的结果。但这将是一种丑陋的暴力方法,并且列表比本示例中给出的要长得多。预先感谢您!

r dplyr
1个回答
1
投票
library(dplyr)
library(stringr)
library(tidyr)

df |>
  mutate(X = str_extract_all(text, str_flatten(listX, "|")),
         Y = str_extract_all(text, str_flatten(listY, "|"))) |>
  unnest_longer(X:Y)
© www.soinside.com 2019 - 2024. All rights reserved.