tidyverse/dplyr 用于 str_detect 案例/mutate 的解决方案

问题描述 投票:0回答:1

我已经看到了一些这样的片段,但遗憾的是到目前为止还没有完整的答案,所以我想问一下。

我正在开发一个函数,根据某些按严重性排名的关键词是否存在来分配值。像这样:

severity <- c("kw1", "kw2", "kw3", "kw4", "kw5", "kw6")

它基本上遍历数据集中的单个列,并根据严重性列表中是否存在“第一个/最严重”条目来分配一个值。 使用以下内容,我意识到您可以使用

str_detect

检测多个字符串:

如何检查另一个字符串中是否存在多个字符串?

severity_rankings <- severity_df |> dplyr::mutate( # Classify severity based on strings severity_kw = dplyr::case_when( if (any(stringr::str_detect(tolower(severity_string),severity))) ~ severity[min(which(str_detect(tolower(severity_string),severity) == TRUE))], .default = NA ))

但这会不断抛出错误,就像它试图解析整个列一样:

Error in `dplyr::mutate()`: ℹ In argument: `severity_kw = dplyr::case_when(...)`. Caused by error in `stringr::str_detect()`: ! Can't recycle `string` (size 20) to match `pattern` (size 6). Run `rlang::last_trace()` to see where the error occurred.

最终,我想要的是这样的输出:

ID severity_string severity_kw 1 kw1 with KW2 and kw6 kw1 2 kw6 kw6 3 kw6 with kW5, kw2 also kw2 4 KW3 kw3 5 KW5 kw5 6 KW4 with kw2, kw1 also kw1 7 KW1 kw1 8 KW2 kw2 9 KW4 with KW5 kw4 10 KW6 kw6 11 KW6 with KW1 on the side kw1 12 KW2 with KW4 and KW1 kw1 13 kw5 with kw6 kw5 14 kw7 <NA> 15 KW3 and KW2 kw2 16 KW2 kw2 17 KW1 and KW6 kw1 18 KW3 kw3 19 KW3 and KW1 kw1 20 kw1 kw1

我确信这是错误的语法或错误的 
dplyr

调用,但不知道从哪里开始。 任何和所有建议将不胜感激。

用于生成初始数据帧:

severity_df <- data.frame( ID = c(1:20), severity_string = c("kw1 with KW2 and kw6", "kw6", "kw6 with kW5, kw2 also", "KW3", "KW5", "KW4 with kw2, kw1 also", "KW1", "KW2", "KW4 with KW5", "KW6", "KW6 with KW1 on the side", "KW2 with KW4 and KW1", "kw5 with kw6", "kw7", "KW3 and KW2", "KW2", "KW1 and KW6", "KW3", "KW3 and KW1", "kw1"), stringsAsFactors = FALSE )


r dplyr tidyverse stringr
1个回答
0
投票
str_detect()

string
input
的值一起使用,它们的长度不兼容。您可以像这样重现错误:
str_detect(c("foo", "bar"), c("foo", "bar", "baz"))
#> Error in `str_detect()`:
#> ! Can't recycle `string` (size 2) to match `pattern` (size 3).
#> Run `rlang::last_trace()` to see where the error occurred.

我认为你也有一个放错地方的
if

,但这似乎超出了问题的重点。

对于您的用例,我会改变策略并使用像 

map_chr()

这样具有定制功能的工具:

severity_df |>
  mutate(
    severity_kw = severity_string |>

      # For each value of severity_string...
      purrr::map_chr(function(x) {
        
        # For each value of severity...
        for (pattern in severity) {
          
          # Return the value of severity, if there's a match
          if (str_detect(x, regex(pattern, ignore_case = TRUE))) {
            return(pattern)
          }
        }
        
        # If no values match, return NA
        NA_character_
      })
  )

© www.soinside.com 2019 - 2024. All rights reserved.