基于字符串中多个单词的完全匹配来转换新列

Question

我有一个数据框：

df <- data.frame(
  Otherspp = c("suck SD", "BT", "SD RS", "RSS"),
  Dominantspp = c("OM", "OM", "RSS", "CH"),
  Commonspp = c(" ", " ", " ", "OM"),
  Rarespp = c(" ", " ", "SD", "NP"),
  NP = rep("northern pikeminnow|NORTHERN PIKEMINNOW|np|NP|npm|NPM", 4),
  OM = rep("steelhead|STEELHEAD|rainbow trout|RAINBOW TROUT|st|ST|rb|RB|om|OM", 4),
  RSS = rep("redside shiner|REDSIDE SHINER|rs|RS|rss|RSS", 4),
  suck = rep("suckers|SUCKERS|sucker|SUCKER|suck|SUCK|su|SU|ss|SS", 4)
)

我需要使用填充了常见鱼码/名称（NP，OM，RSS，suck）的列来评估前四列中的表达式，并根据每个列输出1/0，如果表达式得到满足的话。我下面的代码与完整的单词（仅部分）不匹配，并提供不正确的数据（请参阅下面的结果）。

df %>%
  rowwise() %>%
  transmute_at(vars(NP, OM, RSS, suck), 
               funs(case_when(
                 grepl(., Dominantspp) ~ "1",
                 grepl(., Commonspp) ~ "1",
                 grepl(., Rarespp) ~ "1",
                 grepl(., Otherspp) ~ "1",
                 TRUE ~ "0"))) %>%
  ungroup()

结果：在第三行中看到“suck”和“RSS”都收到“1”。

# A tibble: 4 x 4
     NP    OM   RSS  suck
  <chr> <chr> <chr> <chr>
1     0     1     0     1
2     0     1     0     0
3     0     0     1     1
4     1     1     1     1

期望的输出：

  NP OM RSS suck
1  0  1   0    1
2  0  1   0    0
3  0  0   1    0
4  1  1   1    0

Answer 1

使用相同方法解决问题的最快方法是使用\\b在每个正则表达式的开头和结尾添加单词边界：

df <- data.frame(
  Otherspp = c("suck SD", "BT", "SD RS", "RSS"),
  Dominantspp = c("OM", "OM", "RSS", "CH"),
  Commonspp = c(" ", " ", " ", "OM"),
  Rarespp = c(" ", " ", "SD", "NP"),
  NP = rep("\\b(northern pikeminnow|NORTHERN PIKEMINNOW|np|NP|npm|NPM)\\b", 4),
  OM = rep("\\b(steelhead|STEELHEAD|rainbow trout|RAINBOW TROUT|st|ST|rb|RB|om|OM\\b)", 4),
  RSS = rep("\\b(redside shiner|REDSIDE SHINER|rs|RS|rss|RSS)\\b", 4),
  suck = rep("\\b(suckers|SUCKERS|sucker|SUCKER|suck|SUCK|su|SU|ss|SS)\\b", 4),
  stringsAsFactors = FALSE
)

这使得正则表达式只匹配完整的单词，这将使您的后续解决方案工作。

话虽如此，我认为这不一定是解决问题的方法（今天很少推荐rowwise()，而且这种方法不适合许多鱼类代码）。如果您将其标准化为整洁的格式，我认为您可以更轻松地使用此数据，每行和代码组合一行：

library(tidyr)
library(tidytext)

row_codes <- df %>%
  select(Otherspp:Rarespp) %>%
  mutate(row = row_number()) %>%
  gather(type, codes, -row) %>%
  unnest_tokens(code, codes, token = "regex", pattern = " ")

这将导致：

   row        type code
1    1 Dominantspp   om
2    1    Otherspp suck
3    1    Otherspp   sd
4    2 Dominantspp   om
5    2    Otherspp   bt
6    3 Dominantspp  rss
7    3    Otherspp   sd
8    3    Otherspp   rs
9    3     Rarespp   sd
10   4   Commonspp   om
11   4 Dominantspp   ch
12   4    Otherspp  rss
13   4     Rarespp   np

此时，代码更容易使用（您不再需要正则表达式）。例如，您可以将它inner_join到鱼码表。

基于字符串中多个单词的完全匹配来转换新列

问题描述投票：2回答：1

1个回答

最新问题

基于字符串中多个单词的完全匹配来转换新列

问题描述 投票：2回答：1

1个回答

最新问题

问题描述投票：2回答：1