使用字符串r的标志特定模式

问题描述 投票:1回答:2

我正在使用一个数据集,在该数据集中我需要标记所有以“ C13.xxx开头”的特定代码。列中还有其他树代码,所有树代码的分隔如下:“ C13.xxx | B12.xxx”-所有树代码中都有一个句点。但是数据集还有其他变量,这些变量导致我的字符串r函数标记不是树代码的字符。例如:

library(tidyverse)

# test data
test <- tribble(
  ~id, ~treecode, ~contains_c13_xxx,
  #--|--|----
  1, "B12.123|C13.234.432|A11.123", "yes",
  2, "C12.123|C13039|", "no"
)

# what I tried 
test  %>% mutate(contains_C13_error = ifelse(str_detect(treecode, "C13."), 1, 0))

# code above is flagging both id's as containing C13.xxx

在id 2中,有一个以C13开头的变量,但这不是树码(所有树码都有一个句点)。 contains_c13_xxx变量是我想要生成的代码。在字符串检测功能中,我指定了句点,所以我不确定这里出了什么问题。

r text tidyverse stringr
2个回答
1
投票

棘手的部分是同一列中有多个树代码,并带有分隔符,这使得很难标记。我们可以将每个treecode放在单独的行中,然后检查所需的代码。使用separate_rows中的tidyr

library(dplyr)

test %>%
  tidyr::separate_rows(treecode, sep = "\\|") %>%
  group_by(id) %>%
  summarise(contains_C13_error = any(startsWith(treecode, "C13.")),
            treecode = paste(treecode, collapse = "|"))

# A tibble: 2 x 3
#     id contains_C13_error treecode                   
#  <dbl> <lgl>              <chr>                      
#1     1 TRUE               B12.123|C13.234.432|A11.123
#2     2 FALSE              C12.123|C13039|         

这是假设可能存在模式“ C13”的代码而没有点。如果treecode始终带有"C13"后跟一个点,则只需在正则表达式中转义该点即可。


0
投票

Base R解决方案:

# Split on the | delim: 

split_treecode <- strsplit(df$treecode, "[|]")

# Roll out the ids the number of times of each relevant treecode: 

rolled_out_df <- data.frame(id = rep(df$id, sapply(split_treecode, length)), tc = unlist(split_treecode))

# Test whether or not string contains "C13" 

rolled_out_df$contains_c13_xxx <- grepl("C13.", rolled_out_df$tc, fixed = T)

# Does the id have an element containing "C13" ? 

rolled_out_df$contains_c13_xxx <- ifelse(ave(rolled_out_df$contains_c13_xxx,

                                             rolled_out_df$id,

                                             FUN  = function(x){as.logical(sum(x))}), "yes", "no")

# Build back orignal df: 

df <- merge(df[,c("id", "treecode")], unique(rolled_out_df[,c("id", "contains_c13_xxx")]), by = "id")

数据:

df <- 
  structure(
  list(
    id = c(1, 2),
    treecode = c("B12.123|C13.234.432|A11.123",
                 "C12.123|C13039|"),
    contains_c13_xxx = c("yes", "no")
  ),
  row.names = c(NA,-2L),
  class = "data.frame"
)
© www.soinside.com 2019 - 2024. All rights reserved.