我正在使用一个数据集,在该数据集中我需要标记所有以“ C13.xxx开头”的特定代码。列中还有其他树代码,所有树代码的分隔如下:“ C13.xxx | B12.xxx”-所有树代码中都有一个句点。但是数据集还有其他变量,这些变量导致我的字符串r函数标记不是树代码的字符。例如:
library(tidyverse)
# test data
test <- tribble(
~id, ~treecode, ~contains_c13_xxx,
#--|--|----
1, "B12.123|C13.234.432|A11.123", "yes",
2, "C12.123|C13039|", "no"
)
# what I tried
test %>% mutate(contains_C13_error = ifelse(str_detect(treecode, "C13."), 1, 0))
# code above is flagging both id's as containing C13.xxx
在id 2中,有一个以C13开头的变量,但这不是树码(所有树码都有一个句点)。 contains_c13_xxx变量是我想要生成的代码。在字符串检测功能中,我指定了句点,所以我不确定这里出了什么问题。
棘手的部分是同一列中有多个树代码,并带有分隔符,这使得很难标记。我们可以将每个treecode
放在单独的行中,然后检查所需的代码。使用separate_rows
中的tidyr
。
library(dplyr)
test %>%
tidyr::separate_rows(treecode, sep = "\\|") %>%
group_by(id) %>%
summarise(contains_C13_error = any(startsWith(treecode, "C13.")),
treecode = paste(treecode, collapse = "|"))
# A tibble: 2 x 3
# id contains_C13_error treecode
# <dbl> <lgl> <chr>
#1 1 TRUE B12.123|C13.234.432|A11.123
#2 2 FALSE C12.123|C13039|
这是假设可能存在模式“ C13”的代码而没有点。如果treecode
始终带有"C13"
后跟一个点,则只需在正则表达式中转义该点即可。
Base R解决方案:
# Split on the | delim:
split_treecode <- strsplit(df$treecode, "[|]")
# Roll out the ids the number of times of each relevant treecode:
rolled_out_df <- data.frame(id = rep(df$id, sapply(split_treecode, length)), tc = unlist(split_treecode))
# Test whether or not string contains "C13"
rolled_out_df$contains_c13_xxx <- grepl("C13.", rolled_out_df$tc, fixed = T)
# Does the id have an element containing "C13" ?
rolled_out_df$contains_c13_xxx <- ifelse(ave(rolled_out_df$contains_c13_xxx,
rolled_out_df$id,
FUN = function(x){as.logical(sum(x))}), "yes", "no")
# Build back orignal df:
df <- merge(df[,c("id", "treecode")], unique(rolled_out_df[,c("id", "contains_c13_xxx")]), by = "id")
数据:
df <-
structure(
list(
id = c(1, 2),
treecode = c("B12.123|C13.234.432|A11.123",
"C12.123|C13039|"),
contains_c13_xxx = c("yes", "no")
),
row.names = c(NA,-2L),
class = "data.frame"
)