让我们考虑具有两列df
和word
的stem
。我想创建一个新列,检查stem
中是否包含word
中的值,以及该值是在其他字符之前还是之后。最终结果应如下所示:
WORD STEM NEW
rerun run prefixed
runner run suffixed
run run none
... ... ...
下面您可以看到我的代码。但是,它不起作用,因为grepl
表达式应用于df
的所有行。无论如何,我认为这应该使我的想法更明确。
df$new <- ifelse(grepl(paste0('.+', df$stem, '.+'), df$word), 'both',
ifelse(grepl(paste0(df$stem, '.+'), df$word), 'suffixed',
ifelse(grepl(paste0('.+', df$stem), df$word), 'prefixed','none')))
您可以使用startsWith
和endsWith
子集矢量,例如:
c("none", "suffixed", "prefixed", "both")[1 + startsWith(x$WORD, x$STEM) +
2*endsWith(x$WORD, x$STEM)]
#[1] "prefixed" "suffixed" "both"
或者在WORD和STEM相等的情况下,应返回none
:
c("none", "suffixed", "prefixed", "both")[1 + (startsWith(x$WORD, x$STEM) +
2*endsWith(x$WORD, x$STEM)) * !(x$WORD == x$STEM)]
#[1] "prefixed" "suffixed" "none"
您可以像这样创建new
列
df$new <- ifelse(startsWith(df$word, df$stem) & endsWith(df$word, df$stem), 'both',
ifelse(startsWith(df$word, df$stem), 'suffixed',
ifelse(endsWith(df$word, df$stem), 'prefixed',
'none')))
或者,如果您在dplyr
管道中,并且想要避免所有烦人的df$
df %>%
mutate(new = ifelse(startsWith(word, stem) & endsWith(word, stem), 'both',
ifelse(startsWith(word, stem), 'suffixed',
ifelse(endsWith(word, stem), 'prefixed',
'none'))))
输出
# word stem new1
# 1 rerun run prefixed
# 2 runner run suffixed
# 3 run run both
这是str_locate
和stringr
中使用dplyr
的方法:
library(dplyr)
library(stringr)
data %>%
mutate_at(vars(WORD,STEM), as.character) %>%
mutate(NEW =
case_when(str_locate(WORD,STEM)[,"start"] > 1 &
str_locate(WORD,STEM)[,"end"] < nchar(WORD) ~ "both",
str_locate(WORD,STEM)[,"start"] > 1 ~ "prefixed",
str_locate(WORD,STEM)[,"end"] < nchar(WORD) ~ "suffixed",
TRUE ~ "none"))
WORD STEM NEW
1 rerun run prefixed
2 runner run suffixed
3 run run none
我加了一行以将WORD
和STEM
转换为字符,以防它们成为因素。