我有以下数据,其中所有内容都是字符串:
A 栏 | B 栏 |
---|---|
“你喜欢香蕉片吗?” | “你喜欢香蕉片吗?是的,我喜欢” |
“我是约翰” | “我是约翰,我是史蒂夫” |
“我投票给拜登” | “我投票给拜登了,你为什么要投票给拜登?” |
“苹果” | “苹果” |
如何从 B 列中删除重复的部分,以便留下以下内容:
A 栏 | B 栏 |
---|---|
“你喜欢香蕉片吗?” | “是的,我愿意” |
“我是约翰” | “我是史蒂夫” |
“我投票给拜登” | “你为什么会投票给拜登?” |
“苹果” | “” |
我似乎找不到任何不只是替换任何重叠单词的东西(例如,将 B 列从“我是约翰,我是史蒂夫”变成“史蒂夫”)。我尝试过使用 for 循环和 gsub,但似乎也没有发生任何变化。
tib <- dplyr::tribble(
~col_a, ~col_b,
"Do you like banana splits?", "Do you like banana splits? Yes, I do",
"I am John", "I am John I am Steve",
"I Voted for Biden", "I Voted for Biden Why would you vote for Biden?",
"apple", "apple"
)
tib |>
dplyr::mutate(
esc_col_a = stringr::str_escape(col_a),
new_col_b = stringr::str_squish(stringr::str_replace(
string = col_b,
pattern = esc_col_a,
replacement = ""
))
) |>
dplyr::select(-esc_col_a, -col_b)
#> # A tibble: 4 × 2
#> col_a new_col_b
#> <chr> <chr>
#> 1 Do you like banana splits? "Yes, I do"
#> 2 I am John "I am Steve"
#> 3 I Voted for Biden "Why would you vote for Biden?"
#> 4 apple ""