我有一个带有列的数据框如下:
User df_text
A Hi, how are you ?
B This is beautiful!
C Hello guys
D Originally posted by A Hi, how are you? I am doing good
E Whats going on ?
F Originally posted by B I am doing good Welcome
我想删除部分匹配列df_text中的行的文本。例如,在上面的代码中,用户D回复了用户A,这就是为什么它“最初由...发布”字符串。我需要保留用户D的实际文本,并删除所有那些“最初发布”的字符串以及相关的用户和文本。
我不明白该怎么做。我尝试了以下代码:
df_text[!df_text %in% grep(paste0(df_text, collapse = "|"), df_text, value = T)]
我期望得到的是:
User df_text
A Hi, how are you ?
B This is beautiful!
C Hello guys
D I am doing good
E Whats going on ?
F Welcome
是否有可能获得上述结果?
先感谢您!
您实际上可以测试已经使用Backreference编写的文本,然后在需要删除之前清除整个匹配项,请参阅此正则表达式:
(?:[A-Z] {4})?(.+?$)\n[\s\S]*?\KOriginally posted by [A-Z] \1
(?:[A-Z] {4})?
- 匹配字符串的第一部分( A
)。(.+?$)
- 将被引用为\1
的捕获组,这是A
发送的文本。\n[\s\S]*?
- 下一行,并保持选择,直到找到Originally po...
。\K
- 清除整个选择,所以当你.replace()
你不会删除重要的东西。Originally posted by [A-Z]
- 引用A
消息的文本。\1
- A
发送的文字,所以你可以从B
的消息中删除它。
- 当然,还有一个要删除的空间(这样最终文本就不会搞砸了)。我不知道如何将此算法转换为R,但无论如何它在这里:
var rgx = /(?:[A-Z] {4})?(.+?$)\n[\s\S]*?\KOriginally posted by [A-Z] \1 /;
while (str.match(rgx))
str = str.replace(rgx, "");
perl=TRUE
User df_text A Hi, how are you ? B This is beautiful! C Heuwi D Originally posted by C Heuwi Hellou E Hello guys FOriginally posted by A Hi, how are you ?I am doing good G Whats going on ? H Test2 IOriginally posted by B I am doing goodWelcome J Originally posted by C Test2 Hellou
User df_text A Hi, how are you ? B This is beautiful! C Heuwi DOriginally posted by C HeuwiHellou E Hello guys F I am doing good G Whats going on ? H Test2 I Welcome JOriginally posted by C Test2Hellou
User df_text A Hi, how are you ? B This is beautiful! C Heuwi D Hellou E Hello guys F I am doing good G Whats going on ? H Test2 I Welcome J Hellou
您可以使用gsub
替换文本/模式,例如:""
:
df$df_text <- gsub("Originally posted by ","",df$df_text)
其中qazxsw poi是你的数据框,有qazxsw poi,qazxsw poi
为了更多,你可以去周期
df
这应该给你想要的结果,但它很慢