Regex以匹配R中具有相邻和不相邻单词重复的句子

Question

我有一个带有句子的数据框；在某些句子中，单词被多次使用：

df <- data.frame(Turn = c("well this is what the grumble about do n't they ?",
                          "it 's like being in a play-group , in n it ?",
                          "oh is that that steak i got the other night ?",
                          "well where have the middle sized soda stream bottle gone ?",
                          "this is a half day , right ? needs a full day",
                          "yourself , everybody 'd be changing your hair in n it ?",
                          "cos he finishes at four o'clock on that day anyway .",
                          "no no no i 'm dave and you 're alan .",
                          "yeah , i mean the the film was quite long though",
                          "it had steve martin in it , it 's a comedy",
                          "oh it is a dreary old day in n it ?",
                          "no it 's not mother theresa , it 's saint theresa .",
                          "oh have you seen that face lift job he wants ?",
                          "yeah bolshoi 's right so which one is it then ?"))

我想匹配一个或多个单词重复一次或多次的句子。

编辑1：

重复的单词**可以*相邻，但不一定要相邻。这就是Regular Expression For Consecutive Duplicate Words无法为我的问题提供答案的原因。

我在使用此代码方面取得了一定的成功：

df[grepl("(\\w+\\b\\s)\\1{1,}", df$Turn),]
[1] well this is what the grumble about do n't they ?      
[2] it 's like being in a play-group , in n it ?           
[3] oh is that that steak i got the other night ?          
[4] this is a half day , right ? needs a full day          
[5] yourself , everybody 'd be changing your hair in n it ?
[6] no no no i 'm dave and you 're alan .                  
[7] yeah , i mean the the film was quite long though       
[8] it had steve martin in it , it 's a comedy             
[9] oh it is a dreary old day in n it ?

成功只是适度的，因为某些句子与不应匹配，例如yourself , everybody 'd be changing your hair in n it ?，而另一些句子与应匹配，例如no it 's not mother theresa , it 's saint theresa .。如何改进代码以产生完全匹配？

预期结果：

df
                                                         Turn
2                it 's like being in a play-group , in n it ?
3               oh is that that steak i got the other night ?
5               this is a half day , right ? needs a full day
8                       no no no i 'm dave and you 're alan .
9            yeah , i mean the the film was quite long though
10                 it had steve martin in it , it 's a comedy
11                        oh it is a dreary old day in n it ?
12        no it 's not mother theresa , it 's saint theresa .

编辑2：

另一个问题是如何定义重复单词的确切数量。上面不完美的正则表达式匹配至少重复一次的单词。如果我将量词更改为{2}，从而查找一个单词的三次出现，我将获得此代码和以下结果：

df[grepl("(\\w+\\b\\s)\\1{2}", df$Turn),]
[1] no no no i 'm dave and you 're alan .         # "no" occurs 3 times

但是由于expected结果将是，因此匹配不完美：

[1] no no no i 'm dave and you 're alan .          # "no" occurs 3 times
[2] it had steve martin in it , it 's a comedy     # "it" occurs 3 times

非常感谢您的帮助！

Answer 1

用于定义重复单词的确切数量的选项。

提取相同单词出现3次的句子

更改正则表达式
（\ s？\ b \ w + \ b \ s）（。* \ 1）{2}
[（s）？\ b \ w + \ b \ s）被第1组捕获
- \ s？：空格出现零次或多次。
- \ b \ w + \ b：确切的文字字符。
- \ s？：空格出现一次。
  [（。* \ 1）由第2组捕获]
  - （。* \ 1）：在组1再次匹配之前出现零次或多次的任何字符。
  - （。* \ 1）{2}：第2组匹配两次。

代码

df$Turn[grepl("(\\s?\\b\\w+\\b\\s)(.*\\1){2}", df$Turn, perl = T)]
# [1] "no no no i 'm dave and you 're alan ."     
# [2] "it had steve martin in it , it 's a comedy"

使用strsplit(split="\\s")将句子拆分为单词。
- 使用sapply和table计算每个列表元素中单词的出现次数，然后选择满足要求的句子。

代码

library(magrittr)
df$Turn %<>% as.character()
s<-strsplit(df$Turn,"\\s") %>% sapply(.,function(i)table(i) %>% .[.==3])
df$Turn[which(s!=0)]
# [1] "no no no i 'm dave and you 're alan ."     
# [2] "it had steve martin in it , it 's a comedy"

希望这对您有帮助：）

Answer 2

我宁愿再通过一次以完成此任务。首先，我向原始数据帧添加了一个组变量。然后，我计算了每个单词在每个句子中出现的次数，并创建了一个数据框，即mytemp。

library(tidyverse)

mutate(df, id = 1:n()) -> df

mutate(df, id = 1:n()) %>% 
mutate(word = strsplit(x = Turn, split = " ")) %>% 
unnest(word) %>% 
count(id, word, name = "frequency", sort = TRUE) -> mytemp

使用此数据框，很容易识别句子。我对数据进行了子集处理，并获得了一个单词出现三次的句子的id。我同样地识别出出现多次并获得id的单词。最后，我使用id和three中的twice数字对原始数据进行了子集设置。

# Search words that appear 3 times 

three <- filter(mytemp, frequency == 3) %>% 
         pull(id) %>% 
         unique()

# Serach words that appear more than once.

twice <- filter(mytemp, frequency > 1) %>% 
         pull(id) %>% 
         unique()

# Go back to the original data and handle subsetting
filter(df, id %in% three)

  Turn                                          id
  <chr>                                      <int>
1 no no no i 'm dave and you 're alan .          8
2 it had steve martin in it , it 's a comedy    10

filter(df, id %in% twice)

  Turn                                                   id
  <chr>                                               <int>
1 it 's like being in a play-group , in n it ?            2
2 oh is that that steak i got the other night ?           3
3 this is a half day , right ? needs a full day           5
4 no no no i 'm dave and you 're alan .                   8
5 yeah , i mean the the film was quite long though        9
6 it had steve martin in it , it 's a comedy             10
7 oh it is a dreary old day in n it ?                    11
8 no it 's not mother theresa , it 's saint theresa .    12

Regex以匹配R中具有相邻和不相邻单词重复的句子

问题描述投票：1回答：2

2个回答

最新问题

Regex以匹配R中具有相邻和不相邻单词重复的句子

问题描述 投票：1回答：2

2个回答

最新问题

问题描述投票：1回答：2