将两个单词分隔成一个点

问题描述 投票:0回答:1

我有一个包含新闻文章的大型数据框。我注意到有些文章中有两个单词由点连接,如下例所示The government.said it was important to quit.。我将进行一些主题建模,因此我需要将每个单词分开。

这是我用来分隔这些单词的代码

    #String example
    test <- c("i need.to separate the words connected by dots. however, I need.to keep having the dots separating sentences")

    #Code to separate the words
    test <- do.call(paste, as.list(strsplit(test, "\\.")[[1]]))

   #This is what I get
  > test
  [1] "i need to separate the words connected by dots  however, I need to keep having the dots separating sentences"

如您所见,我删除了文本上的所有点(句点)。我如何获得以下结果:

"i need to separate the words connected by dots. however, I need to keep having the dots separating sentences"

最后注

我的数据框由17.000篇文章组成;所有文本均小写。我只是提供了一个小示例,说明了我试图分离由点连接的两个单词时遇到的问题。另外,有什么方法可以在列表上使用strsplit

r regex string strsplit
1个回答
0
投票

您可以使用

test <- c("i need.to separate the words connected by dots. however, I need.to keep having the dots separating sentences. Look at http://google.com for s.0.m.e more details.")
# Replace each dot that is in between word characters
gsub("\\b\\.\\b", " ", test, perl=TRUE)
# Replace each dot that is in between letters
gsub("(?<=\\p{L})\\.(?=\\p{L})", " ", test, perl=TRUE)
# Replace each dot that is in between word characters, but no in URLs
gsub("(?:ht|f)tps?://\\S*(*SKIP)(*F)|\\b\\.\\b", " ", test, perl=TRUE)

请参见R demo online

输出:

[1] "i need to separate the words connected by dots. however, I need to keep having the dots separating sentences. Look at http://google com for s 0 m e more details."
[1] "i need to separate the words connected by dots. however, I need to keep having the dots separating sentences. Look at http://google com for s.0.m e more details."
[1] "i need to separate the words connected by dots. however, I need to keep having the dots separating sentences. Look at http://google.com for s 0 m e more details."

详细信息

  • [\b\.\b-用单词边界括起来的点(即.之前和之后不能是任何非单词char,除了字母,数字或下划线之外,不能有任何char]
  • [(?<=\p{L})\.(?=\p{L})匹配一个点,该点既不紧跟也不是字母((?<=\p{L})是负向后看,而(?=\p{L})是负向后看)]
  • [(?:ht|f)tps?://\\S*(*SKIP)(*F)|\b\.\b匹配http/ftphttps/ftps,然后匹配://,然后匹配0个或多个非空白字符,并跳过匹配,并从遇到该字符的位置继续搜索匹配项跳过PCRE动词。
© www.soinside.com 2019 - 2024. All rights reserved.