如何分割由R中的点连接的两个单词？

Question

我有一个包含新闻文章的大型数据框。我注意到有些文章的两个单词之间用点连接，如下例所示The government.said it was important to quit.。我将进行一些主题建模，因此我需要将每个单词分开。

这是我用来分隔这些单词的代码

    #String example
    test <- c("i need.to separate the words connected by dots. however, I need.to keep having the dots separating sentences")

    #Code to separate the words
    test <- do.call(paste, as.list(strsplit(test, "\\.")[[1]]))

   #This is what I get
  > test
  [1] "i need to separate the words connected by dots  however, I need to keep having the dots separating sentences"

如您所见，我删除了文本上的所有点（句点）。我如何获得以下结果：

"i need to separate the words connected by dots. however, I need to keep having the dots separating sentences"

最后注

我的数据框由17.000篇文章组成；所有文本均小写。我只是提供了一个小示例，说明了我试图分离由点连接的两个单词时遇到的问题。另外，有什么方法可以在列表上使用strsplit？

Answer 1

使每个匹配项都包含一系列非点字符，后跟一个点。

library(stringr)
str_extract_all(test, pattern = "[^.]*\\." )

您可以在模式中添加更多标点符号，然后再重复与肯定匹配相同的模式。

str_extract_all(test, pattern = "[^\\.\\?\\!]*[\\.\\?\\!]" )

如何分割由R中的点连接的两个单词？

问题描述投票：0回答：1

1个回答

最新问题

如何分割由R中的点连接的两个单词？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1