在 R Str_count 中:计算特定距离处单词的出现次数,例如间隔1到30个字

问题描述 投票:0回答:1

在文本文档中,我想统计在距离全局|需求减少|需求下降1到30个单词的地方出现不确定|不清楚的情况。但是,我的代码如下所示,似乎对 {1,30} 不敏感,因为更改这些值不会更改输出。任何帮助将不胜感激。

str_count(texttw,"\\buncertainty|unclear(?:\\W+\\w+){1,30} ?\\W+global|decrease in demand|fall in demand\\b"))
r text nlp stringr
1个回答
0
投票

我不确定您的文本中的拼写错误是否是故意的(“不确定性”而不是“不确定性”),所以我更正了它,但尝试这样的事情:

library(stringr)

x <- "uncertainty negatively influences economic agents investment and business decisions which leads to decrease in demand. When the economic environment is fraught with uncertainty and the future is unclear businesses and firms may hold back their decisions until uncertainty subsides. Ever since the start of the pandemic global economic outlook has been unclear with unprecedented uncertainty leading to fall in demand."

regex <- "(uncertainty|unclear)\\s(\\w+\\s+){1,30}(global|decrease in demand|fall in demand)"

str_count(x, regex)
# [1] 2

str_extract_all(x, regex)
# [[1]]
# [1] "uncertainty negatively influences economic agents investment and business decisions which leads to decrease in demand"
# [2] "unclear with unprecedented uncertainty leading to fall in demand"    

在您发布的文字中,最后一句有一个有趣的案例,“自大流行开始以来,全球经济前景一直不明朗,前所未有的不确定性导致需求下降。”

根据您的解释,这实际上可能有两个匹配:

  1. 尚不清楚,前所未有的不确定性导致需求下降
  2. 不确定性导致需求下降

如果这是您的解释,那么从技术上讲,您发布的文本应该有三个而不是两个匹配项。

请注意:

“不确定性消退。自疫情爆发以来,全球经济前景一直不明朗,前所未有的不确定性导致需求下降。”由于“消退”之后的时间段,因此不匹配。

© www.soinside.com 2019 - 2024. All rights reserved.