在 R 中查找单词共现

Question

我在 R 中运行了主题建模。这已经返回给我一些主题以及我的数据框中某个词属于该主题的概率。例如，主题 7 是 'religion/christianity'，下面的单词属于该主题。

“耶稣” “玛丽” “基督” “玛格丽特” ...

我的数据框是亚马逊评论列表，“文本”列中的每个单元格都是评论。现在，我想对我的数据框中这些确切单词周围的单词进行情感分析。但是，我不确定搜索这些词的最佳方式是什么。基本上，我希望能够搜索这些确切的词（例如，使用 grep()）并返回每个词两边的 10 个词，无论搜索词出现在评论中的什么位置。有人能帮忙吗？我是否足够清楚地解释了我的问题？非常感谢任何帮助！

我试过运行 grep() 函数，但不知道如何包含围绕该词的词。

Answer 1

一种方法是：

将每个评论拆分为单个单词的向量
对于每个向量，创建一个包含所有匹配词位置的数值向量
对于that向量中的每个元素，创建一个包含要提取范围内所有单词位置的向量。
提取单词并重新拼接成句子

这是一个使用

stringr

和

purrr

库的小例子：

library(stringr)  # for dealing with strings
library(purrr)    # for doing the same thing to every element of a list

reviews <- c(
  "I really love this book about Mary and Jesus!",
  "This book is about fruits - including lemon which is my favourite",
  "I learned so much about hinduism, which I thought was really interesting"
)

faith_words <- c("mary", "jesus", "hinduism")

## clean strings - remove punctuation and double spacing
## then make lowercase

reviews <- 
  str_remove_all(reviews, "[:punct:]") |> 
  str_trim() |> 
  str_to_lower()

reviews
#> [1] "i really love this book about mary and jesus"                           
#> [2] "this book is about fruits  including lemon which is my favourite"       
#> [3] "i learned so much about hinduism which i thought was really interesting"


## split strings into lists of words
review_words <- str_split(reviews, pattern = boundary("word"))
review_words[[1]]
#> [1] "i"      "really" "love"   "this"   "book"   "about"  "mary"   "and"   
#> [9] "jesus"

## find the indexes of matching words in each string
matches <- map(review_words, \(x) which(x %in% faith_words))
matches
#> [[1]]
#> [1] 7 9
#> 
#> [[2]]
#> integer(0)
#> 
#> [[3]]
#> [1] 6

## Specify words either side
words_either_side <- 2

## I think nested lists will be the easiest way to process this data
matches <- map(matches, as.list)

# Which words define the range we want to include?
word_ranges <- map(matches, map, \(x) (x - words_either_side):(x + words_either_side))
word_ranges[[1]]
#> [[1]]
#> [1] 5 6 7 8 9
#> 
#> [[2]]
#> [1]  7  8  9 10 11


## map2 lets us use word_ranges to extract words from matching
## elements of review_words

map2(review_words, word_ranges, 
     \(x,y){
       # for each word_range assigned to this review...
       map(y,
           \(z){
             ## select the non-missing words in that range from the review_words
             words <- x[z]
             words <- words[!is.na(words)]
             words <- str_c(words, collapse = " ")
             return(words)
             })
     })
#> [[1]]
#> [[1]][[1]]
#> [1] "book about mary and jesus"
#> 
#> [[1]][[2]]
#> [1] "mary and jesus"
#> 
#> 
#> [[2]]
#> list()
#> 
#> [[3]]
#> [[3]][[1]]
#> [1] "much about hinduism which i"

^{创建于 2023-02-22 与 reprex v2.0.2}

在 R 中查找单词共现

问题描述投票：0回答：1

1个回答

最新问题

在 R 中查找单词共现

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1