我在 R 中运行了主题建模。这已经返回给我一些主题以及我的数据框中某个词属于该主题的概率。例如,主题 7 是 'religion/christianity',下面的单词属于该主题。
“耶稣” “玛丽” “基督” “玛格丽特” ...
我的数据框是亚马逊评论列表,“文本”列中的每个单元格都是评论。现在,我想对我的数据框中这些确切单词周围的单词进行情感分析。但是,我不确定搜索这些词的最佳方式是什么。基本上,我希望能够搜索这些确切的词(例如,使用 grep())并返回每个词两边的 10 个词,无论搜索词出现在评论中的什么位置。有人能帮忙吗?我是否足够清楚地解释了我的问题?非常感谢任何帮助!
我试过运行 grep() 函数,但不知道如何包含围绕该词的词。
一种方法是:
这是一个使用
stringr
和 purrr
库的小例子:
library(stringr) # for dealing with strings
library(purrr) # for doing the same thing to every element of a list
reviews <- c(
"I really love this book about Mary and Jesus!",
"This book is about fruits - including lemon which is my favourite",
"I learned so much about hinduism, which I thought was really interesting"
)
faith_words <- c("mary", "jesus", "hinduism")
## clean strings - remove punctuation and double spacing
## then make lowercase
reviews <-
str_remove_all(reviews, "[:punct:]") |>
str_trim() |>
str_to_lower()
reviews
#> [1] "i really love this book about mary and jesus"
#> [2] "this book is about fruits including lemon which is my favourite"
#> [3] "i learned so much about hinduism which i thought was really interesting"
## split strings into lists of words
review_words <- str_split(reviews, pattern = boundary("word"))
review_words[[1]]
#> [1] "i" "really" "love" "this" "book" "about" "mary" "and"
#> [9] "jesus"
## find the indexes of matching words in each string
matches <- map(review_words, \(x) which(x %in% faith_words))
matches
#> [[1]]
#> [1] 7 9
#>
#> [[2]]
#> integer(0)
#>
#> [[3]]
#> [1] 6
## Specify words either side
words_either_side <- 2
## I think nested lists will be the easiest way to process this data
matches <- map(matches, as.list)
# Which words define the range we want to include?
word_ranges <- map(matches, map, \(x) (x - words_either_side):(x + words_either_side))
word_ranges[[1]]
#> [[1]]
#> [1] 5 6 7 8 9
#>
#> [[2]]
#> [1] 7 8 9 10 11
## map2 lets us use word_ranges to extract words from matching
## elements of review_words
map2(review_words, word_ranges,
\(x,y){
# for each word_range assigned to this review...
map(y,
\(z){
## select the non-missing words in that range from the review_words
words <- x[z]
words <- words[!is.na(words)]
words <- str_c(words, collapse = " ")
return(words)
})
})
#> [[1]]
#> [[1]][[1]]
#> [1] "book about mary and jesus"
#>
#> [[1]][[2]]
#> [1] "mary and jesus"
#>
#>
#> [[2]]
#> list()
#>
#> [[3]]
#> [[3]][[1]]
#> [1] "much about hinduism which i"
创建于 2023-02-22 与 reprex v2.0.2