R - 数据框中每两行之间的余弦相似度

问题描述 投票:0回答:1

我有一个名为

text
的数据框,有两列:
year
text
。查找下面的 dput 输出作为示例:

text <- structure(list(year = 2000:2007, text = c("I went to McDonald's and they charge me 50 for Big Mac when I only came with 49. The casher told me that I can't read correctly and told me to get glasses. I am file a report on your casher and now I'm mad.", 
"I went to McDonald's and they charge me 50 for Big Mac when I only came with 49. The casher told me that I can't read correctly and told me to get glasses. I am file a report on your casher and now I'm mad.", 
"I really think that if you can buy breakfast anytime then I should be able to get a cheeseburger anytime especially since I really don't care for breakfast food. I really like McDonald's food but I preferred tree lunch rather than breakfast. Thank you thank you thank you.", 
"I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.", 
"Never order McDonald's from Uber or Skip or any delivery service for that matter, most particularly one on Elgin Street and Rideau Street, they never get the order right. Workers at either of these locations don't know how to follow simple instructions. Don't waste your money at these two locations.", 
"Employees left me out in the snow and wouldn’t answer the drive through. They locked the doors and it was freezing. I asked the employee a simple question and they were so stupid they answered a completely different question. Dumb employees and bad food.", 
"McDonalds food was always so good but ever since they add new/more crispy chicken sandwiches it has come out bad. At first I thought oh they must haven't had a good day but every time I go there now it's always soggy, and has no flavor. They need to fix this!!!", 
"I just ordered the new crispy chicken sandwich and I'm very disappointed. Not only did it taste horrible, but it was more bun than chicken. Not at all like the commercial shows. I hate sweet pickles and there were two slices on my sandwich. I wish I could add a photo to show the huge bun and tiny chicken."
)), class = "data.frame", row.names = c(NA, -8L))

我想计算数据框中每个两行文本组合的余弦相似度。我正在使用包

textTinyR
和函数
cosine_distance
来计算余弦相似度。我编写了以下代码,但是失败了:

result <- text %>%
  mutate(cosine_similarity = cosine_distance(text, dplyr::lag(text)))

具体来说,我收到此错误:

Error in `mutate()`:
ℹ In argument: `cosine_similarity = cosine_distance(text, dplyr::lag(text))`.
Caused by error:
! Expecting a single string value: [type=character; extent=8].

我知道函数

cosine_distance()
函数采用单独的字符串值,并且因为我使用
dplyr
mutate
,它传递整个向量,从而导致错误。

举个例子,我尝试了以下方法并且有效:

sentence1 = 'this is one sentence'

sentence2 = 'this is second sentence'

cds = cosine_distance(sentence1, sentence2)

print(cds)

但是,我不确定如何应用它,以便我可以将余弦相似度作为数据框中的附加变量。例如,在 2003 年的行中,我有 2003 年文本与 2002 年文本的余弦相似度,依此类推。非常感谢您的帮助。

r dplyr similarity cosine-similarity
1个回答
0
投票

您可以使用

Map
text
lag(text)
进行矢量化:

text %>%
  as_tibble() %>%
  mutate(cosine_similarity = unlist(Map(cosine_distance,
                                text, dplyr::lag(text)))) 
#> # A tibble: 8 x 3
#>    year text                                           cosin~1
#>   <int> <chr>                                            <dbl>
#> 1  2000 I went to McDonald's and they charge me 50 fo~   0    
#> 2  2001 I went to McDonald's and they charge me 50 fo~   1    
#> 3  2002 I really think that if you can buy breakfast ~   0.336
#> 4  2003 I guess the employee decided to buy their lun~   0.285
#> 5  2004 Never order McDonald's from Uber or Skip or a~   0.179
#> 6  2005 Employees left me out in the snow and wouldn’~   0.170
#> 7  2006 McDonalds food was always so good but ever si~   0.252
#> 8  2007 I just ordered the new crispy chicken sandwic~   0.291
#> # ... with abbreviated variable name 1: cosine_similarity
© www.soinside.com 2019 - 2024. All rights reserved.