计算所有术语相对于一个特定术语的余弦相似度

问题描述 投票:0回答:1

我有一个非常大的语料库/DFM/DTM 对象,我想计算其语言相似度。但是,该对象太大,因此每次我尝试计算余弦相似度统计量时,R 都会关闭。这就是我用来计算余弦相似度分数的方法:

test_cosine <- textstat_simil(myTFIDF, margin = "terms", method = "cosine") #calculate cosine

但问题是,我真的只对几个学期的分数感兴趣。例如,我想查看其他术语与“保守派”和“自由派”的余弦相似度得分。有没有办法让 R 只针对这两个术语生成术语的相似度分数?我在这里看到了一篇post,推荐了另一种计算两个术语之间余弦相似度的方法 (

 d <- stringdist("conserv","liberal",method="cosine")
)。这个确实产生了分数,但是我对这个感到困惑,因为它没有要求我指定数据,所以我不知道它是如何计算分数的。

如果没有,是否有另一种方法可以从大型语料库/DTM/DFM 对象中获取术语的修辞相似度分数?

编辑*** 这是一小部分数据的 dput。

structure(list(`twitterdata[1:50, ]` = c("matter time", "beatl", 
"craze left wing hippi weirdo freak rant observ", " officialmonstax", 
"bienvenido club fan dedicado informar apoyar hermosa talentosa estrella mexicana angelicaval sigueno", 
"boa lagoa", "forget thing hurt lesson learn make mistak can never never regret thing made smile", 
"vynox come soon", "offici th parkad downtown sandiego locat next petco park histor gaslamp quarter", 
"offici salvat armi chicago metropolitan divis largest direct provid social servic state illinoi", 
"jackson ude journalist skill practition field polit communic media public polici polit manag", 
"encourag motiv inspir lead execut", " negat world", "seattl bremerton elliot thorsen ethorsen", 
"laxxxxx", "west geauga high school hockey", "baltimor born rais dundalk", 
"eight thirti thirteen", "talk show produc newstalk humber colleg radio broadcast graduat dolphin magic blue jay aggi mapl leaf hotspur fan", 
"sdsu ", "read book host literari lair fix can also found gomer product now", 
"va dancer yrs old basketbal player volleybal setter yes may loser best damn loser ever meet", 
"can star magic workshopp", "offici page strongsvill ladi mustang varsiti soccer team", 
"queen hous wife mom sis aunt garden chef caregiv teacher counselor pro life liberti happi godfear christfollow biblebeliev john ", 
"sport star war hous music", "alexandria atletico jenni tcw ", 
"fun guitar", "jesus dont give fail everyth mean noth realiz good bro aredhel nargothrond elf name", 
"life give lemon return ask zayn malik pleas ", "collector thing beauti past present find rebelmous vintagedressparlour", 
"streamer aspir musician fit health", "artist illustr design d model busi commiss open charact simpl background", 
"musico poeta loco artista naturaleza", "totalment fascinado afeccion emocional biologica cuerpo humano eterno enamorado letra lavida dio guia hoy manana siempr", 
"girl mani dream shawnmend girlfriend dream", "alway look delici bigup friend", 
"sassi sexi wild lover music writer blogger product junki pierc tattoo enthusiast hopeless romant makeup artist", 
"stop useless start pizza", "keep upto date fixtur result latest news updat across gfa leagu", 
"ez lab onlin portal various nabl iso certifi diagnost lab avail provid qualiti assur healthcar consum", 
" that flick tho", " pinch dinosaurio risueno guey music drug physiotherapi campus puebla", 
"look sharp cut edg design get", "proud eph gopher track alum bs kin umn mba sp mgt cuc ski racer climber around athlet ao", 
"professor nerdi abound warn fond book turn brain", "gotta risk get biscuit mdp presleyy aspir sing avocado ladi", 
"dragonapothek ist onlin apothek allgemein dieser bieten manner sexuell gesundheit medizin kamagra", 
"help compani individu discov fit clariti therapist connect agent outgo introvert flaw believ husband dad bbq er", 
"keep negat aliv babi termin hate spread posit cudfam cudlif"
)), row.names = c(NA, -50L), class = "data.frame") ```
r text cosine-similarity quanteda
1个回答
0
投票

将包含选定列的 DFM 传递给

y
:

textstat_simil(x = dfmt, y = dfmt[,c("conservative", "liberal")])
© www.soinside.com 2019 - 2024. All rights reserved.