是否有更好的方法基于R中的关键字对叙述进行分类?

问题描述 投票:0回答:1

我正在尝试根据某些关键字词典对叙述进行分类。我的方法是在叙述中识别出字符串距离最小的关键字。这种方法效果很好,但是我遇到了一个例子,其中这种方法似乎不合适。以下是代码段

#a is the narration and b(s) are some keywords
a = "PRAJA GHUPTA UTAMA Trf Inw RTGS PT BANK NEGARA INDONESIA (PERSERO) TBKPRAJA GHUPTA UTAMA"
b1 = "tarik"
b2 = "pajak"
b3 = "trf inw rtgs"

#After loading stringdist library
dis1 = stringdist(tolower(a),b1,method = "jw") 
dis2 = stringdist(tolower(a),b2,method = "jw")
dis3 = stringdist(tolower(a),b3,method = "jw")

#Output 
> dis1
[1] 0.3810606

> dis2
[1] 0.3143939

> dis3
[1] 0.4406566

据我了解,stringdist函数首先回收较短长度的字符串以匹配较长长度的字符串,然后根据匹配两个字符串所需的迭代次数来计算距离。

[我不明白的是,b3是旁白a的子字符串,但与其他关键字相比,没有壁橱的距离。

想知道其背后是否有任何原因,以及我可以尝试使用哪种其他替代方法来更好地进行匹配?

r text-mining text-classification fuzzy-search
1个回答
0
投票

这里的关键是要注意,stringdist()适用于字符,而问题似乎是要查找单词相似性因此请考虑以下内容:

# Note this does not attempt to explain all nuances, but only the word versus character aspect:
a = "PRAJA GHUPTA UTAMA Trf Inw RTGS PT BANK NEGARA INDONESIA (PERSERO) TBKPRAJA GHUPTA UTAMA"
b1 = "tarik"      
b2 = "pajak"
b3 = "trf inw rtgs"
b4 = "PRAJA GHUPTA"   # exactly same char. seq. but nchar = 11 - higher score     
b5 = "PRAJA G"        # exactly same char. seq. but nchar = 6  - lower score
b6 = "PRAJA G"        # same, stringdist(b5,b6,method = "jw") = 0 as expected
b7 = "paa gua uaa"    # dis7 = stringdist(tolower(a),b7,method = "jw")
library(stringdist)
library(stringi)
library(stringr)
#After loading stringdist library
dis1 = stringdist(tolower(a),b1,method = "jw") 
dis2 = stringdist(tolower(a),b2,method = "jw")
dis3 = stringdist(tolower(a),b3,method = "jw")
dis4 = stringdist(tolower(a),b4,method = "jw")
dis5 = stringdist(tolower(a),b5,method = "jw")
dis6 = stringdist(b5,b6,method = "jw")

# This uses b7, where b7 nchar=9, but only 4 unique chars p,a,g & u - all available early
dis7 = stringdist(tolower(a),b7,method = "jw")   # relatively 2nd lowest score : 0.2916667


dis1;dis2;dis3;dis4;dis5; dis6; dis7

> dis1;dis2;dis3;dis4;dis5; dis6; dis7
[1] 0.3810606
[1] 0.3143939
[1] 0.4406566
[1] 0.635101
[1] 0.6152597
[1] 0
[1] 0.2916667

# other aspects are explained in the vignette /help pages
© www.soinside.com 2019 - 2024. All rights reserved.