在余弦相似度中对数字施加的权重大于字符串

问题描述 投票:0回答:1

我有一个程序可以从Internet提取地址,并对照数据库进行检查。这很有用,但我现在正尝试引入一个相似性函数,以将互联网上的地址与数据库中的地址进行比较。

我正在使用以下脚本检查余弦相似度对地址的比较:

import string
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

addresses = [
  '705 Sherlock House, 221B Baker Street, London NW1 6XE', 
  '75 Sherlock House, 221B Baker Street, London NW1 6XE', 
  'Apartment 704 Sherlock House, 221B Baker Street, London NW1 6XE', 
  'Apartment 705 Sherlock House, 221B Baker Street, London NW1 6XE', 
  '705, 221B Baker Street, London NW1 6XE', 
  '75, 221B Baker Street, London NW1 6XE',
  '705 Watson House, 219 Baker Street, London NW1 6XE',
  '32 Baker Street, London NW1 6XE',
  '1060 West Addison, London, W2 6SR',
  '705 Sherlock Hse, Baker Street, London, NW1'
  ]

def clean_address(text):
  text = ''.join([word for word in text if word not in string.punctuation])
  text = text.lower()
  return text

cleaned = list(map(clean_address, addresses))

vectorizer = CountVectorizer()
transformedVectorizer = vectorizer.fit_transform(cleaned)
vectors = transformedVectorizer.toarray()

csim = cosine_similarity(vectors)

def cosine_sim_vectors(vec1, vec2):
  vec1 = vec1.reshape(1, -1)
  vec2 = vec2.reshape(1, -1)

  return cosine_similarity(vec1, vec2)[0][0]

cosine_sim_vectors1 = cosine_sim_vectors(vectors[0], vectors[1])
cosine_sim_vectors2 = cosine_sim_vectors(vectors[0], vectors[2])
cosine_sim_vectors3 = cosine_sim_vectors(vectors[0], vectors[3])
cosine_sim_vectors4 = cosine_sim_vectors(vectors[0], vectors[4])
cosine_sim_vectors5 = cosine_sim_vectors(vectors[0], vectors[5])
cosine_sim_vectors6 = cosine_sim_vectors(vectors[0], vectors[6])
cosine_sim_vectors7 = cosine_sim_vectors(vectors[0], vectors[7])
cosine_sim_vectors8 = cosine_sim_vectors(vectors[0], vectors[8])
cosine_sim_vectors9 = cosine_sim_vectors(vectors[0], vectors[9])

print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to 75 Sherlock House, 221B Baker Street, London NW1 6XE".format(cosine_sim_vectors1 * 100))
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to Apartment 704 Sherlock House, 221B Baker Street, London NW1 6XE".format(cosine_sim_vectors2 * 100))
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to Apartment 705 Sherlock House, 221B Baker Street, London NW1 6XE".format(cosine_sim_vectors3 * 100))
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to 705, 221B Baker Street, London NW1 6XE".format(cosine_sim_vectors4 * 100))
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to 75, 221B Baker Street, London NW1 6XE".format(cosine_sim_vectors5 * 100))
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to 705 Watson House, 219 Baker Street, London NW1 6XE".format(cosine_sim_vectors6 * 100))
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to 32 Baker Street, London NW1 6XE".format(cosine_sim_vectors7 * 100))
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to 1060 West Addison, London, W2 6SR".format(cosine_sim_vectors8 * 100))
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to 705 Sherlock Hse, Baker Street, London, NW1".format(cosine_sim_vectors9 * 100))

输出为:

705 Sherlock House, 221B Baker Street, London NW1 6XE is 88.9% similar to 75 Sherlock House, 221B Baker Street, London NW1 6XE
705 Sherlock House, 221B Baker Street, London NW1 6XE is 84.3% similar to Apartment 704 Sherlock House, 221B Baker Street, London NW1 6XE
705 Sherlock House, 221B Baker Street, London NW1 6XE is 94.9% similar to Apartment 705 Sherlock House, 221B Baker Street, London NW1 6XE
705 Sherlock House, 221B Baker Street, London NW1 6XE is 88.2% similar to 705, 221B Baker Street, London NW1 6XE
705 Sherlock House, 221B Baker Street, London NW1 6XE is 75.6% similar to 75, 221B Baker Street, London NW1 6XE
705 Sherlock House, 221B Baker Street, London NW1 6XE is 77.8% similar to 705 Watson House, 219 Baker Street, London NW1 6XE
705 Sherlock House, 221B Baker Street, London NW1 6XE is 68.0% similar to 32 Baker Street, London NW1 6XE
705 Sherlock House, 221B Baker Street, London NW1 6XE is 13.6% similar to 1060 West Addison, London, W2 6SR
705 Sherlock House, 221B Baker Street, London NW1 6XE is 75.6% similar to 705 Sherlock Hse, Baker Street, London, NW1

这做得很合理,因为我可能会将超过60-70%的目光都盯在眼前,并且给我留下深刻的印象,这几乎是我故意用705 Watson House和705 Sherlock Hse欺骗它的尝试,但我确实认为例如,如果它认识到705比伦敦要比较重要,那么它将改进该算法,或者鉴于我可以删除伦敦6XE。

[如果有更合适的函数,我也愿意使用其他相似性函数,因为我确实知道余弦相似性会将字符串更改为向量,并且本质上将它们同等对待。

python scikit-learn cosine-similarity
1个回答
0
投票

没有必要在我的地址字符串的一部分上增加更多的权重,余弦相似度是开箱即用的。

余弦相似度比字符串编辑距离更好,因为'75 Sherlock House,221B Baker Street,London NW1 6XE'与'705 Sherlock House,221B Baker Street,London NW1 6XE'相比,与'Apartment'相似。伦敦贝克街221B号705 Sherlock House,伦敦NW1 6XE'-余弦相似性抓住了这种直觉。

© www.soinside.com 2019 - 2024. All rights reserved.