sklearn匹配结果在数据集增加时变得未对齐

Question

我一直在使用sklearn NearestNeighbors进行名称匹配，并且在某些时候结果变得不对齐。我的标准化名字列表是1亿个。我要匹配的名称列表要小得多，但仍可能在250k到500k之间。在某一点之后，索引似乎开始移动1或更大。

nbrs = NearestNeighbors(n_neighbors=1, n_jobs=-1).fit(tfidf) 
unique_org = set(names['VariationName'].values) # set used for increased performance
#matching query:
def getNearestN(query):
  queryTFIDF_ = vectorizer.transform(query)
  distances, indices = nbrs.kneighbors(queryTFIDF_)
  return distances, indices

print('Getting nearest n...')
distances, indices = getNearestN(unique_org)

unique_org = list(unique_org) #need to convert back to a list
print('Finding matches...')
matches = []
for i,j in enumerate(indices):
  temp = [round(distances[i][0],2), clean_org_names.values[j][0][0],unique_org[i]]
  matches.append(temp)

print('Building data frame...')  
matches = pd.DataFrame(matches, columns=['Match confidence (lower is better)','Matched name','Original name'])
print('Data frame built')

看来，一旦我的标准化列表超过80k，它就会开始向下移动结果。

[VITALI，ANGELO的“混乱名称”（有逗号）

VITALI, ANGELO

标准名称列表可能包含这些（不包含逗号）

VITALI ANGELO   
SENSABLE TECHNOLOGIES INC

[通过上面的匹配运行后，下面的结果表明，VITALI和ANGELO与SENSABLE TECNOLOGIES INC几乎是完美的匹配，因为索引向下移动了一个...我认为。

 0.00   SENSABLE TECHNOLOGIES INC   VITALI, ANGELO

记录的大小或数量是否可能超过该矩阵限制，并以某种方式弄乱了索引？

Answer 1

在黑暗中射击，但您认为半径参数NearestNeighbors可能会对此产生影响吗？

sklearn匹配结果在数据集增加时变得未对齐

问题描述投票：0回答：1

1个回答

最新问题

sklearn匹配结果在数据集增加时变得未对齐

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1