每行包含几行单词,最好使用levenshtein

问题描述 投票:0回答:1

我有一个带有行的文本文件,每行有几个单词,我想按行将它们聚类,而不是将每行分隔为一个单词。我写了一些代码,但是输出很奇怪。我的代码:

import numpy as np
import sklearn.cluster
import distance

f = open("names.txt", "r")
words = f.read().split(',')
#for line in f:
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])

affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)
    print(" - *%s:* %s" % (exemplar, cluster_str))

输出:

 - *BRAZEMAX ESTATYS:*  Inc.,  Inc.
BBAZEMAX ESTATES, BRAZEMAX ESTATYS
 - * LTD
Gramkai Books
Bras5emax Estates:*  Jr
John Smith
PC Adelman
Gramkai,  LTD
BOZEMAN Ent.
Gramkat Estates,  LTD
Gramkai Books
Bras5emax Estates
 - * L.T.D.
BOZEMAN Enterprises
BOZERMAN ENTERPRISES
Nadelman:*  Inc.
Bozeman Enterprises
Michele LTD
Gramkat,  L.T.D.
BOZEMAN Enterprises
BOZERMAN ENTERPRISES
Nadelman

文件:

BRAZEMAX ESTATYS, LTD
Gramkai Books
Bras5emax Estates, L.T.D.
BOZEMAN Enterprises
BOZERMAN ENTERPRISES
Nadelman, Jr
John Smith
PC Adelman
Gramkai, Inc.
Bozeman Enterprises
Michele LTD
Gramkat, Inc.
BBAZEMAX ESTATES, LTD
BOZEMAN Ent.
Gramkat Estates, Inc.

这里怎么了?

python file hierarchical-clustering
1个回答
0
投票

您可能还需要删除\n个字符。该单词与提要换行符结合在一起。这就是为什么您看到多行输出的原因。

您可以在读取文件后更新代码:

original_file = """BRAZEMAX ESTATYS, LTD
Gramkai Books
Bras5emax Estates, L.T.D.
BOZEMAN Enterprises
BOZERMAN ENTERPRISES
Nadelman, Jr
John Smith
PC Adelman
Gramkai, Inc.
Bozeman Enterprises
Michele LTD
Gramkat, Inc.
BBAZEMAX ESTATES, LTD
BOZEMAN Ent.
Gramkat Estates, Inc."""

original_file
'BRAZEMAX ESTATYS, LTD\nGramkai Books\nBras5emax Estates, L.T.D.\nBOZEMAN Enterprises\nBOZERMAN ENTERPRISES\nNadelman, Jr\nJohn Smith\nPC Adelman\nGramkai, Inc.\nBozeman Enterprises\nMichele LTD\nGramkat, Inc.\nBBAZEMAX ESTATES, LTD\nBOZEMAN Ent.\nGramkat Estates, Inc.'

import re
re.split('[\n,]', original_file)
['BRAZEMAX ESTATYS',
 ' LTD',
 'Gramkai Books',
 'Bras5emax Estates',
 ' L.T.D.',
 'BOZEMAN Enterprises',
 'BOZERMAN ENTERPRISES',
 'Nadelman',
 ' Jr',
 'John Smith',
 'PC Adelman',
 'Gramkai',
 ' Inc.',
 'Bozeman Enterprises',
 'Michele LTD',
 'Gramkat',
 ' Inc.',
 'BBAZEMAX ESTATES',
 ' LTD',
 'BOZEMAN Ent.',
 'Gramkat Estates',
 ' Inc.']
​

现在单词被换行和逗号分隔。

© www.soinside.com 2019 - 2024. All rights reserved.