从列表中找到最“共识”的字符串[关闭]

问题描述 投票:0回答:1

我有一个表示相同对象的字符串列表,但每个字符串的名称可能略有不同。我试图从列表中找到最“共识”的字符串,将其用作“黄金来源”类型的值。

此类数据的一个示例可能是:

Procter & Gamble Co.
Procter & Gamble co
Procter & Gamble Co (The)

我实现了一个有效的样本,但它的逻辑不是主意,我想知道是否有库可以帮助我有效地做到这一点。我的算法基本上寻找值的best pair而不是best one to many集(我真的无法弄清楚如何做到这一点)。它确实运行良好,因为我的列表通常是3-5个元素,但是在列表增长中,我可能最终会得到两个相同的错误结果,这些结果会导致更​​好的结果。

我的示例如下所示:

def best_name(frame):
    """build a dictionary from frame data"""
    data = frame2dict(frame)
    logging.info("Getting the best name, source data: {}".format(data))

    """compare values in each row, skipping comparison with self"""
    for item in data:
        item['matches'] = dict()
        for each in data:
            if item['source'] == each['source']:
                pass
            else:
                item['matches'][each['source']] = fuzz.ratio(item['value'], each['value'])
    logging.info("Data with fuzz ratios: {}".format(data))

    """Build a summary array to identify the closest match"""
    summary = list()
    for item in data:
        for match in item['matches']:
            row = [item['source'],item['matches'][match], match]
            if row in summary or reverse_array(row) in summary:
                pass
            else:
                summary.append(row)
    logging.info("Summary table: {}".format(summary))

    """Extract the best match from summary array"""
    best_pair = None
    for item in summary:
        if not best_pair:
            best_pair = item
        if best_pair and best_pair[1] < item[1]:
            best_pair = item[1]
    logging.info("Best pair: {}".format(best_pair))

    """Compare len of two candidate values and return the value of shortest"""
    a = next(x for x in data if x['source'] == best_pair[0])
    b = next(x for x in data if x['source'] == best_pair[2])
    logging.info("Two final candidates: {} and {}, returning shortest".format(a, b))

    if len(a['value']) > len(b['value']):
        return b
    else:
        return a

在行动中,这是跟踪:

INFO:root:Getting the best name, source data: [{'value': 'Procter & Gamble Co.', 'source': 'WSJ'}, {'value': 'Procter & Gamble Co', 'source': 'RTS'}, {'value': 'Procter & Gamble Company (The)', 'source': 'NYSE'}]
INFO:root:Data with fuzz ratios: [{'value': 'Procter & Gamble Co.', 'source': 'WSJ', 'matches': {'RTS': 97, 'NYSE': 76}}, {'value': 'Procter & Gamble Co', 'source': 'RTS', 'matches': {'WSJ': 97, 'NYSE': 78}}, {'value': 'Procter & Gamble Company (The)', 'source': 'NYSE', 'matches': {'WSJ': 76, 'RTS': 78}}]
INFO:root:Summary table: [['WSJ', 97, 'RTS'], ['WSJ', 76, 'NYSE'], ['RTS', 78, 'NYSE']]
INFO:root:Best pair: ['WSJ', 97, 'RTS']
INFO:root:Two final candidates: {'value': 'Procter & Gamble Co.', 'source': 'WSJ', 'matches': {'RTS': 97, 'NYSE': 76}} and {'value': 'Procter & Gamble Co', 'source': 'RTS', 'matches': {'WSJ': 97, 'NYSE': 78}}, returning shortest

它有效,但我想知道是否有类似于difftoos的东西可以做得更聪明一点?或许有

python algorithm text-processing
1个回答
4
投票

使用Levenshtein模块:

variants = [
    "Procter & Gamble Co.",
    "Procter & Gamble co",
    "Procter & Gamble Co (The)"
]

import Levenshtein
Levenshtein.median(variants)
# => 'Procter & Gamble Co'
© www.soinside.com 2019 - 2024. All rights reserved.