查找近似字符串匹配并替换为预定义字符串的有效方法

Question

我需要构建一个

NER

系统（命名实体识别）。为简单起见，我通过使用近似字符串匹配来完成此操作，因为输入可能包含拼写错误和其他细微修改。我遇到过一些很棒的库，例如：fuzzywuzzy甚至更快的RapidFuzz。但不幸的是我没有找到一种方法来返回匹配发生的位置。因为，就我的目的而言，我不仅需要找到匹配项，还需要知道匹配项发生在哪里。至于

NER

，我需要用一些预定义的字符串替换这些匹配项。

例如，如果在输入字符串中找到任何一行，我想用字符串

COMPANY_NAME

:

替换它们

google
microsoft
facebook
International Business Machine

例如，输入：

S/he works at Google

将转换为

S/he works at COMPANY_NAME

。您可以放心地假设，所有输入和要匹配的模式都已经过预处理，最重要的是它们现在都是小写的。所以，区分大小写是没有问题的。

目前，我已经采用了滑动窗口技术。并且滑动窗口从左到右传递到输入字符串上，并且该窗口的大小正是我们想要匹配的模式。例如，当我想与

International Business Machine

匹配时，我从左到右运行大小为

的滑动窗口，并尝试通过同时观察每个

连续标记来找到最佳匹配，步幅为

。我确实相信，这不是最好的方法，也找不到最佳匹配。

那么，找到“最佳”可能匹配的有效方法是什么，以及对找到的匹配的量化（它们相似程度）以及匹配的位置，以便我们可以用给定的值替换它们固定字符串（如果计算出的相似度不小于阈值）？显然，单个输入可能包含多个要替换的部分，每个部分都会被单独替换，例如：Google and Microsoft are big companies将变成

COMPANY_NAME and COMPANY_NAME are big companies

等

编辑：修复了 RapidFuzz 的链接

Answer 1

fuzzywuzzy

和

RapidFuzz

没有此功能。您可以尝试使用

process.extract()

或

process.extractOne()

但需要将文本拆分为更小的部分（即单词）并单独检查每个部分。对于像

International Business Machine

这样较长的单词，它需要分成 3 个单词 - 因此需要更多的工作。

我认为你需要模块

fuzzysearch

import fuzzysearch words = ['google', 'microsoft', 'facebook', 'International Business Machine'] text = 'Google and Microsoft are big companies like International Business Machine' print(' text:', text) print('---') for word in sorted(words, key=len, reverse=True): print(' word:', word) results = fuzzysearch.find_near_matches(word, text, max_l_dist=1) print('found:', results) for item in reversed(results): text = text[:item.start] + 'COMPANY' + text[item.end:] print(' text:', text) print('---')

结果：

text: Google and Microsoft are big companies like facebook International Business Machine --- word: International Business Machine found: [Match(start=53, end=83, dist=0, matched='International Business Machine')] text: Google and Microsoft are big companies like facebook COMPANY --- word: microsoft found: [Match(start=11, end=20, dist=1, matched='Microsoft')] text: Google and COMPANY are big companies like facebook COMPANY --- word: facebook found: [Match(start=42, end=50, dist=0, matched='facebook')] text: Google and COMPANY are big companies like COMPANY COMPANY --- word: google found: [Match(start=0, end=6, dist=1, matched='Google')] text: COMPANY and COMPANY are big companies like COMPANY COMPANY

如果它找到一个单词的多个结果，那么最好从最后一个位置开始替换，以将其他单词保留在同一位置。这就是我使用

reversed()

的原因。

我也会从最长的单词/名称开始，这样以后它仍然可以搜索较短的单词，例如

Business

。这就是我使用

sorted(..., key=len, reverse=True)

 的原因

但我不确定它是否完全按照你想要的方式工作。词错多了可能就会出问题了

编辑：

我尝试使用

fuzzywuzzy

并创建了这个版本，但仅适用于具有单个单词的名称。对于

International Business Machine

，它需要一些其他的想法。

它将全文拆分为单词并比较单词。稍后替换有配给的单词

> 80

words = ['google', 'microsoft', 'facebook', 'International Business Machine']

text = 'Google and Microsoft are big companies like International Business Machine'

# ---

import fuzzywuzzy.fuzz as fuzz
#import fuzzywuzzy.process

new_words = []

for part in text.split():

    matches = []

    for word in words:
        result = fuzz.token_sort_ratio(part, word)
        matches.append([result, part, word])
        #print([result, part, word])

    matches = sorted(matches, reverse=True)

    if matches and matches[0][0] > 80:
        new_words.append('COMPANY')
    else:
        new_words.append(matches[0][1])
        
print(" ".join(new_words))

结果：

[100, 'Google', 'google'] [27, 'Google', 'microsoft'] [29, 'Google', 'facebook'] [17, 'Google', 'International Business Machine'] [0, 'and', 'google'] [0, 'and', 'microsoft'] [18, 'and', 'facebook'] [12, 'and', 'International Business Machine'] [27, 'Microsoft', 'google'] [100, 'Microsoft', 'microsoft'] [35, 'Microsoft', 'facebook'] [15, 'Microsoft', 'International Business Machine'] [22, 'are', 'google'] [17, 'are', 'microsoft'] [36, 'are', 'facebook'] [12, 'are', 'International Business Machine'] [22, 'big', 'google'] [17, 'big', 'microsoft'] [18, 'big', 'facebook'] [12, 'big', 'International Business Machine'] [27, 'companies', 'google'] [33, 'companies', 'microsoft'] [24, 'companies', 'facebook'] [26, 'companies', 'International Business Machine'] [40, 'like', 'google'] [15, 'like', 'microsoft'] [17, 'like', 'facebook'] [18, 'like', 'International Business Machine'] [21, 'International', 'google'] [27, 'International', 'microsoft'] [19, 'International', 'facebook'] [60, 'International', 'International Business Machine'] [14, 'Business', 'google'] [24, 'Business', 'microsoft'] [12, 'Business', 'facebook'] [42, 'Business', 'International Business Machine'] [15, 'Machine', 'google'] [25, 'Machine', 'microsoft'] [40, 'Machine', 'facebook'] [38, 'Machine', 'International Business Machine'] COMPANY and COMPANY are big companies like International Business Machine

编辑：

第二个版本还检查包含许多单词的名称

all_names = ['google', 'microsoft', 'facebook', 'International Business Machine'] text = 'Google and Microsoft are big companies like International Business Machine' # --- import fuzzywuzzy.fuzz as fuzz for name in all_names: length = len(name.split(' ')) # how many words has name print('name length:', length, '|', name) words = text.split() # split text into words # compare name with all words in text matches = [] for index in range(0, len(words)-length+1): # join words if name has more then 1 word part = " ".join(words[index:index+length]) #print('part:', part) result = fuzz.token_sort_ratio(part, name) matches.append([result, name, part, [index, index+length]]) print([result, name, part, [index, index+length]]) # reverse to start at last position matches = list(reversed(matches)) max_match = max(x[0] for x in matches) print('max match:', max_match) # replace if max_match > 80: for match in matches: if match[0] == max_match: idx = match[3] words = words[:idx[0]] + ['COMPANY'] + words[idx[1]:] text = " ".join(words) print('text:', text) print('---')

结果：

ame length: 1 | google [100, 'google', 'Google', [0, 1]] [0, 'google', 'and', [1, 2]] [27, 'google', 'Microsoft', [2, 3]] [22, 'google', 'are', [3, 4]] [22, 'google', 'big', [4, 5]] [27, 'google', 'companies', [5, 6]] [40, 'google', 'like', [6, 7]] [21, 'google', 'International', [7, 8]] [14, 'google', 'Business', [8, 9]] [15, 'google', 'Machine', [9, 10]] max match: 100 text: COMPANY and Microsoft are big companies like International Business Machine --- name length: 1 | microsoft [25, 'microsoft', 'COMPANY', [0, 1]] [0, 'microsoft', 'and', [1, 2]] [100, 'microsoft', 'Microsoft', [2, 3]] [17, 'microsoft', 'are', [3, 4]] [17, 'microsoft', 'big', [4, 5]] [33, 'microsoft', 'companies', [5, 6]] [15, 'microsoft', 'like', [6, 7]] [27, 'microsoft', 'International', [7, 8]] [24, 'microsoft', 'Business', [8, 9]] [25, 'microsoft', 'Machine', [9, 10]] max match: 100 text: COMPANY and COMPANY are big companies like International Business Machine --- name length: 1 | facebook [27, 'facebook', 'COMPANY', [0, 1]] [18, 'facebook', 'and', [1, 2]] [27, 'facebook', 'COMPANY', [2, 3]] [36, 'facebook', 'are', [3, 4]] [18, 'facebook', 'big', [4, 5]] [24, 'facebook', 'companies', [5, 6]] [17, 'facebook', 'like', [6, 7]] [19, 'facebook', 'International', [7, 8]] [12, 'facebook', 'Business', [8, 9]] [40, 'facebook', 'Machine', [9, 10]] max match: 40 text: COMPANY and COMPANY are big companies like International Business Machine --- name length: 3 | International Business Machine [33, 'International Business Machine', 'COMPANY and COMPANY', [0, 3]] [31, 'International Business Machine', 'and COMPANY are', [1, 4]] [31, 'International Business Machine', 'COMPANY are big', [2, 5]] [34, 'International Business Machine', 'are big companies', [3, 6]] [38, 'International Business Machine', 'big companies like', [4, 7]] [69, 'International Business Machine', 'companies like International', [5, 8]] [88, 'International Business Machine', 'like International Business', [6, 9]] [100, 'International Business Machine', 'International Business Machine', [7, 10]] max match: 100 text: COMPANY and COMPANY are big companies like COMPANY

编辑：

带有

fuzzywuzzy.process

 的版本

这次我没有职位，我只是使用标准

text.replace(item[0], 'COMPANY')

。

我认为在大多数情况下它都会正确工作，并且不需要更好的方法。

这次我检查文字有错误：

'Gogle and Mikro-Soft are big companies like Fasebok and Internat. Businnes Machin'


all_names = ['google', 'microsoft', 'facebook', 'International Business Machine']

text = 'Google and Microsoft are big companies like Facebook and International Business Machine'

# text with mistakes
text = 'Gogle and Mikro-Soft are big companies like Fasebok and Internat. Businnes Machin'

# ---

import fuzzywuzzy.process
#import fuzzywuzzy.fuzz

for name in sorted(all_names, key=len, reverse=True):
    lenght = len(name.split())

    words = text.split()
    words = [" ".join(words[i:i+lenght]) for i in range(0, len(words)-lenght+1)]
    #print(words)

    #result = fuzzywuzzy.process.extractBests(name, words, scorer=fuzzywuzzy.fuzz.token_sort_ratio, score_cutoff=80)
    result = fuzzywuzzy.process.extractBests(name, words, score_cutoff=80)
    print(name, result)

    for item in result:
        text = text.replace(item[0], 'COMPANY')

print(text)

查找近似字符串匹配并替换为预定义字符串的有效方法

问题描述投票：0回答：1

1个回答

最新问题

查找近似字符串匹配并替换为预定义字符串的有效方法

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1