目标:
如何在 Python 中执行此操作?
示例:
long_string = """1. Bob likes classical music very much.
2. This is classic music!
3. This is a classic musical. It has a lot of classical musics.
"""
query_string = "classical music"
我希望Python代码能够根据我设置的字符串匹配阈值找到“古典音乐”以及可能的“经典音乐”、“经典音乐”和“古典音乐”。
研究:我发现在Python中检查较长字符串中存在的模糊/近似子字符串?但问题仅关注最佳匹配(即并非所有出现),答案要么也关注最佳匹配,要么不关注处理多单词查询字符串(因为问题只有一个单词查询字符串,或者返回一些不正确的分数(即使精确匹配也得不到完美分数)。
这是迄今为止我的薄弱解决方案:
import regex
long_string = """1. Bob likes classical music very much.
2. This is classic music!
3. This is a classic musical. It has a lot of classical musics.
"""
query_string = "classical music"
threshold = 5
results = regex.finditer(r'(classical music){e<5}', long_string, flags=regex.IGNORECASE)
for result in results:
print(result)
输出:
<regex.Match object; span=(9, 28), match='kes classical music', fuzzy_counts=(0, 4, 0)>
<regex.Match object; span=(49, 64), match='s classic music', fuzzy_counts=(0, 2, 2)>
<regex.Match object; span=(77, 92), match='a classic music', fuzzy_counts=(0, 2, 2)>
<regex.Match object; span=(108, 127), match=' of classical music', fuzzy_counts=(0, 4, 0)>
2个弱点:
query_string
和 threshold
,而是在正则表达式查询中对它们进行硬编码。