初学者在这里:
我有一段文字:
例如:'hey this is a block of text, for an example, wow looks cool blah blah blah angiotensin enzyme looks cool okay.But what about angiotensin enzym well I dont know.'
和单词列表:['angiotensin enzyme serum', 'some diff enzyme', 'angiotensin enzyme a1']
我的最终目标是从单词列表中找到与文本块中的字符串匹配/模糊匹配的单词。
我尝试了什么:difflib.get_close_matches
需要输出:'angiotensin enzyme serum'
,'angiotensin enzyme a1'
输出顺序无关紧要。
对于其他文本块,列表中的其他字符串将匹配。块不是常数。
有没有办法做到这一点?
使用fuzzywuzzy
(来自PyPi):
from fuzzywuzzy import fuzz
text = 'hey this is a block of text, for an example, wow looks cool blah blah blah angiotensin enzyme looks cool okay.But what about angiotensin enzym well I dont know.'
words = ['angiotensin enzyme serum', 'some diff enzyme', 'angiotensin enzyme a1']
matches = [w for w in words if fuzz.partial_ratio(text, w) > 70.]
很显然,您需要调整阈值以适合它,但是在此示例中,这些值被很好地分开了:
>>> print(matches)
['angiotensin enzyme serum', 'angiotensin enzyme a1']
>>> for w in words:
... print(w, fuzz.partial_ratio(text, w))
...
angiotensin enzyme serum 83
some diff enzyme 56
angiotensin enzyme a1 90