我已经使用 nltk 库从一个文件夹中的多个文档中提取了标题,其中一个文档在一个文件中包含多个标题,该库工作正常。唯一的问题是它包含重复的标题,例如
输出:
Daily Update (6/20) REDACTED.,
Daily Update (9/9).,
Daily Update (10/10).,
RE: General/ABC Update.
RE: General/ABC Update RELEASE IN PART BS.
General/ABC Update.
RE: General/ABC Article.
RE: General/ABC Articie.
Wrap Up for Friday, September 2017.,
Wrap Up for Monday, January 2018.,
Wrap Up for Monday, January 2018.
我的问题是我可以应用模糊匹配来清理它吗?如何清理?或者 我可以说保留文件中最重要的一条记录并删除其他记录吗?
处理这个问题的最佳方法是什么?
这是我应用的代码:
from itertools import chain
from nltk import sent_tokenize, word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
word_detokenize = TreebankWordDetokenizer().detokenize
tokenized_text = [word_tokenize(sent) for sent in sent_tokenize(data1)]
sent_idx_with_event= [idx for idx, sent in enumerate(tokenized_text)
if 'Event' in sent or 'Subject' in sent]
window = 1 # If you want 2 sentences before and after.
list1 = []
for idx in sent_idx_with_event:
start = max(idx - window, 1)
end = min(idx+window, len(tokenized_text))
result = ' '.join(word_detokenize(sent) for sent in tokenized_text[start:end])
#result = result.split(':')[-1]
result = re.split("Subject:|Event:", result)[-1]
result = re.sub(r"Re","RE",result)
result = re.sub(r"R.E","RE",result)
result = re.sub(r"Fw","RE",result)
result = re.sub(r"FW","RE",result)
result = re.sub(r"Fwd","RE",result)
result = re.sub(r"REd","RE",result)
#print(result)
list1.append(result)
print(list1)
最后我删除了重复项。
我期待如下结果:
Daily Update (6/20) REDACTED.,
Daily Update (9/9).,
Daily Update (10/10).
General/ABC Article.
Wrap Up for Friday, September 2017.,
Wrap Up for Monday, January 2018.,
Wrap Up for Monday, January 2018.
似乎您想要一个与给定输入标题集“最佳匹配”的单个标题。我们还可以使用
thefuzz
来实现此目的:
import thefuzz
import itertools
examples = set((
"RE: General/ABC Update.",
"RE: General/ABC Update RELEASE IN PART BS.",
"General/ABC Update.",
"RE: General/ABC Article.",
"RE: General/ABC Articie.",
))
score = {}
for item1, item2 in itertools.permutations(examples, 2):
# Go through every permutation
ratio = thefuzz.fuzz.ratio(item1, item2)
# Calculate the levenshtein ratio score
try:
score[item1] += ratio
except KeyError:
score[item1] = ratio
# Choose the single entry with the best score (closest match to all inputs)
result = sorted(score, key=lambda x: score[x])[-1]
最后,我们的输出将是
'RE: General/ABC Update.'
这是来自 examples
的单个字符串,它与输入集中的所有其他字符串具有最佳匹配率。