删除重复项并保留前一条记录

问题描述 投票:0回答:1

我已经使用 nltk 库从一个文件夹中的多个文档中提取了标题,其中一个文档在一个文件中包含多个标题,该库工作正常。唯一的问题是它包含重复的标题,例如

输出:

 Daily Update (6/20) REDACTED.,
 Daily Update (9/9).,
 Daily Update (10/10).,
 RE: General/ABC Update.
 RE: General/ABC Update RELEASE IN PART BS.
 General/ABC Update.
 RE: General/ABC Article.
 RE: General/ABC Articie.
 Wrap Up for Friday, September 2017.,
 Wrap Up for Monday, January 2018.,
 Wrap Up for Monday, January 2018.

我的问题是我可以应用模糊匹配来清理它吗?如何清理?或者 我可以说保留文件中最重要的一条记录并删除其他记录吗?

处理这个问题的最佳方法是什么?

这是我应用的代码:

from itertools import chain
from nltk import sent_tokenize, word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer

word_detokenize = TreebankWordDetokenizer().detokenize


tokenized_text = [word_tokenize(sent) for sent in sent_tokenize(data1)]

sent_idx_with_event= [idx for idx, sent in enumerate(tokenized_text) 
                       if 'Event' in sent or 'Subject' in sent]

window = 1 # If you want 2 sentences before and after.

list1 = []
for idx in sent_idx_with_event:
    start = max(idx - window, 1)
    end = min(idx+window, len(tokenized_text))
    result = ' '.join(word_detokenize(sent) for sent in tokenized_text[start:end])
    #result = result.split(':')[-1] 
    result = re.split("Subject:|Event:", result)[-1]
    result = re.sub(r"Re","RE",result)
    result = re.sub(r"R.E","RE",result)
    result = re.sub(r"Fw","RE",result)
    result = re.sub(r"FW","RE",result)
    result = re.sub(r"Fwd","RE",result)
    result = re.sub(r"REd","RE",result)
    #print(result)
    
    list1.append(result)
print(list1) 

最后我删除了重复项。

我期待如下结果:

 Daily Update (6/20) REDACTED.,
 Daily Update (9/9).,
 Daily Update (10/10).
 General/ABC Article.
 Wrap Up for Friday, September 2017.,
 Wrap Up for Monday, January 2018.,
 Wrap Up for Monday, January 2018.
pandas machine-learning nlp nltk
1个回答
0
投票

似乎您想要一个与给定输入标题集“最佳匹配”的单个标题。我们还可以使用

thefuzz
来实现此目的:

import thefuzz
import itertools

examples = set((
    "RE: General/ABC Update.",
    "RE: General/ABC Update RELEASE IN PART BS.",
    "General/ABC Update.",
    "RE: General/ABC Article.",
    "RE: General/ABC Articie.",
))

score = {}
for item1, item2 in itertools.permutations(examples, 2):
    #  Go through every permutation
    ratio = thefuzz.fuzz.ratio(item1, item2)
    #  Calculate the levenshtein ratio score
    try:
        score[item1] += ratio
    except KeyError:
        score[item1] = ratio
#  Choose the single entry with the best score (closest match to all inputs)
result = sorted(score, key=lambda x: score[x])[-1]

最后,我们的输出将是

'RE: General/ABC Update.'
这是来自
examples
的单个字符串,它与输入集中的所有其他字符串具有最佳匹配率。

© www.soinside.com 2019 - 2024. All rights reserved.