i 有几个字典已经用 picheld(dump) 保存到一个文件中
我现在已经重新加载它们并且会撒谎来比较它们,如果有重复的条目,它们将被删除。起始点是一个名为“linkhash”的键。如果这个key的值在一个dicti中出现了两次,应该立即删除。最后,应该从所有的字典中进行转储,并可能保存在文件中。
这是我的代码
def load_pickle_raw(filename):
with open(savefile + filename, 'rb') as l_file:
load_raw = pickle.load(l_file)
# mögliche exception
# EOFError: Ran out of input wenn file noch nicht geladen werden kann
return load_raw
pickle_raw_new = load_pickle_raw('pickle_raw.txt')
print("pickle_raw_new")
print(len(pickle_raw_new))
for il in pickle_raw_new:
print(il)
print(il['linkhash'])
old_RawIfExist = load_pickle_raw('pickle_old_RawIfExist.txt')
print("###################")
print("old_RawIfExist")
print(len(old_RawIfExist))
for ild in old_RawIfExist:
print(ild)
print(ild['linkhash'])
我腌制的词典插图:
{'resort': 'Draft', 'line1': '"Li Europan lingues es membres del sam familie."', 'line2': 'Li Europan lingues es membres del sam familie.', 'snippet': 'Li Europan lingues es membres del sam familie.', 'link': 'https://www.url1.com/index.html', 'linkhash': 'c4a9fd5cbc11a08263ef3b33bb5cbebf', 'timestamp': '2023-04-11 19:32:10', 'status_create': 'False', 'status_release': 'False'}
{'resort': 'Draft', 'line1': '"Li Europan lingues es membres del sam familie."', 'line2': 'Li Europan lingues es membres del sam familie.', 'snippet': 'Li Europan lingues es membres del sam familie.', 'link': 'https://www.url2.com/index.html', 'linkhash': '4d70755b21c05c5fb1fa57293af5862a', 'timestamp': '2023-04-11 19:32:10', 'status_create': 'False', 'status_release': 'False'}
{'resort': 'Draft', 'line1': '"Li Europan lingues es membres del sam familie."', 'line2': 'Li Europan lingues es membres del sam familie.', 'snippet': 'Li Europan lingues es membres del sam familie.', 'link': 'https://www.ur3.com/index.html', 'linkhash': '1c49a456b1c53e2bcfd1926bd9a73f51', 'timestamp': '2023-04-11 19:32:10', 'status_create': 'False', 'status_release': 'False'}
{'resort': 'Draft', 'line1': '"Li Europan lingues es membres del sam familie."', 'line2': 'Li Europan lingues es membres del sam familie.', 'snippet': 'Li Europan lingues es membres del sam familie.', 'link': 'https://www.url.com/index.html', 'linkhash': '68d081d578c9b2a3221eb976da1cd6ff', 'timestamp': '2023-04-11 19:32:10', 'status_create': 'False', 'status_release': 'False'}
我想知道如何比较多个用词,重复的应该删除。
我创建了一个名为 pickles 的文件夹,并在其中存储了您的四个示例和一个副本。
import os
import pickle
# define the path to the folder containing the pickled files
folder_path = "pickles"
# get a list of the file names in the folder
file_names = os.listdir(folder_path)
objects = [] # create an empty list to store the loaded objects
linkhashes = {} # to check for duplicates
# loop through the file names and load each object
for file_name in file_names:
# construct the full path to the file
file_path = os.path.join(folder_path, file_name)
# open the file for reading in binary mode
with open(file_path, "rb") as f:
content = pickle.load(f)
if content['linkhash'] in linkhashes:
print("Linkhash", content['linkhash'], "already known from", linkhashes[content['linkhash']] )
else:
linkhashes[content['linkhash']] = file_name
print("Read", content['linkhash'])
# Store for later use
objects.append(content)
这是输出:
Read 1c49a456b1c53e2bcfd1926bd9a73f51
Linkhash 1c49a456b1c53e2bcfd1926bd9a73f51 already known from 1c49a456b1c53e2bcfd1926bd9a73f51_2
Read 4d70755b21c05c5fb1fa57293af5862a
Read 68d081d578c9b2a3221eb976da1cd6ff
Read c4a9fd5cbc11a08263ef3b33bb5cbebf