比较许多字典形式的 pikeld(加载),删除重复项和新转储

问题描述 投票:0回答:1

i 有几个字典已经用 picheld(dump) 保存到一个文件中

我现在已经重新加载它们并且会撒谎来比较它们,如果有重复的条目,它们将被删除。起始点是一个名为“linkhash”的键。如果这个key的值在一个dicti中出现了两次,应该立即删除。最后,应该从所有的字典中进行转储,并可能保存在文件中。

这是我的代码

def load_pickle_raw(filename):

    with open(savefile + filename, 'rb') as l_file:
        load_raw = pickle.load(l_file)
        # mögliche exception
        # EOFError: Ran out of input wenn file noch nicht geladen werden kann
    return load_raw


pickle_raw_new = load_pickle_raw('pickle_raw.txt')
print("pickle_raw_new")
print(len(pickle_raw_new))
for il in pickle_raw_new:
    print(il)
    print(il['linkhash'])

old_RawIfExist = load_pickle_raw('pickle_old_RawIfExist.txt')
print("###################")
print("old_RawIfExist")
print(len(old_RawIfExist))
for ild in old_RawIfExist:
    print(ild)
    print(ild['linkhash'])

我腌制的词典插图:

{'resort': 'Draft', 'line1': '"Li Europan lingues es membres del sam familie."', 'line2': 'Li Europan lingues es membres del sam familie.', 'snippet': 'Li Europan lingues es membres del sam familie.', 'link': 'https://www.url1.com/index.html', 'linkhash': 'c4a9fd5cbc11a08263ef3b33bb5cbebf', 'timestamp': '2023-04-11 19:32:10', 'status_create': 'False', 'status_release': 'False'}
{'resort': 'Draft', 'line1': '"Li Europan lingues es membres del sam familie."', 'line2': 'Li Europan lingues es membres del sam familie.', 'snippet': 'Li Europan lingues es membres del sam familie.', 'link': 'https://www.url2.com/index.html', 'linkhash': '4d70755b21c05c5fb1fa57293af5862a', 'timestamp': '2023-04-11 19:32:10', 'status_create': 'False', 'status_release': 'False'}
{'resort': 'Draft', 'line1': '"Li Europan lingues es membres del sam familie."', 'line2': 'Li Europan lingues es membres del sam familie.', 'snippet': 'Li Europan lingues es membres del sam familie.', 'link': 'https://www.ur3.com/index.html', 'linkhash': '1c49a456b1c53e2bcfd1926bd9a73f51', 'timestamp': '2023-04-11 19:32:10', 'status_create': 'False', 'status_release': 'False'}
{'resort': 'Draft', 'line1': '"Li Europan lingues es membres del sam familie."', 'line2': 'Li Europan lingues es membres del sam familie.', 'snippet': 'Li Europan lingues es membres del sam familie.', 'link': 'https://www.url.com/index.html', 'linkhash': '68d081d578c9b2a3221eb976da1cd6ff', 'timestamp': '2023-04-11 19:32:10', 'status_create': 'False', 'status_release': 'False'}

我想知道如何比较多个用词,重复的应该删除。

python dictionary compare pickle
1个回答
0
投票

我创建了一个名为 pickles 的文件夹,并在其中存储了您的四个示例和一个副本。

import os
import pickle

# define the path to the folder containing the pickled files
folder_path = "pickles"

# get a list of the file names in the folder
file_names = os.listdir(folder_path)

objects = [] # create an empty list to store the loaded objects
linkhashes = {} # to check for duplicates

# loop through the file names and load each object
for file_name in file_names:
    # construct the full path to the file
    file_path = os.path.join(folder_path, file_name)
    
    # open the file for reading in binary mode
    with open(file_path, "rb") as f:
        content = pickle.load(f)
        if content['linkhash'] in linkhashes:
            print("Linkhash", content['linkhash'], "already known from", linkhashes[content['linkhash']] )
        else:
            linkhashes[content['linkhash']] = file_name
            print("Read", content['linkhash'])

        # Store for later use
        objects.append(content)

这是输出:

Read 1c49a456b1c53e2bcfd1926bd9a73f51
Linkhash 1c49a456b1c53e2bcfd1926bd9a73f51 already known from 1c49a456b1c53e2bcfd1926bd9a73f51_2
Read 4d70755b21c05c5fb1fa57293af5862a
Read 68d081d578c9b2a3221eb976da1cd6ff
Read c4a9fd5cbc11a08263ef3b33bb5cbebf
© www.soinside.com 2019 - 2024. All rights reserved.