如何解决 python 创建大型数据框时的内存问题

问题描述 投票:0回答:1

背景:

我有大约 8 万个单词的列表,其中可能存在拼写错误

(e.g., "apple" vs "applee" vs "  apple" vs "     aplee   ").

我计划通过一次选择两个单词来创建一个数据框网格,然后应用模糊评分函数来比较相似性。我还应用标准文本清理,例如修剪、删除特殊字符、双空格等,然后获取唯一列表来检查相似性

做法:

我正在使用

itertools.combinations
函数创建数据框网格

#Sample python code

#Step1:
my_unique_list = ['apple','applee','aplee']
data_grid = pd.DataFrame(itertools.combinations(my_unique_list,2),columns = ['name1','name2'])

print(data_grid)


    name1   name2
0   apple   applee
1   apple   aplee
2   applee  aplee

我定义了一个计算模糊分数的函数

def fuzzy_score_func(row):     
    fuzzywuzzy_partial_ratio = fuzz.partial_ratio(row['name1'],row['name2'])
    thefuzz_ratio = fuzz.ratio(row['name1'],row['name2'])

    return fuzzywuzzy_partial_ratio, thefuzz_ratio    

并使用apply函数得到最终分数

#Step2:

data_grid[['partial_ratio','ratio']] = data_grid.apply(fuzzy_score_func,axis = 1, result_type='expand')

print(data_grid)

    name1   name2   partial_ratio   ratio
0   apple   applee  100             91
1   apple   aplee   80              80
2   applee  aplee   80              91

当列表约为 8k 时,此方法效果很好,其中检查所有组合在数据框中具有约 25Mn 行。

但是当我尝试将列表扩展到 80k 时,当我尝试使用所有可能的组合初始化数据帧时,我在步骤 1 中遇到内存错误。鉴于数据帧的大小约为 64 亿行,这是有道理的

File ~\AppData\Local\anaconda3\Lib\site-packages\pandas\core\frame.py:738, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    736         data = np.asarray(data)
    737     else:
--> 738         data = list(data)
    739 if len(data) > 0:
    740     if is_dataclass(data[0]):

MemoryError: 

有关如何解决此内存问题的任何建议,或者是否有更好的方法来实现我的问题陈述。我尝试探索多处理、嵌套循环等,但没有取得重大成功。

我使用的是 Intel Windows 笔记本电脑

Processor: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz   3.00 GHz
Installed RAM: 32.0 GB (31.7 GB usable)
System type: 64-bit operating system, x64-based processor
python pandas numpy out-of-memory python-itertools
1个回答
1
投票

我可能会尝试在不使用 pandas 的情况下仅使用

itertools
来开始使用此代码。

import csv
import itertools
import fuzzywuzzy.fuzz

MIN_RATION = 90

## ----------------------
## the result of cleaning and filtering your input data...
## ----------------------
my_unique_list = ['apple','applee','aplee']
## ----------------------

## ----------------------
## Create a result file of acceptably close matches 
## ----------------------
with open("good_matches.csv", "w", encoding="utf-8", newline="") as file_out:
    writer = csv.writer(file_out)
    writer.writerow(["name1", "name2", "partial_ratio", "ratio"])
    for index, (word1, word2) in enumerate(itertools.combinations(my_unique_list, 2)):
        if index % 1000 == 0:
            print(f"combinations processed: {index}", end="\r", flush=True)

        partial_ratio = fuzzywuzzy.fuzz.partial_ratio(word1, word2)
        ratio = fuzzywuzzy.fuzz.ratio(word1, word2)
        if max(partial_ratio, ratio) >= MIN_RATION:
            writer.writerow([word1, word2, partial_ratio, ratio])
    print()
    print(f"Total combinations processed: {index+1}")
## ----------------------

虽然我不是多处理专家,但这可能有用。您可能想在较小的子集上测试一下:

import csv
import functools
import itertools
import multiprocessing

import fuzzywuzzy.fuzz

MIN_RATION = 90

def get_ratios(pair, queue):
    partial_ratio = fuzzywuzzy.fuzz.partial_ratio(*pair)
    ratio = fuzzywuzzy.fuzz.ratio(*pair)
    if max(partial_ratio, ratio) >= MIN_RATION:
        queue.put(list(pair) + [partial_ratio, ratio])

def main(my_unique_list):
    with multiprocessing.Manager() as manager:
        queue = manager.Queue()

        with multiprocessing.Pool(processes=8) as pool:
            _ = pool.map(functools.partial(get_ratios, queue=queue), itertools.combinations(my_unique_list, 2), chunksize=1000)
            pool.close()
            pool.join()

        with open("good_matches.csv", "w", encoding="utf-8", newline="") as file_out:
            writer = csv.writer(file_out)
            writer.writerow(["name1", "name2", "partial_ratio", "ratio"])
            while not queue.empty():
                item = queue.get()
                writer.writerow(item)
                print(item)

if __name__ == "__main__":
    ## ----------------------
    ## the result of cleaning and filtering your input data...
    ## ----------------------
    my_unique_list = ['apple','applee','aplee']
    ## ----------------------

    main(my_unique_list)
© www.soinside.com 2019 - 2024. All rights reserved.