读取目录中的大量文件导致内存错误

问题描述 投票:0回答:0

感谢您阅读本文。我有一个 260 万个 Java 文件数据集,我试图在其中删除空格和注释。我写了一个函数来一个一个地处理文件。处理完34501个文件,报内存错误`

<

import os
import re

# Folder Path
path = r"H:\DataZip\dataset\dataset\selected"

# Change the directory
os.chdir(path)


# Read text File

counter = 1
def read_text_file(file_path):
    with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:

        file = f.read()
        wtalc = re.sub(r"//.*?\n", '', file)
        wtslc = re.sub(r"/\*.*?\*/", '', wtalc, flags=re.S)
        x = re.sub(r"import.*?\n", '', wtslc)
        x = re.sub(r"package.*?\n", '', x)
        y = re.sub(r"@Override.*?\n", '', x)
        y = y.split("\n")
        regex = re.compile(' Data_Type \s+([^=;,:\(\)]+)')
        result = []
        for line in y:
        result.extend(regex.findall(line))


        lis = []
        for d in y:
            for i in result:
                d = d.replace(i, "data_variable")
            lis.append(d)
        new_list = list(filter(lambda x: x != '', lis))



    with open(file_path, 'w', encoding='utf-8', errors='ignore') as file:
            for item in new_list:
                item = item.strip()
                file.write(item + "\n")
            file.close()
    global counter
    print(f"{file_path} is being preprocessed and total number of preprocessed files : {counter}")
    counter = counter+1



for file in os.listdir():
    # Check whether file is in text format or not
    if file.endswith(".java"):
        file_path = f"{path}\{file}"

        # call read text file function
        read_text_file(file_path)

` 错误

这个错误是不是因为所有的文件都一次加载到内存中了? 我认为我的函数是一个一个地读取文件。

非常感谢您的宝贵时间。

python python-3.x loops file-handling
© www.soinside.com 2019 - 2024. All rights reserved.