感谢您阅读本文。我有一个 260 万个 Java 文件数据集,我试图在其中删除空格和注释。我写了一个函数来一个一个地处理文件。处理完34501个文件,报内存错误`
<
import os
import re
# Folder Path
path = r"H:\DataZip\dataset\dataset\selected"
# Change the directory
os.chdir(path)
# Read text File
counter = 1
def read_text_file(file_path):
with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
file = f.read()
wtalc = re.sub(r"//.*?\n", '', file)
wtslc = re.sub(r"/\*.*?\*/", '', wtalc, flags=re.S)
x = re.sub(r"import.*?\n", '', wtslc)
x = re.sub(r"package.*?\n", '', x)
y = re.sub(r"@Override.*?\n", '', x)
y = y.split("\n")
regex = re.compile(' Data_Type \s+([^=;,:\(\)]+)')
result = []
for line in y:
result.extend(regex.findall(line))
lis = []
for d in y:
for i in result:
d = d.replace(i, "data_variable")
lis.append(d)
new_list = list(filter(lambda x: x != '', lis))
with open(file_path, 'w', encoding='utf-8', errors='ignore') as file:
for item in new_list:
item = item.strip()
file.write(item + "\n")
file.close()
global counter
print(f"{file_path} is being preprocessed and total number of preprocessed files : {counter}")
counter = counter+1
for file in os.listdir():
# Check whether file is in text format or not
if file.endswith(".java"):
file_path = f"{path}\{file}"
# call read text file function
read_text_file(file_path)
这个错误是不是因为所有的文件都一次加载到内存中了? 我认为我的函数是一个一个地读取文件。
非常感谢您的宝贵时间。