用户提供源目录后,以下脚本将读取csvs列表。然后,它需要一个csv,并将其内容逐行复制到新的csv中,直到达到100,000行为止,此时将创建一个新的csv以继续该过程,直到完全复制了原始csv。然后对目录中的下一个csv文件重复此过程。
我有时会遇到上述PermissionError,并且不确定如何解决它,但是有时我会运行该脚本,但没有遇到任何问题。我已验证输入和输出文件在我的计算机上均未打开。我还尝试将目录文件夹的属性更改为非只读,尽管它始终会还原。当确实发生错误时,总是在第一次开始处理csv的几秒钟内。一旦进入大约5秒钟,它就不会给出该csv的错误。但是,稍后可能会到达新的输入csv。
"""
Script processes all csv's in a provided directory and returns
csv's with a maximum of 100,000 rows
"""
import csv
import pathlib
import argparse
import os
import glob
def _get_csv_list(
*, description: str = "Process csv file directory.",
):
"""
Uses argument parser to set up working directory, then
extracts list of csv file names from directory
Args: Directory string
Returns list of csv file name strings
"""
parser = argparse.ArgumentParser(description=description)
parser.add_argument(
"SRC", type=pathlib.Path, help="source (input) directory"
)
parsed_arg = parser.parse_args()
os.chdir(parsed_arg.SRC)
return glob.glob("*.{}".format("csv"))
def _process_csv(file_name):
"""
Iterates through csv file and copies each row to output
file. Once 100,000 rows is reached, a new file is started
Args: file name string
"""
file_index = 0
max_records_per_file = 100_000
with open(file_name) as _file:
reader = csv.reader(_file)
first_line = _file.readline()
first_line_list = first_line.split(",")
for index, row in enumerate(reader):
if index % max_records_per_file == 0:
file_index += 1
with open(
f"output_{file_name.strip('.csv')}_{file_index}.csv",
mode="xt",
encoding="utf-8",
newline="\n",
) as buffer:
writer = csv.writer(buffer)
writer.writerow(first_line_list)
else:
try:
with open(
f"output_{file_name.strip('.csv')}_{file_index}.csv",
mode="at",
encoding="utf-8",
newline="\n",
) as buffer:
writer = csv.writer(buffer)
writer.writerow(row)
except FileNotFoundError as error:
print(error)
with open(
f"output_{file_name.strip('.csv')}_{file_index}.csv",
mode="xt",
encoding="utf-8",
newline="\n",
) as buffer:
writer = csv.writer(buffer)
writer.writerow(first_line_list)
writer.writerow(row)
def main():
"""
Primary function for limiting csv file size
Cmd Line: python csv_row_limiter.py . (Replace '.' with other path
if csv_row_limiter.py directory and csv directory are different)
"""
csv_list = _get_csv_list()
for file_name in csv_list:
_process_csv(file_name)
if __name__ == "__main__":
main()
此外,请注意,输入csv内容的唯一要求是它们具有大量行(100,000+),并且包含一定数量的数据。
关于如何解决此问题的任何想法?
尝试以root用户身份打开它,即尝试通过root或su特权运行此python脚本。我的意思是以root用户身份登录,然后运行此python脚本。希望这可以帮助。