这是我的代码。我正在使用代码检测文件夹中的一堆文本文件,然后将数据输出字符串解析为 csv 文件。您能给我一些关于如何执行此操作的提示吗?我正在努力奋斗。
我的代码的第一步是检测数据在 txt 文件中的位置。我发现所有数据都以“Read”开头,然后我找到了每个文件中数据从哪一行开始。之后我就在如何将数据输出导出到 csv 文件上苦苦挣扎。
import os
import argparse
import csv
from typing import List
def validate_directory(path):
if os.path.isdir(path):
return path
else:
raise NotADirectoryError(path)
def get_data_from_file(file) -> List[str]:
ignore_list = ["Read Segment", "Read Disk", "Read a line", "Read in"]
data = []
with open(file, "r", encoding="latin1") as f:
try:
lines = f.readlines()
except Exception as e:
print(f"Unable to process {file}: {e}")
return []
for line_number, line in enumerate(lines, start=1):
if not any(variation in line for variation in ignore_list):
if line.strip().startswith("Read ") and not line.strip().startswith("Read ("): # TODO: fix this with better regex
data.append(f'Found "Read" at line {line_number} in {file}')
print(f'Found "Read" at {file}:{line_number}')
print(lines[line_number-1])
return data
def list_read_data(directory_path: str) -> List[str]:
total_data = []
for root, _, files in os.walk(directory_path):
for file_name in files:
if file_name.endswith(".txt"):
data = get_data_from_file(os.path.join(root, file_name))
total_data.extend(data)
return total_data
def write_results_to_csv(output_file: str, data: List[str]):
with open(output_file, "w", newline="", encoding="utf-8") as csvfile:
writer = csv.writer(csvfile)
writer.writerow(["Results"])
for line in data:
writer.writerow([line])
def main(directory_path: str, output_file: str):
data = list_read_data(directory_path)
write_results_to_csv(output_file, data)
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Process the 2020Model folder for input data."
)
parser.add_argument(
"--directory", type=validate_directory, help="folder to be processed"
)
parser.add_argument("--output", type=str, help="Output file name (e.g., outputfile.csv)", default="outputfile.csv")
args = parser.parse_args()
main(os.path.abspath(args.directory), args.output)
下面是我理想的 csv 输出数据:
1985 | 1986 | 1986 | 1987 | 1988 | 1989 | 1990 | 1991 | 1992 | 1993 | 1994 |
---|---|---|---|---|---|---|---|---|---|---|
37839 | 36962 | 37856 | 41971 | 40838 | 44640.87 | 42826.34 | 44883.03 | 43077.59 | 45006.49 | 46789 |
您能给我一些提示吗:
下面是一个示例 txt 文件:
Select Year(2007-2025)
Read TotPkSav
/2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025
00 27 53 78 108 133 151 161 169 177 186 195 205 216 229 242 257 273 288
如果您的所有文件看起来都像这 4 行,那么我建议您将文件转换为前面的行列表,而不是尝试单步/迭代这些行。我还建议仅使用 glob 和 recursive=True 并避免尝试遍历树。
因为它在 for 循环内读取文件,所以只需
continue
-ing 到循环中的下一个文件即可跳过任何具有不良属性的文件:
all_rows: list[list[str]] = []
for fname in glob.glob("**/*.txt", recursive=True):
with open(fname, encoding="iso-8859-1") as f:
print(f"reading {fname}")
lines = [x.strip() for x in list(f)]
if len(lines) != 4:
print(f'skipping {fname} with too few lines"')
continue
line2 = lines[1]
if line2[:4] != "Read" or line2[:6] == "Read (":
print(f'skipping {fname} with line2 = "{line2}"')
continue
line3, line4 = lines[2:4]
if line3[0] == "/":
line3 = line3[1:]
header = [x for x in line3.split(" ") if x]
data = [x for x in line4.split(" ") if x]
all_rows.append(header)
all_rows.append(data)
with open("output.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(["Result"])
writer.writerows(all_rows)
我模拟了更多文件并将它们分布在我的树中:
- .
- a
input3.txt
- b
foo.txt
input1.txt
input2.txt
main.py
当我从该树的根运行该程序时,我得到:
reading input1.txt
reading input2.txt
skipping input2.txt with line2 = "Read (TotPkSav)"
reading a/input3.txt
reading b/foo.txt
skipping b/foo.txt with too few lines"
和output.csv看起来像:
| Result |
|--------|
| 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | 2024 | 2025 |
| 00 | 27 | 53 | 78 | 108 | 133 | 151 | 161 | 169 | 177 | 186 | 195 | 205 | 216 | 229 | 242 | 257 | 273 | 288 |
| 2099 | 2098 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | 2024 | 2025 |
| 00 | 27 | 53 | 78 | 108 | 133 | 151 | 161 | 169 | 177 | 186 | 195 | 205 | 216 | 229 | 242 | 257 | 273 | 288 |