检测txt文件中的数据,字符串解析并输出为csv文件

问题描述 投票:0回答:1

这是我的代码。我正在使用代码检测文件夹中的一堆文本文件,然后将数据输出字符串解析为 csv 文件。您能给我一些关于如何执行此操作的提示吗?我正在努力奋斗。

我的代码的第一步是检测数据在 txt 文件中的位置。我发现所有数据都以“Read”开头,然后我找到了每个文件中数据从哪一行开始。之后我就在如何将数据输出导出到 csv 文件上苦苦挣扎。

import os
import argparse
import csv
from typing import List


def validate_directory(path):
    if os.path.isdir(path):
        return path
    else:
        raise NotADirectoryError(path)


def get_data_from_file(file) -> List[str]:
    ignore_list = ["Read Segment", "Read Disk", "Read a line", "Read in"]
    data = []
    with open(file, "r", encoding="latin1") as f:
        try:
            lines = f.readlines()
        except Exception as e:
            print(f"Unable to process {file}: {e}")
            return []
        for line_number, line in enumerate(lines, start=1):
            if not any(variation in line for variation in ignore_list):
                if line.strip().startswith("Read ") and not line.strip().startswith("Read ("): # TODO: fix this with better regex
                    data.append(f'Found "Read" at line {line_number} in {file}')
                    print(f'Found "Read" at {file}:{line_number}')
                    print(lines[line_number-1])
    return data


def list_read_data(directory_path: str) -> List[str]:
    total_data = []
    for root, _, files in os.walk(directory_path):
        for file_name in files:
            if file_name.endswith(".txt"):
                data = get_data_from_file(os.path.join(root, file_name))
                total_data.extend(data)

    return total_data


def write_results_to_csv(output_file: str, data: List[str]):
    with open(output_file, "w", newline="", encoding="utf-8") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["Results"])
        for line in data:
            writer.writerow([line])


def main(directory_path: str, output_file: str):
    data = list_read_data(directory_path)
    write_results_to_csv(output_file, data)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Process the 2020Model folder for input data."
    )
    parser.add_argument(
        "--directory", type=validate_directory, help="folder to be processed"
    )
    parser.add_argument("--output", type=str, help="Output file name (e.g., outputfile.csv)", default="outputfile.csv")

    args = parser.parse_args()
    main(os.path.abspath(args.directory), args.output)

下面是我理想的 csv 输出数据:

1985 1986 1986 1987 1988 1989 1990 1991 1992 1993 1994
37839 36962 37856 41971 40838 44640.87 42826.34 44883.03 43077.59 45006.49 46789

您能给我一些提示吗:

  • 字符串解析放在哪里?
  • 如何输出为 CSV 文件。

下面是一个示例 txt 文件:

Select Year(2007-2025)
Read TotPkSav
/2007     2008     2009     2010     2011     2012     2013     2014     2015     2016     2017     2018     2019     2020     2021     2022     2023     2024     2025 
   00       27       53       78      108      133      151      161      169      177      186      195      205      216      229      242      257      273      288 
python string csv string-parsing
1个回答
0
投票

如果您的所有文件看起来都像这 4 行,那么我建议您将文件转换为前面的行列表,而不是尝试单步/迭代这些行。我还建议仅使用 glob 和 recursive=True 并避免尝试遍历树。

因为它在 for 循环内读取文件,所以只需

continue
-ing 到循环中的下一个文件即可跳过任何具有不良属性的文件:

all_rows: list[list[str]] = []

for fname in glob.glob("**/*.txt", recursive=True):
    with open(fname, encoding="iso-8859-1") as f:
        print(f"reading {fname}")
        lines = [x.strip() for x in list(f)]

        if len(lines) != 4:
            print(f'skipping {fname} with too few lines"')
            continue

        line2 = lines[1]
        if line2[:4] != "Read" or line2[:6] == "Read (":
            print(f'skipping {fname} with line2 = "{line2}"')
            continue

        line3, line4 = lines[2:4]

        if line3[0] == "/":
            line3 = line3[1:]

        header = [x for x in line3.split(" ") if x]
        data = [x for x in line4.split(" ") if x]
      
        all_rows.append(header)
        all_rows.append(data)

with open("output.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(["Result"])
    writer.writerows(all_rows)

我模拟了更多文件并将它们分布在我的树中:

 - .
 - a
    input3.txt
 - b
    foo.txt
   input1.txt
   input2.txt
   main.py

当我从该树的根运行该程序时,我得到:

reading input1.txt
reading input2.txt
skipping input2.txt with line2 = "Read (TotPkSav)"
reading a/input3.txt
reading b/foo.txt
skipping b/foo.txt with too few lines"

和output.csv看起来像:

| Result |
|--------|
| 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | 2024 | 2025 |
| 00   | 27   | 53   | 78   | 108  | 133  | 151  | 161  | 169  | 177  | 186  | 195  | 205  | 216  | 229  | 242  | 257  | 273  | 288  |
| 2099 | 2098 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | 2024 | 2025 |
| 00   | 27   | 53   | 78   | 108  | 133  | 151  | 161  | 169  | 177  | 186  | 195  | 205  | 216  | 229  | 242  | 257  | 273  | 288  |
© www.soinside.com 2019 - 2024. All rights reserved.