Python 过滤和提取文本

问题描述 投票:0回答:1

我对编码还很陌生,最近我偶然发现了一些我想尝试用 Python 解决的问题。下面是我要查询的文本内容,将某些字段提取到一个新文件中。文本内容重复,可达数千行。目前,我只能解析和输出前两列,它们看起来仍然是错误的。希望在这里寻求一些指导。干杯!

原始TXT文件:


教室布置:1A-1
(学生姓名:Jess,主题:EC001,时间:上午 9 点 - 上午 10 点)
(学生姓名:Whit,科目:EC001,时间:上午 9 点 - 上午 10 点)
(学生姓名:Jon,主题:EC0011,时间:上午 11 点 - 中午 12 点)
(学生姓名:Kevin,科目:EC011,时间:上午 11 点至中午 12 点)
(学生姓名:Jess,科目:EC011,时间:上午 11 点 - 中午 12 点)


教室安排:1A-2
(学生姓名:Jess,主题:EC002,时间:上午 11 点 - 中午 12 点)
(学生姓名:Whit,科目:EC002,时间:上午 11 点 - 中午 12 点)
(学生姓名:Jon,主题:EC002,时间:上午 11 点 - 中午 12 点)
(学生姓名:Kevin,主题:EC002,时间:上午 11 点至中午 12 点)
(学生姓名:Claire,科目:EC011,时间:下午 2 点 - 3 点)
(学生姓名:Joshua,主题:EC0011,时间:下午 2 点 - 3 点)
(学生姓名:Florence,科目:EC011,时间:下午 2pm - 3pm)
(学生姓名:Neil,科目:EC011,时间:凌晨 2 点 - 下午 3 点)

预期输出:


教室:1A-1,Jess,主题:EC001 时间:上午 9 点 - 上午 10 点,主题:EC011,时间:上午 11 点 - 中午 12 点
教室:1A-1,Whit,科目:EC001 时间:上午 9 点 - 上午 10 点
教室:1A-1,Jon,科目:EC0011 时间:上午 11 点 - 中午 12 点
教室:1A-1,Kevin,科目:EC011 时间:上午 11 点 - 中午 12 点
教室:1A-2,Jess,科目:EC002 时间:上午 11 点 - 中午 12 点
教室:1A-2,Jon,主题:EC002,时间:上午 11 点 - 中午 12 点
教室:1A-2,Whit,科目:EC002 时间:上午 11 点 - 中午 12 点
教室:1A-2,Kevin,科目:EC002,时间:上午 11 点 - 中午 12 点
教室:1A-2,Claire,科目:EC011,时间:下午 2 点 - 3 点
教室:1A-2,Joshua,科目:EC0011,时间:下午 2 点 - 3 点
教室:1A-2,佛罗伦萨,科目:EC011,时间:下午 2 点 - 3 点
教室:1A-2,Neil,科目:EC011,时间:凌晨 2 点 - 下午 3 点

我尝试在控制台中执行输出之前将读取行传递到模块中,但这似乎是错误的,因为我需要在每行前面添加 1A-1 类。

电流输出:


1A-1级
Jess,对象:EC001 时间:上午 9 点 - 上午 10 点
Jess,主题:EC011,时间:上午 11 点 - 中午 12 点
Whit,主题:EC001 时间:上午 9 点 - 上午 10 点
Jon,主题:EC0011 时间:上午 11 点 - 中午 12 点
Kevin,受试者:EC011 时间:上午 11 点 - 中午 12 点

python pandas filter extract
1个回答
0
投票

这是您的解决方案。您需要调整输入/输出文件路径的值:

classroom.py

import collections


def ingest(infilepath):
    """
    Read all the input from the input file.
    Store it in a dictionary so that we can parse it out later.
    We'll use a collections.defaultdict to make life easier
        {classroom name: {student name: [classes...]} }
            key'd by student name since a student can have multiple courses in a classroom
    """
    answer = collections.defaultdict(lambda: collections.defaultdict(list))
    with open(infilepath) as infile:
        classes = infile.read().split('\n\n')  # divide the input into blocks of classrooms
        classes = [c.strip() for c in classes]  # strip out any extra whitespace

    for classblock in classes:
        name, *records = classblock.splitlines()  # student records per classroom
        name = name.split(':',1)[-1].strip()
        for record in records:
            record = record.replace("(", "").replace(")", '')  # strip out the "()". We don't need that
            kvs = record.split(',')

            d = dict(kv.split(":") for kv in kvs)
            d = {k.strip():v.strip() for k,v in d.items()}

            answer[name][d['Student Name']].append(d)

    return answer


def output(outfilepath, data):
    order = ("Subject", "Time")  # the order in which we want to write the output
    with open(outfilepath, 'w') as outfile:
        for classname, d in data.items():
            for studentname, L in d.items():
                outfile.write(f"Classroom: {classname}, {studentname}, ")
                out = []  # maintain the line output in a list. We'll join everything up later
                for d in L:
                    for k in order:
                        out.append(f"{k}: {d[k]}, ")

                out = ''.join(out)  # this is the file output
                out = out.strip().rstrip(',')  # strip out the trailing ','
                outfile.write(f'{out}\n')


if __name__ == "__main__":
    print('starting')

    data = ingest('path/to/input/file')
    output('path/to/output/file', data)

    print('done')

我使用了这个输入(注意文件开头的空行):



Classroom arrangement : 2A-1
(Student Name: Jess, Subject: EC001, Time: 9am - 10am)
(Student Name: Whit, Subject: EC001, Time: 9am - 10am)
(Student Name: Jon, Subject: EC0011, Time: 11am - 12pm)
(Student Name: Kevin, Subject: EC011, Time: 11am - 12pm)
(Student Name: Jess, Subject: EC011, Time: 11am - 12pm)


Classroom arrangement : 1A-2
(Student Name: Jess, Subject: EC002, Time: 11am - 12pm)
(Student Name: Whit, Subject: EC002, Time: 11am - 12pm)
(Student Name: Jon, Subject: EC002, Time: 11am - 12pm)
(Student Name: Kevin, Subject: EC002, Time: 11am - 12pm)
(Student Name: Claire, Subject: EC011, Time: 2pm - 3pm)
(Student Name: Joshua, Subject: EC0011, Time: 2pm - 3pm)
(Student Name: Florence, Subject: EC011, Time: 2pm - 3pm)
(Student Name: Neil, Subject: EC011, Time: 2am - 3pm)

我得到了这个输出:

Classroom: 1A-1, Jess, Subject: EC001, Time: 9am - 10am, Subject: EC011, Time: 11am - 12pm
Classroom: 1A-1, Whit, Subject: EC001, Time: 9am - 10am
Classroom: 1A-1, Jon, Subject: EC0011, Time: 11am - 12pm
Classroom: 1A-1, Kevin, Subject: EC011, Time: 11am - 12pm
Classroom: 1A-2, Jess, Subject: EC002, Time: 11am - 12pm
Classroom: 1A-2, Whit, Subject: EC002, Time: 11am - 12pm
Classroom: 1A-2, Jon, Subject: EC002, Time: 11am - 12pm
Classroom: 1A-2, Kevin, Subject: EC002, Time: 11am - 12pm
Classroom: 1A-2, Claire, Subject: EC011, Time: 2pm - 3pm
Classroom: 1A-2, Joshua, Subject: EC0011, Time: 2pm - 3pm
Classroom: 1A-2, Florence, Subject: EC011, Time: 2pm - 3pm
Classroom: 1A-2, Neil, Subject: EC011, Time: 2am - 3pm

希望这有帮助

© www.soinside.com 2019 - 2024. All rights reserved.