如何使用 Python 从多个文本文件中将文本提取到 CSV 文件中?

问题描述 投票:0回答:1

我有一个充满子文件夹的文件夹,其中包含如下所示的文本 (.txt) 文件:

some random information here
ignore it

author: Lisa Smith

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla sit amet leo quis risus viverra varius pretium sed nunc. Nullam vitae tempor nisl.

Quisque viverra interdum nibh, id malesuada magna scelerisque sit amet. Quisque sed arcu tempus, feugiat dolor at, convallis justo. Suspendisse euismod, metus non pretium pulvinar, odio eros rhoncus eros, eu scelerisque ex risus id mauris. Praesent id vulputate augue.

Aliquam erat volutpat. Pellentesque dignissim pharetra commodo. Vivamus risus leo, posuere eu odio eget, vestibulum auctor lorem. Aenean volutpat finibus lectus sed pretium. Lorem ipsum dolor sit amet, consectetur adipiscing elit. In ullamcorper mauris nec elit tempor, vitae finibus ante aliquam.

我想创建一个如下所示的 CSV 文件:

文件名 作者 文字
/完整文件路径/ 丽莎·史密斯 Lorem ipsum dolor sit amet, consectetur adipiscing elit。 Nulla sit amet leo quis risus viverra varius pretium sed nunc。 Nullam vitae tempor nisl。 Quisque viverra interdum nibh,id malesuada magna scelerisque sit amet。 Quisque sed arcu tempus,feugiat dolor at,convallis justo。 Suspendisse euismod,metus non pretium pulvinar,odio eros rhoncus eros,eu scelerisque ex risus id mauris。 Praesent id vulputate augue。 Aliquam erat volutpat。 Pellentesque dignissim pharetra commodo。 Vivamus risus leo, posuere eu odio eget, vestibulum auctor lorem. Aenean volutpat finibus lectus sed pretium。 Lorem ipsum dolor sit amet, consectetur adipiscing elit。 In ullamcorper mauris nec elit tempor, vitae finibus ante aliquam.

这是我目前拥有的代码,是我从 previous question 和其他几个帖子中拼凑而成的:

from glob import glob
import os
import re
import csv
import nltk

path = '**/*.txt'

def extract_fields(fname):
    with open(fname) as f:
        author, txt = "", ""

        for line in f:
            line = line.strip()
            if line.startswith("author: "):
                author = line[8:]
                break

        next(f)  # discard the following blank line

        txt = f.read()

        return author, txt


rows = []
for fname in glob(path):
    author, txt = extract_fields(fname)
    rows.append([fname, author, txt])

with open("output.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["filename", "author", "txt"])
    writer.writerows(rows)

我收到以下错误:

Traceback (most recent call last):
  File "print_text.py", line 28, in <module>
    author, txt = extract_fields(fname)
  File "print_text.py", line 19, in extract_fields
    next(f)  # discard the following blank line
StopIteration

任何指导将不胜感激!

python csv
1个回答
0
投票

我在查看您的代码时看到的最大问题是结构性的;正则表达式可能有问题,但是

text
从哪里来,你如何迭代它?

我建议你写一个函数,它接受一个文件名并返回提取的作者和文本。脚本的主体现在看起来像:

def extract_fields(fname):
    ...
    return author, txt


rows = []
for fname in glob(...):
    author, txt = extract_fields(fname)
    rows.append([fname, author, txt])


with open("output.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["filename", "author", "txt"])
    writer.writerows(rows)

在 extract_fields 中,您将在 fname 打开文件,执行提取,并返回提取的作者和文本。

如果你知道你的正则表达式很好并且你喜欢它,你可以忽略其余的。

至于提取的机制,对于看起来如此简单的东西,我更愿意根据单独的 txt 行来处理数据(我不喜欢多行正则表达式并尽量避免它)。

我会迭代文件的前几行,直到找到锚行,“作者:...”。一旦我确定了那条线,我就知道如何获得作者的名字:

for line in f:
    line = line.strip()
    if line.startswith("author: "):
        author = line[8:]
        break

该循环将读取并丢弃(剥离)行,直到找到以“作者:”开头的行,提取作者姓名,然后跳出循环。

跳出循环,知道下一行是空行,可以舍弃:

next(f)  # discard the following blank line

剩下的就是我想要的文字:

txt = f.read()

这里是完整的函数:

def extract_fields(fname):
    with open(fname) as f:
        author, txt = "", ""

        for line in f:
            line = line.strip()
            if line.startswith("author: "):
                author = line[8:]
                break

        next(f)  # discard the following blank line

        txt = f.read()

        return author, txt

有了这两个文件:

file1.txt
=========
some random information here
ignore it

author: Lisa Smith

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

file2.txt
=========
foo

bar

author: Doug

Nulla sit amet leo quis risus viverra varius pretium sed nunc.
Nullam vitae tempor nisl.

我得到这个输出.csv:

+-----------+------------+----------------------------------------------------------------+
| filename  | author     | txt                                                            |
+-----------+------------+----------------------------------------------------------------+
| file1.txt | Lisa Smith | Lorem ipsum dolor sit amet, consectetur adipiscing elit.       |
|           |            |                                                                |
+-----------+------------+----------------------------------------------------------------+
| file2.txt | Doug       | Nulla sit amet leo quis risus viverra varius pretium sed nunc. |
|           |            | Nullam vitae tempor nisl.                                      |
+-----------+------------+----------------------------------------------------------------+
© www.soinside.com 2019 - 2024. All rights reserved.