使用 Python 解析联系人的文本文件并将行转置为列

问题描述 投票:0回答:1

我有一个纯文本短信文件,我想将其转换为 CSV 文件。格式如下:

Sent on 1/1/2023 7:30:33 AM to Person1

Message

-----

Received on 5/20/2023 4:55:33 PM from Person1

Message

我想遍历文本文件,检索行并创建如下所示的结果。

状态 日期 联系方式 留言
已发送 1/1/2023 上午 7:30:33 人1 留言
收到 5/20/2023 下午 4:55:33 人1 留言

我从下面的代码开始,但我是 Python 的新手,不知道如何将行转置为列。

import csv
import openpyxl

input_file = 'output.txt'
output_file = 'allmessages.csv'

wb = openpyxl.Workbook()
ws = wb.worksheets[0]

with open(input_file, 'r') as data:
    reader = csv.reader(data, delimiter='\t')
    for row in reader:
        ws.append(row)
        
wb.save(output_file)

任何建议将不胜感激!

python csv parsing text-files
1个回答
0
投票

带你走进正则表达式的精彩世界!

我拼凑了这个:

regexp = r"(Sent|Received).*?(\d.*?(?:AM|PM))\s(?:to\s|from\s)(\w*?)\n\n([\s\S]*?)(?:\n\-{5}|$)[\S\s]*?"

您可以在这里尝试:Regex101

这个表达式每次匹配返回 4 个组,组包含状态、时间戳、收件人和消息恭敬

我建议你看一下 python 的

re
包,了解如何准确提取此信息,但这里有一篇包含更多信息的帖子:How can i find all matches to a regular expression in Python

稍微解释一下上面的表达式:

(Sent|Received)
// A group matching either the literal "Sent" or "Received"
.*?
// Consume everything until the next match, but be careful not to skip anything
(\d.*?(?:AM|PM))
// A group containing anything between a digit and either the literal "AM" or "PM". 1234AM would match here, but so would 5/20/2023 4:55:33 PM, so as long as the format is consistent it'll work.
\s
// Consume a single space
(?:to\s|from\s)
// Consume either the literal "to " or "from "
(\w*?)
// A group containin sequantial alphanumeric characters without skipping the next match
\n\n
// Consume 2 newlines
([\s\S]*?)
// A group containing literally anything, but making sure it doesn't skip the next match
(?:\n\-{5}|$)
// Consume either the literal "\n-----" or the end of the entire file
[\S\s]*?
// Consumes literally anything until the start of the next match
© www.soinside.com 2019 - 2024. All rights reserved.