使用 Python 解析联系人的文本文件并将行转置为列

Question

我有一个纯文本短信文件，我想将其转换为 CSV 文件。格式如下：

Sent on 1/1/2023 7:30:33 AM to Person1

Message

-----

Received on 5/20/2023 4:55:33 PM from Person1

Message

我想遍历文本文件，检索行并创建如下所示的结果。

状态	日期	联系方式	留言
已发送	1/1/2023 上午 7:30:33	人1	留言
收到	5/20/2023 下午 4:55:33	人1	留言

我从下面的代码开始，但我是 Python 的新手，不知道如何将行转置为列。

import csv
import openpyxl

input_file = 'output.txt'
output_file = 'allmessages.csv'

wb = openpyxl.Workbook()
ws = wb.worksheets[0]

with open(input_file, 'r') as data:
    reader = csv.reader(data, delimiter='\t')
    for row in reader:
        ws.append(row)
        
wb.save(output_file)

任何建议将不胜感激！

Answer 1

带你走进正则表达式的精彩世界！

我拼凑了这个：

regexp = r"(Sent|Received).*?(\d.*?(?:AM|PM))\s(?:to\s|from\s)(\w*?)\n\n([\s\S]*?)(?:\n\-{5}|$)[\S\s]*?"

您可以在这里尝试：Regex101

这个表达式每次匹配返回 4 个组，组包含状态、时间戳、收件人和消息恭敬

我建议你看一下 python 的

re

包，了解如何准确提取此信息，但这里有一篇包含更多信息的帖子：How can i find all matches to a regular expression in Python

稍微解释一下上面的表达式：

(Sent|Received)
// A group matching either the literal "Sent" or "Received"

.*?
// Consume everything until the next match, but be careful not to skip anything

(\d.*?(?:AM|PM))
// A group containing anything between a digit and either the literal "AM" or "PM". 1234AM would match here, but so would 5/20/2023 4:55:33 PM, so as long as the format is consistent it'll work.

\s
// Consume a single space

(?:to\s|from\s)
// Consume either the literal "to " or "from "

(\w*?)
// A group containin sequantial alphanumeric characters without skipping the next match

\n\n
// Consume 2 newlines

([\s\S]*?)
// A group containing literally anything, but making sure it doesn't skip the next match

(?:\n\-{5}|$)
// Consume either the literal "\n-----" or the end of the entire file

[\S\s]*?
// Consumes literally anything until the start of the next match

使用 Python 解析联系人的文本文件并将行转置为列

问题描述投票：0回答：1

1个回答

稍微解释一下上面的表达式：

最新问题

使用 Python 解析联系人的文本文件并将行转置为列

问题描述 投票：0回答：1

1个回答

稍微解释一下上面的表达式：

最新问题

问题描述投票：0回答：1