用于识别分号之间包含逗号和空格的文本的正则表达式

Question

我正在尝试在以分号 (;) 分隔的 csv 中识别一些包含逗号 (,) 和空格 (\s+) 的文本。示例 csv 条目如下：

09/03/2023;13;P;1210/2003 (OJ L169);2003-07-08;http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:169:0006:0023:EN:PDF;IRQ;(UNSC RESOLUTION 1483);;;;;;;;;;;;;;;;;;;;;;;;;;;14;13;1210/2003 (OJ L169);2003-07-08;http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:169:0006:0023:EN:PDF;IRQ;1937-04-28;al-Awja, near Tikrit;IRQ;;;;;;;;;;;;;;;;EU.27.28
09/03/2023;20;P;1210/2003 (OJ L169);2003-07-08;http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:169:0006:0023:EN:PDF;IRQ;(Saddam's second son);26;20;1210/2003 (OJ L169);2003-07-08;http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:169:0006:0023:EN:PDF;IRQ;Hussein Al-Tikriti;Qusay;Saddam;Qusay Saddam Hussein Al-Tikriti;M;;Oversaw Special Republican Guard, Special Security Organisation, and Republican Guard;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;EU.39.56

在样本数据中，我试图提取以下文本：

al-Awja, near Tikrit
Oversaw Special Republican Guard, Special Security Organisation, and Republican Guard

目标文本的两个实例中都有逗号 (,)，这在尝试将分号 (;) 分隔文件转换为逗号 (,) 分隔文件时会产生问题，因为它为现有逗号 (,) 添加了额外的列) 在字符串中。

到目前为止，我有以下正则表达式将我带到所需的文本。但是，我无法使用它来检索整个字符串。

正则表达式：

([A-Za-z0-9-]+)([,])(\s+)([A-Za-z0-9-]+)

请帮忙

Answer 1

如果不必是 RegEx，您可以阅读 CSV，例如。与熊猫。鉴于，您总是在寻找相同的列，您的代码可能是这样的：

import pandas as pd    
df = pd.read_csv('yourFile.csv', sep=';', header=None)
df[[20,41]]

对于返回的示例数据：

	20	41
0	南	al-Awja，靠近提克里特
1	监督特别共和国卫队，特别安全...	南

用于识别分号之间包含逗号和空格的文本的正则表达式

问题描述投票：0回答：1

1个回答

最新问题

用于识别分号之间包含逗号和空格的文本的正则表达式

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1