Python:正则表达式-提取中文文本

问题描述 投票:0回答:1

我正在尝试从以下文本中提取省名和城市名(这是html,但是我删除了一些转义符)。但是,我编写的正则表达式返回一个空白列表。

[当我在re网站(例如https://regex101.com/)上测试代码时,它似乎可以工作,但是当我在脚本中编写代码时,它就无法工作。

这里是我的代码的简化版本(html转储要长得多)。

任何帮助将不胜感激。

import re
text = 'try  window.getAreaStat = [provinceName:湖北省,provinceShortName:湖北,confirmedCount:3554,suspectedCount:0,curedCount:80,deadCount:125,comment:待明确地区:治愈 30,cities:[cityName:武汉,confirmedCount:1905,suspectedCount:0,curedCount:47,deadCount:104,cityName:黄冈,confirmedCount:324,suspectedCount:0,curedCount:2,deadCount:5,cityName:孝感,confirmedCount:274,suspectedCount:0,curedCount:0,deadCount:3,cityName:荆门,confirmedCount:142,suspectedCount:0,curedCount:0,deadCount:4,cityName:襄阳,confirmedCount:131,suspectedCount:0,curedCount:0,deadCount:0,cityName:随州,confirmedCount:116,suspectedCount:0,curedCount:0,deadCount:0,cityName:咸宁,confirmedCount:112,suspectedCount:0,curedCount:0,deadCount:0,cityName:荆州,confirmedCount:101,suspectedCount:0,curedCount:1,deadCount:2,cityName:十堰,confirmedCount:88,suspectedCount:0,curedCount:0,deadCount:0,cityName:黄石,confirmedCount:86,suspectedCount:0,curedCount:0,deadCount:1,cityName:鄂州,confirmedCount:84,suspectedCount:0,curedCount:0,deadCount:1,cityName:宜昌,confirmedCount:63,suspectedCount:0,curedCount:0,deadCount:1,cityName:恩施州,confirmedCount:51,suspectedCount:0,curedCount:0,deadCount:0,cityName:天门,confirmedCount:34,suspectedCount:0,curedCount:0,deadCount:3,cityName:仙桃,confirmedCount:32,suspectedCount:0,curedCount:0,deadCount:0,cityName:潜江,confirmedCount:8,suspectedCount:0,curedCount:0,deadCount:1,cityName:神农架林区,confirmedCount:3,suspectedCount:0,curedCount:0,deadCount:0],provinceName:浙江省,provinceShortName:浙江,confirmedCount:296,suspectedCount:0,curedCount:3,deadCount:0,comment:,cities:[cityName:温州,confirmedCount:114,suspectedCount:0,curedCount:3,deadCount:0,cityName:杭州,confirmedCount:51,suspectedCount:0,curedCount:0,deadCount:0,cityName:台州,confirmedCount:40,suspectedCount:0,curedCount:0,deadCount:0,cityName:宁波,confirmedCount:20,suspectedCount:0,curedCount:0,deadCount:0,cityName:绍兴,confirmedCount:19,suspectedCount:0,curedCount:0,deadCount:0,cityName:嘉兴,confirmedCount:14,suspectedCount:0,curedCount:0,deadCount:0,cityName:金华,confirmedCount:13,suspectedCount:0,curedCount:0,deadCount:0,cityName:衢州,confirmedCount:8,suspectedCount:0,curedCount:0,deadCount:0,cityName:舟山,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:丽水,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:湖州,confirmedCount:5,suspectedCount:0,curedCount:0,deadCount:0],provinceName:广东省,provinceShortName:广东,confirmedCount:241,suspectedCount:0,curedCount:5,deadCount:0,comment:,cities:[cityName:广州,confirmedCount:63,suspectedCount:0,curedCount:0,deadCount:0,cityName:深圳,confirmedCount:63,suspectedCount:0,curedCount:4,deadCount:0,cityName:佛山,confirmedCount:18,suspectedCount:0,curedCount:0,deadCount:0,cityName:珠海,confirmedCount:14,suspectedCount:0,curedCount:0,deadCount:0,cityName:惠州,confirmedCount:12,suspectedCount:0,curedCount:1,deadCount:0,cityName:中山,confirmedCount:12,suspectedCount:0,curedCount:0,deadCount:0,cityName:阳江,confirmedCount:10,suspectedCount:0,curedCount:0,deadCount:0,cityName:湛江,confirmedCount:7,suspectedCount:0,curedCount:0,deadCount:0,cityName:东莞,confirmedCount:7,suspectedCount:0,curedCount:0,deadCount:0,cityName:清远,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:汕头,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:揭阳,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:肇庆,confirmedCount:5,suspectedCount:0,curedCount:0,deadCount:0,cityName:韶关,confirmedCount:4,suspectedCount:0,curedCount:0,deadCount:0,cityName:梅州,confirmedCount:4,suspectedCount:0,curedCount:0,deadCount:0,cityName:茂名,confirmedCount:2,suspectedCount:0,curedCount:0,deadCount:0,cityName:汕尾,confirmedCount:1,suspectedCount:0,curedCount:0,deadCount:0,cityName:河源'

regex = "((?<=provinceName:)|(?<=cityName:)).*?(?=,)"
province = re.findall(regex, text)

print(province)
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
python regex cjk
1个回答
1
投票

从此answer开始,re.findall将返回所有捕获的组。我在https://regexr101.com中尝试了您的正则表达式,所有返回的都是捕获的空白组。

您可以通过添加(?:...)使用非捕获组

regex = "(?:(?<=provinceName:)|(?<=cityName:)).*?(?=,)"

Preview on Repl.it

© www.soinside.com 2019 - 2024. All rights reserved.