corpus = """In the US 555-0198 and 1-206-5705-0100 are examples fictitious numbers.
In the UK, 044-113-496-1834 is a fictitious number.
In Ireland, the number 353-020-917-1234 is fictitious.
And in Australia, 061-970-654-321 is a fictitious number.
311 is a joke."""
我是python的新手,正在研究正则表达式,试图将所有7,11,12和13位数字都更改为零。我希望它仍然看起来像一个电话号码。例如将555-0198更改为000-0000,是否有一种方法可以将311保持原样而不变为零?以下是我能够提出的内容
起初我尝试过,但是使所有数字都为零
for word in corpus.split():
nums = re.sub("(\d)", "0",word)
print(nums)
然后我尝试过,但是我意识到用这种方式对11位和13位数字不正确
def sub_nums():
for word in corpus.split():
nums = re.sub("(\d{1,4})-+(\d{1,4})", "000-0000",word)
print(nums)
sub_nums()
我使用的正则表达式是:
r'(?<!\S)(?:(?=(-*\d-*){7}(\s|\Z))[\d-]+|(?=(-*\d-*){11}(\s|\Z))[\d-]+|(?=(-*\d-*){12}(\s|\Z))[\d-]+|(?=(-*\d-*){13}(\s|\Z))[\d-]+)'
[7、11、12和13位电话号码有重复的“主题”或模式,所以我只解释7位电话号码的模式:
(?!\S)
这是negative lookbehind,适用于所有模式,并说电话号码必须not后面带有not空格字符。这是一个双重否定,并且positive lookbehind (?=\s|\A)
,它表示电话号码必须以空格或字符串开头。但是,这是可变长度的回溯,Python随附的regex引擎不支持该变量(但PyPi存储库的regex
程序包支持)。(?=(-*\d-*){7}(\s|\Z))
7位电话号码的[\d-]+
进行输入中下一个数字和连字符的实际匹配。import re
corpus = """In the US 555-0198 and 1-206-5705-0100 are examples fictitious numbers.
In the UK, 044-113-496-1834 is a fictitious number.
In Ireland, the number 353-020-917-1234 is fictitious.
And in Australia, 061-970-654-321 is a fictitious number.
311 is a joke."""
regex = r'(?<!\S)(?:(?=(-*\d-*){7}(\s|\Z))[\d-]+|(?=(-*\d-*){11}(\s|\Z))[\d-]+|(?=(-*\d-*){12}(\s|\Z))[\d-]+|(?=(-*\d-*){13}(\s|\Z))[\d-]+)'
new_corpus = re.sub(regex, lambda m: re.sub(r'\d', '0', m[0]), corpus)
print(new_corpus)
打印:
In the US 000-0000 and 0-000-0000-0000 are examples fictitious numbers.
In the UK, 000-000-000-0000 is a fictitious number.
In Ireland, the number 000-000-000-0000 is fictitious.
And in Australia, 000-000-000-000 is a fictitious number.
311 is a joke.