Python 正则表达式，删除除 unicode 字符串连字符之外的所有标点符号

Question

我有这段代码用于从正则表达式字符串中删除所有标点符号：

import regex as re    
re.sub(ur"\p{P}+", "", txt)

如何更改它以允许使用连字符？如果您能解释一下您是如何做到的，那就太好了。我明白了，如果错了请指正，标点符号后面加个P。

Answer 1

[^\P{P}-]+

\P

是

\p

的补语 - 不是标点符号。因此，这会匹配任何 not（不是标点符号或破折号）的内容 - 导致除破折号之外的所有标点符号。

示例：http://www.rubular.com/r/JsdNM3nFJ3

如果您想要一种非复杂的方式，另一种方法是

\p{P}(?<!-)

：匹配所有标点符号，然后检查它不是破折号（使用负向后查找）。
工作示例：http://www.rubular.com/r/5G62iSYTdk

Answer 2

以下是如何使用

re

模块执行此操作，以防您必须坚持使用标准库：

# works in python 2 and 3
import re
import string

remove = string.punctuation
remove = remove.replace("-", "") # don't remove hyphens
pattern = r"[{}]".format(remove) # create the pattern

txt = ")*^%{}[]thi's - is - @@#!a !%%!!%- test."
re.sub(pattern, "", txt) 
# >>> 'this - is - a - test'

如果性能很重要，您可能需要使用

str.translate

，因为它比使用正则表达式更快。在 Python 3 中，代码是

txt.translate({ord(char): None for char in remove})

。

Answer 3

您可以手动指定要删除的标点符号，如

[._,]

所示，或者提供一个函数而不是替换字符串：

re.sub(r"\p{P}", lambda m: "-" if m.group(0) == "-" else "", text)

Answer 4

你可以试试

import re, string

text = ")*^%{}[]thi's - is - @@#!a !%%!!%- test."

exclusion_pattern = r"([{}])".format(string.punctuation.replace("-", ""))

result = re.sub(exclusion_pattern, r"", text)

print(result)

“这是一个测试”

Python 正则表达式，删除除 unicode 字符串连字符之外的所有标点符号

问题描述投票：0回答：4

4个回答

最新问题

Python 正则表达式，删除除 unicode 字符串连字符之外的所有标点符号

问题描述 投票：0回答：4

4个回答

最新问题

问题描述投票：0回答：4