正则表达式，用于确认字符串是否是Python中的有效标识符

Question

我对标识符有以下定义：

Identifier --> letter{ letter| digit}

基本上我有一个标识符函数，它从文件中获取一个字符串并对其进行测试，以确保它是上面定义的有效标识符。

我试过这个：

if re.match('\w+(\w\d)?', i):     
  return True
else:
  return False

但是当我每次遇到一个整数运行我的程序时，它认为它是一个有效的标识符。

例如

c = 0 ;

它打印c作为有效的标识符是好的，但它也打印0作为有效的标识符。

我在这做错了什么？

Answer 1

来自official reference：identifier ::= (letter|"_") (letter | digit | "_")*

所以正则表达式是：

^[^\d\W]\w*\Z

示例（对于Python 2，只省略re.UNICODE）：

import re
identifier = re.compile(r"^[^\d\W]\w*\Z", re.UNICODE)

tests = [ "a", "a1", "_a1", "1a", "aa$%@%", "aa bb", "aa_bb", "aa\n" ]
for test in tests:
    result = re.match(identifier, test)
    print("%r\t= %s" % (test, (result is not None)))

结果：

'a' = True
'a1'    = True
'_a1'   = True
'1a'    = False
'aa$%@%'    = False
'aa bb' = False
'aa_bb' = True
'aa\n'  = False

Answer 2

str.isidentifier()的作品。正则表达式的答案错误地无法匹配某些有效的python标识符并错误地匹配一些无效的标识符。

str.isidentifier()如果字符串是根据语言定义，标识符和关键字部分的有效标识符，则返回true。

使用keyword.iskeyword()测试保留标识符，例如def和class。

@ martineau的评论给出了正则表达式解决方案失败的'℘᧚'的例子。

>>> '℘᧚'.isidentifier()
True
>>> import re
>>> bool(re.search(r'^[^\d\W]\w*\Z', '℘᧚'))
False

Why does this happen?

让我们定义匹配给定正则表达式的代码点集，以及与str.isidentifier匹配的集合。

import re
import unicodedata

chars = {chr(i) for i in range(0x10ffff) if re.fullmatch(r'^[^\d\W]\w*\Z', chr(i))}
identifiers = {chr(i) for i in range(0x10ffff) if chr(i).isidentifier()}

有多少正则表达式匹配不是标识符？

In [26]: len(chars - identifiers)                                                                                                               
Out[26]: 698

有多少标识符不是正则表达式匹配？

In [27]: len(identifiers - chars)                                                                                                               
Out[27]: 4

有意思 - 哪些？

In [37]: {(c, unicodedata.name(c), unicodedata.category(c)) for c in identifiers - chars}                                                       
Out[37]: 
set([
    ('\u1885', 'MONGOLIAN LETTER ALI GALI BALUDA', 'Mn'),
    ('\u1886', 'MONGOLIAN LETTER ALI GALI THREE BALUDA', 'Mn'),
    ('℘', 'SCRIPT CAPITAL P', 'Sm'),
    ('℮', 'ESTIMATED SYMBOL', 'So'),
])

What's different about these two sets?

它们具有不同的Unicode“常规类别”值。

In [31]: {unicodedata.category(c) for c in chars - identifiers}                                                                                 
Out[31]: set(['Lm', 'Lo', 'No'])

来自wikipedia，那是Letter, modifier; Letter, other; Number, other。这与re docs一致，因为\d只是十进制数字：

\d匹配任何Unicode十进制数字（即Unicode字符类别[Nd]中的任何字符）

另一种方式呢？

In [32]: {unicodedata.category(c) for c in identifiers - chars}                                                                                 
Out[32]: set(['Mn', 'Sm', 'So'])

那就是qazxsw poi; qazxsw poi; qazxsw poi。

Where is this all documented?

在Mark, nonspacing
在Symbol, math

Where is it implemented?

Symbol, other

I still want a regular expression

查看PyPI上的Python Language Reference模块。

此正则表达式实现与标准“re”模块向后兼容，但提供了其他功能。

它包括“常规类别”的过滤器。

Answer 3

对于Python 3，您需要处理Unicode字母和数字。所以，如果这是一个问题，你应该相处：

PEP 3131 - Supporting non-ascii identifiers

https://github.com/python/cpython/commit/47383403a0a11259acb640406a8efc38981d2255匹配一个不是数字而不是“非字母数字”的字符，该字符转换为“字母或下划线”。

Answer 4

2
投票

\ w匹配数字和字符。试试regex

Answer 5

像魅力一样：re_ident = re.compile(r"^[^\d\W]\w*$", re.UNICODE)

正则表达式，用于确认字符串是否是Python中的有效标识符

问题描述投票：14回答：5

5个回答

Why does this happen?

What's different about these two sets?

Where is this all documented?

Where is it implemented?

I still want a regular expression

最新问题

正则表达式，用于确认字符串是否是Python中的有效标识符

问题描述 投票：14回答：5

5个回答

Why does this happen?

What's different about these two sets?

Where is this all documented?

Where is it implemented?

I still want a regular expression

最新问题

问题描述投票：14回答：5