nltk正则表达式标记生成器

Question

我尝试在python中使用nltk实现一个正则表达式标记生成器，但结果如下：

>>> import nltk
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x)    # set flag to allow verbose regexps
...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
...   | \w+(-\w+)*        # words with optional internal hyphens
...   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
...   | \.\.\.            # ellipsis
...   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
[('', '', ''), ('', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '')]

但想要的结果是这样的：

['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

为什么？哪里出错了？

Answer 1

您应该将所有捕获组转为非捕获：

([A-Z]\.)+> (?:[A-Z]\.)+
\w+(-\w+)* - > \w+(?:-\w+)*
\$?\d+(\.\d+)?%?到\$?\d+(?:\.\d+)?%?

问题是regexp_tokenize似乎正在使用re.findall，当模式中定义了多个捕获组时，它会返回捕获元组列表。见this nltk.tokenize package reference：

pattern (str) - 用于构建此tokenizer的模式。（此模式不得包含捕获括号;请使用非捕获括号，例如（？：...））

另外，我不确定你是否想使用匹配包括全部大写字母的范围的:-_，将-放到字符类的末尾。

因此，使用

pattern = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
      | \w+(?:-\w+)*        # words with optional internal hyphens
      | \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
      | \.\.\.              # ellipsis
      | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    '''

nltk正则表达式标记生成器

问题描述投票：6回答：1

1个回答

最新问题

nltk正则表达式标记生成器

问题描述 投票：6回答：1

1个回答

最新问题

问题描述投票：6回答：1