如何在NLTK CFG中匹配整数?

问题描述 投票:2回答:2

如果我想定义一种语法,其中一个标记将与一个整数匹配,我如何使用nltk的字符串CFG来实现它?

例如-

S -> SK SO FK
SK -> 'SELECT'
SO -> '\d+'
FK -> 'FROM'
python regex nlp nltk
2个回答
1
投票

这样创建一个数字短语:

import nltk

groucho_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I' | NUM N
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas' | 'elephants'
V -> 'shot'
P -> 'in'
NUM -> '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | '10'
""")

sent = 'I shot 3 elephants'.split()
parser = nltk.ChartParser(groucho_grammar)
for tree in parser.parse(sent):
    print(tree)

[out]:

(S (NP I) (VP (V shot) (NP (NUM 3) (N elephants))))

但是请注意,那只能处理一位数字。因此,让我们尝试将整数压缩为单个令牌类型,例如'#NUM#':

import nltk

groucho_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I' | NUM N
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas' | 'elephants'
V -> 'shot'
P -> 'in'
NUM -> '#NUM#'
""")

sent = 'I shot 333 elephants'.split()
sent = ['#NUM#' if i.isdigit() else i for i in sent]

parser = nltk.ChartParser(groucho_grammar)
for tree in parser.parse(sent):
    print(tree)

[out]:

(S (NP I) (VP (V shot) (NP (NUM #NUM#) (N elephants))))

要放回数字,请尝试:

import nltk

groucho_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I' | NUM N
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas' | 'elephants'
V -> 'shot'
P -> 'in'
NUM -> '#NUM#'
""")

original_sent = 'I shot 333 elephants'.split()
sent = ['#NUM#' if i.isdigit() else i for i in original_sent]
numbers = [i for i in original_sent if i.isdigit()]

parser = nltk.ChartParser(groucho_grammar)
for tree in parser.parse(sent):
    treestr = str(tree)
    for n in numbers:
        treestr = treestr.replace('#NUM#', n, 1)
    print(treestr)

[out]:

(S (NP I) (VP (V shot) (NP (NUM 333) (N elephants))))

0
投票

一个简单的解决方案是定义一个函数,该函数根据给定的句子和语法创建一个解析器。通过扩展每个函数调用的语法以包括句子中整数的乘积,可以解决整数问题。这是一个示例函数:

def name_parser(G,sent):
    ints = [i for i in sent if i.isdigit()]
    lproductions = list(G.productions())
    lproduction.extend([nltk.grammar.Production(nltk.grammar.Nonterminal('INT'),[i]) for i in ints])
    lgrammar = nltk.grammar.CFG(G.start(),lproductions)
    parser = nltk.ChartParser(lgrammar)
    for tree in parser.parse(sent):
        print(tree)

© www.soinside.com 2019 - 2024. All rights reserved.