我正在复制本文描述的算法:https://arxiv.org/pdf/1811.11008.pdf
[在最后一页上,它描述了使用以下示例提取标记为'NP JJ'的语法中定义的内容:营业利润率为8.3%,而去年同期为11.8%。
我希望看到标有'NP JJ'的叶子,但我没有。为什么要(正则表达式相对较新。)
def split_sentence(sentence_as_string):
''' function to split sentence into list of words
'''
words = word_tokenize(sentence_as_string)
return words
def pos_tagging(sentence_as_list):
words = nltk.pos_tag(sentence_as_list)
return words
def get_regex(sentence, grammar):
sentence = pos_tagging(split_sentence(sentence));
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence)
return result
example_sentence = "Operating profit margin was 8.3%, compared to 11.8% a year earlier."
grammar = """JJ : {< JJ.∗ > ∗}
V B : {< V B.∗ >}
NP : {(< NNS|NN >)∗}
NP P : {< NNP|NNP S >}
RB : {< RB.∗ >}
CD : {< CD >}
NP JJ : : {< NP|NP P > +(< (>< .∗ > ∗ <) >) ∗ (< IN >< DT > ∗ < RB > ∗ < JJ > ∗ < NP|NP P >) ∗ < RB > ∗(< V B >< JJ >< NP >)∗ < V B > (< DT >< CD >< NP >) ∗ < NP|NP P > ∗ < CD > ∗ < .∗ > ∗ < CD > ∗| < NP|NP P >< IN >< NP|NP P >< CD >< .∗ > ∗ <, >< V B > < IN >< NP|NP P >< CD >}"""
grammar = grammar.replace('∗','*')
tree = get_regex(example_sentence, grammar)
print(tree)
首先,请参见How to use nltk regex pattern to extract a specific phrase chunk?
让我们看看这句话的POS标签是什么:
from nltk import word_tokenize, pos_tag
text = "Operating profit margin was 8.3%, compared to 11.8% a year earlier."
pos_tag(word_tokenize(text))
[out]:
[('Operating', 'NN'),
('profit', 'NN'),
('margin', 'NN'),
('was', 'VBD'),
('8.3', 'CD'),
('%', 'NN'),
(',', ','),
('compared', 'VBN'),
('to', 'TO'),
('11.8', 'CD'),
('%', 'NN'),
('a', 'DT'),
('year', 'NN'),
('earlier', 'RBR'),
('.', '.')]
JJ
在该句子中的任何POS中都没有JJ
标签。
https://arxiv.org/pdf/1811.11008.pdf
NP JJ
或UP
标签。让我们改写步骤:
使用解析器解析句子(在这种情况下,使用某种语法的正则表达式解析器]
[标识该句子具有a pattern可以告知使用最终标签的信号。
2a。遍历分析树以提取another pattern,它向我们介绍了性能指标和数值。
2b。使用提取的提取数值使用某些启发式
确定方向性DOWN
/UP
2c。用(2b)中标识的DOWN
/ UP
标记句子。
2b。提取另一个模式
我们知道输出的一定比例始终是]的[C0
Down
因此,请尝试将其纳入语法中。
CD NN
[out]:
('8.3', 'CD'), ('%', 'NN') ('11.8', 'CD'), ('%', 'NN')
现在,我们怎么得到这个:
- 识别出该句子具有可以告诉使用最终标签的模式。
我们知道patterns = """
PERCENT: {<CD><NN>}
"""
PChunker = RegexpParser(patterns)
PChunker.parse(pos_tag(word_tokenize(text)))
是一个很好的模式,所以让我们尝试对其进行编码。
我们知道Tree('S', [('Operating', 'NN'), ('profit', 'NN'), ('margin', 'NN'), ('was', 'VBD'),
Tree('PERCENT', [('8.3', 'CD'), ('%', 'NN')]),
(',', ','), ('compared', 'VBN'), ('to', 'TO'),
Tree('PERCENT', [('11.8', 'CD'), ('%', 'NN')]),
('a', 'DT'), ('year', 'NN'), ('earlier', 'RBR'), ('.', '.')])
具有来自]的标签[C0
<PERCENT> compared to <PERCENT>
怎么样:
compared to
[out]:
VBN TO
但是该模式可以是任意数字。我们需要一个
('8.3', 'CD'), ('%', 'NN'), (',', ','), ('compared', 'VBN'), ('to', 'TO'), ('11.8', 'CD'), ('%', 'NN'),
的信号由于我不是金融领域的领域专家,所以简单地使用
patterns = """ PERCENT: {<CD><NN>} P2P: {<PERCENT><.*>?<VB.*><TO><PERCENT>} """ PChunker = RegexpParser(patterns) PChunker.parse(pos_tag(word_tokenize(text)))
的存在可能是一个好信号,即
Tree('S', [('Operating', 'NN'), ('profit', 'NN'), ('margin', 'NN'), ('was', 'VBD'), Tree('P2P', [ Tree('PERCENT', [('8.3', 'CD'), ('%', 'NN')]), (',', ','), ('compared', 'VBN'), ('to', 'TO'), Tree('PERCENT', [('11.8', 'CD'), ('%', 'NN')])] ), ('a', 'DT'), ('year', 'NN'), ('earlier', 'RBR'), ('.', '.')] )
[out]:
performance indicator
现在如何获得
operating profit margin
/from nltk import word_tokenize, pos_tag, RegexpParser patterns = """ PERCENT: {<CD><NN>} P2P: {<PERCENT><.*>?<VB.*><TO><PERCENT>} """ PChunker = RegexpParser(patterns) text = "Operating profit margin was 8.3%, compared to 11.8% a year earlier." indicators = ['operating profit margin'] for i in indicators: if i in text.lower(): print(PChunker.parse(pos_tag(word_tokenize(text))))
?2b。使用提取的提取的数值,通过一些试探法来确定方向性UP / DOWN
仅从例句中,除了“更早”,其他都没有告诉我们数字的先例。
所以让我们假设一下,如果我们有模式(S
Operating/NN
profit/NN
margin/NN
was/VBD
(P2P
(PERCENT 8.3/CD %/NN)
,/,
compared/VBN
to/TO
(PERCENT 11.8/CD %/NN))
a/DT
year/NN
earlier/RBR
./.)
,我们说第二个百分比是一个较旧的数字。
UP
[out]:
DOWN
还有
PERCENT VBN TO PERCENT earlier
/import nltk from nltk import word_tokenize, pos_tag, RegexpParser patterns = """ PERCENT: {<CD><NN>} P2P: {<PERCENT><.*>?<VB.*><TO><PERCENT><.*>*<RBR>} """ def traverse_tree(tree, label=None): # print("tree:", tree) for subtree in tree: if type(subtree) == nltk.tree.Tree and subtree.label() == label: yield subtree PChunker = RegexpParser(patterns) parsed_text = PChunker.parse(pos_tag(word_tokenize(text))) for p2p in traverse_tree(parsed_text, 'P2P'): print(p2p)
标签?(P2P (PERCENT 8.3/CD %/NN) ,/, compared/VBN to/TO (PERCENT 11.8/CD %/NN) a/DT year/NN earlier/RBR)
现在问题开始产生...
**您要编写这么多规则并使用上面的
UP
捕获它们吗? **
您编写的模式是否万无一失?
例如在某些情况下,使用指标和“较早的时间”比较百分比的模式将不会像预期的那样“上”或“下”
我们为什么要在AI时代编写规则?
您是否已经有了人工注释的数据,其中有句子及其相应的上/下标签?如果是,让我建议类似DOWN
或import nltk
from nltk import word_tokenize, pos_tag, RegexpParser
patterns = """
PERCENT: {<CD><NN>}
P2P: {<PERCENT><.*>?<VB.*><TO><PERCENT><.*>*<RBR>}
"""
PChunker = RegexpParser(patterns)
def traverse_tree(tree, label=None):
# print("tree:", tree)
for subtree in tree:
if type(subtree) == nltk.tree.Tree and subtree.label() == label:
yield subtree
def labelme(text):
parsed_text = PChunker.parse(pos_tag(word_tokenize(text)))
for p2p in traverse_tree(parsed_text, 'P2P'):
# Check if the subtree ends with "earlier".
if p2p.leaves()[-1] == ('earlier', 'RBR'):
# Check if which percentage is larger.
percentages = [float(num[0]) for num in p2p.leaves() if num[1] == 'CD']
# Sanity check that there's only 2 numbers from our pattern.
assert len(percentages) == 2
if percentages[0] > percentages[1]:
return 'DOWN'
else:
return 'UP'
text = "Operating profit margin was 8.3%, compared to 11.8% a year earlier."
labelme(text)
的内容>