NLTK关系提取没有返回任何内容https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L222

问题描述 投票:6回答:2

我最近正致力于使用nltk从文本中提取关系。所以我建立了一个示例文本:“汤姆是微软的联合创始人。”并使用以下程序测试并返回任何内容。我无法弄清楚为什么。

我使用的是NLTK版本:3.2.1,python版本:3.5.2。

这是我的代码:

import re
import nltk
from nltk.sem.relextract import extract_rels, rtuple
from nltk.tokenize import sent_tokenize, word_tokenize


def test():
    with open('sample.txt', 'r') as f:
        sample = f.read()   # "Tom is the cofounder of Microsoft"

    sentences = sent_tokenize(sample)
    tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
    tagged_sentences = [nltk.tag.pos_tag(sentence) for sentence in tokenized_sentences]

    OF = re.compile(r'.*\bof\b.*')

    for i, sent in enumerate(tagged_sentences):
        sent = nltk.chunk.ne_chunk(sent) # ne_chunk method expects one tagged sentence
        rels = extract_rels('PER', 'GPE', sent, corpus='ace', pattern=OF, window=10) 
        for rel in rels:
            print('{0:<5}{1}'.format(i, rtuple(rel)))

if __name__ == '__main__':
    test()

1.经过一些调试,如果发现我改变了输入为

“盖茨于1955年10月28日出生在华盛顿州西雅图。”

nltk.chunk.ne_chunk()输出是:

(S(PERSON Gates / NNS)/ / VBD出生/ VBN in / IN(GPE Seattle / NNP),/,(GPE Washington / NNP)/ IN 10月/ NNP 28 / CD,/,1955 / CD ./。 )

test()返回:

[PER:'盖茨/ NNS']'/ VBD出生/ VBN in / IN'[GPE:'Seattle / NNP']

2.我将输入更改为:

“盖茨于1955年10月28日出生在西雅图。”

test()什么都不返回。

我挖到nltk / sem / relextract.py并发现这很奇怪

输出是由函数引起的:semi_rel2reldict(pairs,window = 5,trace = False),仅当len(pairs)> 2时才返回结果,这就是为什么当一个少于三个NE的句子将返回N​​one时。

这是一个错误还是我错误地使用了NLTK?

python nltk semantics relation knowledge-base-population
2个回答
6
投票

首先,对于使用ne_chunk的网格块,这个成语看起来像这样

>>> from nltk import ne_chunk, pos_tag, word_tokenize
>>> text = "Tom is the cofounder of Microsoft"
>>> chunked = ne_chunk(pos_tag(word_tokenize(text)))
>>> chunked
Tree('S', [Tree('PERSON', [('Tom', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN'), Tree('ORGANIZATION', [('Microsoft', 'NNP')])])

(另见https://stackoverflow.com/a/31838373/610569

接下来让我们来看看extract_rels function

def extract_rels(subjclass, objclass, doc, corpus='ace', pattern=None, window=10):
    """
    Filter the output of ``semi_rel2reldict`` according to specified NE classes and a filler pattern.
    The parameters ``subjclass`` and ``objclass`` can be used to restrict the
    Named Entities to particular types (any of 'LOCATION', 'ORGANIZATION',
    'PERSON', 'DURATION', 'DATE', 'CARDINAL', 'PERCENT', 'MONEY', 'MEASURE').
    """

当你唤起这个功能时:

extract_rels('PER', 'GPE', sent, corpus='ace', pattern=OF, window=10)

它按顺序执行4个过程。

1. It checks whether your subjclass and objclassare valid

即yaazkssvpoi:

https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L202

2. It extracts "pairs" from your NE tagged inputs:

if subjclass and subjclass not in NE_CLASSES[corpus]:
    if _expand(subjclass) in NE_CLASSES[corpus]:
        subjclass = _expand(subjclass)
    else:
        raise ValueError("your value for the subject type has not been recognized: %s" % subjclass)
if objclass and objclass not in NE_CLASSES[corpus]:
    if _expand(objclass) in NE_CLASSES[corpus]:
        objclass = _expand(objclass)
    else:
        raise ValueError("your value for the object type has not been recognized: %s" % objclass)

现在让我们看看你的输入句子if corpus == 'ace' or corpus == 'conll2002': pairs = tree2semi_rel(doc) elif corpus == 'ieer': pairs = tree2semi_rel(doc.text) + tree2semi_rel(doc.headline) else: raise ValueError("corpus type not recognized") Tom is the cofounder of Microsoft返回什么:

tree2semi_rel()

因此它返回一个包含2个列表的列表,第一个内部列表包含空白列表,>>> from nltk.sem.relextract import tree2semi_rel, semi_rel2reldict >>> from nltk import word_tokenize, pos_tag, ne_chunk >>> text = "Tom is the cofounder of Microsoft" >>> chunked = ne_chunk(pos_tag(word_tokenize(text))) >>> tree2semi_rel(chunked) [[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]] 包含“PERSON”标记。

Tree

第二个列表包含短语[[], Tree('PERSON', [('Tom', 'NNP')])] 和包含“ORGANIZATION”的is the cofounder of

让我们继续。

3. Tree then tries to change the pairs to some sort of relation dictionary

extract_rel

如果我们查看reldicts = semi_rel2reldict(pairs) 函数返回的示例句子,我们会看到这是空列表返回的位置:

semi_rel2reldict

那么让我们看看>>> tree2semi_rel(chunked) [[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]] >>> semi_rel2reldict(tree2semi_rel(chunked)) [] semi_rel2reldict的代码:

https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L144

def semi_rel2reldict(pairs, window=5, trace=False): """ Converts the pairs generated by ``tree2semi_rel`` into a 'reldict': a dictionary which stores information about the subject and object NEs plus the filler between them. Additionally, a left and right context of length =< window are captured (within a given input sentence). :param pairs: a pair of list(str) and ``Tree``, as generated by :param window: a threshold for the number of items to include in the left and right context :type window: int :return: 'relation' dictionaries whose keys are 'lcon', 'subjclass', 'subjtext', 'subjsym', 'filler', objclass', objtext', 'objsym' and 'rcon' :rtype: list(defaultdict) """ result = [] while len(pairs) > 2: reldict = defaultdict(str) reldict['lcon'] = _join(pairs[0][0][-window:]) reldict['subjclass'] = pairs[0][1].label() reldict['subjtext'] = _join(pairs[0][1].leaves()) reldict['subjsym'] = list2sym(pairs[0][1].leaves()) reldict['filler'] = _join(pairs[1][0]) reldict['untagged_filler'] = _join(pairs[1][0], untag=True) reldict['objclass'] = pairs[1][1].label() reldict['objtext'] = _join(pairs[1][1].leaves()) reldict['objsym'] = list2sym(pairs[1][1].leaves()) reldict['rcon'] = _join(pairs[2][0][:window]) if trace: print("(%s(%s, %s)" % (reldict['untagged_filler'], reldict['subjclass'], reldict['objclass'])) result.append(reldict) pairs = pairs[1:] return result 做的第一件事是检查来自semi_rel2reldict()的输出超过2个元素的位置,你的例句不是:

tree2semi_rel()

啊哈,这就是为什么>>> tree2semi_rel(chunked) [[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]] >>> len(tree2semi_rel(chunked)) 2 >>> len(tree2semi_rel(chunked)) > 2 False 什么也没回来。

现在问题是如何让extract_rel返回一些东西,即使是来自extract_rel()的2个元素?这甚至可能吗?

让我们尝试一个不同的句子:

tree2semi_rel()

但这只能证实当>>> text = "Tom is the cofounder of Microsoft and now he is the founder of Marcohard" >>> chunked = ne_chunk(pos_tag(word_tokenize(text))) >>> chunked Tree('S', [Tree('PERSON', [('Tom', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN'), Tree('ORGANIZATION', [('Microsoft', 'NNP')]), ('and', 'CC'), ('now', 'RB'), ('he', 'PRP'), ('is', 'VBZ'), ('the', 'DT'), ('founder', 'NN'), ('of', 'IN'), Tree('PERSON', [('Marcohard', 'NNP')])]) >>> tree2semi_rel(chunked) [[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])], [[('and', 'CC'), ('now', 'RB'), ('he', 'PRP'), ('is', 'VBZ'), ('the', 'DT'), ('founder', 'NN'), ('of', 'IN')], Tree('PERSON', [('Marcohard', 'NNP')])]] >>> len(tree2semi_rel(chunked)) > 2 True >>> semi_rel2reldict(tree2semi_rel(chunked)) [defaultdict(<type 'str'>, {'lcon': '', 'untagged_filler': 'is the cofounder of', 'filler': 'is/VBZ the/DT cofounder/NN of/IN', 'objsym': 'microsoft', 'objclass': 'ORGANIZATION', 'objtext': 'Microsoft/NNP', 'subjsym': 'tom', 'subjclass': 'PERSON', 'rcon': 'and/CC now/RB he/PRP is/VBZ the/DT', 'subjtext': 'Tom/NNP'})] 返回<2对时,extract_rel无法提取。如果我们删除tree2semi_rel的条件会怎样?

为什么我们不能做while len(pairs) > 2

如果我们仔细研究代码,我们会看到填充reldict的最后一行,while len(pairs) > 1

https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L169

它试图访问reldict['rcon'] = _join(pairs[2][0][:window]) 的第3个元素,如果pairs的长度是2,你将得到一个pairs

那么如果我们删除那个IndexError密钥并简单地将其更改为rcon会发生什么?

要做到这一点,我们必须覆盖while len(pairs) >= 2函数:

semi_rel2redict()

啊!它有效,但在>>> from nltk.sem.relextract import _join, list2sym >>> from collections import defaultdict >>> def semi_rel2reldict(pairs, window=5, trace=False): ... """ ... Converts the pairs generated by ``tree2semi_rel`` into a 'reldict': a dictionary which ... stores information about the subject and object NEs plus the filler between them. ... Additionally, a left and right context of length =< window are captured (within ... a given input sentence). ... :param pairs: a pair of list(str) and ``Tree``, as generated by ... :param window: a threshold for the number of items to include in the left and right context ... :type window: int ... :return: 'relation' dictionaries whose keys are 'lcon', 'subjclass', 'subjtext', 'subjsym', 'filler', objclass', objtext', 'objsym' and 'rcon' ... :rtype: list(defaultdict) ... """ ... result = [] ... while len(pairs) >= 2: ... reldict = defaultdict(str) ... reldict['lcon'] = _join(pairs[0][0][-window:]) ... reldict['subjclass'] = pairs[0][1].label() ... reldict['subjtext'] = _join(pairs[0][1].leaves()) ... reldict['subjsym'] = list2sym(pairs[0][1].leaves()) ... reldict['filler'] = _join(pairs[1][0]) ... reldict['untagged_filler'] = _join(pairs[1][0], untag=True) ... reldict['objclass'] = pairs[1][1].label() ... reldict['objtext'] = _join(pairs[1][1].leaves()) ... reldict['objsym'] = list2sym(pairs[1][1].leaves()) ... reldict['rcon'] = [] ... if trace: ... print("(%s(%s, %s)" % (reldict['untagged_filler'], reldict['subjclass'], reldict['objclass'])) ... result.append(reldict) ... pairs = pairs[1:] ... return result ... >>> text = "Tom is the cofounder of Microsoft" >>> chunked = ne_chunk(pos_tag(word_tokenize(text))) >>> tree2semi_rel(chunked) [[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]] >>> semi_rel2reldict(tree2semi_rel(chunked)) [defaultdict(<type 'str'>, {'lcon': '', 'untagged_filler': 'is the cofounder of', 'filler': 'is/VBZ the/DT cofounder/NN of/IN', 'objsym': 'microsoft', 'objclass': 'ORGANIZATION', 'objtext': 'Microsoft/NNP', 'subjsym': 'tom', 'subjclass': 'PERSON', 'rcon': [], 'subjtext': 'Tom/NNP'})] 仍然是第四步。

4. It performs a filter of the reldict given the regex you have provided to the extract_rels() parameter, pattern:

https://github.com/nltk/nltk/blob/develop/nltk/sem/relextract.py#L222

现在让我们尝试使用被攻击的relfilter = lambda x: (x['subjclass'] == subjclass and len(x['filler'].split()) <= window and pattern.match(x['filler']) and x['objclass'] == objclass) 版本:

semi_rel2reldict

有用!现在让我们以元组形式看到它:

>>> text = "Tom is the cofounder of Microsoft"
>>> chunked = ne_chunk(pos_tag(word_tokenize(text)))
>>> tree2semi_rel(chunked)
[[[], Tree('PERSON', [('Tom', 'NNP')])], [[('is', 'VBZ'), ('the', 'DT'), ('cofounder', 'NN'), ('of', 'IN')], Tree('ORGANIZATION', [('Microsoft', 'NNP')])]]
>>> semi_rel2reldict(tree2semi_rel(chunked))
[defaultdict(<type 'str'>, {'lcon': '', 'untagged_filler': 'is the cofounder of', 'filler': 'is/VBZ the/DT cofounder/NN of/IN', 'objsym': 'microsoft', 'objclass': 'ORGANIZATION', 'objtext': 'Microsoft/NNP', 'subjsym': 'tom', 'subjclass': 'PERSON', 'rcon': [], 'subjtext': 'Tom/NNP'})]
>>> 
>>> pattern = re.compile(r'.*\bof\b.*')
>>> reldicts = semi_rel2reldict(tree2semi_rel(chunked))
>>> relfilter = lambda x: (x['subjclass'] == subjclass and
...                            len(x['filler'].split()) <= window and
...                            pattern.match(x['filler']) and
...                            x['objclass'] == objclass)
>>> relfilter
<function <lambda> at 0x112e591b8>
>>> subjclass = 'PERSON'
>>> objclass = 'ORGANIZATION'
>>> window = 5
>>> list(filter(relfilter, reldicts))
[defaultdict(<type 'str'>, {'lcon': '', 'untagged_filler': 'is the cofounder of', 'filler': 'is/VBZ the/DT cofounder/NN of/IN', 'objsym': 'microsoft', 'objclass': 'ORGANIZATION', 'objtext': 'Microsoft/NNP', 'subjsym': 'tom', 'subjclass': 'PERSON', 'rcon': [], 'subjtext': 'Tom/NNP'})]

-1
投票

alvas的解决方案效果非常好!虽然稍作修改:而不是写作

>>> from nltk.sem.relextract import rtuple
>>> rels = list(filter(relfilter, reldicts))
>>> for rel in rels:
...     print rtuple(rel)
... 
[PER: 'Tom/NNP'] 'is/VBZ the/DT cofounder/NN of/IN' [ORG: 'Microsoft/NNP']

请用

>>> for rel in rels:
...     print rtuple(rel)

- 无法添加评论

© www.soinside.com 2019 - 2024. All rights reserved.