使用 NLTK 提取关系

问题描述 投票:0回答:4

这是我的问题的后续。我正在使用 nltk 来解析人、组织及其关系。使用这个例子,我能够创建人员和组织的块;但是,我在 nltk.sem.extract_rel 命令中收到错误:

AttributeError: 'Tree' object has no attribute 'text'

完整代码如下:

import nltk
import re
#billgatesbio from http://www.reuters.com/finance/stocks/officerProfile?symbol=MSFT.O&officerId=28066
with open('billgatesbio.txt', 'r') as f:
    sample = f.read()

sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.batch_ne_chunk(tagged_sentences)

# tried plain ne_chunk instead of batch_ne_chunk as given in the book
#chunked_sentences = [nltk.ne_chunk(sentence) for sentence in tagged_sentences]

# pattern to find <person> served as <title> in <org>
IN = re.compile(r'.+\s+as\s+')
for doc in chunked_sentences:
    for rel in nltk.sem.extract_rels('ORG', 'PERSON', doc,corpus='ieer', pattern=IN):
        print nltk.sem.show_raw_rtuple(rel)

这个示例与书中给出的示例非常相似,但是该示例使用了准备好的“已解析文档”,它无处不在,我不知道在哪里可以找到它的对象类型。我也浏览了 git 库。如有任何帮助,我们将不胜感激。

我的最终目标是提取一些公司的人员、组织、头衔(日期);然后创建个人和组织的网络地图。

python nlp nltk
4个回答
6
投票
它看起来像一个“解析文档”,一个对象需要有一个

headline

 成员和 
text
 成员,这两个成员都是标记列表,其中一些标记被标记为树。例如,这个(hacky)示例有效:

import nltk import re IN = re.compile (r'.*\bin\b(?!\b.+ing)') class doc(): pass doc.headline=['foo'] doc.text=[nltk.Tree('ORGANIZATION', ['WHYY']), 'in', nltk.Tree('LOCATION',['Philadelphia']), '.', 'Ms.', nltk.Tree('PERSON', ['Gross']), ','] for rel in nltk.sem.extract_rels('ORG','LOC',doc,corpus='ieer',pattern=IN): print nltk.sem.relextract.show_raw_rtuple(rel)

运行时提供输出:

[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']

显然,您实际上不会像这样编码,但它提供了

extract_rels

预期的数据格式的工作示例,您只需要确定如何执行预处理步骤以将数据整理为该格式。


5
投票
这里是nltk.sem.extract_rels函数的源代码:

def extract_rels(subjclass, objclass, doc, corpus='ace', pattern=None, window=10): """ Filter the output of ``semi_rel2reldict`` according to specified NE classes and a filler pattern. The parameters ``subjclass`` and ``objclass`` can be used to restrict the Named Entities to particular types (any of 'LOCATION', 'ORGANIZATION', 'PERSON', 'DURATION', 'DATE', 'CARDINAL', 'PERCENT', 'MONEY', 'MEASURE'). :param subjclass: the class of the subject Named Entity. :type subjclass: str :param objclass: the class of the object Named Entity. :type objclass: str :param doc: input document :type doc: ieer document or a list of chunk trees :param corpus: name of the corpus to take as input; possible values are 'ieer' and 'conll2002' :type corpus: str :param pattern: a regular expression for filtering the fillers of retrieved triples. :type pattern: SRE_Pattern :param window: filters out fillers which exceed this threshold :type window: int :return: see ``mk_reldicts`` :rtype: list(defaultdict) """ ....

因此,如果您将语料库参数作为 ieer 传递,则 nltk.sem.extract_rels 函数期望 doc 参数是 IEERDocument 对象。您应该将语料库作为 ace 传递,或者只是不传递它(默认为 ace)。在这种情况下,它需要一个块树列表(这就是您想要的)。我修改了代码如下:

import nltk import re from nltk.sem import extract_rels,rtuple #billgatesbio from http://www.reuters.com/finance/stocks/officerProfile?symbol=MSFT.O&officerId=28066 with open('billgatesbio.txt', 'r') as f: sample = f.read().decode('utf-8') sentences = nltk.sent_tokenize(sample) tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences] tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences] # here i changed reg ex and below i exchanged subj and obj classes' places OF = re.compile(r'.*\bof\b.*') for i, sent in enumerate(tagged_sentences): sent = nltk.ne_chunk(sent) # ne_chunk method expects one tagged sentence rels = extract_rels('PER', 'ORG', sent, corpus='ace', pattern=OF, window=7) # extract_rels method expects one chunked sentence for rel in rels: print('{0:<5}{1}'.format(i, rtuple(rel)))

它给出了结果:

[PER: u'Chairman/NNP'] u'and/CC Chief/NNP Executive/NNP Officer/NNP of/IN the/DT' [ORG: u'Company/NNP']
    

0
投票
这是nltk版本问题。你的代码应该在 nltk 2.x 中工作 但对于 nltk 3 你应该这样编码

IN = re.compile(r'.*\bin\b(?!\b.+ing)') for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'): for rel in nltk.sem.relextract.extract_rels('ORG', 'LOC', doc,corpus='ieer', pattern = IN): print (nltk.sem.relextract.rtuple(rel))

NLTK 关系提取示例不起作用


0
投票
导入nltk 从 nltk 导入 word_tokenize, pos_tag 从 nltk.chunk 导入 ne_chunk 从 nltk.sem 导入 relextract

def语义_角色_标签(句子): # 对句子进行标记 标记 = word_tokenize(句子)

# Perform part-of-speech tagging tagged_tokens = pos_tag(tokens) # Extract named entities entities = ne_chunk(tagged_tokens) # Perform semantic role labeling roles = relextract.extract_rels('PER', 'ORG', entities, corpus='ace', pattern='ie') # Print a message before the loop print("Semantic Roles:") for role in roles: subjtext = role['subjtext'][0] if 'subjtext' in role else None filler = role['filler'][0] if 'filler' in role else None objtext = role['objtext'][0] if 'objtext' in role else None print(f"Agent: {subjtext}, Action: {filler}, Patient: {objtext}")
例句

句子=“约翰去上学”

执行语义角色标记

semantic_role_labeling(句子)没有给出结果

© www.soinside.com 2019 - 2024. All rights reserved.