如何为CoreNLP提供一些预先标记的命名实体？

Question

我想使用Standford CoreNLP来提取Coreferences并开始研究预标记文本的依赖关系。我最终希望在相关的命名实体之间构建图形节点和边缘。我在python中工作，但是使用nltk的java函数直接调用“edu.stanford.nlp.pipeline.StanfordCoreNLP”jar（无论如何nltk在幕后工作）。

我预先标记的文字采用以下格式：

PRE-LABELED:  During his youth, [PERSON: Alexander III of Macedon] was tutored by [PERSON: Aristotle] until age 16.  Following the conquest of [LOCATION: Anatolia], [PERSON: Alexander] broke the power of [LOCATION: Persia] in a series of decisive battles, most notably the battles of [LOCATION: Issus] and [LOCATION: Gaugamela].  He subsequently overthrew [PERSON: Persian King Darius III] and conquered the [ORGANIZATION: Achaemenid Empire] in its entirety.

我试图做的是自己标记我的句子，建立一个IOB格式的元组列表：[（“期间”，“O”），（“他的”，“O”），（“青年”，“O”），（“亚历山大”，“B-PERSON”），（“III”，“I-PERSON”），......]

但是，我无法弄清楚如何告诉CoreNLP将此元组列表作为起点，构建其他未初始标记的命名实体，并在这些新的，更高质量的标记化句子上找到共识。我显然试图简单地删除我的标签，并让CoreNLP单独完成此操作，但CoreNLP在查找命名实体方面并不像人类标记的预标记文本那么好。

我需要一个输出如下。我知道使用Dependencies以这种方式获取Edges很困难，但我需要知道我能走多远。

DESIRED OUTPUT:
[Person 1]:
Name: Alexander III of Macedon
Mentions:
* "Alexander III of Macedon"; Sent1 [4,5,6,7] # List of tokens
* "Alexander"; Sent2 [6]
* "He"; Sent3 [1]
Edges:
* "Person 2"; "tutored by"; "Aristotle"

[Person 2]:
Name: Aristotle
[....]

我如何为CoreNLP提供一些预先识别的命名实体，并仍然可以获得其他命名实体，Coreference和Basic Dependencies的帮助？

附：请注意，这不是NLTK Named Entity Recognition with Custom Data的副本。我不打算用我预先标记的NER来训练一个新的分类器，我只是在运行coreference（包括提及）和依赖于给定句子时尝试将CoreNLP添加到我自己的。

Answer 1

答案是使用Additional TokensRegexNER Rules制作规则文件。

我使用正则表达式来分组标记的名称。从这里我构建了一个规则tempfile，我将其传递给了-ner.additional.regexner.mapping mytemprulesfile的corenlp jar。

Alexander III of Macedon    PERSON      PERSON,LOCATION,ORGANIZATION,MISC
Aristotle                   PERSON      PERSON,LOCATION,ORGANIZATION,MISC
Anatolia                    LOCATION    PERSON,LOCATION,ORGANIZATION,MISC
Alexander                   PERSON      PERSON,LOCATION,ORGANIZATION,MISC
Persia                      LOCATION    PERSON,LOCATION,ORGANIZATION,MISC
Issus                       LOCATION    PERSON,LOCATION,ORGANIZATION,MISC
Gaugamela                   LOCATION    PERSON,LOCATION,ORGANIZATION,MISC
Persian King Darius III     PERSON      PERSON,LOCATION,ORGANIZATION,MISC
Achaemenid Empire           ORGANIZATION    PERSON,LOCATION,ORGANIZATION,MISC

我已将此列表与可读性对齐，但这些是以制表符分隔的值。

一个有趣的发现是，一些多字预先标记的实体保留了原始标记的多字，而在没有规则文件的情况下运行corenlp有时会将这些标记拆分为单独的实体。

我曾经想要专门识别命名实体令牌，认为它会使核心参考更容易，但我想现在这样做。无论如何，实体名称在一个文档中多久相同但不相关？

示例（执行需要约70秒）

import os, re, tempfile, json, nltk, pprint
from subprocess import PIPE
from nltk.internals import (
    find_jar_iter,
    config_java,
    java,
    _java_options,
    find_jars_within_path,
)

def ExtractLabeledEntitiesByRegex( text, regex ):
    rgx = re.compile(regex)
    nelist = []
    for mobj in rgx.finditer( text ):
        ne = mobj.group('ner')
        try:
            tag = mobj.group('tag')
        except IndexError:
            tag = 'PERSON'
        mstr = text[mobj.start():mobj.end()]
        nelist.append( (ne,tag,mstr) )
    cleantext = rgx.sub("\g<ner>", text)
    return (nelist, cleantext)

def GenerateTokensNERRules( nelist ):
    rules = ""
    for ne in nelist:
        rules += ne[0]+'\t'+ne[1]+'\tPERSON,LOCATION,ORGANIZATION,MISC\n'
    return rules

def GetEntities( origtext ):
    nelist, cleantext = ExtractLabeledEntitiesByRegex( origtext, '(\[(?P<tag>[a-zA-Z]+)\:\s*)(?P<ner>(\s*\w)+)(\s*\])' )

    origfile = tempfile.NamedTemporaryFile(mode='r+b', delete=False)
    origfile.write( cleantext.encode('utf-8') )
    origfile.flush()
    origfile.seek(0)
    nerrulefile = tempfile.NamedTemporaryFile(mode='r+b', delete=False)
    nerrulefile.write( GenerateTokensNERRules(nelist).encode('utf-8') )
    nerrulefile.flush()
    nerrulefile.seek(0)

    java_options='-mx4g'
    config_java(options=java_options, verbose=True)
    stanford_jar = '../stanford-corenlp-full-2018-10-05/stanford-corenlp-3.9.2.jar'
    stanford_dir = os.path.split(stanford_jar)[0]
    _classpath = tuple(find_jars_within_path(stanford_dir))

    cmd = ['edu.stanford.nlp.pipeline.StanfordCoreNLP',
        '-annotators','tokenize,ssplit,pos,lemma,ner,parse,coref,coref.mention,depparse,natlog,openie,relation',
        '-ner.combinationMode','HIGH_RECALL',
        '-ner.additional.regexner.mapping',nerrulefile.name,
        '-coref.algorithm','neural',
        '-outputFormat','json',
        '-file',origfile.name
        ]

    # java( cmd, classpath=_classpath, stdout=PIPE, stderr=PIPE )
    stdout, stderr = java( cmd, classpath=_classpath, stdout=PIPE, stderr=PIPE )    # Couldn't get working- stdin=textfile
    PrintJavaOutput( stdout, stderr )

    origfilenametuple = os.path.split(origfile.name)
    jsonfilename = origfilenametuple[len(origfilenametuple)-1] + '.json'

    os.unlink( origfile.name )
    os.unlink( nerrulefile.name )
    origfile.close()
    nerrulefile.close()

    with open( jsonfilename ) as jsonfile:
        jsondata = json.load(jsonfile)

    currentid = 0
    entities = []
    for sent in jsondata['sentences']:
        for thisentity in sent['entitymentions']:
            tag = thisentity['ner']
            if tag == 'PERSON' or tag == 'LOCATION' or tag == 'ORGANIZATION':
                entity = {
                    'id':currentid,
                    'label':thisentity['text'],
                    'tag':tag
                }
                entities.append( entity )
                currentid += 1

    return entities

#### RUN ####
corpustext = "During his youth, [PERSON:Alexander III of Macedon] was tutored by [PERSON: Aristotle] until age 16.  Following the conquest of [LOCATION: Anatolia], [PERSON: Alexander] broke the power of [LOCATION: Persia] in a series of decisive battles, most notably the battles of [LOCATION: Issus] and [LOCATION: Gaugamela].  He subsequently overthrew [PERSON: Persian King Darius III] and conquered the [ORGANIZATION: Achaemenid Empire] in its entirety."

entities = GetEntities( corpustext )
for thisent in entities:
    pprint.pprint( thisent )

产量

{'id': 0, 'label': 'Alexander III of Macedon', 'tag': 'PERSON'}
{'id': 1, 'label': 'Aristotle', 'tag': 'PERSON'}
{'id': 2, 'label': 'his', 'tag': 'PERSON'}
{'id': 3, 'label': 'Anatolia', 'tag': 'LOCATION'}
{'id': 4, 'label': 'Alexander', 'tag': 'PERSON'}
{'id': 5, 'label': 'Persia', 'tag': 'LOCATION'}
{'id': 6, 'label': 'Issus', 'tag': 'LOCATION'}
{'id': 7, 'label': 'Gaugamela', 'tag': 'LOCATION'}
{'id': 8, 'label': 'Persian King Darius III', 'tag': 'PERSON'}
{'id': 9, 'label': 'Achaemenid Empire', 'tag': 'ORGANIZATION'}
{'id': 10, 'label': 'He', 'tag': 'PERSON'}

如何为CoreNLP提供一些预先标记的命名实体？

问题描述投票：0回答：1

1个回答

最新问题

如何为CoreNLP提供一些预先标记的命名实体？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1