格式化Stanford Corenlp的NER输出

Question

我正在与Stanford CoreNLP合作并将其用于NER。但是当我提取组织名称时，我看到每个单词都用注释标记。因此，如果该实体是“纽约时报”，那么它将被记录为三个不同的实体：“NEW”，“YORK”和“TIMES”。我们可以在Stanford COreNLP中设置一个属性，以便我们可以将组合输出作为实体吗？

就像在Stanford NER中一样，当我们使用命令行实用程序时，我们可以选择输出格式为：inlineXML？我们可以以某种方式设置属性来选择Stanford CoreNLP中的输出格式吗？

Answer 1

如果您只想要斯坦福NER找到的每个命名实体的完整字符串，请尝试以下方法：

String text = "<INSERT YOUR INPUT TEXT HERE>";
AbstractSequenceClassifier<CoreMap> ner = CRFClassifier.getDefaultClassifier();
List<Triple<String, Integer, Integer>> entities = ner.classifyToCharacterOffsets(text);
for (Triple<String, Integer, Integer> entity : entities)
    System.out.println(text.substring(entity.second, entity.third), entity.second));

如果您想知道，实体类由entity.first指示。

或者，您可以使用ner.classifyWithInlineXML(text)获取看起来像<PERSON>Bill Smith</PERSON> went to <LOCATION>Paris</LOCATION> .的输出

Answer 2

不，CoreNLP 3.5.0没有合并NER标签的实用程序。下一个版本（下周某个时候）有一个新的MentionsAnnotator，可以为你处理这个合并。目前，您可以（a）使用MentionsAnnotator上提供的CoreNLP master branch，或（b）手动合并。

使用-outputFormat xml选项可以获得CoreNLP输出XML。（这是你想要的吗？）

Answer 3

您可以在属性文件中设置任何属性，包括“outputFormat”属性。 Stanford CoreNLP支持几种不同的格式，如json，xml和text。但是，xml选项不是inlineXML格式。 xml格式为NER提供每个标记注释。

    <tokens> 
      <token id="1"> 
        <word>New</word> 
        <lemma>New</lemma> 
        <CharacterOffsetBegin>0</CharacterOffsetBegin> 
        <CharacterOffsetEnd>3</CharacterOffsetEnd> 
        <POS>NNP</POS> 
        <NER>ORGANIZATION</NER> 
        <Speaker>PER0</Speaker> 
      </token> 
      <token id="2"> 
        <word>York</word> 
        <lemma>York</lemma> 
        <CharacterOffsetBegin>4</CharacterOffsetBegin> 
        <CharacterOffsetEnd>8</CharacterOffsetEnd> 
        <POS>NNP</POS> 
        <NER>ORGANIZATION</NER> 
        <Speaker>PER0</Speaker> 
      </token> 
      <token id="3"> 
        <word>Times</word> 
        <lemma>Times</lemma> 
        <CharacterOffsetBegin>9</CharacterOffsetBegin> 
        <CharacterOffsetEnd>14</CharacterOffsetEnd> 
        <POS>NNP</POS> 
        <NER>ORGANIZATION</NER> 
        <Speaker>PER0</Speaker> 
      </token> 
    </tokens>

Answer 4

从Stanford CoreNLP 3.6及其后，您可以在Pipeline中使用实体命令并获取所有实体的列表。我在这里展示了一个例子。有用。

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner,entitymentions");
props.put("regexner.mapping", "jg-regexner.txt");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);


String inputText = "I have done Bachelor of Arts and Bachelor of Laws so that I can work at British Broadcasting Corporation"; 
Annotation annotation = new Annotation(inputText);

pipeline.annotate(annotation); 

List<CoreMap> multiWordsExp = annotation.get(MentionsAnnotation.class);
for (CoreMap multiWord : multiWordsExp) {
      String custNERClass = multiWord.get(NamedEntityTagAnnotation.class);
      System.out.println(multiWord +" : " +custNERClass);
}

格式化Stanford Corenlp的NER输出

问题描述投票：2回答：4

4个回答

最新问题

格式化Stanford Corenlp的NER输出

问题描述 投票：2回答：4

4个回答

最新问题

问题描述投票：2回答：4