如何从Stanza选区解析树中获取字符串中的原始标记位置？

Question

我正在使用 Stanza 从文本中提取名词短语。我正在使用此代码来提取 NP 并根据其深度存储它们。

nlp = stanza.Pipeline('en', tokenize_pretokenized=True)
sentence_tokens = ['This', 'is', 'a', 'sentence', '.']
doc = nlp(sentence_tokens)
for sent in doc.sentences:
    tree = sent.constituency

    def extract_NPs(tree, np_dict):
        for child in tree.children:
            if child.label=='NP':
                np_dict[child.depth()].append(child)
            np_dict = extract_NPs(child, np_dict)
        return np_dict
    nps = extract_NPs(tree, np_dict=defaultdict(list))

输出字典以深度为键，以及具有该深度的 NP 树列表。每个 NP 都是一棵树，在 Stanza github here 中进行了描述。

我梳理了代码和文档，似乎找不到一种方法将 NP 的文本映射回原始输入句子中的位置。简单地在 Sentence_tokens 中查找标记的索引对我来说不起作用，因为其中许多句子都有重复标记。

有什么想法吗？

Answer 1

在处理树对象之前，您可以使用

replace_words()

将选区解析中的每个单词替换为单词的id（新单词仍然必须是字符串）：

tree = tree.replace_words(map(str, range(len(sentence_tokens))))

然后您可以使用

leaf_labels()

恢复给定 NP 树的单词 id。例如，在根上调用

leaf_labels()

现在将返回：

['0', '1', '2', '3', '4']

如何从Stanza选区解析树中获取字符串中的原始标记位置？

问题描述投票：0回答：1

1个回答

最新问题

如何从Stanza选区解析树中获取字符串中的原始标记位置？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1