如何在spaCy中增强英文模型的形态信息？

Question

我正在尝试使用 spaCy 中的英语模型来检测祈使语气中的动词，但我发现形态特征与 Morphology 文档中的示例不一致。这个问题类似于这个悬而未决的问题Extracting English祈使语气从动词标签与spaCy问题。具体来说，似乎很少有情绪特征被识别。

我不确定我是否缺少某些配置，或者我是否需要以某种方式训练模型以更好地识别形态特征。在开始培训之前，我想了解为什么我所做的与文档不符。

我写了一个小例子来演示这种差异。

'''
Prerequisites

pip install spacy
python -m spacy download en_core_web_lg
'''
import spacy

nlp = spacy.load("en_core_web_lg")

def show_morph_as_markdown_table(doc):
    print("|Context|Token|Lemma|POS|TAG|MORPH|")
    print("|----|----|----|----|----|----|")
    for token in doc:
        print(f'|{doc}|{token.text}|{token.lemma_}|{token.pos_}|{token.tag_}|{token.morph.to_dict()}|')

def show_morph_for_sentences_as_markdown_table(sentences):
    sentence_docs = list(nlp.pipe(sentences))
    for sentence_doc in sentence_docs:
        show_morph_as_markdown_table(sentence_doc)

example_sentences = [
    "I was reading the paper",
    "I don’t watch the news, I read the paper",
    "I read the paper yesterday"
]

show_morph_for_sentences_as_markdown_table(example_sentences)

我已修剪输出以仅包含 Morphology 文档中显示的行。

背景	代币	引理	POS	标签	变形
我正在读报纸	阅读	阅读	动词	VBG	{'Aspect': 'Prog', 'Tense': 'Pres', 'VerbForm': 'Part'}
我不看新闻，我看报纸	阅读	阅读	动词	VBD	{'时态': '过去', 'VerbForm': 'Fin'}
我昨天读了报纸	阅读	阅读	动词	VBP	{'时态'：'Pres'，'动词形式'：'Fin'}

这与预期的输出有很大不同：

背景	代币	引理	POS	标签	变形
我正在读报纸	阅读	阅读	动词	VBG	{'VerbForm': 'Ger'}
我不看新闻，我看报纸	阅读	阅读	动词	VBD	{'VerbForm': 'Fin', 'Mood': 'Ind', '时态': 'Pres'}
我昨天读了报纸	阅读	阅读	动词	VBP	{'VerbForm': 'Fin', 'Mood': 'Ind', '时态': '过去'}

我尝试使用 DEFAULT_MORPH_MODEL 将形态生成器添加到管道中，但遇到了初始化错误。我对管道还不够了解，还不明白为什么。

from spacy.pipeline.morphologizer import DEFAULT_MORPH_MODEL

config = {"model": DEFAULT_MORPH_MODEL}
nlp.add_pipe("morphologizer", config=config)

# ValueError: [E109] Component 'morphologizer' could not be run. Did you forget to call `initialize()`?

# trying to fix above error with the following
nlp.initialize()

# [E955] Can't find table(s) lexeme_norm for language 'en' in spacy-lookups-data. Make sure you have the package installed or provide your own lookup tables if no default lookups are available for your language.

进一步研究发现，spaCy 版本 3 使用 AttributeRuler 管理 tag_map 和 morph_rules。可下载的模型是否可能不包含与文档使用的相同信息？

我希望有一个简单的配置修复，我缺少或指向正确的兔子洞（我已经失败了很多）。

Answer 1

此表是文档，只是您可能会看到的注释类型的通用示例，每个单独模型的确切注释可能有所不同，对于模型的每个单独发行版/版本也是如此。

使用

en_core_web_*

模型检测命令式的运气不会很好，因为训练数据无法区分命令式和其他形式。处理标签集转换的规则很大程度上基于此表（请注意，任何 PTB 标签都没有

Mood=Imp

）：

https://universaldependency.org/tagset-conversion/en-penn-uposf.html

但是，看起来某些 UD 英语语料库确实包含

Mood=Imp

或使用区分祈使句的细粒度标签。首先，您可以通过 Stanza 或 Trankit 等工具测试预训练的 UD English EWT 模型，看看它是否足够适合您的任务。这可能很难区分，所以我不知道整体性能有多好。

如果您想继续使用 spacy，您可以将

spacy-stanza

与默认的“en”模型一起使用，该模型是在 UD English EWT 上进行训练的。

如何在spaCy中增强英文模型的形态信息？

问题描述投票：0回答：1

1个回答

最新问题

如何在spaCy中增强英文模型的形态信息？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1