在spaCy中添加新词的引理和规范化/词形还原的概念

问题描述 投票:0回答:1

按照有关标记化的文档中的示例,我有以下代码:

import spacy
from spacy.symbols import ORTH, NORM

nlp = spacy.load("en_core_web_sm")
special_case = [{ORTH: "gim", NORM: "give"}, {ORTH: "me"}]
nlp.tokenizer.add_special_case("gimme", special_case)

doc = nlp("gimme that. he gave me that. Going to someplace.")

然后我检查标记化

doc[0].norm_  # 'give'  (as expected)

但是词形还原器不会返回相同的输出

lemmatizer = nlp.get_pipe("lemmatizer")
lemmatizer.lemmatize(doc[0])  # ['gim']  (expected ['give']

另一方面

lemmatizer.lemmatize(doc[5]) # ['give']
lemmatizer.lemmatize(doc[9]) # [go']

我做错了什么?怎么修”?在 spaCy 中,标准化标记和lemmatized标记之间有什么区别?我如何“教授”单个标记的词形还原(如示例中的

gim
标记)?

nlp spacy lemmatization
1个回答
0
投票
 in your code you've customized the tokenizer to handle the special case "gimme" and normalize it to "give.
    Here's how you can achieve consistent lemmatization results with your custom normalization
    
    import spacy
    from spacy.symbols import ORTH, NORM
    
    nlp = spacy.load("en_core_web_sm")
    special_case = [{ORTH: "gim", NORM: "give"}, {ORTH: "me"}]
    nlp.tokenizer.add_special_case("gimme", special_case)
    
    # Define a custom lemmatization function
    def custom_lemmatizer(doc):
      for token in doc:
            if token.norm_ == "give":
                token.lemma_ = "give"
            # Add more custom rules for other words if needed
        return doc
    
    # Add the custom lemmatizer to the pipeline
    nlp.add_pipe(custom_lemmatizer, name="custom_lemmatizer", after="lemmatizer")
    
    doc = nlp("gimme that. he gave me that. Going to someplace.")
    print(doc[0].lemma_)  # 'give' (as expected)
    print(doc[5].lemma_)  # 'give' (as expected)
    print(doc[9].lemma_)  # 'go' (as expected)
© www.soinside.com 2019 - 2024. All rights reserved.