我如何获得Spacy停止将带连字符的数字和单词拆分为单独的标记?

问题描述 投票:2回答:1

感谢您的光临。我正在使用spaCy对一块文本执行命名实体识别,并且遇到了一个我似乎无法克服的特殊问题。这是示例代码:

from spacy.tokenizer import Tokenizer
nlp = spacy.load("en_core_web_sm")

doc = nlp('The Indo-European Caucus won the all-male election 58-32.')

这将导致以下结果:

['The', 'Indo', '-', 'European', 'Caucus', 'won', 'the', 'all', '-', 'male', 'election', ',', '58', '-', '32', '.']

我的问题是,我需要那些包含连字符的单词和数字作为单个标记通过。我使用以下代码遵循了this answer中给出的示例:

inf = list(nlp.Defaults.infixes)
inf = [x for x in inf if '-|–|—|--|---|——|~' not in x] # remove the hyphen-between-letters pattern from infix patterns
infix_re = compile_infix_regex(tuple(inf))

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
                                suffix_search=nlp.tokenizer.suffix_search,
                                infix_finditer=infix_re.finditer,
                                token_match=nlp.tokenizer.token_match,
                                rules=nlp.Defaults.tokenizer_exceptions)

nlp.tokenizer = custom_tokenizer(nlp)

这对字母字符有所帮助,我明白了:

['The', 'Indo-European', 'Caucus', 'won', 'the', 'all-male', 'election', ',', '58', '-', '32', '.']

那好多了,但是'58-32'仍被拆分为单独的标记。我尝试了this answer并得到了相反的效果:

['The', 'Indo', '-', 'European', 'Caucus', 'won', 'the', 'all', '-', 'male', 'election', ',' '58-32', '.']

在两种情况下如何更改令牌生成器以给我正确的结果?

python regex tokenize spacy
1个回答
2
投票

您可以结合使用两种解决方案:

import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex

nlp = spacy.load("en_core_web_sm")

def custom_tokenizer(nlp):
    inf = list(nlp.Defaults.infixes)               # Default infixes
    inf.remove(r"(?<=[0-9])[+\-\*^](?=[0-9-])")    # Remove the generic op between numbers or between a number and a -
    inf = tuple(inf)                               # Convert inf to tuple
    infixes = inf + tuple([r"(?<=[0-9])[+*^](?=[0-9-])", r"(?<=[0-9])-(?=-)"])  # Add the removed rule after subtracting (?<=[0-9])-(?=[0-9]) pattern
    infixes = [x for x in infixes if '-|–|—|--|---|——|~' not in x] # Remove - between letters rule
    infix_re = compile_infix_regex(infixes)

    return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
                                suffix_search=nlp.tokenizer.suffix_search,
                                infix_finditer=infix_re.finditer,
                                token_match=nlp.tokenizer.token_match,
                                rules=nlp.Defaults.tokenizer_exceptions)

nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp('The Indo-European Caucus won the all-male election 58-32.')
print([token.text for token in doc]) 

输出:

['The', 'Indo-European', 'Caucus', 'won', 'the', 'all-male', 'election', '58-32', '.']
© www.soinside.com 2019 - 2024. All rights reserved.