将 spacy token 向量转换为文本

问题描述 投票:0回答:3

我正在使用 spacy 创建句子向量。如果句子是“我正在工作”,它会给出一个形状为 (3, 300) 的向量。有没有办法使用这些向量取回句子中的文本?

提前感谢, 哈拉蒂

python vector text nlp spacy
3个回答
3
投票

实际上,您可以使用 .orth_ 属性直接从 doc 对象获取字符串,该属性返回令牌的字符串表示形式,而不是 SpaCy 令牌对象

import en_core_web_sm
nlp = en_core_web_sm.load()
tokenizer = nlp.Defaults.create_tokenizer(nlp)
text = 'I am working'
tokens = [token.orth_ for token in tokenizer(text)]
print(tokens)
['I', 'am', 'working']

1
投票

没有办法从向量→单词进行翻译。但是,您可以实例化第二个序列,将令牌序列映射到整数序列,指示 spacy 模型词汇表中每个令牌的 id。

sentence = 'I am working'
document = nlp(sentence)
id_sequence = map(lambda x: x.orth, [token for token in document])
text = map(lambda x: nlp.vocab[x].text, [id for id in id_sequence])
print(text)
['I', 'am', 'working']

0
投票

您是否尝试过查找“最相似”的单词?

    nlp = spacy.load("en_core_web_lg")
    doc1 = nlp("I am working")
    # most_similar words in vocab
    keys, best_rows, scores = nlp.vocab.vectors.most_similar(
        np.asarray([
            doc1.vector,  # the input is 1x1 (x300)
            ]),
        n=20
        )
    # keys is 1xn (x300)
    for key, best_row, score in zip(keys[0, :], best_rows[0, :], scores[0, :]):
        print(f'text: {nlp.vocab[key].text}, score: {score}')  # key: {key}

返回如下:

text: Am, score: 0.8314999938011169
text: aM, score: 0.8314999938011169
text: am, score: 0.8314999938011169
text: AM, score: 0.8314999938011169
text: I, score: 0.8113999962806702
text: i, score: 0.8113999962806702
text: İ, score: 0.8113999962806702
text: 'M, score: 0.7860000133514404
text: 'm, score: 0.7860000133514404
text: MYSELF, score: 0.7333999872207642
text: Myself, score: 0.7333999872207642
text: myself, score: 0.7333999872207642
text: WORKING, score: 0.7249000072479248
text: WOrking, score: 0.7249000072479248
text: working, score: 0.7249000072479248
text: Working, score: 0.7249000072479248
text: knOw, score: 0.7063999772071838
text: know, score: 0.7063999772071838
text: Know, score: 0.7063999772071838
text: KNow, score: 0.7063999772071838

© www.soinside.com 2019 - 2024. All rights reserved.