将 spacy token 向量转换为文本

Question

我正在使用 spacy 创建句子向量。如果句子是“我正在工作”，它会给出一个形状为 (3, 300) 的向量。有没有办法使用这些向量取回句子中的文本？

提前感谢，哈拉蒂

Answer 1

实际上，您可以使用 .orth_ 属性直接从 doc 对象获取字符串，该属性返回令牌的字符串表示形式，而不是 SpaCy 令牌对象

import en_core_web_sm
nlp = en_core_web_sm.load()
tokenizer = nlp.Defaults.create_tokenizer(nlp)
text = 'I am working'
tokens = [token.orth_ for token in tokenizer(text)]
print(tokens)
['I', 'am', 'working']

Answer 2

没有办法从向量→单词进行翻译。但是，您可以实例化第二个序列，将令牌序列映射到整数序列，指示 spacy 模型词汇表中每个令牌的 id。

sentence = 'I am working'
document = nlp(sentence)
id_sequence = map(lambda x: x.orth, [token for token in document])
text = map(lambda x: nlp.vocab[x].text, [id for id in id_sequence])
print(text)
['I', 'am', 'working']

Answer 3

您是否尝试过查找“最相似”的单词？

    nlp = spacy.load("en_core_web_lg")
    doc1 = nlp("I am working")
    # most_similar words in vocab
    keys, best_rows, scores = nlp.vocab.vectors.most_similar(
        np.asarray([
            doc1.vector,  # the input is 1x1 (x300)
            ]),
        n=20
        )
    # keys is 1xn (x300)
    for key, best_row, score in zip(keys[0, :], best_rows[0, :], scores[0, :]):
        print(f'text: {nlp.vocab[key].text}, score: {score}')  # key: {key}

返回如下：

text: Am, score: 0.8314999938011169
text: aM, score: 0.8314999938011169
text: am, score: 0.8314999938011169
text: AM, score: 0.8314999938011169
text: I, score: 0.8113999962806702
text: i, score: 0.8113999962806702
text: İ, score: 0.8113999962806702
text: 'M, score: 0.7860000133514404
text: 'm, score: 0.7860000133514404
text: MYSELF, score: 0.7333999872207642
text: Myself, score: 0.7333999872207642
text: myself, score: 0.7333999872207642
text: WORKING, score: 0.7249000072479248
text: WOrking, score: 0.7249000072479248
text: working, score: 0.7249000072479248
text: Working, score: 0.7249000072479248
text: knOw, score: 0.7063999772071838
text: know, score: 0.7063999772071838
text: Know, score: 0.7063999772071838
text: KNow, score: 0.7063999772071838

将 spacy token 向量转换为文本

问题描述投票：0回答：3

3个回答

最新问题

将 spacy token 向量转换为文本

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3