沿轴以可变长度索引索引到火炬张量

Question

我正在尝试根据语言模型计算标记化单词列表的单词概率，并且我需要一些奇特的索引。

我的输入，用下面的玩具示例进行说明：

token_list：n_words x max_tokenization_length（例如，最大标记化长度为 3 的三个单词）
pxhs：n_words x (max_tokenization_length + 1) x |vocabulary|，（例如，三个单词、四组 3+1 标记的逻辑和维度 1000 词汇）
next_word_token_ids：构成新单词的标记列表（例如，以空格字符开头的所有标记）。

pxhs = torch.rand((3,4,1000))

pad_token_id = tokenizer.pad_token_id
word_token_list = [
    [120, pad_token_id, pad_token_id],
    [131, 132, pad_token_id],
    [140, 141, 142],
]

new_word_token_ids = [0,1,2,3,5]

所需的输出是长度为 3 的单词概率列表，计算如下：

word 1: pxhs[0, 0, 120] * pxhs[0, 1, new_word_token_ids].sum()
word 2: pxhs[1, 0, 131] * pxhs[1, 1, 132] * pxhs[1, 2, new_word_token_ids].sum()
word 3: pxhs[2, 0, 140] * pxhs[2, 1, 141] * pxhs[2, 2, 142] * pxhs[2, 3, new_word_token_ids].sum()

在实践中，我想通过将第一个 pad_token_id 替换为新单词标记 id 来进行索引，然后什么都不做（这不能作为索引，只是说明）：

actual_idx = [
    [[120], new_word_token_ids, [None], [None]],
    [[131], [132], new_word_token_ids, [None]],
    [[140], [142], [143], new_word_token_ids],
]

我写了一个非常慢的函数来执行此操作：

all_word_probs = []
for word_tokens, word_probs in zip(token_list, pxhs):
    counter=0
    p_word=1
    while (counter < len(word_tokens) and 
            word_tokens[counter] != tokenizer.pad_token_id):
        p_word = p_word * word_probs[counter, word_tokens[counter]]
        counter+=1
    new_word_prob = word_probs[counter, new_word_tokens].sum()
    p_word = p_word * new_word_prob
    all_word_probs.append(p_word)

我需要更快的东西，提前感谢您的帮助！

Answer 1

获取您想要的输出

word 2: pxhs[1, 0, 131] * pxhs[1, 1, 132] * pxhs[1, 2, new_word_token_ids].sum()

这个想法是将计算分为两部分：

token_list

概率 (

pxhs[1, 0, 131] * pxhs[1, 1, 132]

) 和

new_word_token_ids

概率 (

pxhs[1, 2, new_word_token_ids].sum()

)。

我假设您可以访问一个

n_words

长度

seq_lens

张量，它存储每个序列的第一个填充标记的索引。我还假设所有对象都是张量，并且下面的最大序列长度（

max_len

）不会太长（否则for循环将成为瓶颈）。

n = len(word_token_list)
max_len = seq_lens.max()
seq_idxs = torch.arange(n)
log_pxhs = pxhs.log()

log_p_x = torch.zeros(n)

# Step 1:
# Compute the probability of the sequences in `word_token_list`
# (clever indexing could also vectorize this at the expense of intelligibility)
for i in range(max_len):
    tok_idxs = word_token_list[:, i]
    log_p_x += log_pxhs[seq_idxs, i, tok_idxs] * (i < seq_lens)

# Step 2:
# Compute the probability of the `new_word_token_ids` at the end of the sequence
v = pxhs.shape[2]
v_mask = torch.isin(torch.arange(v), new_word_token_ids, assume_unique=True)
# mask pxhs to only include nonzero values for `new_word_token_ids`
p_new = pxhs * v_mask
# index those probabilities for the end of the sequence
p_new_given_x = p_new[seq_idxs, seq_lens].sum(1)

# Step 3: compute the final log-probability
print(log_p_x + p_new_given_x.log())

我没有对事物进行基准测试，但我认为这应该会带来显着的加速。

沿轴以可变长度索引索引到火炬张量

问题描述投票：0回答：1

1个回答

最新问题

沿轴以可变长度索引索引到火炬张量

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1