model.resize_token_embeddings()函数如何重构tokenizer中新添加的token的嵌入？

Question

我是自然语言处理新手，目前正在使用 Hugging Face 的 ALMA-7B 模型进行机器翻译。我想根据 Word2Vec 嵌入中的标记创建自定义标记生成器，并且我也有相应的嵌入（权重）。我使用以下代码将标记添加到标记器：

alma_tokenizer.add_tokens(word_chunks)

其中

alma_tokenizer

是 ALMA-7B 模型的分词器，

word_chunks

是我要添加的单词列表。我也想在模型中使用相应的词嵌入来更新模型，并且建议我使用

resize_token_embeddings()

的

AutoModelForCausalLM

函数。使用时，它实际上为我添加的令牌创建了新的嵌入，并且我也确认了这一点。但我的问题是这些嵌入是如何创建的？它们是随机创建的（因为它们不是零张量）？我可以插入我的嵌入而不是他们创建的嵌入吗？

任何形式的帮助将不胜感激！

embeddings=model.resize_token_embeddings(len(tokenizer))

Answer 1

transformers.modeling_utils.PreTrainedModel.resize_token_embeddings

（https://github.com/huggingface/transformers/blob/38611086d293ea4a5809bcd7fadd8081d55cb74e/src/transformers/modeling_utils.py#L1855C14-L1855C27）。

最终调用

_get_resized_embeddings

，并且

Model._init_weights

将用于初始化新的嵌入。

new_embeddings.weight.data[:n, :] = old_embeddings.weight.data[:n, :]

将确保旧的令牌嵌入保持不变。

据我所知，ALMA 与 Llama 具有相同的架构。下面是

_init_weight

中的

transformers.models.llama.modeling_llama

函数：

    def _init_weights(self, module):
        std = self.config.initializer_range
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()

对于 ALMA，新的令牌嵌入将使用平均值 = 0 和 var = std（在模型配置中定义）的正态分布进行初始化

当然您可以插入嵌入内容。

方法1重写model._init_weights

def _my_init_weights(self, module):
    std = self.config.initializer_range
    if isinstance(module, nn.Linear):
        module.weight.data.normal_(mean=0.0, std=std)
        if module.bias is not None:
            module.bias.data.zero_()
    elif isinstance(module, nn.Embedding):
        # replace following line with you embedding initialization here 
        module.weight.data.normal_(mean=0.0, std=std)
        if module.padding_idx is not None:
            module.weight.data[module.padding_idx].zero_()

方法2手动完成

my_embedding = nn.Embedding(...)
# do your initialize 
alma_model.model.embed_tokens = my_embedding

如果您手动执行此操作，也不要忘记调整 lm_head 的大小。您可能还需要更新 model.config 中的参数

model.resize_token_embeddings()函数如何重构tokenizer中新添加的token的嵌入？

问题描述投票：0回答：1

1个回答

方法1重写model._init_weights

方法2手动完成

最新问题

model.resize_token_embeddings()函数如何重构tokenizer中新添加的token的嵌入？

问题描述 投票：0回答：1

1个回答

方法1重写model._init_weights

方法2手动完成

最新问题

问题描述投票：0回答：1