Huggingface 预训练模型的标记器和模型对象具有不同的最大输入长度

问题描述 投票:0回答:3

我正在使用 Huggingface 的 symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli 预训练模型。我的任务需要在相当大的文本上使用它,因此了解最大输入长度非常重要。

以下代码应该加载预训练模型及其标记器:

encoding_model_name = "symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli"
encoding_tokenizer = AutoTokenizer.from_pretrained(encoding_model_name)
encoding_model = SentenceTransformer(encoding_model_name)

所以,当我打印有关它们的信息时:

encoding_tokenizer
encoding_model

我得到:

PreTrainedTokenizerFast(name_or_path='symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli', vocab_size=250002, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False)})
SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

如您所见,分词器中的model_max_len=512参数与模型中的max_seq_length=128参数不匹配

我怎样才能知道哪个是真的?或者,可能,如果它们以某种方式响应不同的特征,我如何检查模型的最大输入长度?

nlp huggingface-transformers huggingface-tokenizers sentence-transformers
3个回答
4
投票

由于您使用的是 SentenceTransformer 并将其加载到 SentenceTransformer 类,因此它将按照 documentation 中所述将您的输入截断为 128 个标记(相关代码位于 here):

属性 max_seq_length
用于获取模型的最大输入序列长度的属性。较长的输入将被截断。

您也可以自己检查一下:

fifty = model.encode(["This "*50], convert_to_tensor=True)
two_hundered = model.encode(["This "*200], convert_to_tensor=True)
four_hundered = model.encode(["This "*400], convert_to_tensor=True)

print(torch.allclose(fifty, two_hundered))
print(torch.allclose(two_hundered,four_hundered))

输出:

False
True

底层模型 (xlm-roberta-base) 能够处理最多 512 个标记的序列,但我假设 Symanto 将其限制为 128,因为他们在训练期间也使用了此限制(即嵌入可能不适合序列)超过 128 个令牌)。


0
投票

Model_max_length 是模型可以采用的位置嵌入的最大长度。要检查这一点,请执行以下操作

print(model.config)
您会看到
"max_position_embeddings": 512
以及其他配置。

如何检查我的模型的最大输入长度?

在对文本序列进行编码时,您可以传递 max_length(模型可以接受的长度):

tokenizer.encode(txt, max_length=512)


0
投票

句子转换器的摘录:

# Input Sequence Length
# Transformer models like BERT / RoBERTa / DistilBERT etc. the runtime and the memory requirement grows quadratic with the input length. 
# This limits transformers to inputs of certain lengths. A common value for BERT & Co. are 512 word pieces, which corresponds to about 300-400 words (for English). 
# Longer texts than this are truncated to the first x word pieces.

# By default, the provided methods use a limit of 128 word pieces, longer inputs will be truncated. 
# You can get and set the maximal sequence length like this:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

print("Max Sequence Length:", model.max_seq_length)

# Change the length to 200
model.max_seq_length = 200

print("Max Sequence Length:", model.max_seq_length)

#Note: You cannot increase the length higher than what is maximally supported by the respective transformer model. 
# Also note that if a model was trained on short texts, the representations for long texts might not be that good.

链接 - https://www.sbert.net/examples/applications/computing-embeddings/README.html#input-sequence-length

© www.soinside.com 2019 - 2024. All rights reserved.