# Formatting
block_size = 128 # or any number suitable to your context
def group_texts(examples):
# Concatenate all 'input_ids'
concatenated_examples = sum(examples["input_ids"], [])
total_length = len(concatenated_examples)
# Organize into sequences of fixed length
sequences = [
concatenated_examples[i : i + block_size]
for i in range(0, total_length, block_size)
]
result = {
"input_ids": sequences,
# Shift the labels for CLM
"labels": [sequence[1:] + [tokenizer.eos_token_id] for sequence in sequences],
}
return result
tokenized_dataset = tokenized_dataset.map(
group_texts,
batched=True,
batch_size=1000, # or any number suitable to your context
我不明白 block_size 和 batch_size 指的是什么?
Batch_size 决定同时并行处理多少个示例。例如,在您的代码中,batch_size=1000,这意味着将同时处理 1000 个实例。
Block_size决定了每个序列的固定长度。使用滑动窗口方法将 concatenated_examples 列表分为长度为 block_size 的序列。
Column 4 named input_ids expected length 1000 but got length 328
您可能需要承诺实例数可以被1000整除,因为您只有328个实例,您可以使用一点batch_size,例如8。
设置
tokenized_dataset = tokenized_dataset.map(
group_texts,
batched=True,
batch_size=8)