ArrowInvalid：名为 input_ids 的第 4 列预期长度为 1000，但长度为 328

Question

# Formatting
block_size = 128  # or any number suitable to your context


def group_texts(examples):
    # Concatenate all 'input_ids'
    concatenated_examples = sum(examples["input_ids"], [])
    total_length = len(concatenated_examples)
    # Organize into sequences of fixed length
    sequences = [
        concatenated_examples[i : i + block_size]
        for i in range(0, total_length, block_size)
    ]
    result = {
        "input_ids": sequences,
        # Shift the labels for CLM
        "labels": [sequence[1:] + [tokenizer.eos_token_id] for sequence in sequences],
    }
    return result


tokenized_dataset = tokenized_dataset.map(
    group_texts,
    batched=True,
    batch_size=1000,  # or any number suitable to your context

我不明白 block_size 和 batch_size 指的是什么？

Answer 1

Batch_size 决定同时并行处理多少个示例。例如，在您的代码中，batch_size=1000，这意味着将同时处理 1000 个实例。

Block_size决定了每个序列的固定长度。使用滑动窗口方法将 concatenated_examples 列表分为长度为 block_size 的序列。

 Column 4 named input_ids expected length 1000 but got length 328

您可能需要承诺实例数可以被1000整除，因为您只有328个实例，您可以使用一点batch_size，例如8。

Answer 2

设置

tokenized_dataset = tokenized_dataset.map(
    group_texts,
    batched=True,
    batch_size=8)

ArrowInvalid：名为 input_ids 的第 4 列预期长度为 1000，但长度为 328

问题描述投票：0回答：2

2个回答

最新问题

ArrowInvalid：名为 input_ids 的第 4 列预期长度为 1000，但长度为 328

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2