使用'padding=True''truncation=True'截断和/或填充的问题

问题描述 投票:0回答:0

我想从头开始训练我的模型,我在 file.txt 中有我的文本,然后是 train.txt 和 validation.txt,我从 file.txt 分成 90% 和 10%,我有 merges.txt 和 vocab.json ,这都是塞尔维亚拉丁语,但我有一个问题。 我想我找到了问题所在,代码中的这一行导致了问题

tokenized_dataset = dataset.map(lambda x: tokenizer(x["text"], truncation=True, padding=True, max_length=512), batched =True),

说明是

padding
的问题,但是我真的不知道怎么解决,这是我的代码和一个确认问题位置的错误

wandb: Currently logged in as: mynick. Use `wandb login --relogin` to force relogin
    [I 2023-04-29 15:23:40,553] A new study created in memory with name: no-name-f040-4cc1-90e6
    Using pad_token, but it is not set yet.
    pad_token: <pad>
    pad_token_id: 5356
    Downloading and preparing dataset text/default to /home/tea/.cache/huggingface/datasets/text/default-77a17d1be/0.0.0/bd71a82ad27976be3b12b407850...
    Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 10686.12it/s]
    Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1755.67it/s]
    Dataset text downloaded and prepared to /home/tea/.cache/huggingface/datasets/text/default-7a17d1be/0.0.0/bd71a82ad27976be3b12b407850. Subsequent calls will reuse this data.
    100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1184.83it/s]
    Tokenized dataset example:                                                                                                                                    
    text: Zakon o javnim nabavkama (Sl. glasnik RS, br. 912019)
    input_ids: [3809, None, 58, None, 938, None, 963, None, 5, 4226, 8, None, 2412, None, 2093, 7, None, 1843, 8, None, 18, 1534, 1207, 6, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356]
    attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    wandb: Tracking run with wandb version 0.15.0
    wandb: Run data is saved locally in /var/www/html/Documents/finet/wandb/run-2023_43-kbsflswo
    wandb: Run `wandb offline` to turn off syncing.
    wandb: Syncing run resilient-sea-520
    wandb: ⭐️ View project at https://wandb.ai/mynick/huggingface
    wandb: 🚀 View run at https://wandb.ai/mynick/huggingface/runs/kbsflswo
      0%|                                                                                                                            

0/4896 [00:00<?, ?it/s][W 2023-04-29 15:23:44,669] Trial 0 failed with parameters: {'learning_rate': 4.8088958599499364e-05, 'weight_decay': 5.687828891515473e-06, 'warmup_steps': 115, 'num_train_epochs': 9} because of the following error: 
ValueError("Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).").

下面是执行标记化的代码,我认为填充有问题,所以请帮助我!谢谢。

tokenized_dataset = None

vocab_path = "/var/www/html/ainabavka/Documents/finet/finetune/gpt2-zjn/vocab.json"
merges_path = "/var/www/html/ainabavka/Documents/finet/finetune/gpt2-zjn/merges.txt"
train_path = "/var/www/html/ainabavka/Documents/finet/finetune/gpt2-zjn/train.txt"
validation_path = "/var/www/html/ainabavka/Documents/finet/finetune/gpt2-zjn/validation.txt"
output_dir = "/var/www/html/ainabavka/Documents/finet/finetune/gpt2-model"
tokenizer_directory = "/var/www/html/ainabavka/Documents/finet/finetune/gpt2-zjn"
model_directory = "/var/www/html/ainabavka/Documents/finet/finetune/gpt2-zjn"


with open(train_path, 'r', encoding='utf-8') as file:
    train_data = file.readlines()

with open(validation_path, 'r', encoding='utf-8') as file:
    validation_data = file.readlines()

# Save the data to the files
with open(train_path, 'w', encoding='utf-8') as file:
    file.writelines(train_data)

with open(validation_path, 'w', encoding='utf-8') as file:
    file.writelines(validation_data)

`# Dodajem sledeće linije kako bi kreirao tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(tokenizer_directory)

if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '<pad>'})
    tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids('<pad>')

print("pad_token:", tokenizer.pad_token)
print("pad_token_id:", tokenizer.pad_token_id)

# Dodajem sledeće linije kako bi kreirao model
model = GPT2LMHeadModel.from_pretrained(model_directory)
model.resize_token_embeddings(len(tokenizer))
# Ažurirajte model sa novim tokenizatorom
model.config.pad_token_id = tokenizer.pad_token_id`

`def encode_function(example):
    tokenized_example = tokenizer(example["text"], padding=False, truncation=True, return_tensors='pt', max_length=512)
    return {key: value.squeeze(0) for key, value in tokenized_example.items()}

tokenized_dataset = dataset.map(lambda x: tokenizer(x["text"], truncation=True, padding=True, max_length=512), batched=True)

print("Tokenized dataset example:")
for key, value in tokenized_dataset["train"][0].items():
    print(f"{key}: {value}")
`

`class CustomDataCollatorForLanguageModeling(DataCollatorForLanguageModeling):
    def __init__(self, tokenizer, mlm: bool = True, mlm_probability: float = 0.15):
        super().__init__(tokenizer=tokenizer, mlm=mlm, mlm_probability=mlm_probability)

    def mask_tokens(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.
        """
        labels = inputs.clone()
        # Uzorak nekoliko tokena u svakoj sekvenci za MLM obuku (sa verovatnoćom `self.mlm_probability`)
        probability_matrix = torch.full(labels.shape, self.mlm_probability)
        special_tokens_mask = [
            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
        ]
        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)
        if self.tokenizer._pad_token is not None:
            padding_mask = labels.eq(self.tokenizer.pad_token_id)
            probability_matrix.masked_fill_(padding_mask, value=0.0)
        masked_indices = torch.bernoulli(probability_matrix).bool()
        labels[~masked_indices] = -100  # Gubitak računam samo na maskiranim tokenima

        # 80% vremena zamjenjujem maskirane ulazne tokene sa tokenizer.mask_token ([MASK])
        indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
        inputs[indices_replaced] = self.tokenizer.mask_token_id

        # U 10% slučajeva zamjenjujem maskirane ulazne tokene sa nasumičnom rečju
        indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
        random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)
        inputs[indices_random] = random_words[indices_random]

        # Ostatak vremena (10% vremena) čuvam maskirane ulazne tokene nepromenjene
        return inputs, labels

    def __call__(self, examples: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        print("Using pad_token_id in CustomDataCollatorForLanguageModeling:", self.tokenizer.pad_token_id)
        examples = [{"input_ids": example["input_ids"]} for example in examples]
        cleaned_examples = []
        max_length = max([len(example["input_ids"]) for example in examples])

        for example in examples:
            cleaned_example = [token if token is not None else self.tokenizer.pad_token_id for token in example["input_ids"]]
            cleaned_example = cleaned_example + [self.tokenizer.pad_token_id] * (max_length - len(cleaned_example))
            cleaned_examples.append(cleaned_example)

        input_ids = torch.stack(cleaned_examples)
        input_ids, labels = self.mask_tokens(input_ids)
        return {"input_ids": input_ids, "labels": labels}


data_collator = CustomDataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)`
python machine-learning padding tokenize
© www.soinside.com 2019 - 2024. All rights reserved.