我想从头开始训练我的模型,我在 file.txt 中有我的文本,然后是 train.txt 和 validation.txt,我从 file.txt 分成 90% 和 10%,我有 merges.txt 和 vocab.json ,这都是塞尔维亚拉丁语,但我有一个问题。 我想我找到了问题所在,代码中的这一行导致了问题
tokenized_dataset = dataset.map(lambda x: tokenizer(x["text"], truncation=True, padding=True, max_length=512), batched =True),
说明是
padding
的问题,但是我真的不知道怎么解决,这是我的代码和一个确认问题位置的错误
wandb: Currently logged in as: mynick. Use `wandb login --relogin` to force relogin
[I 2023-04-29 15:23:40,553] A new study created in memory with name: no-name-f040-4cc1-90e6
Using pad_token, but it is not set yet.
pad_token: <pad>
pad_token_id: 5356
Downloading and preparing dataset text/default to /home/tea/.cache/huggingface/datasets/text/default-77a17d1be/0.0.0/bd71a82ad27976be3b12b407850...
Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 10686.12it/s]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1755.67it/s]
Dataset text downloaded and prepared to /home/tea/.cache/huggingface/datasets/text/default-7a17d1be/0.0.0/bd71a82ad27976be3b12b407850. Subsequent calls will reuse this data.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1184.83it/s]
Tokenized dataset example:
text: Zakon o javnim nabavkama (Sl. glasnik RS, br. 912019)
input_ids: [3809, None, 58, None, 938, None, 963, None, 5, 4226, 8, None, 2412, None, 2093, 7, None, 1843, 8, None, 18, 1534, 1207, 6, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356, 5356]
attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
wandb: Tracking run with wandb version 0.15.0
wandb: Run data is saved locally in /var/www/html/Documents/finet/wandb/run-2023_43-kbsflswo
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run resilient-sea-520
wandb: ⭐️ View project at https://wandb.ai/mynick/huggingface
wandb: 🚀 View run at https://wandb.ai/mynick/huggingface/runs/kbsflswo
0%|
0/4896 [00:00<?, ?it/s][W 2023-04-29 15:23:44,669] Trial 0 failed with parameters: {'learning_rate': 4.8088958599499364e-05, 'weight_decay': 5.687828891515473e-06, 'warmup_steps': 115, 'num_train_epochs': 9} because of the following error:
ValueError("Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).").
下面是执行标记化的代码,我认为填充有问题,所以请帮助我!谢谢。
tokenized_dataset = None
vocab_path = "/var/www/html/ainabavka/Documents/finet/finetune/gpt2-zjn/vocab.json"
merges_path = "/var/www/html/ainabavka/Documents/finet/finetune/gpt2-zjn/merges.txt"
train_path = "/var/www/html/ainabavka/Documents/finet/finetune/gpt2-zjn/train.txt"
validation_path = "/var/www/html/ainabavka/Documents/finet/finetune/gpt2-zjn/validation.txt"
output_dir = "/var/www/html/ainabavka/Documents/finet/finetune/gpt2-model"
tokenizer_directory = "/var/www/html/ainabavka/Documents/finet/finetune/gpt2-zjn"
model_directory = "/var/www/html/ainabavka/Documents/finet/finetune/gpt2-zjn"
with open(train_path, 'r', encoding='utf-8') as file:
train_data = file.readlines()
with open(validation_path, 'r', encoding='utf-8') as file:
validation_data = file.readlines()
# Save the data to the files
with open(train_path, 'w', encoding='utf-8') as file:
file.writelines(train_data)
with open(validation_path, 'w', encoding='utf-8') as file:
file.writelines(validation_data)
`# Dodajem sledeće linije kako bi kreirao tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(tokenizer_directory)
if tokenizer.pad_token is None:
tokenizer.add_special_tokens({'pad_token': '<pad>'})
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids('<pad>')
print("pad_token:", tokenizer.pad_token)
print("pad_token_id:", tokenizer.pad_token_id)
# Dodajem sledeće linije kako bi kreirao model
model = GPT2LMHeadModel.from_pretrained(model_directory)
model.resize_token_embeddings(len(tokenizer))
# Ažurirajte model sa novim tokenizatorom
model.config.pad_token_id = tokenizer.pad_token_id`
`def encode_function(example):
tokenized_example = tokenizer(example["text"], padding=False, truncation=True, return_tensors='pt', max_length=512)
return {key: value.squeeze(0) for key, value in tokenized_example.items()}
tokenized_dataset = dataset.map(lambda x: tokenizer(x["text"], truncation=True, padding=True, max_length=512), batched=True)
print("Tokenized dataset example:")
for key, value in tokenized_dataset["train"][0].items():
print(f"{key}: {value}")
`
`class CustomDataCollatorForLanguageModeling(DataCollatorForLanguageModeling):
def __init__(self, tokenizer, mlm: bool = True, mlm_probability: float = 0.15):
super().__init__(tokenizer=tokenizer, mlm=mlm, mlm_probability=mlm_probability)
def mask_tokens(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
"""
Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.
"""
labels = inputs.clone()
# Uzorak nekoliko tokena u svakoj sekvenci za MLM obuku (sa verovatnoćom `self.mlm_probability`)
probability_matrix = torch.full(labels.shape, self.mlm_probability)
special_tokens_mask = [
self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
]
probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)
if self.tokenizer._pad_token is not None:
padding_mask = labels.eq(self.tokenizer.pad_token_id)
probability_matrix.masked_fill_(padding_mask, value=0.0)
masked_indices = torch.bernoulli(probability_matrix).bool()
labels[~masked_indices] = -100 # Gubitak računam samo na maskiranim tokenima
# 80% vremena zamjenjujem maskirane ulazne tokene sa tokenizer.mask_token ([MASK])
indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
inputs[indices_replaced] = self.tokenizer.mask_token_id
# U 10% slučajeva zamjenjujem maskirane ulazne tokene sa nasumičnom rečju
indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)
inputs[indices_random] = random_words[indices_random]
# Ostatak vremena (10% vremena) čuvam maskirane ulazne tokene nepromenjene
return inputs, labels
def __call__(self, examples: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
print("Using pad_token_id in CustomDataCollatorForLanguageModeling:", self.tokenizer.pad_token_id)
examples = [{"input_ids": example["input_ids"]} for example in examples]
cleaned_examples = []
max_length = max([len(example["input_ids"]) for example in examples])
for example in examples:
cleaned_example = [token if token is not None else self.tokenizer.pad_token_id for token in example["input_ids"]]
cleaned_example = cleaned_example + [self.tokenizer.pad_token_id] * (max_length - len(cleaned_example))
cleaned_examples.append(cleaned_example)
input_ids = torch.stack(cleaned_examples)
input_ids, labels = self.mask_tokens(input_ids)
return {"input_ids": input_ids, "labels": labels}
data_collator = CustomDataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)`