我正在尝试基于两个数据集微调模型,按照 Hugging Face 网站上的示例,我在 Yelp Review 数据集上进行模型训练,但我也想在 Short Jokes 数据集上训练我的模型。
选择这两个数据集只是为了说明我想要微调模型的数据集是完全不相关的。
我已经看到了
interleave_datasets
功能,但我不确定它是否正是我应该使用的。
我尝试过在一个数据集上进行训练:
from datasets import load_dataset
yelp_dataset = load_dataset("yelp_review_full")
jokes_dataset = load_dataset("short-jokes")
from transformers import AutoTokenizer, Trainer
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = yelp_dataset.map(tokenize_function, batched=True)
small_train_yelp_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_yelp_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
# Where do these go?
small_train_short_jokes_dataset = jokes_dataset["train"].shuffle(seed=42).select(range(1000))
small_eval_short_jokes_dataset = jokes_dataset["test"].shuffle(seed=42).select(range(1000))
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)
from transformers import TrainingArguments
training_args = TrainingArguments(output_dir="test_trainer")
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_yelp_dataset,
eval_dataset=small_eval_yelp_dataset,
)
trainer.train()
我如何同时在两个不同的集合上训练我的模型?
from datasets import load_dataset
from datasets import interleave_datasets
from transformers import AutoTokenizer, Trainer, TrainingArguments
from transformers import AutoModelForSequenceClassification
yelp_dataset = load_dataset("yelp_review_full")
jokes_dataset = load_dataset("Fraser/short-jokes")
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")
model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
# Select before preprocess, otherwise you'll waste the computes on tokenizing data you wont use.
ds1_train = yelp_dataset['train'].shuffle(seed=42).select(range(1000)).map(tokenize_function)
ds2_train = jokes_dataset['train'].shuffle(seed=42).select(range(1000)).map(tokenize_function)
ds1_test = yelp_dataset['test'].shuffle(seed=42).select(range(1000)).map(tokenize_function)
ds2_test = jokes_dataset['train'].shuffle(seed=42).select(range(1000)).map(tokenize_function)
ds_train = interleave_datasets([ds1_train, ds2_train], probabilities=[0.7, 0.3], seed=42)
ds_test = interleave_datasets([ds1_test, ds2_test], probabilities=[0.7, 0.3], seed=42)
training_args = TrainingArguments(output_dir="output_model")
trainer = Trainer(
model=model,
args=training_args,
train_dataset=ds_train,
eval_dataset=ds_test,
)
trainer.train()
从文档中,https://huggingface.co/docs/datasets/en/process#interleave,您只需将数据集与
interleave_dataset
放在一起,例如
ds_train = interleave_datasets(
[ds1_train, ds2_train],
probabilities=[0.7, 0.3], seed=42
)