如何在多个数据集上训练拥抱人脸模型?

问题描述 投票:0回答:1

我正在尝试基于两个数据集微调模型,按照 Hugging Face 网站上的示例,我在 Yelp Review 数据集上进行模型训练,但我也想在 Short Jokes 数据集上训练我的模型。

选择这两个数据集只是为了说明我想要微调模型的数据集是完全不相关的。

我已经看到了

interleave_datasets
功能,但我不确定它是否正是我应该使用的。

我尝试过在一个数据集上进行训练:

from datasets import load_dataset

yelp_dataset = load_dataset("yelp_review_full")
jokes_dataset = load_dataset("short-jokes")

from transformers import AutoTokenizer, Trainer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = yelp_dataset.map(tokenize_function, batched=True)

small_train_yelp_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_yelp_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

# Where do these go?
small_train_short_jokes_dataset = jokes_dataset["train"].shuffle(seed=42).select(range(1000))
small_eval_short_jokes_dataset = jokes_dataset["test"].shuffle(seed=42).select(range(1000))

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)

from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_yelp_dataset,
    eval_dataset=small_eval_yelp_dataset,

)

trainer.train()

我如何同时在两个不同的集合上训练我的模型?

python nlp huggingface-transformers huggingface-datasets google-bert
1个回答
0
投票

TL;博士

from datasets import load_dataset
from datasets import interleave_datasets
from transformers import AutoTokenizer, Trainer, TrainingArguments
from transformers import AutoModelForSequenceClassification

yelp_dataset = load_dataset("yelp_review_full")
jokes_dataset = load_dataset("Fraser/short-jokes")

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")
model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# Select before preprocess, otherwise you'll waste the computes on tokenizing data you wont use.
ds1_train = yelp_dataset['train'].shuffle(seed=42).select(range(1000)).map(tokenize_function)
ds2_train = jokes_dataset['train'].shuffle(seed=42).select(range(1000)).map(tokenize_function)

ds1_test = yelp_dataset['test'].shuffle(seed=42).select(range(1000)).map(tokenize_function)
ds2_test = jokes_dataset['train'].shuffle(seed=42).select(range(1000)).map(tokenize_function)

ds_train = interleave_datasets([ds1_train, ds2_train], probabilities=[0.7, 0.3], seed=42)
ds_test = interleave_datasets([ds1_test, ds2_test], probabilities=[0.7, 0.3], seed=42)

training_args = TrainingArguments(output_dir="output_model")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds_train,
    eval_dataset=ds_test,

)

trainer.train()

从文档中,https://huggingface.co/docs/datasets/en/process#interleave,您只需将数据集与

interleave_dataset
放在一起,例如

ds_train = interleave_datasets(
  [ds1_train, ds2_train], 
  probabilities=[0.7, 0.3], seed=42
)
© www.soinside.com 2019 - 2024. All rights reserved.