我正在尝试微调 Facebook BART 模型,我正在关注这篇文章,以便使用我自己的数据集对文本进行分类。
我正在使用 Trainer 对象来训练:
training_args = TrainingArguments(
output_dir=model_directory, # output directory
num_train_epochs=1, # total number of training epochs - 3
per_device_train_batch_size=4, # batch size per device during training - 16
per_device_eval_batch_size=16, # batch size for evaluation - 64
warmup_steps=50, # number of warmup steps for learning rate scheduler - 500
weight_decay=0.01, # strength of weight decay
logging_dir=model_logs, # directory for storing logs
logging_steps=10,
)
model = BartForSequenceClassification.from_pretrained("facebook/bart-base") # bart-large-mnli
trainer = Trainer(
model=model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
compute_metrics=new_compute_metrics, # a function to compute the metrics
train_dataset=train_dataset, # training dataset
eval_dataset=val_dataset # evaluation dataset
)
这是我使用的分词器:
from transformers import BartTokenizerFast
tokenizer = BartTokenizerFast.from_pretrained('facebook/bart-base')
但是当我使用
trainer.train()
时,我得到以下信息:
打印以下内容:
***** Running training *****
Num examples = 172
Num Epochs = 1
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 16
Gradient Accumulation steps = 1
Total optimization steps = 11
紧随其后的是这个错误:
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/databricks/python/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/databricks/python/lib/python3.9/site-packages/transformers/models/bart/modeling_bart.py", line 1496, in forward
outputs = self.model(
File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/databricks/python/lib/python3.9/site-packages/transformers/models/bart/modeling_bart.py", line 1222, in forward
encoder_outputs = self.encoder(
File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/databricks/python/lib/python3.9/site-packages/transformers/models/bart/modeling_bart.py", line 846, in forward
layer_outputs = encoder_layer(
File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/databricks/python/lib/python3.9/site-packages/transformers/models/bart/modeling_bart.py", line 323, in forward
hidden_states, attn_weights, _ = self.self_attn(
File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/databricks/python/lib/python3.9/site-packages/transformers/models/bart/modeling_bart.py", line 191, in forward
query_states = self.q_proj(hidden_states) * self.scaling
File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
我搜索了这个网站和 GitHub 以及拥抱脸论坛,但仍然没有找到任何可以帮助我解决这个问题的东西(我尝试添加更多内存、降低批次和预热、重新启动、指定 CPU 或 GPU 等等,但是没有一个对我有用)
为此我正在使用 Databricks,集群:Standard_NC24s_v3 有 4 个 GPU,2 到 6 个工人
如果您需要任何其他信息,请发表评论,我会尽快添加