我使用包含大约 100,000 篇文章的乔治亚数据集对 Mistral 7B 模型进行了微调,包括自定义分词器微调。微调过程耗时约9个小时。但是,当我尝试生成文本时,输出并不符合预期;无论提供什么输入,它都会始终将输入返回为输出。
这是我用于微调的代码:
import time
import json
import torch
from datasets import Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
# Load dataset, preprocess, and fine-tuning details...
training_args = TrainingArguments(
output_dir="mistral_georgian_news_finetuning",
max_steps=3125,
per_device_train_batch_size=32,
learning_rate=3e-4,
# Other arguments...
)
# Fine-tuning setup...
# Start fine-tuning
trainer.train()
为了测试微调后的模型,我使用了以下代码:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path = "/path/to/fine-tuned-model"
tokenizer_path = "/path/to/tokenizer"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
model = AutoModelForCausalLM.from_pretrained(model_path)
def generate_text(prompt_text, max_length=500):
input_ids = tokenizer(prompt_text, return_tensors="pt").input_ids
output = model.generate(input_ids, max_length=max_length)
return tokenizer.decode(output[0], skip_special_tokens=True)
prompt = "რამდენიმე დღეში შესრულდება ..."
generated_text = generate_text(prompt)
print(generated_text)
在测试过程中,模型只是回显提示,而不生成新文本。以下是观察到的日志:
config.json: 0%| | 0.00/571 [00:00<?, ?B/s]
model.safetensors.index.json: 0%| | 0.00/25.1k [00:00<?, ?B/s]
Downloading shards: 0%| | 0/2 [00:00<?, ?it/s]
model-00001-of-00002.safetensors: 0%| | 0.00/9.94G [00:00<?, ?B/s]
model-00002-of-00002.safetensors: 0%| | 0.00/4.54G [00:00<?, ?B/s]
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
generation_config.json: 0%| | 0.00/116 [00:00<?, ?B/s]
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
我不确定问题是否在于我如何加载和测试模型,或者是否与微调过程有关。模型应根据输入提示生成文本,但它返回输入作为输出。
有没有人遇到过类似的问题,或者有人可以发现我的方法可能有什么问题吗?
让我们重构您的测试代码,以确保模型和分词器正确加载,并且生成过程按预期运行。这是更新后的代码:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path = "/path/to/fine-tuned-model"
tokenizer_path = "/path/to/tokenizer"
def load_model(model_path, tokenizer_path):
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
model = AutoModelForCausalLM.from_pretrained(model_path)
return model, tokenizer
def generate_text(model, tokenizer, prompt_text, max_length=500):
input_ids = tokenizer(prompt_text, return_tensors="pt").input_ids
output = model.generate(input_ids, max_length=max_length)
return tokenizer.decode(output[0], skip_special_tokens=True)
def main():
model, tokenizer = load_model(model_path, tokenizer_path)
prompt = "რამდენიმე დღეში შესრულდება ..."
generated_text = generate_text(model, tokenizer, prompt)
print("Generated Text:", generated_text)
if __name__ == "__main__":
main()
确保将“/path/to/fine-tuned-model”和“/path/to/tokenizer”替换为微调模型和标记生成器的实际路径。此代码可确保模型和分词器正确加载,并且文本生成过程按预期运行。如果问题仍然存在,可能需要进一步调查微调过程和数据质量。