在格鲁吉亚数据集上从微调的 Mistral 7B 模型生成文本的问题

问题描述 投票:0回答:1

我使用包含大约 100,000 篇文章的乔治亚数据集对 Mistral 7B 模型进行了微调,包括自定义分词器微调。微调过程耗时约9个小时。但是,当我尝试生成文本时,输出并不符合预期;无论提供什么输入,它都会始终将输入返回为输出。

这是我用于微调的代码:

import time
import json
import torch
from datasets import Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments

# Load dataset, preprocess, and fine-tuning details...

training_args = TrainingArguments(
    output_dir="mistral_georgian_news_finetuning",
    max_steps=3125,
    per_device_train_batch_size=32,
    learning_rate=3e-4,
    # Other arguments...
)

# Fine-tuning setup...

# Start fine-tuning
trainer.train()

为了测试微调后的模型,我使用了以下代码:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "/path/to/fine-tuned-model"
tokenizer_path = "/path/to/tokenizer"

tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

def generate_text(prompt_text, max_length=500):
    input_ids = tokenizer(prompt_text, return_tensors="pt").input_ids
    output = model.generate(input_ids, max_length=max_length)
    return tokenizer.decode(output[0], skip_special_tokens=True)

prompt = "რამდენიმე დღეში შესრულდება ..."
generated_text = generate_text(prompt)
print(generated_text)

在测试过程中,模型只是回显提示,而不生成新文本。以下是观察到的日志:

config.json: 0%| | 0.00/571 [00:00<?, ?B/s]
model.safetensors.index.json: 0%| | 0.00/25.1k [00:00<?, ?B/s]
Downloading shards: 0%| | 0/2 [00:00<?, ?it/s]
model-00001-of-00002.safetensors: 0%| | 0.00/9.94G [00:00<?, ?B/s]
model-00002-of-00002.safetensors: 0%| | 0.00/4.54G [00:00<?, ?B/s]
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
generation_config.json: 0%| | 0.00/116 [00:00<?, ?B/s]
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.

我不确定问题是否在于我如何加载和测试模型,或者是否与微调过程有关。模型应根据输入提示生成文本,但它返回输入作为输出。

有没有人遇到过类似的问题,或者有人可以发现我的方法可能有什么问题吗?

nlp huggingface language-model fine-tuning text-generation
1个回答
0
投票

让我们重构您的测试代码,以确保模型和分词器正确加载,并且生成过程按预期运行。这是更新后的代码:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "/path/to/fine-tuned-model"
tokenizer_path = "/path/to/tokenizer"

def load_model(model_path, tokenizer_path):
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
    model = AutoModelForCausalLM.from_pretrained(model_path)
    return model, tokenizer

def generate_text(model, tokenizer, prompt_text, max_length=500):
    input_ids = tokenizer(prompt_text, return_tensors="pt").input_ids
    output = model.generate(input_ids, max_length=max_length)
    return tokenizer.decode(output[0], skip_special_tokens=True)

def main():
    model, tokenizer = load_model(model_path, tokenizer_path)

    prompt = "რამდენიმე დღეში შესრულდება ..."
    generated_text = generate_text(model, tokenizer, prompt)
    print("Generated Text:", generated_text)

if __name__ == "__main__":
    main()

确保将“/path/to/fine-tuned-model”和“/path/to/tokenizer”替换为微调模型和标记生成器的实际路径。此代码可确保模型和分词器正确加载,并且文本生成过程按预期运行。如果问题仍然存在,可能需要进一步调查微调过程和数据质量。

© www.soinside.com 2019 - 2024. All rights reserved.