背景
我正在使用与这两篇博文中概述的相同工作流程(使用 Sagemaker)对 Mistra-7B-instruct-v01 模型进行微调:
一切看起来都很棒,经过微调的模型产生的结果看起来非常好。不过我对进度条很好奇。
当我使用以下设置对包含 100 个观测值的小数据集进行微调时:
这是微调时达到的进度条:
0%| | 0/4 [00:00<?, ?it/s]
25%|██▌ | 1/4 [00:24<01:14, 24.76s/it]
50%|█████ | 2/4 [00:49<00:49, 24.51s/it]
75%|███████▌ | 3/4 [01:13<00:24, 24.47s/it]
100%|██████████| 4/4 [01:37<00:00, 24.42s/it]
{'train_runtime': 97.8848, 'train_samples_per_second': 0.184, 'train_steps_per_second': 0.041, 'train_loss': 1.038140892982483, 'epoch': 1.78}
100%|██████████| 4/4 [01:37<00:00, 24.42s/it]
100%|██████████| 4/4 [01:37<00:00, 24.47s/it]
当我对包含 10,000 个观察值的数据集运行微调时,进度条如下所示(仅在此处显示最终迭代):
100%|█████████▉| 491/492 [3:19:46<00:24, 24.41s/it]
100%|██████████| 492/492 [3:20:10<00:00, 24.40s/it]
{'train_runtime': 12010.6264, 'train_samples_per_second': 0.164, 'train_steps_per_second': 0.041, 'train_loss': 0.5181044475819038, 'epoch': 2.0}
100%|██████████| 492/492 [3:20:10<00:00, 24.40s/it]
100%|██████████| 492/492 [3:20:10<00:00, 24.41s/it]
问题 看不懂进度条里的迭代更新
当微调中只有 100 个观测值时,使用两个 epoch、batch_size 为 1、梯度cumulative_step 为 4 时的步数应为 200 / 4 = 50。
类似地,当我们有 10,000 个观察值时,步骤数应该为 20,000 / 4 = 5000。
为什么进度条在这里显示 4 和 492 迭代步骤?
代码
job_name = f'mistralinstruct-7b-hf-mini'
hyperparameters = {
'dataset_path': '/opt/ml/input/data/training/train_dataset.json',
'model_id': "mistralai/Mistral-7B-Instruct-v0.1",
'max_seq_len': 3872,
'use_qlora': True,
'num_train_epochs': 2,
'per_device_train_batch_size': 1,
'gradient_accumulation_steps': 4,
'gradient_checkpointing': True,
'optim': "adamw_torch_fused",
'logging_steps': 25,
'save_strategy': "steps",
'save_steps' : 100,
'learning_rate': 2e-4,
'bf16': True,
'tf32': True,
'max_grad_norm': 1.0,
'warmup_ratio': 0.03,
'lr_scheduler_type': "constant",
'report_to': "tensorboard",
'output_dir': "/opt/ml/checkpoints",
'merge_adapters': True,
}
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket='com.ravenpack.dsteam.research.testing'
if sagemaker_session_bucket is None and sess is not None:
# set to default bucket if a bucket name is not given
sagemaker_session_bucket = sess.default_bucket()
print(sagemaker_session_bucket)
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='SageMaker-ds-research-testing')['Role']['Arn']
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
tensorboard_output_config = TensorBoardOutputConfig(
container_local_output_path='/opt/ml/output/tensorboard',
s3_output_path = f's3://{sess.default_bucket()}/...{my_path}...',
)
metric_definitions = [
{'Name': 'loss', 'Regex': "'loss':\s*([0-9\\.]+)"},
{'Name': 'grad_norm', 'Regex': "'grad_norm':\s*([0-9\\.]+)"},
{'Name': 'learning_rate', 'Regex': "'learning_rate':\s*([0-9\\.]+)"},
{'Name': 'epoch', 'Regex': "'epoch':\s*([0-9\\.]+)"}
]
# create the Estimator
huggingface_estimator = HuggingFace(
entry_point = 'run_sft.py', # train script (used Philip's from https://github.com/philschmid/llm-sagemaker-sample/blob/main/scripts/trl/run_sft.py)
source_dir = '...{my_path}...',
instance_type = 'ml.g5.4xlarge',
instance_count = 1,
max_run = 1*24*60*60,
max_wait = 2*24*60*60,
use_spot_instances = True,
base_job_name = job_name,
role = role,
volume_size = 300,
transformers_version = '4.36',
pytorch_version = '2.1',
py_version = 'py310',
hyperparameters = hyperparameters,
disable_output_compression = True,
environment = {
"HUGGINGFACE_HUB_CACHE": "/tmp/.cache",
},
metric_definitions = metric_definitions,
tensorboard_output_config = tensorboard_output_config,
checkpoint_s3_uri = f's3://{sess.default_bucket()}/...{my_path}...',
)
training_input_path = f's3://{sess.default_bucket()}/...{my_path}...'
data = {'training': training_input_path}
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)
此问题是否仅发生在 SageMaker 笔记本环境中?或者这也发生在更简单的终端环境中? (例如 SSH 会话)
我在这里找到了类似的帖子: jupyter笔记本中的进度条变得疯狂
根据如何渲染基于字符的进度条,可能会出现类似的问题,即无意中渲染多行。