直接从 S3 读取预训练的 Huggingface 变压器

问题描述 投票:0回答:2

加载huggingface预训练的变压器模型似乎需要您将模型保存在本地(如此处所述),这样您只需将本地路径传递给模型和配置即可:

model = PreTrainedModel.from_pretrained('path/to/model', local_files_only=True)

模型存储在S3上可以实现吗?

amazon-s3 huggingface-transformers
2个回答
5
投票

回答我自己的问题...(显然鼓励

我使用瞬态文件(

NamedTemporaryFile
)实现了这一点,它可以达到目的。我希望找到一个内存中的解决方案(即将
BytesIO
直接传递到
from_pretrained
),但这需要对
transformers
代码库进行补丁

import boto3 
import json 

from contextlib import contextmanager 
from io import BytesIO 
from tempfile import NamedTemporaryFile 
from transformers import PretrainedConfig, PreTrainedModel 
  
@contextmanager 
def s3_fileobj(bucket, key): 
    """
    Yields a file object from the filename at {bucket}/{key}

    Args:
        bucket (str): Name of the S3 bucket where you model is stored
        key (str): Relative path from the base of your bucket, including the filename and extension of the object to be retrieved.
    """
    s3 = boto3.client("s3") 
    obj = s3.get_object(Bucket=bucket, Key=key) 
    yield BytesIO(obj["Body"].read()) 
 
def load_model(bucket, path_to_model, model_name='pytorch_model'):
    """
    Load a model at the given S3 path. It is assumed that your model is stored at the key:

        '{path_to_model}/{model_name}.bin'

    and that a config has also been generated at the same path named:

        f'{path_to_model}/config.json'

    """
    tempfile = NamedTemporaryFile() 
    with s3_fileobj(bucket, f'{path_to_model}/{model_name}.bin') as f: 
        tempfile.write(f.read()) 
 
    with s3_fileobj(bucket, f'{path_to_model}/config.json') as f: 
        dict_data = json.load(f) 
        config = PretrainedConfig.from_dict(dict_data) 
 
    model = PreTrainedModel.from_pretrained(tempfile.name, config=config) 
    return model 
     
model = load_model('my_bucket', 'path/to/model')

0
投票

也可以对模型目录执行类似的操作:

def load_project_processing_models_from_s3():
    
    s3_client = get_s3_client()
    result = s3_client.list_objects(
        Bucket=secret_vault.AWS_S3_MODELS_DIRECTORY,
        Prefix=secret_vault.AWS_S3_DOC_PARSER_MODEL_DIR,
    )
    # Create a temporary directory
    temp_dir = tempfile.mkdtemp()
    keys = [obj["Key"] for obj in result.get("Contents", [])]
    for key in keys:
        file_path = os.path.join(temp_dir, os.path.basename(key))
        with open(file_path, "wb") as file:
            file_data = s3_client.get_object(
                Bucket=secret_vault.AWS_S3_MODELS_DIRECTORY, Key=key
            )
            file.write(file_data["Body"].read())
    # Load the Hugging Face model from the temporary directory
    model = AutoModel.from_pretrained(temp_dir)

    # Clean up the temporary directory
    shutil.rmtree(temp_dir)

    return model
© www.soinside.com 2019 - 2024. All rights reserved.