Llama3 70b 在等待预测时“assert len(running_scheduled.prefill_seq_groups) == 0”上带有 VLLM

问题描述 投票:0回答:1

vllm 0.4.1 骆驼3指令70b FastApi 服务于 gcp 的 Vertex AI

多次调用预测会导致断言:

assert len(running_scheduled.prefill_seq_groups) == 0

模型加载如下:

from vllm import LLM
        ...
        self.model = LLM(
        model="meta-llama/Meta-Llama-3-70B-Instruct",
        tensor_parallel_size=8,
        enable_prefix_caching=False,
        max_model_len=4096,
        download_dir="/dev/shm/cache/huggingface",
    )
vllm
1个回答
0
投票

显然上面的示例以异步方式使用 vllm 模型,而用户将使用

AsyncLLMEngine
来代替,

这是一个例子:

   from vllm.engine.async_llm_engine import AsyncLLMEngine
   from vllm.engine.arg_utils import AsyncEngineArgs
   from vllm.usage.usage_lib import UsageContext

   engine_args = AsyncEngineArgs(
        model=model_config.hf_model_path,
        engine_use_ray=bool(model_config.tensor_parallel_size > 1),
        ...
    )

    self.model = AsyncLLMEngine.from_engine_args(engine_args, 
    usage_context=UsageContext.API_SERVER)

要生成响应,请遵循他们的服务器实现

© www.soinside.com 2019 - 2024. All rights reserved.