vllm 0.4.1 骆驼3指令70b FastApi 服务于 gcp 的 Vertex AI
多次调用预测会导致断言:
assert len(running_scheduled.prefill_seq_groups) == 0
模型加载如下:
from vllm import LLM
...
self.model = LLM(
model="meta-llama/Meta-Llama-3-70B-Instruct",
tensor_parallel_size=8,
enable_prefix_caching=False,
max_model_len=4096,
download_dir="/dev/shm/cache/huggingface",
)
显然上面的示例以异步方式使用 vllm 模型,而用户将使用
AsyncLLMEngine
来代替,
这是一个例子:
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.usage.usage_lib import UsageContext
engine_args = AsyncEngineArgs(
model=model_config.hf_model_path,
engine_use_ray=bool(model_config.tensor_parallel_size > 1),
...
)
self.model = AsyncLLMEngine.from_engine_args(engine_args,
usage_context=UsageContext.API_SERVER)
要生成响应,请遵循他们的服务器实现