vLLM
High-throughput inference engine for serving large language models.
vLLM is a serving framework optimised for production LLM inference, using a technique called PagedAttention to manage attention-key memory the way an operating system manages virtual memory. The result is high throughput at long context lengths and efficient batching of many concurrent requests.
It runs on Linux with CUDA or ROCm and exposes an OpenAI-compatible HTTP API. vLLM is widely deployed behind cloud LLM endpoints and on on-premises GPU clusters.
Install
pip install vllm
Authors
- UC Berkeley Sky Computing Lab
- vLLM contributors
