vLLM

High-throughput inference engine for serving large language models.

vLLM is a serving framework optimised for production LLM inference, using a technique called PagedAttention to manage attention-key memory the way an operating system manages virtual memory. The result is high throughput at long context lengths and efficient batching of many concurrent requests.

It runs on Linux with CUDA or ROCm and exposes an OpenAI-compatible HTTP API. vLLM is widely deployed behind cloud LLM endpoints and on on-premises GPU clusters.

License: Apache-2.0

Category: AI / ML

Website: https://github.com/vllm-project/vllm

Install

pip install vllm

Authors

UC Berkeley Sky Computing Lab
vLLM contributors

PreviousVLC media player Nextvnstat