← Software

vLLM

High-throughput inference engine for serving large language models.

vLLM is a serving framework optimised for production LLM inference, using a technique called PagedAttention to manage attention-key memory the way an operating system manages virtual memory. The result is high throughput at long context lengths and efficient batching of many concurrent requests.

It runs on Linux with CUDA or ROCm and exposes an OpenAI-compatible HTTP API. vLLM is widely deployed behind cloud LLM endpoints and on on-premises GPU clusters.

License: Apache-2.0

Category: AI / ML

Website: https://github.com/vllm-project/vllm

Install

pip install vllm

Authors

  • UC Berkeley Sky Computing Lab
  • vLLM contributors
PreviousVLC media player Nextvnstat
Textbook of Linux — Learn Linux on iPhone — Download on the App Store