llama.cpp
Pure-C++ inference engine for LLaMA-family models.
llama.cpp is a portable C/C++ implementation of LLM inference focused on running quantised models efficiently on CPUs and consumer GPUs. It supports a wide range of quantisation formats (Q4_K_M, Q5_K_M, Q8_0, and many others) and runs on x86, ARM (including Apple Silicon), CUDA, Metal, Vulkan, and OpenCL backends.
It is the runtime used by Ollama, LM Studio, and most local-LLM tools, and is a reference implementation for the GGUF model format.
Install
git clone https://github.com/ggerganov/llama.cpp cd llama.cpp && make
Authors
- Georgi Gerganov (creator)
- llama.cpp contributors
