LLM Inference Framework Benchmarks
2025 LunitA systematic evaluation of open-source LLM inference frameworks to select the best option for Lunit's Report Generation project. Benchmarked llama.cpp, ExLlamaV2, and Ollama across multiple models (Phi-3.5 Mini, Llama-3.1-8B) and quantization levels (INT4, INT8, FP16) on both CPU and GPU.
Goal
Pick an inference framework that balances speed, memory efficiency, and deployment flexibility for production medical report generation.
What Was Measured
- Time to First Token (TTFT)
- Tokens per second and per-token latency (TPOT)
- Peak memory footprint across quantization levels
- CPU vs GPU inference performance
- Docker vs local deployment overhead
Frameworks Tested
llama.cpp / llama-cpp-python
Tested via native CLI, Python bindings, llama-server, and Docker. Covered GGUF model conversion, quantization (Q4_0, Q8_0, FP16), and GPU offloading with CUDA. Emerged as the best overall option — good speed, low memory footprint, wide model support, and flexible deployment options.
ExLlamaV2
GPU-only framework with its own quantization format (EXL2). Achieved the highest throughput — 208 tokens/s on Q4 with Phi-3.5 on an NVIDIA L4. Best raw speed but limited to GPU environments only.
Ollama
Built on llama.cpp with a higher-level API and Docker-first workflow. Simplest setup and model management via Modelfiles, but less fine-grained control over inference parameters.
Key Findings
- llama.cpp offered the best balance of speed, memory, and deployment flexibility
- INT4 quantization cuts model size by ~3x with only modest quality trade-offs
- GPU inference (CUDA) achieved ~5-15x speedup over CPU
- ExLlamaV2 matched or exceeded llama.cpp on GPU for raw throughput
- Docker adds slight inference overhead compared to native deployment
- Repository
- GitHub
- Context
- Lunit — Report Generation project
- Models
- Phi-3.5 Mini Instruct, Llama-3.1-8B Instruct
- Hardware
- Intel Xeon Gold 6240 (CPU), NVIDIA 2080 Ti / L4 (GPU)
- Stack
- llama.cpp, llama-cpp-python, ExLlamaV2, Ollama, CUDA, Docker
CPU Results — Phi-3.5 Mini Instruct
llama-cpp-python · Intel Xeon Gold 6240 · 18 threads
| Metric | Q4_0 | Q8_0 | FP16 |
|---|---|---|---|
| TTFT | 6.18 ± 0.98 s | 6.18 ± 0.47 s | 5.84 ± 1.19 s |
| Generation Time | 23.47 ± 0.37 s | 50.14 ± 0.42 s | 69.97 ± 0.27 s |
| Tokens/s | 21.82 | 10.21 | 7.32 |
| TPOT | 45.84 ms | 97.93 ms | 136.67 ms |
| Peak Memory | 3,035 MB | 4,859 MB | 8,275 MB |
| Model Size | 2.1 GB | 3.8 GB | 7.2 GB |
CPU Results — Llama-3.1-8B Instruct
llama-cpp-python · Intel Xeon Gold 6240 · 18 threads
| Metric | Q4_0 | Q8_0 | FP16 |
|---|---|---|---|
| TTFT | 9.80 ± 4.32 s | 8.84 ± 0.67 s | 10.12 ± 3.07 s |
| Generation Time | 26.69 ± 3.08 s | 74.26 ± 0.26 s | 114.13 ± 1.77 s |
| Tokens/s | 15.13 | 6.05 | 3.78 |
| TPOT | 66.90 ms | 165.39 ms | 264.79 ms |
| Peak Memory | 4,925 MB | 8,657 MB | 15,836 MB |
| Model Size | 4.4 GB | 8.0 GB | 15 GB |
GPU Results — Phi-3.5 Mini Instruct
llama-cpp-python + CUDA · NVIDIA 2080 Ti (11 GB)
| Metric | Q4_K_S | Q8_0 | FP16 |
|---|---|---|---|
| TTFT | 0.39 s | 0.37 s | 0.33 s |
| Generation Time | 4.86 s | 6.29 s | 9.28 s |
| Tokens/s | 112.16 | 85.50 | 57.00 |
| TPOT | 8.92 ms | 11.70 ms | 17.50 ms |
| GPU Memory | 3,193 MB | 4,933 MB | 8,247 MB |
GPU Results — Llama-3.1-8B Instruct
llama-cpp-python + CUDA · NVIDIA 2080 Ti (11 GB)
| Metric | Q4_0 | Q8_0 | FP16 |
|---|---|---|---|
| TTFT | 0.48 s | 0.49 s | OOM |
| Generation Time | 5.18 s | 6.21 s | OOM |
| Tokens/s | 82.49 | 56.42 | OOM |
| TPOT | 12.12 ms | 17.72 ms | OOM |
| GPU Memory | 4,895 MB | 8,309 MB | > 11 GB |
ExLlamaV2 — Phi-3.5 Mini Instruct
NVIDIA L4 (24 GB)
| Metric | Q4_0 | Q8_0 |
|---|---|---|
| Tokens/s | 208.10 | 148.35 |
| GPU Memory | 4,600 MB | 6,350 MB |
Memory Footprint — Phi-3.5 Mini (CPU)
INT4
3.0 GB peak
Model: 2.1 GB
INT8
4.9 GB peak
Model: 3.8 GB
FP16
8.3 GB peak
Model: 7.2 GB