LLM Inference Framework Benchmarks

2025 Lunit

A systematic evaluation of open-source LLM inference frameworks to select the best option for Lunit's Report Generation project. Benchmarked llama.cpp, ExLlamaV2, and Ollama across multiple models (Phi-3.5 Mini, Llama-3.1-8B) and quantization levels (INT4, INT8, FP16) on both CPU and GPU.

Goal

Pick an inference framework that balances speed, memory efficiency, and deployment flexibility for production medical report generation.

What Was Measured

Time to First Token (TTFT)
Tokens per second and per-token latency (TPOT)
Peak memory footprint across quantization levels
CPU vs GPU inference performance
Docker vs local deployment overhead

Frameworks Tested

llama.cpp / llama-cpp-python

Tested via native CLI, Python bindings, llama-server, and Docker. Covered GGUF model conversion, quantization (Q4_0, Q8_0, FP16), and GPU offloading with CUDA. Emerged as the best overall option — good speed, low memory footprint, wide model support, and flexible deployment options.

ExLlamaV2

GPU-only framework with its own quantization format (EXL2). Achieved the highest throughput — 208 tokens/s on Q4 with Phi-3.5 on an NVIDIA L4. Best raw speed but limited to GPU environments only.

Ollama

Built on llama.cpp with a higher-level API and Docker-first workflow. Simplest setup and model management via Modelfiles, but less fine-grained control over inference parameters.

Key Findings

llama.cpp offered the best balance of speed, memory, and deployment flexibility
INT4 quantization cuts model size by ~3x with only modest quality trade-offs
GPU inference (CUDA) achieved ~5-15x speedup over CPU
ExLlamaV2 matched or exceeded llama.cpp on GPU for raw throughput
Docker adds slight inference overhead compared to native deployment

Repository: GitHub
Context: Lunit — Report Generation project
Models: Phi-3.5 Mini Instruct, Llama-3.1-8B Instruct
Hardware: Intel Xeon Gold 6240 (CPU), NVIDIA 2080 Ti / L4 (GPU)
Stack: llama.cpp, llama-cpp-python, ExLlamaV2, Ollama, CUDA, Docker

CPU Results — Phi-3.5 Mini Instruct

llama-cpp-python · Intel Xeon Gold 6240 · 18 threads

Metric	Q4_0	Q8_0	FP16
TTFT	6.18 ± 0.98 s	6.18 ± 0.47 s	5.84 ± 1.19 s
Generation Time	23.47 ± 0.37 s	50.14 ± 0.42 s	69.97 ± 0.27 s
Tokens/s	21.82	10.21	7.32
TPOT	45.84 ms	97.93 ms	136.67 ms
Peak Memory	3,035 MB	4,859 MB	8,275 MB
Model Size	2.1 GB	3.8 GB	7.2 GB

CPU Results — Llama-3.1-8B Instruct

llama-cpp-python · Intel Xeon Gold 6240 · 18 threads

Metric	Q4_0	Q8_0	FP16
TTFT	9.80 ± 4.32 s	8.84 ± 0.67 s	10.12 ± 3.07 s
Generation Time	26.69 ± 3.08 s	74.26 ± 0.26 s	114.13 ± 1.77 s
Tokens/s	15.13	6.05	3.78
TPOT	66.90 ms	165.39 ms	264.79 ms
Peak Memory	4,925 MB	8,657 MB	15,836 MB
Model Size	4.4 GB	8.0 GB	15 GB

GPU Results — Phi-3.5 Mini Instruct

llama-cpp-python + CUDA · NVIDIA 2080 Ti (11 GB)

Metric	Q4_K_S	Q8_0	FP16
TTFT	0.39 s	0.37 s	0.33 s
Generation Time	4.86 s	6.29 s	9.28 s
Tokens/s	112.16	85.50	57.00
TPOT	8.92 ms	11.70 ms	17.50 ms
GPU Memory	3,193 MB	4,933 MB	8,247 MB

GPU Results — Llama-3.1-8B Instruct

llama-cpp-python + CUDA · NVIDIA 2080 Ti (11 GB)

Metric	Q4_0	Q8_0	FP16
TTFT	0.48 s	0.49 s	OOM
Generation Time	5.18 s	6.21 s	OOM
Tokens/s	82.49	56.42	OOM
TPOT	12.12 ms	17.72 ms	OOM
GPU Memory	4,895 MB	8,309 MB	> 11 GB

ExLlamaV2 — Phi-3.5 Mini Instruct

NVIDIA L4 (24 GB)

Metric	Q4_0	Q8_0
Tokens/s	208.10	148.35
GPU Memory	4,600 MB	6,350 MB

Memory Footprint — Phi-3.5 Mini (CPU)

INT4

3.0 GB peak

Model: 2.1 GB

INT8

4.9 GB peak

Model: 3.8 GB

FP16

8.3 GB peak

Model: 7.2 GB