Back to Works

LLM Inference Framework Benchmarks

2025 Lunit

A systematic evaluation of open-source LLM inference frameworks to select the best option for Lunit's Report Generation project. Benchmarked llama.cpp, ExLlamaV2, and Ollama across multiple models (Phi-3.5 Mini, Llama-3.1-8B) and quantization levels (INT4, INT8, FP16) on both CPU and GPU.

Goal

Pick an inference framework that balances speed, memory efficiency, and deployment flexibility for production medical report generation.

What Was Measured

Frameworks Tested

llama.cpp / llama-cpp-python

Tested via native CLI, Python bindings, llama-server, and Docker. Covered GGUF model conversion, quantization (Q4_0, Q8_0, FP16), and GPU offloading with CUDA. Emerged as the best overall option — good speed, low memory footprint, wide model support, and flexible deployment options.

ExLlamaV2

GPU-only framework with its own quantization format (EXL2). Achieved the highest throughput — 208 tokens/s on Q4 with Phi-3.5 on an NVIDIA L4. Best raw speed but limited to GPU environments only.

Ollama

Built on llama.cpp with a higher-level API and Docker-first workflow. Simplest setup and model management via Modelfiles, but less fine-grained control over inference parameters.

Key Findings

Repository
GitHub
Context
Lunit — Report Generation project
Models
Phi-3.5 Mini Instruct, Llama-3.1-8B Instruct
Hardware
Intel Xeon Gold 6240 (CPU), NVIDIA 2080 Ti / L4 (GPU)
Stack
llama.cpp, llama-cpp-python, ExLlamaV2, Ollama, CUDA, Docker

CPU Results — Phi-3.5 Mini Instruct

llama-cpp-python · Intel Xeon Gold 6240 · 18 threads

Metric Q4_0 Q8_0 FP16
TTFT 6.18 ± 0.98 s 6.18 ± 0.47 s 5.84 ± 1.19 s
Generation Time 23.47 ± 0.37 s 50.14 ± 0.42 s 69.97 ± 0.27 s
Tokens/s 21.82 10.21 7.32
TPOT 45.84 ms 97.93 ms 136.67 ms
Peak Memory 3,035 MB 4,859 MB 8,275 MB
Model Size 2.1 GB 3.8 GB 7.2 GB

CPU Results — Llama-3.1-8B Instruct

llama-cpp-python · Intel Xeon Gold 6240 · 18 threads

Metric Q4_0 Q8_0 FP16
TTFT 9.80 ± 4.32 s 8.84 ± 0.67 s 10.12 ± 3.07 s
Generation Time 26.69 ± 3.08 s 74.26 ± 0.26 s 114.13 ± 1.77 s
Tokens/s 15.13 6.05 3.78
TPOT 66.90 ms 165.39 ms 264.79 ms
Peak Memory 4,925 MB 8,657 MB 15,836 MB
Model Size 4.4 GB 8.0 GB 15 GB

GPU Results — Phi-3.5 Mini Instruct

llama-cpp-python + CUDA · NVIDIA 2080 Ti (11 GB)

Metric Q4_K_S Q8_0 FP16
TTFT 0.39 s 0.37 s 0.33 s
Generation Time 4.86 s 6.29 s 9.28 s
Tokens/s 112.16 85.50 57.00
TPOT 8.92 ms 11.70 ms 17.50 ms
GPU Memory 3,193 MB 4,933 MB 8,247 MB

GPU Results — Llama-3.1-8B Instruct

llama-cpp-python + CUDA · NVIDIA 2080 Ti (11 GB)

Metric Q4_0 Q8_0 FP16
TTFT 0.48 s 0.49 s OOM
Generation Time 5.18 s 6.21 s OOM
Tokens/s 82.49 56.42 OOM
TPOT 12.12 ms 17.72 ms OOM
GPU Memory 4,895 MB 8,309 MB > 11 GB

ExLlamaV2 — Phi-3.5 Mini Instruct

NVIDIA L4 (24 GB)

Metric Q4_0 Q8_0
Tokens/s 208.10 148.35
GPU Memory 4,600 MB 6,350 MB

Memory Footprint — Phi-3.5 Mini (CPU)

INT4

3.0 GB peak

Model: 2.1 GB

INT8

4.9 GB peak

Model: 3.8 GB

FP16

8.3 GB peak

Model: 7.2 GB