Running LLMs on CPU: Quantisation in Practice

5 April 2026 · 6 min read

Not every deployment has access to GPU infrastructure. In defence environments especially, you’re often working with air-gapped systems running commodity hardware. We’ve spent the last year making AI inference practical on CPU-only machines.

Why CPU Matters

GPU availability is a luxury. Many of our clients — particularly in defence and healthcare — operate in environments where GPU hardware isn’t available, isn’t approved, or isn’t practical. Air-gapped networks, embedded systems, edge deployments: these are CPU territory.

The good news is that modern quantisation techniques have closed the gap dramatically. A 7B parameter model quantised to 4-bit runs comfortably on a modern desktop CPU with 16GB of RAM.

Quantisation: The Practical Bits

Quantisation reduces the precision of model weights from 32-bit or 16-bit floating point down to 8-bit or 4-bit integers. The maths is straightforward — you’re trading precision for speed and memory.

What’s less obvious is which quantisation method to use. We’ve benchmarked GPTQ, AWQ, and GGUF across a range of tasks relevant to our clients:

GGUF (llama.cpp) — best for pure CPU inference. No GPU required at all. We use this for air-gapped deployments.
AWQ — better quality preservation at 4-bit than GPTQ, but needs GPU for inference. Not suitable for our CPU-only use cases.
GPTQ — mature and well-supported, but GPU-dependent for practical speed.

Real-World Performance

On an Intel Xeon with 64GB RAM (typical server hardware in defence environments), we see:

Llama 2 7B (Q4_K_M): ~15 tokens/sec — usable for interactive applications
Mistral 7B (Q4_K_M): ~18 tokens/sec — slightly faster architecture
Llama 2 13B (Q4_K_M): ~8 tokens/sec — still practical for batch processing

These numbers aren’t going to win benchmarks against an A100, but they’re absolutely practical for document summarisation, classification, and structured extraction — the tasks our defence and healthcare clients actually need.

Deployment Architecture

We wrap llama.cpp in a Go service that provides an OpenAI-compatible API. This means client applications don’t need to know whether they’re talking to a local CPU model or a cloud GPU — the interface is identical. Swap one for the other with a config change.

If you’re running AI workloads in constrained environments, we’d like to hear about your use case.