← Back to Blog
AI

Running LLMs on CPU: Quantisation in Practice

5 April 2026 · 6 min read

Not every deployment has access to GPU infrastructure. In defence environments especially, you’re often working with air-gapped systems running commodity hardware. We’ve spent the last year making AI inference practical on CPU-only machines.

Why CPU Matters

GPU availability is a luxury. Many of our clients — particularly in defence and healthcare — operate in environments where GPU hardware isn’t available, isn’t approved, or isn’t practical. Air-gapped networks, embedded systems, edge deployments: these are CPU territory.

The good news is that modern quantisation techniques have closed the gap dramatically. A 7B parameter model quantised to 4-bit runs comfortably on a modern desktop CPU with 16GB of RAM.

Quantisation: The Practical Bits

Quantisation reduces the precision of model weights from 32-bit or 16-bit floating point down to 8-bit or 4-bit integers. The maths is straightforward — you’re trading precision for speed and memory.

What’s less obvious is which quantisation method to use. We’ve benchmarked GPTQ, AWQ, and GGUF across a range of tasks relevant to our clients:

Real-World Performance

On an Intel Xeon with 64GB RAM (typical server hardware in defence environments), we see:

These numbers aren’t going to win benchmarks against an A100, but they’re absolutely practical for document summarisation, classification, and structured extraction — the tasks our defence and healthcare clients actually need.

Deployment Architecture

We wrap llama.cpp in a Go service that provides an OpenAI-compatible API. This means client applications don’t need to know whether they’re talking to a local CPU model or a cloud GPU — the interface is identical. Swap one for the other with a config change.

If you’re running AI workloads in constrained environments, we’d like to hear about your use case.