AI Model Quantization Preference for Customer Facing Services
Summary (TL;DR) For the 14-billion-parameter phi-4-reasoning model we elect to ship the production build as a Q4_K_M 4-bit quantised GGUF. Against FP16 it cuts the disk/VRAM footprint by ~75% (≈ 29 → 7 GB) while sacrificing only ≈ 1-2% perplexity and staying within the quality bar we measured in evaluation. Q8_0 keeps accuracy almost identical to FP16, but at twice the memory of Q4_K_M and only ~10% faster decoding, which fails our edge-device constraint. FP16 is reserved for R&D regression tests.
Architecture Decision Record Metadata
Decision ID
ADR-2025-05-30-002
Topic
Model Footprint & Quantisation
Model
phi-4-reasoning (14 B params)
Scope
q4_K_M vs q8_0 vs fp16 export formats
Status
Accepted
Owner
Haluan Irsad
Reviewers
-
Date
30 May 2025
1. Context
Edge & laptop targets. 70% of deployments run on 8–12 GB VRAM GPUs or 16 GB system RAM (consumer laptops grade).
Offline-first UX. Users tolerate ≤ 800 ms first-token latency and ≤ 30 tokens/sec.
Distribution size cap. Mobile updater hard-limits single-package downloads to 8 GB.
Internal eval bar. Bleu ≥ 90% of FP16 on our reasoning suite and Δperplexity ≤ +2.0.
2. Options Considered
A
FP16
29 GB
baseline
100%
No (size)
Full precision
B
Q8_0
15 GB (-48%)
+0.2
99%
No (size)
Linear 8-bit
C
Q4_K_M (chosen)
7 GB (-75%)
+1.8
97%
Yes
K-means 4-bit (medium)
†ΔPPL = perplexity increase w.r.t. FP16 on eval set; numbers averaged over GSM8K-clean & 3SAT dev.
3. Decision
Adopt Q4_K_M as the default shipping artifact for phi-4-reasoning; retain FP16 for benchmarking and Q8_0 for premium-GPU customers behind a feature flag.
4. Rationale
4.1 Memory & Distribution
Q4_K_M compresses weights 4:1 relative to FP16 and 2:1 relative to Q8_0, fitting comfortably inside our 8 GB OTA limit while leaving head-room for tokenizer, LoRA adapters, and safety rulesets.
4.2 Quality Preservation
Community benchmarks show K_M flavors incur the lowest perplexity loss of any 4-bit scheme and often outperform older Q4_0/Q4_1 variants.
Our own eval run recorded only a 3-point drop on BIG-Bench-Hard versus FP16, which is still inside the product tolerance band.
4.3 Runtime Performance
On CPU the extra de-quant overhead of Q4_K_M is offset by reduced memory bandwidth; tokens-per-second is statistically equal to Q8_0 in our tests.
On consumer GPUs (RTX 3060 12 GB) Q4_K_M allows the whole model to reside in VRAM, avoiding PCIe thrashing that hits FP16/Q8_0 once context > 4 k tokens.
4.4 Hardware Reach & Cost
Shipping a single 7 GB blob doubles the install-base addressable (Steam HWS shows 62% of gamers have ≤ 8 GB VRAM).
Cloud inference: we fit two Q4_K_M replicas per A10-G 24 GB node, cutting hourly cost 44% versus single-replica FP16.
4.5 Operational Simplicity
Q4_K_M, Q8_0 and FP16 all generate deterministically from identical GGUF manifests; keeping FP16 as “golden master” simplifies regression diffing.
5. Quantization & Serving Architecture
+---------------------+
| FP16 ckpt |
| (Hugging Face) |
+---------------------+
|
v
+---------------------------+
| llama.cpp ⎮ quantize |
| (pipeline-CI) |
+---------------------------+
|
v
+---------------------+
| Q4_K_M.gguf |
+---------------------+
⎮ (artifact store → CDN)
|
v
+----------------------------+ +---------------------+
| Desktop / Edge App |<–> | llama.cpp run |
| (auto-update) | | (CPU / GPU) |
+----------------------------+ +---------------------+Pipeline CI: GitHub/Gitlab pipeline actions runs
quantize -m Q4_K_Mand signs SHA-256.CDN pushes per-variant manifests; client negotiates highest precision fitting local VRAM.
Runtime: llama.cpp compiled with
GGML_Q4_K_Msupport streams outputs via gRPC to customer facing Desktop/Edge App.
6. Consequences & Trade-offs
Size
75 % smaller packages
Slight accuracy drop
Continuous eval; allow users to opt-in Q8_0
Latency
Fits entire model in GPU VRAM
De-quant cost on CPU
Use SIMD/AVX2 kernels
Quality
Meets reasoning bar
Bleu-2 pp lower on creative tasks
Route creative tasks to cloud FP16
Tooling
Unified GGUF pipeline
Need CI for three variants
Template Makefile targets
7. Future Considerations
Mixed-precision blocks: experiment with 6-bit Q6_K for middle layers to claw back 0.5 GB without hurting PPL.
Weight grouping + LoRA: keep base Q4_K_M and add small FP16 LoRA adapters for domain tuning.
Hardware-aware selector: dynamic runtime choosing Q8_0 when free VRAM > 12 GB.
INT4 GEMM (General Matrix Multiply) kernels on newer RTX-40xx cards.
Annotated References
Phi-4-reasoning model card (14 B params) (Hugging Face)
FP16 Phi-4 size (Ollama)
Q4_K_M recommended for best size-quality trade-off (The Register)
Community perplexity notes on K_M variants (GitHub)
license: CC-BY-SA-4.0
Last updated