AI Model Quantization Preference for Customer Facing Services

Summary (TL;DR) For the 14-billion-parameter phi-4-reasoning model we elect to ship the production build as a Q4_K_M 4-bit quantised GGUF. Against FP16 it cuts the disk/VRAM footprint by ~75% (≈ 29 → 7 GB) while sacrificing only ≈ 1-2% perplexity and staying within the quality bar we measured in evaluation. Q8_0 keeps accuracy almost identical to FP16, but at twice the memory of Q4_K_M and only ~10% faster decoding, which fails our edge-device constraint. FP16 is reserved for R&D regression tests.

Architecture Decision Record Metadata

Field

Value

Decision ID

ADR-2025-05-30-002

Topic

Model Footprint & Quantisation

Model

phi-4-reasoning (14 B params)

Scope

q4_K_M vs q8_0 vs fp16 export formats

Status

Accepted

Owner

Haluan Irsad

Reviewers

Date

30 May 2025

1. Context

Edge & laptop targets. 70% of deployments run on 8–12 GB VRAM GPUs or 16 GB system RAM (consumer laptops grade).
Offline-first UX. Users tolerate ≤ 800 ms first-token latency and ≤ 30 tokens/sec.
Distribution size cap. Mobile updater hard-limits single-package downloads to 8 GB.
Internal eval bar. Bleu ≥ 90% of FP16 on our reasoning suite and Δperplexity ≤ +2.0.

2. Options Considered

Variant

Disk / VRAM (≈)

ΔPPL†

Bleu vs FP16

Meets caps?

Notes

FP16

29 GB

baseline

100%

No (size)

Full precision

Q8_0

15 GB (-48%)

+0.2

99%

No (size)

Linear 8-bit

Q4_K_M (chosen)

7 GB (-75%)

+1.8

97%

Yes

K-means 4-bit (medium)

†ΔPPL = perplexity increase w.r.t. FP16 on eval set; numbers averaged over GSM8K-clean & 3SAT dev.

3. Decision

Adopt Q4_K_M as the default shipping artifact for phi-4-reasoning; retain FP16 for benchmarking and Q8_0 for premium-GPU customers behind a feature flag.

4. Rationale

4.1 Memory & Distribution

Q4_K_M compresses weights 4:1 relative to FP16 and 2:1 relative to Q8_0, fitting comfortably inside our 8 GB OTA limit while leaving head-room for tokenizer, LoRA adapters, and safety rulesets.

4.2 Quality Preservation

Community benchmarks show K_M flavors incur the lowest perplexity loss of any 4-bit scheme and often outperform older Q4_0/Q4_1 variants.
Our own eval run recorded only a 3-point drop on BIG-Bench-Hard versus FP16, which is still inside the product tolerance band.

4.3 Runtime Performance

On CPU the extra de-quant overhead of Q4_K_M is offset by reduced memory bandwidth; tokens-per-second is statistically equal to Q8_0 in our tests.
On consumer GPUs (RTX 3060 12 GB) Q4_K_M allows the whole model to reside in VRAM, avoiding PCIe thrashing that hits FP16/Q8_0 once context > 4 k tokens.

4.4 Hardware Reach & Cost

Shipping a single 7 GB blob doubles the install-base addressable (Steam HWS shows 62% of gamers have ≤ 8 GB VRAM).
Cloud inference: we fit two Q4_K_M replicas per A10-G 24 GB node, cutting hourly cost 44% versus single-replica FP16.

4.5 Operational Simplicity

Q4_K_M, Q8_0 and FP16 all generate deterministically from identical GGUF manifests; keeping FP16 as “golden master” simplifies regression diffing.

5. Quantization & Serving Architecture

+---------------------+
|     FP16 ckpt       |
|   (Hugging Face)    |
+---------------------+
           |
           v
+---------------------------+
|  llama.cpp  ⎮  quantize   |
|      (pipeline-CI)        |
+---------------------------+
           |
           v
+---------------------+
|     Q4_K_M.gguf     |
+---------------------+
          ⎮ (artifact store → CDN)
          |
          v
+----------------------------+    +---------------------+
|   Desktop / Edge App       |<–> |   llama.cpp run     |
|     (auto-update)          |    |     (CPU / GPU)     |
+----------------------------+    +---------------------+

Pipeline CI: GitHub/Gitlab pipeline actions runs quantize -m Q4_K_M and signs SHA-256.
CDN pushes per-variant manifests; client negotiates highest precision fitting local VRAM.
Runtime: llama.cpp compiled with GGML_Q4_K_M support streams outputs via gRPC to customer facing Desktop/Edge App.

6. Consequences & Trade-offs

Aspect

Positive

Negative

Mitigation

Size

75 % smaller packages

Slight accuracy drop

Continuous eval; allow users to opt-in Q8_0

Latency

Fits entire model in GPU VRAM

De-quant cost on CPU

Use SIMD/AVX2 kernels

Quality

Meets reasoning bar

Bleu-2 pp lower on creative tasks

Route creative tasks to cloud FP16

Tooling

Unified GGUF pipeline

Need CI for three variants

Template Makefile targets

7. Future Considerations

Mixed-precision blocks: experiment with 6-bit Q6_K for middle layers to claw back 0.5 GB without hurting PPL.
Weight grouping + LoRA: keep base Q4_K_M and add small FP16 LoRA adapters for domain tuning.
Hardware-aware selector: dynamic runtime choosing Q8_0 when free VRAM > 12 GB.
INT4 GEMM (General Matrix Multiply) kernels on newer RTX-40xx cards.

Annotated References

Phi-4-reasoning model card (14 B params) (Hugging Face)
FP16 Phi-4 size (Ollama)
Q4_K_M recommended for best size-quality trade-off (The Register)
Community perplexity notes on K_M variants (GitHub)

license: CC-BY-SA-4.0

PreviousChunking Strategy >< Freshness Window for RAG NextAgent Communication Protocol as peer-to-peer AI Agent nteractions

Last updated 3 months ago