AI Model Quantization Preference for Customer Facing Services

Summary (TL;DR) For the 14-billion-parameter phi-4-reasoning model we elect to ship the production build as a Q4_K_M 4-bit quantised GGUF. Against FP16 it cuts the disk/VRAM footprint by ~75% (≈ 29 → 7 GB) while sacrificing only ≈ 1-2% perplexity and staying within the quality bar we measured in evaluation. Q8_0 keeps accuracy almost identical to FP16, but at twice the memory of Q4_K_M and only ~10% faster decoding, which fails our edge-device constraint. FP16 is reserved for R&D regression tests.


Architecture Decision Record Metadata

Field
Value

Decision ID

ADR-2025-05-30-002

Topic

Model Footprint & Quantisation

Model

phi-4-reasoning (14 B params)

Scope

q4_K_M vs q8_0 vs fp16 export formats

Status

Accepted

Owner

Haluan Irsad

Reviewers

-

Date

30 May 2025


1. Context

  • Edge & laptop targets. 70% of deployments run on 8–12 GB VRAM GPUs or 16 GB system RAM (consumer laptops grade).

  • Offline-first UX. Users tolerate ≤ 800 ms first-token latency and ≤ 30 tokens/sec.

  • Distribution size cap. Mobile updater hard-limits single-package downloads to 8 GB.

  • Internal eval bar. Bleu ≥ 90% of FP16 on our reasoning suite and Δperplexity ≤ +2.0.

2. Options Considered

#
Variant
Disk / VRAM (≈)
ΔPPL†
Bleu vs FP16
Meets caps?
Notes

A

FP16

29 GB

baseline

100%

No (size)

Full precision

B

Q8_0

15 GB (-48%)

+0.2

99%

No (size)

Linear 8-bit

C

Q4_K_M (chosen)

7 GB (-75%)

+1.8

97%

Yes

K-means 4-bit (medium)

†ΔPPL = perplexity increase w.r.t. FP16 on eval set; numbers averaged over GSM8K-clean & 3SAT dev.

3. Decision

Adopt Q4_K_M as the default shipping artifact for phi-4-reasoning; retain FP16 for benchmarking and Q8_0 for premium-GPU customers behind a feature flag.

4. Rationale

4.1 Memory & Distribution

  • Q4_K_M compresses weights 4:1 relative to FP16 and 2:1 relative to Q8_0, fitting comfortably inside our 8 GB OTA limit while leaving head-room for tokenizer, LoRA adapters, and safety rulesets.

4.2 Quality Preservation

  • Community benchmarks show K_M flavors incur the lowest perplexity loss of any 4-bit scheme and often outperform older Q4_0/Q4_1 variants.

  • Our own eval run recorded only a 3-point drop on BIG-Bench-Hard versus FP16, which is still inside the product tolerance band.

4.3 Runtime Performance

  • On CPU the extra de-quant overhead of Q4_K_M is offset by reduced memory bandwidth; tokens-per-second is statistically equal to Q8_0 in our tests.

  • On consumer GPUs (RTX 3060 12 GB) Q4_K_M allows the whole model to reside in VRAM, avoiding PCIe thrashing that hits FP16/Q8_0 once context > 4 k tokens.

4.4 Hardware Reach & Cost

  • Shipping a single 7 GB blob doubles the install-base addressable (Steam HWS shows 62% of gamers have ≤ 8 GB VRAM).

  • Cloud inference: we fit two Q4_K_M replicas per A10-G 24 GB node, cutting hourly cost 44% versus single-replica FP16.

4.5 Operational Simplicity

  • Q4_K_M, Q8_0 and FP16 all generate deterministically from identical GGUF manifests; keeping FP16 as “golden master” simplifies regression diffing.

5. Quantization & Serving Architecture

+---------------------+
|     FP16 ckpt       |
|   (Hugging Face)    |
+---------------------+
           |
           v
+---------------------------+
|  llama.cpp  ⎮  quantize   |
|      (pipeline-CI)        |
+---------------------------+
           |
           v
+---------------------+
|     Q4_K_M.gguf     |
+---------------------+
          ⎮ (artifact store → CDN)
          |
          v
+----------------------------+    +---------------------+
|   Desktop / Edge App       |<–> |   llama.cpp run     |
|     (auto-update)          |    |     (CPU / GPU)     |
+----------------------------+    +---------------------+
  • Pipeline CI: GitHub/Gitlab pipeline actions runs quantize -m Q4_K_M and signs SHA-256.

  • CDN pushes per-variant manifests; client negotiates highest precision fitting local VRAM.

  • Runtime: llama.cpp compiled with GGML_Q4_K_M support streams outputs via gRPC to customer facing Desktop/Edge App.

6. Consequences & Trade-offs

Aspect
Positive
Negative
Mitigation

Size

75 % smaller packages

Slight accuracy drop

Continuous eval; allow users to opt-in Q8_0

Latency

Fits entire model in GPU VRAM

De-quant cost on CPU

Use SIMD/AVX2 kernels

Quality

Meets reasoning bar

Bleu-2 pp lower on creative tasks

Route creative tasks to cloud FP16

Tooling

Unified GGUF pipeline

Need CI for three variants

Template Makefile targets

7. Future Considerations

  • Mixed-precision blocks: experiment with 6-bit Q6_K for middle layers to claw back 0.5 GB without hurting PPL.

  • Weight grouping + LoRA: keep base Q4_K_M and add small FP16 LoRA adapters for domain tuning.

  • Hardware-aware selector: dynamic runtime choosing Q8_0 when free VRAM > 12 GB.

  • INT4 GEMM (General Matrix Multiply) kernels on newer RTX-40xx cards.


Annotated References

  1. Phi-4-reasoning model card (14 B params) (Hugging Face)

  2. FP16 Phi-4 size (Ollama)

  3. Q4_K_M recommended for best size-quality trade-off (The Register)

  4. Community perplexity notes on K_M variants (GitHub)


license: CC-BY-SA-4.0

Last updated