Chunking Strategy >< Freshness Window for RAG
Summary (TL;DR)
For our multilingual knowledge-assistant we will index content with a hybrid semantic + recursive chunking pipeline and collection-specific sliding freshness windows. Paired with granite-embedding-278 M for dense retrieval and phi-4-reasoning-14 B-fp16 for generation, this configuration lifts answer F1-score +4 pts over fixed-size baselines, trims the vector store 35%, and guarantees that fast-moving collections (e.g. news, chat) never serve content older than seven days. Evergreen corp-policy remains searchable indefinitely. The decision log below records the context, alternatives, rationale, and consequences.
Architecture Decision Record Metadata
Decision ID
ADR-2025-05-27-003
Topic
Retrieval-Augmented Generation (RAG) Integration
Scope
Chunking strategy × Freshness window
Embedding Model
granite-embedding-278 M
Foundation Model
phi-4-reasoning-14 B-fp16
Status
Accepted
Owner
Haluan Mohammad Irsad
Reviewers
-
Date
27 May 2025
1. Context
Workloads. Internal Q & A, policy lookup, incident retros, and customer-chat triage across 12 languages.
Quality bar. ≥ 92% exact-match on held-out FAQ set; ≤ 1.5 s retrieval p95.
Data velocity.
News / Chat: hours -> irrelevant after one week.
HR / Finance: updated monthly; retain 90 days.
Policies / Manuals: long-tail evergreen, versioned.
Hardware. Vector DB on 3 × r6g.4xlarge; GPU inference on A10-G instances; index growth must stay < 400 GB for budget year.
2. Options Considered
A
Fixed 1 k tokens, no overlap
None
100%
84%
1.1 s
No (quality)
Sparse context gaps
B
Fixed 512 tokens, 20% overlap
Global 30-day TTL
155%
88%
1.3 s
No (size)
Index inflation
C
Semantic + recursive 128–384 tok (chosen)
7 / 90 / ∞-day sliding
65%
92%
1.2 s
Yes
Cohesive chunks
D
On-the-fly doc retrieval (no chunking)
Query-time filter
N/A
91%
3.4 s
No (latency)
Full-doc re-embed
▲Relative to Option A size.
3. Decision
Adopt Option C: semantic boundary detection (sentence tokenizer → embedding-similarity splitter with 64% threshold, recursive fallback to 256-char chunks) and per-collection rolling freshness windows (7 d news/chat, 90 d HR/finance, infinite policies). Documents outside the window are tomb-stoned from the vector store but preserved in cold object storage.
4. Rationale
4.1 Retrieval Quality
Semantic chunks preserve discourse units; overlap only when similarity falls below threshold, avoiding context dilution; this added 4 F1 points over fixed windows.
4.2 Index Size & Cost
Average chunk length 220 tokens (vs 512) shrinks vector count 35% while maintaining coverage; r6g.4xlarge fleet drops from 4 -> 3 nodes.
4.3 Freshness & Compliance
Sliding windows ensure “data right-to-forget” for sensitive chat and HR streams without re-ingesting the whole corpus.
Policy docs remain indefinitely, version-tagged in metadata for traceability.
4.4 Multilingual Fit
granite-embedding-278 M supports 12 languages natively and maintains cosine recall parity across EN/ID/FR corpora .
4.5 Operational Complexity
LlamaIndex
SemanticSplitter+RecursiveCharacterTextSplittercompose cleanly; freshness windows enforced via vector-store metadata filters and nightly TTL sweeps.
5. RAG Architecture

Ingest Worker: Scrapes or receives new documents, tags them with
collectionandtimestamp, then drops each file onto a Kafka-backed ingest queue.Semantic + Recursive Chunker: Reads from the queue; splits documents at semantic sentence boundaries (cos-sim < 0.64) and, if still too long, recursively down to ≤ 384 tokens. Outputs
{chunk, embedding, meta}JSON.Vector Store (Pinecone / pgvector): Persists each chunk’s 768-d Granite embedding plus metadata. Supports recency-decay scoring and metadata filters for collection and language.
TTL Sweeper: Nightly cron job that tomb-stones vectors older than their collection’s freshness window (7 d news/chat, 90 d HR, ∞ policies) and archives them to cold S3.
Retriever API:
Embeds the user query with granite-embedding-278 M.
Executes a top-k similarity search with recency-decay (
sim × e^{-λΔdays}) and returns the chunks + metadata.
Generation Service: Prompts phi-4-reasoning-14 B-fp16 with system instructions, retrieved chunks, and the user question; streams the answer back through gRPC, hitting ≤ 1.5 s p95 E2E latency.
Feedback Logger: Captures
query_id, selected chunks, model answer, and user thumbs-up/down; feeds nightly analytics that adjust λ or flag stale content for re-indexing.
6. Consequences & Trade-offs
Quality
Highest F1, natural boundaries
Extra pre-proc pass
Chunker runs in async ingest queue
Latency
Still ≤ 1.5 s p95
Slight vector scan overhead
Recency filter reduces candidate set
Storage
35% smaller index
Need tombstone GC
Nightly TTL cron job
Compliance
Auto-expires PII streams
Possible recall loss on old chats
Cold store fallback endpoint
7. Future Considerations
Dynamic windowing:shrink/expand TTLs based on query recency distribution.
Hybrid sparse + dense retrieval for very long documents once
phi-4gains 128 k context.Federated freshness:push TTL logic into edge caches for on-device RAG.
Embedded recency weights:incorporate timestamp into embedding (time2vec).
References
Phi-4-reasoning 14 B release notes (Hugging Face, TECHCOMMUNITY.MICROSOFT.COM)
Semantic chunking best practices (Medium, Superlinked)
Last updated