Chunking Strategy >< Freshness Window for RAG

Summary (TL;DR)

For our multilingual knowledge-assistant we will index content with a hybrid semantic + recursive chunking pipeline and collection-specific sliding freshness windows. Paired with granite-embedding-278 M for dense retrieval and phi-4-reasoning-14 B-fp16 for generation, this configuration lifts answer F1-score +4 pts over fixed-size baselines, trims the vector store 35%, and guarantees that fast-moving collections (e.g. news, chat) never serve content older than seven days. Evergreen corp-policy remains searchable indefinitely. The decision log below records the context, alternatives, rationale, and consequences.

Architecture Decision Record Metadata

Field

Value

Decision ID

ADR-2025-05-27-003

Topic

Retrieval-Augmented Generation (RAG) Integration

Scope

Chunking strategy × Freshness window

Embedding Model

granite-embedding-278 M

Foundation Model

phi-4-reasoning-14 B-fp16

Status

Accepted

Owner

Haluan Mohammad Irsad

Reviewers

Date

27 May 2025

1. Context

Workloads. Internal Q & A, policy lookup, incident retros, and customer-chat triage across 12 languages.
Quality bar. ≥ 92% exact-match on held-out FAQ set; ≤ 1.5 s retrieval p95.
Data velocity.
- News / Chat: hours -> irrelevant after one week.
- HR / Finance: updated monthly; retain 90 days.
- Policies / Manuals: long-tail evergreen, versioned.
Hardware. Vector DB on 3 × r6g.4xlarge; GPU inference on A10-G instances; index growth must stay < 400 GB for budget year.

2. Options Considered

Chunking

Freshness Window

Index Size▲

Eval F1

p95 Latency

Meets Targets?

Notes

Fixed 1 k tokens, no overlap

None

100%

84%

1.1 s

No (quality)

Sparse context gaps

Fixed 512 tokens, 20% overlap

Global 30-day TTL

155%

88%

1.3 s

No (size)

Index inflation

Semantic + recursive 128–384 tok (chosen)

7 / 90 / ∞-day sliding

65%

92%

1.2 s

Yes

Cohesive chunks

On-the-fly doc retrieval (no chunking)

Query-time filter

N/A

91%

3.4 s

No (latency)

Full-doc re-embed

▲Relative to Option A size.

3. Decision

Adopt Option C: semantic boundary detection (sentence tokenizer → embedding-similarity splitter with 64% threshold, recursive fallback to 256-char chunks) and per-collection rolling freshness windows (7 d news/chat, 90 d HR/finance, infinite policies). Documents outside the window are tomb-stoned from the vector store but preserved in cold object storage.

4. Rationale

4.1 Retrieval Quality

Semantic chunks preserve discourse units; overlap only when similarity falls below threshold, avoiding context dilution; this added 4 F1 points over fixed windows.

4.2 Index Size & Cost

Average chunk length 220 tokens (vs 512) shrinks vector count 35% while maintaining coverage; r6g.4xlarge fleet drops from 4 -> 3 nodes.

4.3 Freshness & Compliance

Sliding windows ensure “data right-to-forget” for sensitive chat and HR streams without re-ingesting the whole corpus.
Policy docs remain indefinitely, version-tagged in metadata for traceability.

4.4 Multilingual Fit

granite-embedding-278 M supports 12 languages natively and maintains cosine recall parity across EN/ID/FR corpora .

4.5 Operational Complexity

LlamaIndex SemanticSplitter + RecursiveCharacterTextSplitter compose cleanly; freshness windows enforced via vector-store metadata filters and nightly TTL sweeps.

5. RAG Architecture

Ingest Worker: Scrapes or receives new documents, tags them with collection and timestamp, then drops each file onto a Kafka-backed ingest queue.
Semantic ＋ Recursive Chunker: Reads from the queue; splits documents at semantic sentence boundaries (cos-sim < 0.64) and, if still too long, recursively down to ≤ 384 tokens. Outputs {chunk, embedding, meta} JSON.
Vector Store (Pinecone / pgvector): Persists each chunk’s 768-d Granite embedding plus metadata. Supports recency-decay scoring and metadata filters for collection and language.
TTL Sweeper: Nightly cron job that tomb-stones vectors older than their collection’s freshness window (7 d news/chat, 90 d HR, ∞ policies) and archives them to cold S3.
Retriever API:
- Embeds the user query with granite-embedding-278 M.
- Executes a top-k similarity search with recency-decay (sim × e^{-λΔdays}) and returns the chunks + metadata.
Generation Service: Prompts phi-4-reasoning-14 B-fp16 with system instructions, retrieved chunks, and the user question; streams the answer back through gRPC, hitting ≤ 1.5 s p95 E2E latency.
Feedback Logger: Captures query_id, selected chunks, model answer, and user thumbs-up/down; feeds nightly analytics that adjust λ or flag stale content for re-indexing.

6. Consequences & Trade-offs

Aspect

Positive

Negative

Mitigation

Quality

Highest F1, natural boundaries

Extra pre-proc pass

Chunker runs in async ingest queue

Latency

Still ≤ 1.5 s p95

Slight vector scan overhead

Recency filter reduces candidate set

Storage

35% smaller index

Need tombstone GC

Nightly TTL cron job

Compliance

Auto-expires PII streams

Possible recall loss on old chats

Cold store fallback endpoint

7. Future Considerations

Dynamic windowing:shrink/expand TTLs based on query recency distribution.
Hybrid sparse + dense retrieval for very long documents once phi-4 gains 128 k context.
Federated freshness:push TTL logic into edge caches for on-device RAG.
Embedded recency weights:incorporate timestamp into embedding (time2vec).

References

IBM Granite Embedding 278 M model card (IBM, IBM)
Phi-4-reasoning 14 B release notes (Hugging Face, TECHCOMMUNITY.MICROSOFT.COM)
Semantic chunking best practices (Medium, Superlinked)
Sliding window & context preservation in RAG pipelines (Elastic, arXiv)

PreviousAI Workloads Micro-batch Inference over Real-time Inference NextAI Model Quantization Preference for Customer Facing Services

Last updated 3 months ago