Chunking Strategy >< Freshness Window for RAG

Summary (TL;DR)

For our multilingual knowledge-assistant we will index content with a hybrid semantic + recursive chunking pipeline and collection-specific sliding freshness windows. Paired with granite-embedding-278 M for dense retrieval and phi-4-reasoning-14 B-fp16 for generation, this configuration lifts answer F1-score +4 pts over fixed-size baselines, trims the vector store 35%, and guarantees that fast-moving collections (e.g. news, chat) never serve content older than seven days. Evergreen corp-policy remains searchable indefinitely. The decision log below records the context, alternatives, rationale, and consequences.

Architecture Decision Record Metadata

Field
Value

Decision ID

ADR-2025-05-27-003

Topic

Retrieval-Augmented Generation (RAG) Integration

Scope

Chunking strategy × Freshness window

Embedding Model

granite-embedding-278 M

Foundation Model

phi-4-reasoning-14 B-fp16

Status

Accepted

Owner

Haluan Mohammad Irsad

Reviewers

-

Date

27 May 2025

1. Context

  • Workloads. Internal Q & A, policy lookup, incident retros, and customer-chat triage across 12 languages.

  • Quality bar. ≥ 92% exact-match on held-out FAQ set; ≤ 1.5 s retrieval p95.

  • Data velocity.

    • News / Chat: hours -> irrelevant after one week.

    • HR / Finance: updated monthly; retain 90 days.

    • Policies / Manuals: long-tail evergreen, versioned.

  • Hardware. Vector DB on 3 × r6g.4xlarge; GPU inference on A10-G instances; index growth must stay < 400 GB for budget year.

2. Options Considered

#
Chunking
Freshness Window
Index Size▲
Eval F1
p95 Latency
Meets Targets?
Notes

A

Fixed 1 k tokens, no overlap

None

100%

84%

1.1 s

No (quality)

Sparse context gaps

B

Fixed 512 tokens, 20% overlap

Global 30-day TTL

155%

88%

1.3 s

No (size)

Index inflation

C

Semantic + recursive 128–384 tok (chosen)

7 / 90 / ∞-day sliding

65%

92%

1.2 s

Yes

Cohesive chunks

D

On-the-fly doc retrieval (no chunking)

Query-time filter

N/A

91%

3.4 s

No (latency)

Full-doc re-embed

▲Relative to Option A size.

3. Decision

Adopt Option C: semantic boundary detection (sentence tokenizer → embedding-similarity splitter with 64% threshold, recursive fallback to 256-char chunks) and per-collection rolling freshness windows (7 d news/chat, 90 d HR/finance, infinite policies). Documents outside the window are tomb-stoned from the vector store but preserved in cold object storage.


4. Rationale

4.1 Retrieval Quality

  • Semantic chunks preserve discourse units; overlap only when similarity falls below threshold, avoiding context dilution; this added 4 F1 points over fixed windows.

4.2 Index Size & Cost

  • Average chunk length 220 tokens (vs 512) shrinks vector count 35% while maintaining coverage; r6g.4xlarge fleet drops from 4 -> 3 nodes.

4.3 Freshness & Compliance

  • Sliding windows ensure “data right-to-forget” for sensitive chat and HR streams without re-ingesting the whole corpus.

  • Policy docs remain indefinitely, version-tagged in metadata for traceability.

4.4 Multilingual Fit

  • granite-embedding-278 M supports 12 languages natively and maintains cosine recall parity across EN/ID/FR corpora .

4.5 Operational Complexity

  • LlamaIndex SemanticSplitter + RecursiveCharacterTextSplitter compose cleanly; freshness windows enforced via vector-store metadata filters and nightly TTL sweeps.

5. RAG Architecture

  • Ingest Worker: Scrapes or receives new documents, tags them with collection and timestamp, then drops each file onto a Kafka-backed ingest queue.

  • Semantic + Recursive Chunker: Reads from the queue; splits documents at semantic sentence boundaries (cos-sim < 0.64) and, if still too long, recursively down to ≤ 384 tokens. Outputs {chunk, embedding, meta} JSON.

  • Vector Store (Pinecone / pgvector): Persists each chunk’s 768-d Granite embedding plus metadata. Supports recency-decay scoring and metadata filters for collection and language.

  • TTL Sweeper: Nightly cron job that tomb-stones vectors older than their collection’s freshness window (7 d news/chat, 90 d HR, ∞ policies) and archives them to cold S3.

  • Retriever API:

    • Embeds the user query with granite-embedding-278 M.

    • Executes a top-k similarity search with recency-decay (sim × e^{-λΔdays}) and returns the chunks + metadata.

  • Generation Service: Prompts phi-4-reasoning-14 B-fp16 with system instructions, retrieved chunks, and the user question; streams the answer back through gRPC, hitting ≤ 1.5 s p95 E2E latency.

  • Feedback Logger: Captures query_id, selected chunks, model answer, and user thumbs-up/down; feeds nightly analytics that adjust λ or flag stale content for re-indexing.

6. Consequences & Trade-offs

Aspect
Positive
Negative
Mitigation

Quality

Highest F1, natural boundaries

Extra pre-proc pass

Chunker runs in async ingest queue

Latency

Still ≤ 1.5 s p95

Slight vector scan overhead

Recency filter reduces candidate set

Storage

35% smaller index

Need tombstone GC

Nightly TTL cron job

Compliance

Auto-expires PII streams

Possible recall loss on old chats

Cold store fallback endpoint

7. Future Considerations

  • Dynamic windowing:shrink/expand TTLs based on query recency distribution.

  • Hybrid sparse + dense retrieval for very long documents once phi-4 gains 128 k context.

  • Federated freshness:push TTL logic into edge caches for on-device RAG.

  • Embedded recency weights:incorporate timestamp into embedding (time2vec).


References

  1. IBM Granite Embedding 278 M model card (IBM, IBM)

  2. Phi-4-reasoning 14 B release notes (Hugging Face, TECHCOMMUNITY.MICROSOFT.COM)

  3. Semantic chunking best practices (Medium, Superlinked)

  4. Sliding window & context preservation in RAG pipelines (Elastic, arXiv)

Last updated