AI Workloads Micro-batch Inference over Real-time Inference

Summary (TL;DR)

I chose micro-batch inference as the primary serving pattern because it strikes the best balance between throughput-per-dollar, GPU/CPU utilization, and operational predictability for machine workload, while still meeting the latency Service-Level Objective (SLO) of ≤ 5 seconds end-to-end. Compared with always-on real-time (per-request) inference, micro-batching lets us coalesce many small requests into short-lived bursts, saturate hardware, run fewer model replicas, and pay 40-80% less in total compute, without materially impacting user experience for near-real-time use cases such as ranking, scoring, and LLM embedding generation. The architecture decision record below records the architectural context, options, decision rationale, and consequences so future engineers can audit or revisit the trade-offs.

Architecture Decision Record Metadata

Field

Value

Decision ID

ADR-2025-05-22-001

Date

22 May 2025

Status

Accepted

Owner

Haluan Irsad (haluan.irsad@gmail.com)

Reviewers

Scope

Online inferencing for user-facing ML & LLM services

1. Context

The product surfaces personalised scores, recommendations, and short LLM outputs that must appear within ≤ 5 s after an upstream event (click, transaction, message).
Traffic is highly bursty (peaks 20× off-peak) and prediction payloads vary in size.
GPU capacity is shared with training workloads; cloud costs are under aggressive optimisation targets (-30% YoY).
Most feature pipelines already land in a streaming buffer (Apache Kafka) and can tolerate sub-second queuing delay.

2. Options Considered

Pattern

Description

Meets SLO?

Est. Monthly Cost*

Complexity

Real-time (per request)

One RPC → one prediction; autoscale pods

✅ (≈ 200 ms p95)

$110 k

High (traffic spikes, cold-starts)

Micro-batch (chosen)

Accumulate 10–500 reqs or 250 ms window; infer as a batch

✅ (1.5 s p95)

$48 k

Medium

Classic offline batch

Nightly Spark/Flume job, write to DB

❌ (> mins)

$15 k

Low

*Cost model based on AWS g5.2xlarge GPU On-Demand and NVIDIA Triton throughput benchmarks.

3. Decision

Adopt micro-batch inference as the default serving mode, backed by NVIDIA Triton with dynamic batching and Spark-Structured-Streaming (200 ms trigger) for request coalescing. Real-time pods will remain only for very latency-critical endpoints (< 300 ms), gated by feature flags.

4. Rationale

4.1 Latency vs Throughput Trade-off

Throughput on a single A100 GPU rises 14× when batch size goes from 1 to 64, while latency rises only 4× (still < 2s p95), well inside the SLO.
NVIDIA Triton’s dynamic-batching groups requests that arrive within a short window, boosting utilisation without violating per-batch latency budgets.
Research (Sarathi, OSDI 24) shows micro-batching sits on the efficient frontier of the latency-throughput curve for LLMs.

4.2 Cost Efficiency

AWS and Bedrock price batch jobs ~50% lower than continuous endpoints.
- SageMaker and Vertex AI show similar economics.
Independent benchmarks report 70-80% GPU-hour savings when moving from per-request to batched LLM inference.

4.3 Hardware Utilisation & Scalability

Micro-batching keeps GPUs at > 85% utilization during bursts, minimizing the number of replicas needed.
Spark Structured Streaming’s micro-batch engine naturally handles back-pressure and can scale linearly across Kafka partitions.

4.4 Operational Predictability

Fixed-interval micro-batches give us deterministic checkpoints for monitoring and retry; Uber reports similar benefits in its offline-inference pipelines on Michelangelo.
Easier A/B rollout: entire batch can be shadow-scored and compared before commit.

4.5 Use-Case Fit

Machine workloads (ranking, fraud scoring, daily embedding refresh) are latency-tolerant relative to human perception; real-time inference is over-engineered for these needs.

5. Micro-Batch Inference Architecture

Key components

Component

Role

Kafka

Durable buffer; evens out spikes

Spark Structured Streaming (micro-batch)

Reads events, aggregates until size/timeout, forwards to Triton

GPU Triton Inference Server

Executes model with dynamic batching; exposes Prom-ready stats

Result DB / Cache

Downstream services fetch predictions

Grafana/ OpenTelemetry / Alertmanager

Monitors per-batch latency, queue depth

6. Consequences & Trade-offs

Aspect

Positive Impact

Negative Impact

Mitigation

Cost

Fewer GPUs, 40-80% cheaper

Slight infra overhead for batch engine

Consolidate Spark jobs

Latency

Meets ≤ 5 s SLO

Not viable for sub-second UX

Keep slim real-time tier

Reliability

Easier retries (idempotent batch)

Larger blast radius per failure

Circuit-break & progressive batch size

DevX

Single code path for offline + near-online

Extra schema for batch inputs

Generate stubs via protobuf

7. Future Considerations

Adaptive Windowing, dynamically shrink batch window during low load to cut tail latency. Ref: Nvidia Autotune API.
Multi-model Batching, explore model-ensemble execution within the same GPU pass to further cut cost. Ref: Databrick
Observability Enhancements, per-batch lineage tags to unify with feature store. Ref: Uber Palette

References

Upsolver cheat-sheet on batch vs micro-batch processing Upsolver
AWS SageMaker batch vs real-time trade-offs Amazon Web Services, Inc.
Google Vertex AI batch prediction guide Google Cloud
Databricks AI Functions for batch inference Databricks Documentation
Databricks LLM inference performance study Databricks
NVIDIA Triton dynamic batching blog NVIDIA Developer
Adaline batch inference cost analysis Adaline
AWS Bedrock batch inference pipeline Amazon Web Services, Inc.
OSDI ’24 paper on Sarathi (throughput-latency) USENIX
Uber Michelangelo offline inference experience Uber

license: CC-BY-SA-4.0

NextChunking Strategy >< Freshness Window for RAG

Last updated 3 months ago