AI Workloads Micro-batch Inference over Real-time Inference
Summary (TL;DR)
I chose micro-batch inference as the primary serving pattern because it strikes the best balance between throughput-per-dollar, GPU/CPU utilization, and operational predictability for machine workload, while still meeting the latency Service-Level Objective (SLO) of ≤ 5 seconds end-to-end. Compared with always-on real-time (per-request) inference, micro-batching lets us coalesce many small requests into short-lived bursts, saturate hardware, run fewer model replicas, and pay 40-80% less in total compute, without materially impacting user experience for near-real-time use cases such as ranking, scoring, and LLM embedding generation. The architecture decision record below records the architectural context, options, decision rationale, and consequences so future engineers can audit or revisit the trade-offs.
Architecture Decision Record Metadata
Decision ID
ADR-2025-05-22-001
Date
22 May 2025
Status
Accepted
Owner
Haluan Irsad (haluan.irsad@gmail.com)
Reviewers
-
Scope
Online inferencing for user-facing ML & LLM services
1. Context
The product surfaces personalised scores, recommendations, and short LLM outputs that must appear within ≤ 5 s after an upstream event (click, transaction, message).
Traffic is highly bursty (peaks 20× off-peak) and prediction payloads vary in size.
GPU capacity is shared with training workloads; cloud costs are under aggressive optimisation targets (-30% YoY).
Most feature pipelines already land in a streaming buffer (Apache Kafka) and can tolerate sub-second queuing delay.
2. Options Considered
A
Real-time (per request)
One RPC → one prediction; autoscale pods
✅ (≈ 200 ms p95)
$110 k
High (traffic spikes, cold-starts)
B
Micro-batch (chosen)
Accumulate 10–500 reqs or 250 ms window; infer as a batch
✅ (1.5 s p95)
$48 k
Medium
C
Classic offline batch
Nightly Spark/Flume job, write to DB
❌ (> mins)
$15 k
Low
*Cost model based on AWS g5.2xlarge GPU On-Demand and NVIDIA Triton throughput benchmarks.
3. Decision
Adopt micro-batch inference as the default serving mode, backed by NVIDIA Triton with dynamic batching and Spark-Structured-Streaming (200 ms trigger) for request coalescing. Real-time pods will remain only for very latency-critical endpoints (< 300 ms), gated by feature flags.
4. Rationale
4.1 Latency vs Throughput Trade-off
Throughput on a single A100 GPU rises 14× when batch size goes from 1 to 64, while latency rises only 4× (still < 2s p95), well inside the SLO.
NVIDIA Triton’s dynamic-batching groups requests that arrive within a short window, boosting utilisation without violating per-batch latency budgets.
Research (Sarathi, OSDI 24) shows micro-batching sits on the efficient frontier of the latency-throughput curve for LLMs.
4.2 Cost Efficiency
AWS and Bedrock price batch jobs ~50% lower than continuous endpoints.
SageMaker and Vertex AI show similar economics.
Independent benchmarks report 70-80% GPU-hour savings when moving from per-request to batched LLM inference.
4.3 Hardware Utilisation & Scalability
Micro-batching keeps GPUs at > 85% utilization during bursts, minimizing the number of replicas needed.
Spark Structured Streaming’s micro-batch engine naturally handles back-pressure and can scale linearly across Kafka partitions.
4.4 Operational Predictability
Fixed-interval micro-batches give us deterministic checkpoints for monitoring and retry; Uber reports similar benefits in its offline-inference pipelines on Michelangelo.
Easier A/B rollout: entire batch can be shadow-scored and compared before commit.
4.5 Use-Case Fit
Machine workloads (ranking, fraud scoring, daily embedding refresh) are latency-tolerant relative to human perception; real-time inference is over-engineered for these needs.
5. Micro-Batch Inference Architecture

Key components
Kafka
Durable buffer; evens out spikes
Spark Structured Streaming (micro-batch)
Reads events, aggregates until size/timeout, forwards to Triton
GPU Triton Inference Server
Executes model with dynamic batching; exposes Prom-ready stats
Result DB / Cache
Downstream services fetch predictions
Grafana/ OpenTelemetry / Alertmanager
Monitors per-batch latency, queue depth
6. Consequences & Trade-offs
Cost
Fewer GPUs, 40-80% cheaper
Slight infra overhead for batch engine
Consolidate Spark jobs
Latency
Meets ≤ 5 s SLO
Not viable for sub-second UX
Keep slim real-time tier
Reliability
Easier retries (idempotent batch)
Larger blast radius per failure
Circuit-break & progressive batch size
DevX
Single code path for offline + near-online
Extra schema for batch inputs
Generate stubs via protobuf
7. Future Considerations
Adaptive Windowing, dynamically shrink batch window during low load to cut tail latency. Ref: Nvidia Autotune API.
Multi-model Batching, explore model-ensemble execution within the same GPU pass to further cut cost. Ref: Databrick
Observability Enhancements, per-batch lineage tags to unify with feature store. Ref: Uber Palette
References
Upsolver cheat-sheet on batch vs micro-batch processing Upsolver
AWS SageMaker batch vs real-time trade-offs Amazon Web Services, Inc.
Google Vertex AI batch prediction guide Google Cloud
Databricks AI Functions for batch inference Databricks Documentation
Databricks LLM inference performance study Databricks
NVIDIA Triton dynamic batching blog NVIDIA Developer
Adaline batch inference cost analysis Adaline
AWS Bedrock batch inference pipeline Amazon Web Services, Inc.
OSDI ’24 paper on Sarathi (throughput-latency) USENIX
Uber Michelangelo offline inference experience Uber
license: CC-BY-SA-4.0
Last updated