Designing Production Voice Agent Architecture: From Orchestrator to Tool Server

A production voice agent isn't completed by a single good model. LiveKit recommends 4-core/8GB servers handling 10–25 concurrent jobs as a starting point, and Vapi recommends 2–3 fallback providers for TTS/STT failures — because ultimately, Orchestrator separation and recovery design matter more than performance.

The Orchestrator Is a Session OS, Not a Model Wrapper

In real-time voice systems, the Orchestrator isn't just code that chains STT, LLM, and TTS in series. It must be an operational layer that centrally manages session creation, turn detection, barge-in cancellation, timeout budgets, provider health scores, and tool execution order. LiveKit runs agent servers as worker pools with separate processes per job for this reason — clear session boundaries prevent one call's delay or crash from contaminating another call's quality.

STT Provider, LLM Router, and TTS Engine Must Be Swappable

In production, provider outages arrive before model quality issues. STT Provider, LLM Router, and TTS Engine should be abstracted not by identical request/response types, but by streaming event interfaces and common error codes. Vapi supports voice fallback and transcriber fallback separately and recommends 2–3 different providers for each. The Orchestrator must know the degraded mode path, retryable stages, and acceptable voice quality degradation ranges beyond the primary route.

Tool Server: Transaction Boundaries Before Function Calls

From the moment the LLM selects a tool, the problem shifts from natural language to transaction design. The Tool Server must validate function schemas, check permissions, handle idempotency keys, manage timeouts, and determine retry eligibility — returning results as structured JSON. Without separating read-only queries from write operations, you'll immediately face duplicate bookings, incorrect CRM updates, and payment re-invocations. Vapi's custom tool and code tool patterns signal that this boundary should be extracted into an independent service.

State Store: Execution State, Not Conversation Logs

As Pipecat's context aggregator demonstrates, voice agent state extends far beyond simple transcripts. User utterances, assistant's actual TTS text, tool results, summarized context, incomplete tasks, and interruption flags must be managed together for accurate next turns. The State Store should separate hot session state from durable event logs. Short-term conversational memory belongs in low-latency storage, while long-term history should be asynchronously loaded with summaries to reduce both token costs and recovery time.

Error Handling and Observability Belong in the Same Design Document

Defining retry policies without tracing makes root cause analysis impossible. The Orchestrator, STT Provider, LLM Router, Tool Server, and TTS Engine should be linked by per-turn traces, with each hop logging latency, partial transcripts, tool arguments, and provider error codes. LiveKit assumes log collection, health endpoints, and graceful draining; Vapi Evals enables mock conversations and tool-call verification as pre-deployment quality gates. The key rule: voice output favors fallback over retry, while tool write operations favor idempotency over fallback.

BringTalk Designs Context Injection Within Zero Retention Boundaries

BringTalk's design principle places session decisions in the Orchestrator, injects customer state via Context Injection only when needed, and never pushes sensitive information outside Zero Retention boundaries. In time-critical scenarios like LQA and FUA, CRM events and recent behavioral data are tightly connected, but raw PII and internal identifiers are masked behind the Tool Server or substituted with lookup tokens. This ensures policies persist even when models change, and model changes don't require rewriting the entire voice pipeline.

📌 Key metrics: start with 4-core/8GB handling 10–25 concurrent jobs, begin autoscale near load_threshold 0.5 rather than 0.7, and default to 2–3 different providers for voice/transcription fallback