Once OpenClaw Gateway is up, production usually asks how to rein in cost and latency together. modelRouting tiers traffic before the upstream call by estimated context size instead of always paying top-tier model prices. This guide explains what problem it solves, how it sits beside agents.defaults and fallbacks inside openclaw.json, how to map SLOs to maxTokens ladders, and delivers a six-step rollout checklist plus misconfiguration triage—complementing the install, systemd, and Docker articles on this site.
In production, OpenClaw requests often carry system prompts, chat history, tool outputs, and RAG chunks together. Feeding everything to one flagship model forever blows up bills and tail latency; relying only on post-failure fallbacks means you already burned a huge context before you learn it was the wrong path. modelRouting estimates context token size before the upstream inference and picks a tier so “small questions get small models” by default—not after the fact.
Six pain signals teams see most often—if several hit, put routing on the config review agenda instead of only staring at Grafana:
Long-tail latency: p95/p99 pulls away from the mean at the same QPS and tracks conversation length—heavy context paths are overused.
Nonlinear spend: traffic up 30%, bill up 100%—often “every session defaults to the biggest model.”
Tool calls inflate context: multi-hop tool output in one turn spikes tokens, causing silent truncation or surprise retries.
Fallback chains too long: users feel nothing, but you chained models on one request—latency and cost stack.
No routing observability: you only log the final model name, not why that tier was chosen—triage becomes guesswork.
Weak multi-tenant isolation: heavy sessions on a shared Gateway drag light-session SLOs—needs a hard gate by context shape.
After the site’s OpenClaw install/deploy series you should already have “process stays up, ports/tunnels healthy.” This article covers model selection inside that same long-lived process. It is orthogonal to remote execution (self-hosted runners or dedicated remote Macs): routing picks which brain; the executor layer picks which machine runs the work.
Another myth: modelRouting is “another load balancer.” It is closer to context-shape routing—estimate size, then pick a model—not random round-robin, or you get clever-looking traces with painfully honest invoices.
They are not mutually exclusive, but separate the jobs: fallbacks fit failure semantics—model unavailable, errors, rate limits; modelRouting fits cost/latency semantics—how heavy this turn is. If you blur them, you get “route picked the big model, then failure fell back to the small model”—paying twice for drama.
| Dimension | primary + fallbacks (classic) | modelRouting (context tiers) |
|---|---|---|
| Trigger | Error codes, timeouts, retryable failures | Estimated context token thresholds (e.g., context-size strategy) |
| Main win | Availability: rescue from a bad model | Efficiency: light chats do not pay flagship prices |
| Typical risk | Long chains inflate tail latency and double-bill | Bad thresholds mis-classify heavy vs light |
| Observability | Failure rates, retries, why we switched | Route hit mix, errors near thresholds, token percentiles |
| agents.defaults | Declare primary + fallback list | Add routing block under defaults to split before the call |
Write “swap on failure” and “pick before failure” on two different pages—your on-call will thank you.
Log routing decisions structurally (tier hit, estimated token band, final model ID); otherwise prod only shows the final model and you cannot review thresholds. The six steps below make that a release gate.
For engineers who can already ship config changes—each step has an artifact so modelRouting does not become a one-off JSON doodle.
Freeze SLO language: target p95 latency, per-session cost ceiling, and assumed share of “heavy” sessions. No SLO, no serious thresholds.
Sample token distributions: use real chats and tool outputs—including tails—not just average session length.
Sketch three tiers: light/medium/heavy model IDs and explicit tasks that must never land on the light tier (e.g., multi-hop tools).
Wire modelRouting + telemetry: log hits, estimated tokens, final model to structured logs and your metrics stack.
Canary with control: dual-run old vs new on a slice, watch cost and latency percentiles, then promote.
Rollback switch: keep a snapshot to return to “defaults + short fallback chain” if routing misfires.
{
"agents": {
"defaults": {
"model": { "primary": "anthropic/claude-sonnet-4-5" },
"modelRouting": {
"enabled": true,
"strategy": "context-size",
"thresholds": [
{ "maxTokens": 4000, "model": "anthropic/claude-haiku-4-5", "description": "light" },
{ "maxTokens": 100000, "model": "anthropic/claude-sonnet-4-5", "description": "medium" },
{ "maxTokens": null, "model": "anthropic/claude-opus-4-5", "description": "xlarge context" }
],
"fallbackOnOverflow": true
}
}
}
}
Note: This shows shape and semantics; real keys/defaults must match your OpenClaw version. Diff configs and run integration fixtures before upgrading Gateway.
A useful mental model: defaults declares the primary model and general fallbacks; modelRouting (per your OpenClaw version) performs context-based splitting in cooperation with defaults; fallbacks still handle upstream failures. In staging, verify three things: routing should not thrash models on healthy paths (if it does, thresholds are too tight); fallbacks after routing still behave; logs separate route hits from failure swaps.
With remote compute, a common topology is Gateway on Linux VPS or containers while heavy toolchains or macOS-only steps go through a queue to dedicated remote Mac executors. modelRouting only tiers inference inside Gateway—it does not replace cross-machine scheduling (still your queue/runner problem).
For multi-tenant agents on one Gateway, give tenants distinct routing profiles or keys—otherwise a heavy tenant’s context estimate raises the waterline for everyone.
Warning: Treat fallbackOnOverflow as “context does not fit the model,” not a “save money” knob—misuse invites silent truncation or hidden retries.
Use this for fast on-call routing; if estimated tokens and provider bills diverge wildly, check whether tool outputs are excluded from estimation or logs are sampled away.
fallbackOnOverflow semantics.Running Gateway on a throwaway laptop or a host without capacity guarantees will wreck p95 even with perfect routing; without an exclusive, always-on, contractible macOS execution plane, toolchains and local build steps resist automation. Teams that need OpenClaw alongside iOS/macOS builds, CI, or agents on one long-lived production SLO usually stabilize faster by placing heavy execution on dedicated remote Mac nodes instead of perpetual throwaway environments. Balancing routing policy with executor economics, NodeMini Mac Mini cloud rental fits as a base: tier inference with modelRouting in Gateway, land heavy toolchains on dedicated nodes, and encode keys and capacity in your runbooks.
modelRouting tiers before the upstream call using estimated context for cost/latency; fallbacks usually react to failures. They can coexist—define boundaries. Browse more OpenClaw posts via the category filter.
Replay real transcripts with fixtures in staging, verify route hits, then canary while watching token and latency percentiles. If you need parallel compute, align capacity using the pricing page for remote Mac executor nodes.
Those guides cover daemons and exposure; this article covers in-Gateway routing. Stabilize deployment, then tighten openclaw.json. For connectivity and permissions, see the help center.