How is modelRouting different from model.primary + fallbacks?

modelRouting tiers by estimated context before the upstream call to optimize cost and latency; fallbacks usually react to failures in the same request path. Both can coexist, but separate responsibilities to avoid double billing and surprise routing—see the comparison table in the article.

How does this pair with the systemd and Docker OpenClaw guides?

Those guides cover daemons and exposure; this article covers in-Gateway routing policy. Stabilize deployment first, then tighten openclaw.json. Browse more via the OpenClaw category filter and the help center connectivity notes.

2026 OpenClaw modelRouting in Production: openclaw.json Thresholds, Multi-Model Chains, and Troubleshooting

Q: What should we watch on the first routing rollout?

Replay real transcripts with fixed fixtures in staging, verify hit distribution, then canary in prod while watching token and latency percentiles. If you need parallel compute, align capacity against the public pricing page for remote Mac executor nodes.

Why modelRouting belongs in the Gateway: not “more models for show,” but a cost/latency throttle

In production, OpenClaw requests often carry system prompts, chat history, tool outputs, and RAG chunks together. Feeding everything to one flagship model forever blows up bills and tail latency; relying only on post-failure fallbacks means you already burned a huge context before you learn it was the wrong path. modelRouting estimates context token size before the upstream inference and picks a tier so “small questions get small models” by default—not after the fact.

Six pain signals teams see most often—if several hit, put routing on the config review agenda instead of only staring at Grafana:

01
Long-tail latency: p95/p99 pulls away from the mean at the same QPS and tracks conversation length—heavy context paths are overused.
02
Nonlinear spend: traffic up 30%, bill up 100%—often “every session defaults to the biggest model.”
03
Tool calls inflate context: multi-hop tool output in one turn spikes tokens, causing silent truncation or surprise retries.
04
Fallback chains too long: users feel nothing, but you chained models on one request—latency and cost stack.
05
No routing observability: you only log the final model name, not why that tier was chosen—triage becomes guesswork.
06
Weak multi-tenant isolation: heavy sessions on a shared Gateway drag light-session SLOs—needs a hard gate by context shape.

After the site’s OpenClaw install/deploy series you should already have “process stays up, ports/tunnels healthy.” This article covers model selection inside that same long-lived process. It is orthogonal to remote execution (self-hosted runners or dedicated remote Macs): routing picks which brain; the executor layer picks which machine runs the work.

Another myth: modelRouting is “another load balancer.” It is closer to context-shape routing—estimate size, then pick a model—not random round-robin, or you get clever-looking traces with painfully honest invoices.

Comparison: primary + fallbacks vs modelRouting (context-size strategy)

They are not mutually exclusive, but separate the jobs: fallbacks fit failure semantics—model unavailable, errors, rate limits; modelRouting fits cost/latency semantics—how heavy this turn is. If you blur them, you get “route picked the big model, then failure fell back to the small model”—paying twice for drama.

Dimension	primary + fallbacks (classic)	modelRouting (context tiers)
Trigger	Error codes, timeouts, retryable failures	Estimated context token thresholds (e.g., context-size strategy)
Main win	Availability: rescue from a bad model	Efficiency: light chats do not pay flagship prices
Typical risk	Long chains inflate tail latency and double-bill	Bad thresholds mis-classify heavy vs light
Observability	Failure rates, retries, why we switched	Route hit mix, errors near thresholds, token percentiles
agents.defaults	Declare primary + fallback list	Add routing block under defaults to split before the call

Write “swap on failure” and “pick before failure” on two different pages—your on-call will thank you.

Log routing decisions structurally (tier hit, estimated token band, final model ID); otherwise prod only shows the final model and you cannot review thresholds. The six steps below make that a release gate.

Six-step rollout: from threshold draft to reversible production

For engineers who can already ship config changes—each step has an artifact so modelRouting does not become a one-off JSON doodle.

01
Freeze SLO language: target p95 latency, per-session cost ceiling, and assumed share of “heavy” sessions. No SLO, no serious thresholds.
02
Sample token distributions: use real chats and tool outputs—including tails—not just average session length.
03
Sketch three tiers: light/medium/heavy model IDs and explicit tasks that must never land on the light tier (e.g., multi-hop tools).
04
Wire modelRouting + telemetry: log hits, estimated tokens, final model to structured logs and your metrics stack.
05
Canary with control: dual-run old vs new on a slice, watch cost and latency percentiles, then promote.
06
Rollback switch: keep a snapshot to return to “defaults + short fallback chain” if routing misfires.

openclaw.json (snippet)

{
  "agents": {
    "defaults": {
      "model": { "primary": "anthropic/claude-sonnet-4-5" },
      "modelRouting": {
        "enabled": true,
        "strategy": "context-size",
        "thresholds": [
          { "maxTokens": 4000,  "model": "anthropic/claude-haiku-4-5", "description": "light" },
          { "maxTokens": 100000, "model": "anthropic/claude-sonnet-4-5", "description": "medium" },
          { "maxTokens": null,  "model": "anthropic/claude-opus-4-5", "description": "xlarge context" }
        ],
        "fallbackOnOverflow": true
      }
    }
  }
}

info

Note: This shows shape and semantics; real keys/defaults must match your OpenClaw version. Diff configs and run integration fixtures before upgrading Gateway.

Boundaries with agents.defaults and fallbacks: do not braid three different jobs

A useful mental model: defaults declares the primary model and general fallbacks; modelRouting (per your OpenClaw version) performs context-based splitting in cooperation with defaults; fallbacks still handle upstream failures. In staging, verify three things: routing should not thrash models on healthy paths (if it does, thresholds are too tight); fallbacks after routing still behave; logs separate route hits from failure swaps.

With remote compute, a common topology is Gateway on Linux VPS or containers while heavy toolchains or macOS-only steps go through a queue to dedicated remote Mac executors. modelRouting only tiers inference inside Gateway—it does not replace cross-machine scheduling (still your queue/runner problem).

For multi-tenant agents on one Gateway, give tenants distinct routing profiles or keys—otherwise a heavy tenant’s context estimate raises the waterline for everyone.

warning

Warning: Treat fallbackOnOverflow as “context does not fit the model,” not a “save money” knob—misuse invites silent truncation or hidden retries.

Production triage: symptom table and three review-ready talking points

Use this for fast on-call routing; if estimated tokens and provider bills diverge wildly, check whether tool outputs are excluded from estimation or logs are sampled away.

Symptom: light chats hit heavy tiers. Thresholds too low or estimator biased high—raise light-tier ceiling or fix the estimation window and re-check hit mix.
Symptom: heavy chats keep overflowing or falling back. Thresholds too high or missing an xlarge tier—add a bucket and re-validate fallbackOnOverflow semantics.
Symptom: latency improves but cost does not. Likely multiple model swaps on failure paths—split route logs from failure logs and shorten fallback chains.

Running Gateway on a throwaway laptop or a host without capacity guarantees will wreck p95 even with perfect routing; without an exclusive, always-on, contractible macOS execution plane, toolchains and local build steps resist automation. Teams that need OpenClaw alongside iOS/macOS builds, CI, or agents on one long-lived production SLO usually stabilize faster by placing heavy execution on dedicated remote Mac nodes instead of perpetual throwaway environments. Balancing routing policy with executor economics, NodeMini Mac Mini cloud rental fits as a base: tier inference with modelRouting in Gateway, land heavy toolchains on dedicated nodes, and encode keys and capacity in your runbooks.

FAQ

modelRouting tiers before the upstream call using estimated context for cost/latency; fallbacks usually react to failures. They can coexist—define boundaries. Browse more OpenClaw posts via the category filter.

Replay real transcripts with fixtures in staging, verify route hits, then canary while watching token and latency percentiles. If you need parallel compute, align capacity using the pricing page for remote Mac executor nodes.

Those guides cover daemons and exposure; this article covers in-Gateway routing. Stabilize deployment, then tighten openclaw.json. For connectivity and permissions, see the help center.

2026 OpenClaw modelRouting in Production openclaw.json thresholds · multi-model chains and troubleshooting

Why modelRouting belongs in the Gateway: not “more models for show,” but a cost/latency throttle

Comparison: primary + fallbacks vs modelRouting (context-size strategy)

Six-step rollout: from threshold draft to reversible production

Boundaries with agents.defaults and fallbacks: do not braid three different jobs

Production triage: symptom table and three review-ready talking points

FAQ