OpenClaw Gateway install is only the starting line; in production, on-call time usually goes to misleading health checks, logs you cannot find, and config drift after upgrades. This article is for teams that finished Linux systemd + Tunnel, Docker Compose, or three-platform install, and need minimal observability, log routing, upgrade/rollback, and a symptom table; for routing policy, continue to the modelRouting article.
Install guides prove the happy path; production faces long-tail issues such as zombie processes, port clashes, permission drift, and downstream model timeouts. The six items below are the checklist that turns on-call from guessing into inspecting.
Health checks too loose: only the process exists, without proving the Gateway actually routes traffic, so you only notice a half-dead state after traffic shifts.
Scattered logs: systemd, containers, app stdout, and the reverse proxy each log somewhere else, so you cannot rebuild a timeline during an incident.
Upgrades without a baseline: no record of the previous image digest or global npm version, so rollback becomes "reinstall and hope".
Config mixed with secrets: openclaw.json and env injection fall out of sync, showing up as intermittent 401s or silent routing failures.
Observability lags changes: listen addresses or Tunnel targets change, but probe paths in monitoring do not.
Treating Gateway as a universal executor: heavy Xcode workloads on the same small VPS max CPU and get misread as "the model is slow".
If two or more apply, fix the minimal observability layer before feature churn; otherwise every release pays tuition on the same class of failure.
One table splits responsibilities so "we can install" and "we can stay stable" are not reviewed in the same breath.
| Topic | Install / daemon posts (systemd · Docker · three platforms) | This article (production observability and change) |
|---|---|---|
| Process and exposure | unit/Compose, loopback bind, Tunnel or firewall policy | liveness probes, port conflict checks, reprobing paths after change |
| Configuration model | first write of openclaw.json, directory permissions | diff review, backups, canary order and rollback sequence |
| Logs | land on disk or be collected by journal/docker first | field meaning, correlation IDs, catalog of common error patterns |
| Upgrades | provide one copy-paste upgrade command or image pull path | record digest/version, backup point, rollback verification checklist |
| Model routing | optional mention | deep strategy in the dedicated modelRouting article |
Operability comes from the same inspection commands and the same rollback order, not from one person's memory.
The order works for systemd and Docker: confirm the facts (process, port, health endpoint), then the interpretation (logs and downstream). Commands differ slightly by distro, but checkpoints should stay the same.
Confirm the main process: systemd uses systemctl status; Docker uses docker compose ps; watch restart counts and exit codes.
Verify listening sockets: ss -lntp or container port maps, aligned with Tunnel/reverse-proxy targets.
Health checks: HTTP probes against the documented or custom probe path; separate "process is up" from "routing works".
Pull recent logs: journalctl -u or docker compose logs --tail=200; fix a time window before full-text search.
Validate downstream models: smallest possible request fixture to rule out "Gateway fine, upstream API broken".
Write a change record: each release notes version/digest, config diff, and probe evidence so the next on-call can continue.
# Example: quick sanity check (replace with your unit / container name) systemctl status openclaw-gateway.service --no-pager || true ss -lntp | grep -E '18789|LISTEN' || true # Docker path (example) # docker compose -f /opt/openclaw/docker-compose.yml ps # docker compose -f /opt/openclaw/docker-compose.yml logs --tail=200 gateway
Note: with Cloudflare Tunnel, after changes validate both public probes and loopback probes on the host, to avoid false positives when the edge still caches an old route.
An upgrade you can roll back needs three things: a snapshot before release, only one change vector during release, and the same probe set after release. On Docker prefer a pinned digest or a private-registry tagging policy; on bare metal/npm lock the global package version and lockfile where applicable.
Canary pattern: prove on one staging host or low-traffic replica, then roll forward; if Gateway backs remote executors, use layered rollout—confirm the control plane first, then scale execution.
Warning: do not try ad-hoc routing edits in parallel without backing up openclaw.json and environment injection; production outages often come from half-applied config.
The figures below are engineering-order-of-magnitude for review alignment; real timeouts and quotas follow your vendor and contract.
| Symptom | Suspect first | Direction |
|---|---|---|
| Exits right after start | JSON syntax in config, missing env vars, port in use | Reproduce in foreground once, compare with install guide checkpoints |
| Intermittent 401 | Key rotation out of sync, multiple config file paths | Unify injection sources, clean stale shell profile pollution |
| CPU pegged long term | Execution load colocated with Gateway | Move heavy work to dedicated executors or a remote Mac |
| Latency spikes | Upstream throttling, DNS, TLS handshakes | Layer captures and logs; isolate network before touching models |
Pinning heavy macOS builds, signing, and GUI-dependent work to the same small Linux VPS as Gateway saves effort short term but drags down both control-plane stability and signal-to-noise while debugging; a laptop alone rarely gives 24/7 and auditable isolation. Teams that need stable iOS CI, automation agents, and contractable compute usually keep Gateway on a general VPS and place macOS execution on dedicated remote Mac nodes. For ops boundaries and elastic scale, NodeMini cloud Mac Mini rental fits that execution tier: pick region and disk, layer it under the OpenClaw control plane, and on-call watches a clear observability surface.
Use the OpenClaw category filter on the blog index, and read in order: systemd → Docker → observability → modelRouting.
Start with the rental rates page and compute ordering, and budget Gateway separately from the macOS execution tier.
See the help center, then cross-check with the health check and logging sections in this article.