2026 OpenClaw MCP stdio subprocess governance Gateway buildup · Memory reclaim · HTTP MCP ops comparison

You already run OpenClaw Gateway and stdio MCP, but in production you see slowly rising child process counts, climbing RSS, or occasional OOM, and you are unsure whether to tune config or change architecture. This article splits work with MCP stdio/HTTP handshake and stuck-session troubleshooting and Gateway production observability: first use a seven-item checklist to draw boundaries, then a stdio vs HTTP ops comparison table to narrow trade-offs, then a six-step reclaim and throttling runbook, plus environment variables and restart strategy under systemd/Docker; at the end, links to the OpenClaw category and compute scenarios.

01

Scope of this article: seven things you should not use it to replace

For stdio MCP connection handshake, stuck sessions, and permission denials, prefer handshake troubleshooting; for allowlists and tool policy, MCP toolchain allowlist troubleshooting; for health checks, logs, and upgrade/rollback, observability. This article only answers: when the Gateway already accepts sessions stably, but child processes and memory curves are still unacceptable, how to govern in layers.

  1. 01

    First-time handshake failure: inspect MCP transport and downstream executable paths; not expanded here.

  2. 02

    Token / scope and gateway closed (1000): use the dedicated closed(1000) article, not reclaim scripts.

  3. 03

    Pure security policy: changes to dmPolicy / networkPolicy belong in the hardening article.

  4. 04

    Gateway will not start / not ready: use the not-ready troubleshooting article for ports, memory, and compose order.

  5. 05

    Model backend timeouts at the app layer: can run in parallel with MCP subprocess work, but the root cause may be routing, not the process table.

  6. 06

    One-off leaks from third-party MCP bugs: need an upstream fix or a pinned version; reclaim only mitigates.

  7. 07

    Treating “cleanup” as a cure-all: hard kills without watermarks and version logs hide real leaks.

In open-source discussions, stdio MCP as Gateway child processes under long-lived sessions can show the subprocess pool growing with sessions; behavior varies by release, and ops should codify “acceptable process ceiling + reclaim policy” in the runbook instead of relying on defaults. Below we first build the mental model session isolation → expected buildup, then the comparison table.

The overlap with the handshake article: failures show as connection/handshake errors in logs; the signals this article covers are monotonically rising process counts, stepwise memory growth, and OOM kills on a fixed cadence. When triaging, confirm parent/child relationships between the Gateway and MCP children (ps / pstree in the container) so you do not count model backends or channel processes as MCP.

If you run multiple toolchains (local scripts plus long-lived agents on a remote Mac), sketch which host runs the Gateway and which runs heavy MCP: headroom on the Gateway node caps concurrent stdio sessions. For capacity and access, pair rental rates with help-center network requirements.

From an observability angle, capture at least three series: process count, RSS for Gateway and MCP, and tool-call QPS with session count. Without session dimensions, you cannot tell whether “buildup” is a traffic peak or missing reclaim. Align log fields with observability, then compare week over week in dashboards; otherwise on-call restarts on gut feel.

Another common mistake is equating “many children” with “must switch to HTTP.” If tools are very light and concurrency is low, the issue is more often sessions not closed or a blocking downstream binary; tighten client session lifecycle before you price a transport migration.

02

stdio MCP vs HTTP MCP: ops comparison (when to consider migration)

stdio transport runs the MCP server as a child process tightly coupled to the Gateway lifecycle; HTTP transport behaves more like a standalone endpoint, with different scaling and health-check paths. The table below helps decide “keep tuning stdio” versus “move to HTTP.”

Dimensionstdio MCPHTTP MCP
Process couplingChildren follow Gateway/session model; buildup is easy to perceiveOften a separate process; Gateway is a client
Horizontal scaleOften requires scaling Gateway or throttling sessionsCan scale MCP service replicas independently
Health checksRely on Gateway logs and the process tableHTTP probes and independent SLOs
Blast radiusChild issues can slow Gateway on the same hostBetter isolation, extra network hop
Good fitLight tools, low concurrency, trusted same-host setupHeavy tools, high concurrency, independent release cadence

Governance is not infinite hardware; it is a contract on process and memory curves: past the contract, throttle, reclaim, or change transport.

If you already validated configuration in stdio/HTTP handshake troubleshooting but still live near the memory red line, put “HTTP-ify” in architecture review instead of endlessly upsizing the host.

03

Six-step subprocess and memory runbook (paste into the on-call guide)

These steps stress “evidence first, throttle second, architecture last”: align log fields with observability before blind restarts.

  1. 01

    Establish a baseline: record process count, RSS, Gateway version, and MCP package versions under stable load in a change log.

  2. 02

    Separate spikes from leaks: spikes usually track concurrent sessions; monotonic growth suggests missing reclaim or a hung downstream—capture stacks.

  3. 03

    Tighten concurrency and timeouts: within config limits, lower parallel tool calls and shorten idle timeouts; see whether curves track session count down.

  4. 04

    Planned reclaim: during a maintenance window, rolling-restart Gateway or isolate nodes; drain sessions first.

  5. 05

    Containers and systemd: verify environment variables reach the real runtime (daemon vs interactive shell is a frequent pitfall).

  6. 06

    Evaluate HTTP migration: for heavy MCP or services that need independent scale, deploy separately with health checks and switch Gateway to HTTP transport.

bash · process snapshot (example)
ps -axo pid,ppid,rss,comm | grep -E 'openclaw|mcp|node' | head -n 50
# In containers, pair with pstree -p 1 for parent/child view
info

Tip: after editing openclaw.json, run the site-recommended validation path (e.g. config:validate / doctor), then roll restarts so you do not get “config looks applied but processes still use old values.”

Public GitHub issues have reported “stdio MCP children not reclaimed on session rotation”; treat periodic reclaim + version tracking as temporary relief while upstream fixes land. Do not rely long term on undocumented manual kill.

If you use external orchestration (systemd timers or k8s CronJobs) for sidecar reclaim, keep the Gateway user, environment, and cgroup memory limits aligned; otherwise the process tree your script sees diverges from production and triage becomes finger-pointing. On small single-node setups, prefer tuning concurrency and timeouts into contract before adding sidecar complexity.

Finally, tie every reclaim or throttle change to a release version: stdio behavior can shift across minor Gateway releases; curves without version alignment are not statistically meaningful.

04

systemd and Docker: environment variables, restart policy, and reading the wrong logs

systemd services do not inherit interactive ~/.zshrc; under Docker Compose, MCP binary paths and read-only mounts can also cause children to respawn in a loop. Alongside Docker production guidance, check Environment=, WorkingDirectory=, named volumes, and image digest.

warning

Warning: frequent docker compose restart without the prior exit code can mislabel a configuration bug as a memory leak.

Pair with not-ready troubleshooting: under memory pressure the Gateway may never reach stable MCP sessions—fix resources and startup order before subprocess tuning.

05

On-call reference lines (including EEAT)

The numbers below are common starting points; tune to your load.

  • Memory headroom: on Linux nodes running Gateway plus multiple stdio MCP sessions, reserve well above “CLI probe only” for peak sessions; if OOM is frequent, throttle before you scale up.
  • Reclaim windows: planned reclaim belongs on the change calendar; emergency reclaim needs before/after process tables and log snippets tied to versions.
  • Migration criteria: when the same MCP is operable on HTTP with an independent SLO and stdio curves stay unacceptable, prefer architectural migration over stacking cron jobs.

Stacking Agent and Gateway on a small, unstable node produces “toolchain and model double jitter”; scaling hardware alone without changing the session model makes cost linear. Teams that need long-lived, predictable compute for macOS builds or automation and still want headroom for OpenClaw-related toolchains often find NodeMini Mac Mini cloud rental clearer than laptops or oversubscribed shared hosts, with fixed SSH and stated specs; review rental rates and help center for network and access.

Operationalize by writing “stdio session ceiling + reclaim policy + HTTP migration triggers” on the same ops contract page so engineering and SRE review the same metrics.

For external postmortems, attach process snapshots and log excerpts to separate misconfiguration from upstream defects.

FAQ

Frequently asked questions

Not necessarily. Separate expected growth from session isolation from leak-like buildup, then align RSS trends and the Gateway version with known issues; upgrade to a fixed release instead of only adding cron.

When subprocess lifetimes are hard to align with the session model long term, or memory headroom stays insufficient without horizontal scale, HTTP MCP is usually easier to scale and health-check independently. More toolchain context is in the OpenClaw category.

Start with rental rates to compare tiers, then use help center network and access articles to build a capacity table.