Where do I start when a stdio MCP subprocess “just disappears”?

Check PATH, working directory, and execute permissions for the same user as the Gateway and the child process; then OOM kills, the OS terminating the process, and broken npm/npx caches upstream. Persist exit codes and stderr, and record the command line and version pins in the change ticket. For capacity and connectivity baselines, see Mac Mini rental pricing and the cloud Mac help center.

How does this article relate to the MCP allowlist post on the site?

The allowlist article focuses on gateway-side registration, permission boundaries, and first-response connectivity issues; this one focuses on transport choice, subprocess lifecycle, handshake behavior, and stuck workers. Read both and align them in one review table.

Where should I start reading OpenClaw articles?

Open the blog OpenClaw category for install, systemd, Docker, security, and observability posts; when you need a stable macOS execution plane for toolchains, combine rental pricing and the help center to plan remote nodes.

2026 OpenClaw MCP Production Rollout: stdio vs HTTP MCP, Handshake Failures, and Stuck-Worker Triage

Scope and seven typical pain points before you “wire MCP for real”

The install article answers how the Gateway process stays resident; the security article covers listen surfaces, tokens, dmPolicy, and egress; the allowlist article covers tool registration and the first response when permissions are denied. This piece sits after those: it focuses on differences between stdio subprocesses and HTTP remote MCP in operations, and which log classes matter when handshakes, timeouts, or stuck workers appear.

If three or more of the bullets below match your environment, add an explicit “MCP runtime” risk line in the review—not a vague “try restarting Gateway again.”

01
Command lines that only work on a dev laptop: npx paths, Node minor versions, and global packages differ under systemd from an interactive shell, producing “SSH works, Gateway launch fails.”
02
Implicit working-directory dependencies: the MCP child assumes a repo root; an empty HOME or read-only mount makes it fail quietly.
03
HTTP MCP configured with a URL but not TLS: certificate chains, SNI, internal self-signed certs, and networkPolicy combine into symptoms that look like an endless handshake.
04
Stale tool-list caches: after servers add or remove tools, clients still call old schemas and you see random parameter validation failures.
05
Long calls without timeouts: when a downstream API hangs, Gateway-side threads or connections do not drain and the system eventually freezes globally.
06
Zombie-like subprocesses: with stdio, a half-closed pipe can leave a child alive but idle, burning file descriptors and CPU.
07
Config drift with no paper trail: openclaw.json diverges per host with no validate/doctor record, so triage becomes folklore.

Once these land in a runbook, MCP can behave like CI: change ticket + pinned rollback. Next, a single table flattens stdio versus HTTP operations cost so a meeting cannot skip TLS and egress governance with “remote is easier.”

In 2026 platform-engineering practice, toolchain governance binds to who may spawn subprocesses in production: stdio pushes the boundary to OS users and file permissions; HTTP pushes it to network policy and bearer tokens. Neither is universally better—only whether it matches how you observe and on-call.

stdio subprocess versus HTTP MCP: scenarios, exposure, and operational cost

Use this table with SRE, security, and product: do not compare latency alone—price in identity, egress, upgrades, and failure isolation together.

Dimension	stdio (local subprocess)	HTTP / SSE-style remote MCP
Typical deployment	Same host as Gateway or same container namespace	Standalone service, sidecar, or internal cluster
Identity and trust	OS user, file permissions, optional sandbox	mTLS, bearer tokens, reverse-proxy auth
Upgrade path	Pin image/package versions; roll Gateway or the child package	Independent blue/green; mind protocol version negotiation
What to observe	Exit codes, stderr, fd leaks, OOM	HTTP 5xx/429, connection pools, TLS handshake latency
Failure isolation	Process crash → supervisor restart	Network partitions can slow multiple tools—need circuit breaking

MCP rollout is about turning tool calls into a versioned, bounded, rollback-friendly supply chain; transport only decides whether complexity sits at the kernel edge or the network edge.

If you already tightened networkPolicy per security hardening, revisit egress allowlists when adding HTTP MCP; for stdio, re-check whether the Gateway user can execute the intended binaries—avoid “chmod +x everything to move faster.”

Seven steps to a reproducible MCP rollout (with config validation)

These steps assume a bootable Gateway; if install and the daemon are not done yet, return to cross-platform install and the systemd/Docker production guides.

01
Freeze the runtime: record Node minor, package manager, and MCP server package versions; production and staging must match origin.
02
Minimal stdio probe: start the MCP once non-interactively as the same user as Gateway and confirm PATH and cwd.
03
Write the config snippet: register servers in openclaw.json (or the path your docs specify); use a team prefix on names to avoid collisions.
04
Run validation: openclaw config:validate then openclaw doctor; differences belong in the change ticket.
05
Wire the allowlist: per the allowlist article, tighten tool names and namespaces to the minimum set.
06
Add observability hooks: thresholds for child CPU/memory and P95 latency to HTTP MCP, fed into your logging pipeline.
07
Practice rollback: keep the last known-good config and image digest so removing one MCP entry restores baseline.

openclaw.json snippet (example)

{
  "mcpServers": {
    "corp-files-stdio": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "/var/lib/openclaw/mcp-data"],
      "env": { "NODE_OPTIONS": "--max-old-space-size=512" }
    },
    "internal-api-http": {
      "url": "https://mcp.internal.example/sse",
      "headers": { "Authorization": "Bearer ${MCP_SERVICE_TOKEN}" }
    }
  }
}

info

Note: real key names and nesting follow your OpenClaw version docs; the sketch only shows how stdio and HTTP entries can coexist. Before a major upgrade, re-run validate and read release notes for breaking changes.

Tool discovery, naming collisions, and rolling upgrades

MCP tool names are often namespaced on the gateway; multiple environments on one Gateway invite “same name, different implementation” incidents. Prefer explicit prefixes in config (e.g. prod_ / stg_) and attach a tool-list diff to the release checklist.

When rolling HTTP MCP, keep backward-compatible schemas first; if you must break compatibility, update Gateway allowlists together and canary a slice of session traffic. stdio server upgrades need attention to binary ABI and dynamic library paths, especially in slim images.

warning

Warning: do not let production Gateway pull “latest” with unlocked npx -y; pin digest or an internal artifact feed or you lose supply-chain auditability.

Symptom cheat sheet, reference framing, and compute context

Use this matrix on the first on-call screen; details still require Gateway logs and upstream MCP docs.

Symptom	Check first	Common fix
Handshake fails immediately	Version fields, auth headers, TLS chain	Align protocol version; fix certificates or allow SNI
Worked once, never again	Pool exhaustion, stuck subprocess	Restart MCP side; add timeouts and circuit breaking
Missing tools in the list	Cache, canary routing, allowlist	Invalidate cache; reconcile allowlist and routing
Random timeouts	Downstream API, quota, DNS	Layered timeouts; log trace IDs

Subprocess recovery: for stdio MCP, default a crash restart cap (e.g. five per ten minutes), then alert for human review—avoid infinite restart storms.
HTTP concurrency: keep a dedicated connection ceiling for remote MCP, counted separately from model traffic, so they do not fight for file descriptors.
Config audit: every openclaw.json change should attach a snippet of validation output to the ticket for postmortems.

Running all MCP from a developer laptop invites “tools randomly unavailable” under sleep, VPN jitter, and multi-user desktop sessions; exposing HTTP MCP on the public internet without TLS and policy multiplies Gateway attack surface. If you also need a stable macOS plane for Apple toolchain automation (mobile builds, signing, or agents paired with MCP tools), a contracted remote Mac node usually yields cleaner permission and logging boundaries than personal hardware. Across transport choice, subprocess governance, and on-call design, NodeMini Mac Mini cloud rental can complement the compute layer: plan it with the OpenClaw column’s install, security, and observability posts so model gateway + toolchain + macOS execution split into clear ownership.

FAQ

Frequently asked questions

Start with PATH, cwd, and execute permissions for the Gateway user and the child; then OOM, the OS killing the process, and npm/npx cache corruption. Persist exit codes and stderr. For capacity and connectivity baselines, use Mac Mini rental pricing and the cloud Mac help center.

The allowlist post covers registration, permissions, and first-response connectivity; this one covers stdio/HTTP choice, lifecycle, and stuck workers. Review both tables together in the same meeting.

Open the blog OpenClaw category for install, systemd, Docker, security, and observability, then return here for MCP runtime detail.

2026 OpenClaw MCP Production Rollout stdio vs HTTP MCP · handshake failures · stuck-worker triage

Scope and seven typical pain points before you “wire MCP for real”

stdio subprocess versus HTTP MCP: scenarios, exposure, and operational cost

Seven steps to a reproducible MCP rollout (with config validation)

Tool discovery, naming collisions, and rolling upgrades

Symptom cheat sheet, reference framing, and compute context

Frequently asked questions