Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Troubleshooting

When something isn’t working, run doctor first. It catches the vast majority of issues in one round-trip. Beyond that, here are the most common failure modes and what to do about each.

“No espctl tools available” / “Failed to start MCP server”

Your client can’t even spawn the MCP server.

Check:

  • Is the absolute path to espctl correct in your client config? Run ls -l /path/to/espctl to confirm.
  • Does it have execute permission? chmod +x if not.
  • Run espctl mcp serve in a terminal manually. What does it print to stderr? Common issues:
    • cannot find store at <path> — the store doesn’t exist or has wrong permissions.
    • dynamic linker errors — the binary was built against newer libc than your system has; rebuild from source or grab a different release.
  • For Claude Desktop on macOS specifically: GUI apps don’t inherit your shell’s env vars. List every env var explicitly in claude_desktop_config.json rather than relying on ~/.zshrc.

doctor reports control_plane: error

Your MCP server is running fine but can’t reach the build server.

Check:

  • curl ${CONTROL_BASE_URL}/health — does it return 200 with a JSON body?
  • Is CONTROL_BASE_URL actually a URL? Common mistakes: missing http:// or https:// scheme, trailing slash, or pasting an SSH alias instead of a routable hostname.
  • DNS — dig or nslookup the host. If it fails, you may need to use the IP form (http://<your-server-ip>) until DNS resolves.
  • Firewall — outbound port 80/443 must be reachable from your machine.

doctor reports control_plane: ok but builds still fail

The MCP server can reach the build server, but builds aren’t producing output.

Check:

  • Is MCP_AUTH_SECRET set and correct? Builds need it; doctor only needs the build server to respond to /health. Without the secret, you’ll see “401 Unauthorized” in the response to /grant/request.

  • Get the live secret from the build server host:

    Security note: This command shows sensitive auth tokens. Only operators should run it. Never share the output publicly.

    ssh <control-host> 'grep -E "^(MCP_AUTH_SECRET|AGENT_AUTH_SECRET)=" /etc/aegis/secrets.env /etc/aegis/control.env 2>/dev/null'
    
  • Is at least one build machine online? Operators can check with aegis-control list-agents (or by reading the build server’s metrics).

  • Are build machine and build server time-synchronized? Permissions have short TTLs; if either side’s clock is off by more than ~30 seconds, every permission expires before it can be used.

WebRTC connection establishes but immediately closes

on_open fires but the connection drops within seconds, or on_open never fires at all.

Likely causes:

  • Connection negotiation failed. No candidate pair worked. The peer connection state goes to Failed after ~5 seconds and the data channels never open. Cause: network restrictions or firewalls block all UDP and the fallback servers aren’t configured or reachable.
  • Network restrictions on both sides. Direct peer-to-peer is impossible; forces a relay through fallback servers. Make sure the build server returns at least one relay entry in ice_servers.
  • Relay credentials expired. Relay credentials rotate per-session; if your client cached one from an earlier session, it’s stale. Open a fresh session.
  • Browser blocked WebRTC. Some corporate browser policies disable WebRTC entirely. Check chrome://webrtc-internals/ (Chrome) for the connection candidate dump.

Fix pattern: Always implement a fast-fail in your client that watches for RTCPeerConnection.connectionState === 'failed' in parallel with waiting for on_open. Wrap connect() in a 3-attempt retry loop with a 2-second delay between attempts.

Build hangs in pending for a long time

The permission was issued, but no build machine picked up the job.

Check:

  • Are build machines online? An idle build server can issue permissions but with no build machines to honor them, the job sits in pending until it times out.
  • Are build machines capable of running the requested target? If you ask for esp32p4 and no build machine has the toolchain installed, the job will sit unassigned. Operators can see unassigned jobs in the build server log.

Build fails with a compiler error

This is the easy case. Ask your AI assistant:

Run parse_build_errors on the latest build, then run the diagnose-build-error prompt against the result.

You’ll get a structured “what’s wrong, why, here’s the fix” rather than a 500-line log dump.

“Channel pty was rejected”

The build machine refused to open a data channel that wasn’t in the permission’s whitelist.

Cause: Your client’s permission request didn’t include pty in required_channels. This typically happens when a client was upgraded to a version that uses new channels but the operator hasn’t updated the allowed-channel list on the build server yet.

Fix: Either update the build server’s allowlist, or pin your client to a version that doesn’t use the new channel.

Send queue full / firmware download stalls

Throughput drops dramatically partway through a firmware download (only matters for large *.bin files over a relay connection).

Cause: Production build machines cap the send queue at 128 KB. Combined with a 500 ms round-trip relay, this caps throughput at ~256 KB/s, not the multi-MB/s you’d see on a direct peer-to-peer connection.

Fix: This is by design (preventing memory exhaustion when the receiver can’t keep up). If your firmware is large enough that it matters, prefer a direct peer-to-peer connection over a relay. Direct connections aren’t affected as severely because the round-trip time is much lower.

“Pre-deploy CORS error”

You’re a build server operator and the browser can’t even reach /grant/request.

Check:

  • cat /etc/aegis/control.env | grep ALLOWED_ORIGINS — is your origin listed?
  • Does the value include the scheme (https://) and exclude trailing slashes? ALLOWED_ORIGINS=https://esphome.cloud is correct; ALLOWED_ORIGINS=esphome.cloud/ is not.
  • Did you restart aegis-control after editing the env file? sudo systemctl restart aegis-control.

Still stuck

  • Ask your AI assistant to read the install://overview resource — it returns the same env-var table from inside the MCP server, which lets you cross-reference what the server thinks its config is.
  • Check the live logs: build server (journalctl -u aegis-control -f), build machine (journalctl -u aegis-agent -f).
  • File an issue on the aegis or type-driven-ui repository with the output of doctor attached.

See also