Guide · Incident Management

MCP Server Incident Runbook — response playbook for common failure modes

A runbook is the document you wish you had when the alert wakes you at 2 AM. It converts the high-stakes, context-sparse moment of an incident into a sequence of specific, low-ambiguity steps: check this first, if you see X do Y, escalate when Z. For MCP servers, the five failure modes AliveMCP detects — connection refused, protocol handshake failure, tool call timeout, schema drift, and elevated error rate — each have a distinct investigation path and a distinct set of remediation actions. This runbook documents all five, plus the standard escalation decision tree and a postmortem template.

TL;DR

The first check for any MCP server incident: open the AliveMCP dashboard and read the failure reason field. It tells you which failure mode you are in. Connection refused → check process/container health first. Protocol failure → check for SDK version mismatch or recent config change. Timeout → check CPU/memory and external dependency latency. Schema drift → check recent deploys for tool definition changes. Elevated error rate → check application logs for exception patterns. In all cases: if you cannot identify root cause in 15 minutes, escalate and post a status update rather than continuing to investigate alone.

Before the incident: runbook setup

A runbook only helps if it is accessible when you need it. Three preparatory steps:

  1. Link the runbook from your AliveMCP alert payload. In AliveMCP's webhook configuration, add a custom field runbook_url pointing to this document (hosted in your internal wiki, Notion, GitHub, or Confluence). Your PagerDuty or OpsGenie integration should include this URL in the alert — one tap from the incident to the playbook.
  2. Store credentials in a known place. The runbook references access to your VPS, Kubernetes cluster, and log aggregator. Before an incident, document where each credential is stored (1Password, AWS Secrets Manager, Vault). An incident is not the time to search for the SSH key.
  3. Verify access paths in advance. Can you SSH to your VPS from your phone? Can you read your production logs from a mobile browser? Test these paths before the first incident. At 2 AM on a bad network, a VPN that doesn't connect on mobile is not a runbook failure — it is an infrastructure failure that should have been discovered and fixed beforehand.

Failure mode 1: Connection refused

AliveMCP failure reason: ECONNREFUSED or "Connection refused to port N"

What it means: Nothing is listening on the MCP server's port. The process has crashed, been stopped, or was never started. The network path (DNS, Ingress, load balancer) is intact; the port itself has no listener.

Step Action What you are looking for
1 systemctl status mcp-server (bare metal / PM2: pm2 status) Process state (running / stopped / failed), exit code, restart count
2 journalctl -u mcp-server -n 100 or pm2 logs --lines 100 Last error before crash — OOM kill, uncaught exception, SIGTERM without restart
3 free -m and dmesg | grep -i oom OOM killer output — if process was OOM-killed, logs may not contain the error
4 Restart the process: systemctl restart mcp-server or pm2 restart mcp-server Does it start and stay up? Or does it crash immediately?
5 Manually probe: curl -X POST http://localhost:PORT -H "Content-Type: application/json" -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-03-26","capabilities":{}}}' Valid MCP initialize response, or error

Common root causes: OOM kill (server accumulating session state without cleanup), unhandled exception in a tool handler that crashes the process, manual kill during a deploy without a restart, or a Kubernetes liveness probe restart loop (pod keeps crashing before the probe succeeds).

Mitigation: Restart. If the crash recurs, set a memory limit in PM2 (max_memory_restart: '512M') or in Kubernetes resource limits to trigger a controlled restart before OOM. Follow up with a fix for the root cause during business hours.

Failure mode 2: Protocol handshake failure

AliveMCP failure reason: "MCP initialize failed", "Invalid protocol version", or "JSON-RPC parse error"

What it means: The server is accepting connections (no ECONNREFUSED) but failing to complete the MCP initialize handshake. This is almost always a protocol version mismatch, a broken recent deploy, or a misconfiguration that broke the JSON-RPC serialization.

Step Action What you are looking for
1 Check recent deploys: git log --oneline -10 Any deploy in the past hour? A broken deploy is the most common cause.
2 Run the manual initialize curl (see failure mode 1) and read the full response What is the actual response? "Method not found"? Empty body? HTML error page?
3 Check the MCP SDK version: npm list @modelcontextprotocol/sdk Is the installed version the same as what worked before the incident started?
4 Read application logs for any startup errors Environment variable missing, database connection failed during startup, import error
5 If caused by a bad deploy: roll back with git revert HEAD and redeploy Does the server pass AliveMCP's health check after rollback?

Common root causes: SDK upgrade that changed the response shape, an environment variable needed by the initialize handler that is missing in production (exists in dev), a configuration file that changed the transport mode (stdio vs HTTP), or a reverse proxy misconfiguration that intercepts the initialize request before it reaches the MCP server.

Failure mode 3: Tool call timeout

AliveMCP failure reason: "Timeout after Nms", where N exceeds AliveMCP's configured timeout (default 10 seconds for initialize)

What it means: The server is accepting connections and passing the initialize handshake but the initialize or tool call response is taking too long. The server is overloaded or a downstream dependency is slow.

Step Action What to look for
1 top or kubectl top pods CPU at 100%? Memory near limit? Node.js at 100% CPU usually means event loop block.
2 Check database query times: look for slow query logs Is a DB query the bottleneck? Run EXPLAIN ANALYZE on suspected queries.
3 Check external API response times in your logs Is an external API your tools call experiencing slowness?
4 Check connection pool utilization: look for "no connections available" log lines Pool exhaustion means new requests wait indefinitely for a DB slot.
5 If overloaded: add a replica (scale up) or shed traffic via the load balancer Does response time recover after scaling? If yes, root cause is capacity.

Common root causes: Sudden traffic spike exceeding capacity, a slow external API dependency, connection pool exhaustion (all DB connections held by long-running tool calls), or a memory leak causing excessive GC pause.

Failure mode 4: Schema drift

AliveMCP failure reason: "Schema drift detected: tool X removed", "Tool definition changed: parameter Y type changed from string to number"

What it means: AliveMCP compares the current tools/list response against the last known good schema. A difference means a tool was added, removed, or its parameter schema changed. This is not necessarily a downtime event — the server is responding — but it can silently break all clients that were using the removed tool or expecting the old parameter type.

Step Action What to verify
1 Run tools/list manually and compare against the previous known-good schema Which tool changed? Was the change intentional?
2 Check recent deploys: did this deploy intentionally change the tool schema? If intentional: acknowledge the drift in AliveMCP, update the baseline schema.
3 If unintentional: check environment variables and config files for accidental differences Is a feature flag in production disabling a tool that is enabled in dev?
4 Notify users of any breaking change: post in your community Discord / email list Clients using the old tool definition will get unexpected errors until they update.
5 Update AliveMCP's baseline schema to match the new expected tool list The drift alert should resolve once the baseline is updated.

Common root causes: A deploy that accidentally included a tool refactoring not yet ready for production, an environment variable controlling tool availability that differs between dev and prod, or a version of the MCP SDK that changed the parameter schema format.

Failure mode 5: Elevated error rate

AliveMCP failure reason: "Tool call error rate above threshold", "tools/list returning errors for N% of requests"

What it means: The server is up and passing initialize, but a fraction of tool calls are returning errors. This is a partial failure — some users succeed, some fail — which is often harder to diagnose than a complete outage.

Step Action What to look for
1 Read application error logs for the error patterns Are the errors from one tool or all tools? One error type or many?
2 Check if errors correlate with specific request patterns Errors only for large payloads? Only for authenticated users? Only for a specific tool?
3 Check external dependency status pages (GitHub status, AWS Health) Is an API your tools depend on experiencing elevated error rates?
4 Check if the error rate is increasing, stable, or decreasing over the AliveMCP graph Increasing: still degrading. Stable: limited blast radius. Decreasing: self-recovering.
5 Determine impact: which users / which tool calls are affected? Post a targeted status update to affected users rather than a blanket announcement.

Escalation decision tree

Use this decision tree at any point during incident response to decide whether to escalate.

Postmortem template

Complete a postmortem for every P1 incident and any P2 incident lasting more than 30 minutes. The postmortem is not a blame document — it is a system analysis.

## MCP Server Incident Postmortem

**Date:** YYYY-MM-DD
**Severity:** P1 / P2
**Duration:** N minutes
**Affected server(s):** [server slug(s)]
**Author(s):** [names]

### Incident summary
[1-2 sentences: what happened, when, and what the user impact was]

### Timeline
- HH:MM UTC — AliveMCP detected failure: [failure reason]
- HH:MM UTC — On-call paged via [PagerDuty / OpsGenie / Discord]
- HH:MM UTC — First response: [who, what action]
- HH:MM UTC — Root cause identified: [what it was]
- HH:MM UTC — Fix applied: [what was done]
- HH:MM UTC — AliveMCP confirmed recovery

### Root cause
[Technical description of what caused the failure]

### Contributing factors
[What made this failure mode possible — missing validation, insufficient capacity, no circuit breaker, etc.]

### What went well
[Things that worked: fast detection, good runbook, quick escalation, etc.]

### What went poorly
[Things that didn't work: missing credentials, runbook out of date, no access from mobile, etc.]

### Action items
| Action | Owner | Due date |
|--------|-------|----------|
| [Specific fix] | [Name] | [Date] |
| [Runbook update] | [Name] | [Date] |
| [Monitoring improvement] | [Name] | [Date] |

Link the postmortem from the closed PagerDuty/OpsGenie incident and from the Discord/Slack incident thread. Over time, the collection of postmortems becomes your most valuable operational knowledge base — the real record of how your MCP server has failed and what you learned.

Frequently asked questions

Where should I store the runbook so it is accessible during an incident?

The runbook must be accessible from: your phone on mobile data (in case the VPN is down), a browser on any device (not locked behind a company SSO that might be affected by the same incident), and without authentication if possible (or with very simple authentication). Notion, GitHub Pages, Confluence, and your AliveMCP status page's custom pages are all good hosts. Do not host the runbook on the same infrastructure as the MCP server — if the VPS is down, the runbook should still be accessible. The best practice is to link the runbook URL directly from your PagerDuty/OpsGenie alert payload and from the AliveMCP status page for the server, so the on-call person has one tap from the alert to the playbook.

How often should I update this runbook?

Update the runbook after every incident where you discovered a step was wrong, missing, or out of date. This is the single most important runbook hygiene practice. A runbook that was accurate in January and hasn't been touched since is likely wrong in June. Specifically: update credential locations when they change, update command examples when your deployment tooling changes (pm2 → kubernetes → docker compose), and add new failure modes when you encounter a failure type not in the existing playbook. Treat the runbook as a living document with the same version discipline as your code — every incident response should end with a check: "did the runbook have everything we needed? if not, update it."

What should I do if the runbook doesn't cover the failure mode I am seeing?

Proceed with the standard first-principles investigation: (1) check if the server process is running, (2) check application logs for recent errors, (3) check if the failure correlates with any recent deploy or change, (4) check external dependencies. If you resolve the incident, document the new failure mode and add it to the runbook immediately while the details are fresh. If you cannot resolve it within 15 minutes and it is a P1, escalate — do not wait until you understand it completely before calling for help. After the incident, add the new failure mode to the runbook as a new section. Over time, a good runbook grows from incident experience rather than being designed from theory.

Should I do a postmortem for every MCP server incident?

For P1 incidents (complete outage, internal server down), always. For P2 incidents lasting more than 30 minutes, usually. For P3 and P4, a brief internal note is sufficient — not a full postmortem. The postmortem cost is time and attention; the benefit is organizational learning and prevention of recurring incidents. For a solo indie MCP author, a lightweight postmortem (5 bullet points: what happened, why, what I did, what I'll do differently, action items) is more sustainable than a formal document and still captures the most important learnings. The goal is not process compliance — it is not having the same incident twice.

How does AliveMCP help shorten incident response time?

AliveMCP reduces two of the most expensive phases of incident response: detection lag and context lag. Detection lag is the time between when the server fails and when you know about it. Without monitoring, this is however long it takes for a user to file a complaint — often 10–30 minutes or more. AliveMCP detects failure within 2 minutes. Context lag is the time you spend investigating basics before you can start diagnosing: is it down? Since when? What is the failure reason? The AliveMCP dashboard shows you the failure reason, the exact time it started, the 90-day history (to see if this is a recurring pattern), and the response time trend (to distinguish a sudden crash from a gradual degradation). This context, surfaced immediately in the alert payload, means the on-call person starts at step 3 of the runbook rather than step 1.

Further reading

Know when your MCP server is down — before users do

AliveMCP detects the five failure modes in this runbook within 2 minutes — with the failure reason already in the alert, so your on-call engineer starts at step 3, not step 1.

Start monitoring free