Kubernetes guide · 2026-06-18 · Kubernetes Native Runtime Patterns
MCP Servers in Production: Kubernetes Liveness, Readiness, Scaling, Load Testing, and Capacity Planning
Kubernetes gives you five distinct runtime tools for operating an MCP server in production — liveness probes that restart hung containers, readiness probes that gate traffic until the server is truly ready, horizontal autoscaling that adds and removes pods under load, k6 load testing that stress-tests the MCP protocol before a deploy ships, and capacity planning that sizes resources before the first spike arrives. Each provides a different window into server health. Each has the same structural blind spot: it operates from inside the cluster, over the pod network, bypassing the Ingress, the TLS certificate, and the network path that LLM clients actually traverse. This guide synthesizes all five tools, maps the specific failure class each catches, and explains why an external protocol probe from AliveMCP is the layer that completes the picture — not because Kubernetes tooling is insufficient, but because it answers a fundamentally different question.
Five tools, five windows, one shared gap
The comparison table below maps each Kubernetes runtime tool to its role, the MCP-specific health signal it provides, and the failure class that its architecture cannot detect.
| Tool | Primary role | MCP-specific signal | Structural blind spot |
|---|---|---|---|
| Liveness probe | Restart containers that have entered an unrecoverable hang | /live endpoint exercises event loop — detects hung async queue, OOM-killed worker thread, deadlocked Promise chain |
Kubelet probes via pod IP, bypassing Ingress/TLS; cert expiry and Ingress misconfiguration are invisible |
| Readiness probe | Remove overloaded pods from the load balancer without restarting | /ready checks actual MCP dependencies — DB connection pool idle slots, tool registry built, required secrets loaded |
Kubelet probes from inside; simultaneous all-pod readiness failure (shared dependency) invisible from inside the cluster |
| HPA / KEDA | Add and remove pods to match load; maintain SSE session continuity during scale events | CPU/memory HPA for Streamable HTTP; KEDA mcp_active_sse_connections metric for SSE; sticky-session Ingress annotations |
New pods that pass readiness but serve the wrong MCP protocol version are invisible until a real client connects |
| k6 load testing | Stress-test MCP protocol flows pre-deploy; validate HPA thresholds with synthetic load | Full 4-step VU function (initialize → initialized → tools/list → tools/call); custom mcp_init_errors and mcp_tool_duration metrics; CI deploy gate |
k6 runs pre-deploy from the test runner — nothing is watching the production endpoint after the test exits |
| Capacity planning | Size replicas, memory limits, and connection pools before load arrives | Concurrent session formula; memory bucket model; connection pool sizing; HPA threshold tuning | Capacity planning is a pre-launch exercise; degradation at runtime (memory leak, rising P95) requires a continuous signal to detect |
The pattern across all five: each tool operates from a privileged position inside the cluster's own infrastructure. The kubelet fires probes over the pod network. KEDA reads metrics from Prometheus running inside the cluster. k6 connects from a test runner that typically shares infrastructure with the deployment target. Capacity planning is done with spreadsheets and load test results before production traffic arrives. None of these tools run from the same network path — through the public Ingress, through the TLS certificate, through the same DNS resolution — that an LLM client uses when it calls tools/call on your MCP server.
Liveness probes: detecting hangs the process cannot self-recover from
MCP servers are long-lived process servers — unlike stateless REST APIs where each request runs in a fresh context, MCP servers accumulate in-memory session state, hold open SSE connections, and run tool handlers that call external APIs, query databases, and spawn child processes. Any of these can enter an unrecoverable state: a deadlocked Promise chain where two async operations wait on each other, an OOM-killed worker thread that leaves the main process alive but silently dropping dispatched work, connection pool saturation where every DB connection is held by a slow tool call and new requests queue indefinitely.
A tcpSocket probe confirms the port is open. It cannot detect any of these four failure modes — the port stays open even when the event loop is completely hung. The correct probe for an MCP server is an httpGet probe against a dedicated /live endpoint that exercises the event loop:
// Node.js: /live endpoint that enqueues work on the event loop
// If the loop is hung, the Promise never resolves and the probe times out
app.get('/live', async (req, res) => {
await new Promise(resolve => setImmediate(resolve));
// setImmediate fires only after all I/O events in the current iteration complete
// A hung event loop delays this resolution — the probe timeout catches the hang
res.json({ status: 'live' });
});
setImmediate is the right primitive because it runs at the end of the current event loop iteration, after all pending I/O callbacks. A synchronous hang — a blocking JSON parse, a blocking crypto operation, a deep recursive traversal inside a tool handler — will delay the setImmediate callback. If the delay exceeds the probe's timeoutSeconds, the probe fails and Kubernetes restarts the container. The Python equivalent is await asyncio.sleep(0), which yields control to the event loop for one iteration.
The four probe types, ordered by what they detect for MCP servers:
| Probe type | What it checks | Catches event loop hang? | Recommended for MCP? |
|---|---|---|---|
tcpSocket |
TCP port open and accepting connections | No — port stays open even when event loop is hung | No — insufficient |
httpGet → trivial path |
HTTP 200 from a hardcoded response (no async work) | No — a pre-allocated buffer can return 200 without touching the event loop | Partial — better than TCP, misses most hangs |
httpGet → /live with async probe |
HTTP 200 from a handler that awaits a minimal async operation | Yes — event loop must process the handler to respond | Yes — recommended |
exec |
Exit code of a shell script inside the container | No — exec spawns a new process, separate from the event loop | No — wrong tool for event-loop health |
initialDelaySeconds tuning matters for MCP servers specifically because startup times vary dramatically with tool registry size and database connection pool warmup. A server that connects to a PostgreSQL pool and caches tool schemas from an external registry on startup can take 30–90 seconds before the first initialize can succeed. Set initialDelaySeconds generously — a liveness probe that fires before warmup is complete will restart a healthy server on every deploy. For highly variable startup times, use a startupProbe with a high failureThreshold for the startup phase, then switch to a tighter liveness probe for steady state.
The liveness probe blind spot is structural and worth stating precisely: the kubelet sends the probe request directly to the pod IP over the cluster's pod network. It bypasses the Ingress, the load balancer, and the TLS certificate entirely. A TLS certificate that expires at 3am is invisible to the liveness probe — the pod continues passing liveness while every external LLM client gets a TLS handshake error. An Ingress annotation misconfiguration that causes nginx to return 502 is invisible to the liveness probe. A DNS resolution failure on the public domain is invisible. AliveMCP's probe, by contrast, connects through the full public network path — through DNS, through the TLS certificate, through the Ingress — and fails within 60 seconds of any failure in that path, regardless of what the kubelet is reporting.
Readiness probes: gating traffic until MCP dependencies are ready
The liveness/readiness distinction is frequently misunderstood, and for MCP servers the difference is operationally significant. Liveness failure triggers a container restart — appropriate for an unrecoverable hang. Readiness failure removes the pod from the Service's Endpoints list without restarting it — appropriate for a recoverable, temporary not-ready state. For MCP servers with long-lived SSE connections, this distinction determines whether a temporary overload event disconnects all active clients (restart) or gracefully sheds new load until the pod recovers (readiness failure).
A /ready endpoint for an MCP server should check the dependencies that matter for serving tools/call requests — not just whether the process is alive:
app.get('/ready', async (req, res) => {
const checks = {
db: false,
toolRegistry: false,
secrets: false,
};
// DB connection pool: check idle slot availability, not just connectivity
// A pool where all connections are held is "connected" but not ready
try {
const pool = getDbPool();
checks.db = pool.idleCount > 0;
} catch { checks.db = false; }
// Tool registry: was it built successfully at startup?
checks.toolRegistry = toolRegistry.isReady();
// Required secrets: were all required env vars loaded?
checks.secrets = Boolean(process.env.API_KEY && process.env.DB_URL);
const ready = Object.values(checks).every(Boolean);
res.status(ready ? 200 : 503).json({ ready, checks });
});
The connection pool check is the most operationally valuable for MCP servers under load. When all DB connections are held by slow tool calls, pool.idleCount === 0 — the pod fails readiness and is removed from the load balancer. New SSE connections route to other pods. The overloaded pod drains its backlog, connection pool slots free up, readiness passes again, and the pod rejoins the load balancer without a single client disconnection. This is a self-regulating feedback loop that Kubernetes provides for free, but only if the readiness probe actually checks pool saturation rather than just HTTP reachability.
For SSE-transport servers, set successThreshold: 2 to prevent the probe from oscillating. A pod under intermittent load may briefly fail readiness (pool full), then pass (one slot freed), then fail again. successThreshold: 2 requires two consecutive successful probes before the pod rejoins the load balancer, preventing rapid in-out-in-out membership changes that cause SSE session assignment churn at the Ingress.
The readiness probe blind spot that AliveMCP uniquely catches: simultaneous all-pod readiness failure due to a shared dependency. If your PostgreSQL database goes down, every pod fails readiness simultaneously. The Kubernetes Service has no endpoints. From inside the cluster, this looks like zero healthy pods — Kubernetes will not restart them because they are not failing liveness, just readiness. From outside the cluster, LLM clients start receiving connection refused errors immediately. AliveMCP detects this within 60 seconds and alerts, while the internal view shows nothing that warrants a restart signal.
Horizontal scaling: HPA for Streamable HTTP, KEDA for SSE
The fundamental choice between SSE transport and Streamable HTTP transport determines the entire horizontal scaling story. Streamable HTTP is stateless — each request is independent, any pod handles any request, and CPU/memory HPA works immediately without additional configuration. SSE is stateful — each client establishes a long-lived connection to a specific pod, and that affinity must be maintained throughout the session. Scaling an SSE-transport MCP server without accounting for this produces broken client experiences: follow-up POST requests that carry the session token arrive at a different pod than the one holding the SSE connection, producing immediate protocol errors.
For Streamable HTTP servers, standard HPA configuration is sufficient, with one MCP-specific tuning: scale the threshold lower than for typical REST APIs. MCP tool calls are often CPU-intensive — JSON parsing of large schemas, cryptographic operations, recursive data traversals. Setting the target to 60% CPU utilization (vs the typical 80%) gives the autoscaler room to add pods before tool call latency degrades noticeably:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60 # Lower than typical REST — MCP tool calls are CPU-intensive
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately during spikes
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before removing pods
For SSE-transport servers, CPU utilization is the wrong metric. An idle SSE connection consumes a file descriptor, a libuv event loop watcher, and 2–10KB of memory — but near-zero CPU. A pod holding 500 idle SSE connections appears to have very low CPU utilization and triggers a scale-down, closing all 500 connections. The correct metric is active SSE connection count, which KEDA can read from a Prometheus metric:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
spec:
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: mcp_active_sse_connections
threshold: "50" # Scale up when average pod holds >50 active SSE connections
query: avg(mcp_active_sse_connections{job="mcp-server"})
SSE stickiness requires Ingress-level session affinity. nginx Ingress supports this via cookie-based routing — the first request from a client sets a cookie that subsequent requests carry, and nginx uses it to route to the same pod. The SSE connection also requires disabling proxy buffering and extending timeouts so nginx does not buffer SSE events or close idle connections before the client has finished a long tool call:
annotations:
nginx.ingress.kubernetes.io/affinity: "cookie"
nginx.ingress.kubernetes.io/session-cookie-name: "mcp-affinity"
nginx.ingress.kubernetes.io/session-cookie-max-age: "86400"
nginx.ingress.kubernetes.io/proxy-buffering: "off"
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
Scale-in for SSE servers requires a graceful termination handler. When KEDA decides to remove a pod, Kubernetes sends SIGTERM. The pod should stop accepting new connections immediately (fail readiness) and wait for existing SSE connections to drain before exiting:
process.on('SIGTERM', () => {
markUnready(); // Fail /ready immediately — stops new SSE connections routing here
setTimeout(() => process.exit(0), 55_000); // 55s drain within 60s terminationGracePeriodSeconds
});
The scaling blind spot that only AliveMCP catches: a rolling deploy that passes readiness but ships the wrong MCP SDK version. The new pods respond to Kubernetes health checks — HTTP 200, DB pool has idle slots, tool registry is built — but the initialize response carries a protocol version that the LLM client rejects. Inside the cluster, readiness is green. From outside, every new client gets a protocol version mismatch. AliveMCP's protocol probe sends the full initialize JSON-RPC sequence and validates the protocolVersion field in the response — it detects this within 60 seconds of the first mismatched pod becoming active, before HPA has scaled the bad version across the fleet.
k6 load testing: validating MCP protocol flows before they reach production
k6 is not just a performance tool for MCP servers — it is the primary instrument for validating that the entire MCP protocol sequence holds up under concurrent load. The MCP protocol is stateful within a session: a client must send initialize and receive an initialized notification before tools/list and tools/call are valid. A k6 VU (virtual user) that simulates this sequence exercises the protocol in a way that simple HTTP load generators cannot.
A complete k6 VU function for an MCP server follows four steps:
import http from 'k6/http';
import { check } from 'k6';
import { Counter, Trend } from 'k6/metrics';
const mcpInitErrors = new Counter('mcp_init_errors');
const mcpToolDuration = new Trend('mcp_tool_duration', true); // true = measure in ms
export default function () {
const headers = { 'Content-Type': 'application/json' };
// Step 1: initialize — establish protocol version
const initRes = http.post(MCP_URL, JSON.stringify({
jsonrpc: '2.0', id: 1, method: 'initialize',
params: { protocolVersion: '2024-11-05', clientInfo: { name: 'k6', version: '1.0' } }
}), { headers });
const initOk = check(initRes, {
'initialize: 200': r => r.status === 200,
'initialize: protocolVersion matches': r =>
JSON.parse(r.body).result?.protocolVersion === '2024-11-05',
});
if (!initOk) { mcpInitErrors.add(1); return; } // Don't proceed with broken session
// Step 2: initialized notification (required by MCP spec before tools/list)
http.post(MCP_URL, JSON.stringify({
jsonrpc: '2.0', method: 'notifications/initialized'
}), { headers });
// Step 3: tools/list — validate tool schema is served correctly
const listRes = http.post(MCP_URL, JSON.stringify({
jsonrpc: '2.0', id: 2, method: 'tools/list'
}), { headers });
check(listRes, { 'tools/list: has tools': r => JSON.parse(r.body).result?.tools?.length > 0 });
// Step 4: tools/call — measure actual tool execution time
const startTs = Date.now();
const callRes = http.post(MCP_URL, JSON.stringify({
jsonrpc: '2.0', id: 3, method: 'tools/call',
params: { name: 'ping', arguments: { target: 'example.com' } }
}), { headers, timeout: '30s' });
mcpToolDuration.add(Date.now() - startTs);
check(callRes, { 'tools/call: result present': r => JSON.parse(r.body).result != null });
}
The custom metrics are the most operationally useful part. mcp_init_errors counts initialization failures separately from HTTP errors — an MCP server can return HTTP 200 on the initialize request but with a wrong or missing protocolVersion, which the metric catches without needing to inspect every response manually. mcp_tool_duration is a Trend metric (true = report in milliseconds), which k6 reports as P50, P90, P95, P99 — the percentile distribution tells you whether latency is uniform (a flat fast response) or has a long tail (a shared external dependency with occasional slow responses).
Tool scenario distribution matters for realistic load tests. Most MCP servers have a mix of fast, medium, and slow tools. Weighting the load test to match production distribution produces more accurate HPA threshold calibration:
export const options = {
scenarios: {
fast_db_queries: { executor: 'constant-vus', vus: 50, duration: '5m',
env: { TOOL_NAME: 'query_db' } }, // 50% of VUs
medium_search: { executor: 'constant-vus', vus: 30, duration: '5m',
env: { TOOL_NAME: 'search_index' } }, // 30% of VUs
slow_external: { executor: 'constant-vus', vus: 20, duration: '5m',
env: { TOOL_NAME: 'fetch_external' } } // 20% of VUs
},
thresholds: {
'mcp_init_errors': ['count<5'], // Less than 5 init failures across entire test
'mcp_tool_duration{scenario:slow_external}': ['p95<8000'], // Slow tool P95 under 8s
'http_req_failed': ['rate<0.01'], // Less than 1% HTTP errors
}
};
The CI deploy gate connects k6 to the deployment pipeline. A failed mcp_init_errors threshold or a P95 latency regression blocks the deploy before production traffic hits the new version. A passing threshold does not guarantee production reliability — it guarantees the protocol holds under synthetic load from the test runner's network position. This is the boundary where k6 and AliveMCP's roles separate cleanly: k6 is pre-deploy validation, AliveMCP is post-deploy continuous monitoring. k6 tells you the server is correct under load before users arrive. AliveMCP tells you the server remains correct after users arrive, after certificate renewals, after upstream API changes, after the HPA scaled to a misconfigured pod count at 2am.
Capacity planning: sizing before the spike, using AliveMCP to see the squeeze
Capacity planning for MCP servers differs from typical REST API sizing in three ways: SSE connections consume memory and file descriptors even when idle; tool calls have dramatically wider latency variance than REST handlers (a tool that queries a database, calls an external API, and parses a large response can take 1–15 seconds, vs a few milliseconds for a typical REST endpoint); and the MCP protocol adds initialization overhead on every new client session that REST APIs don't have.
The concurrent session estimate is the foundational number everything else derives from:
concurrent_sessions = (DAU × sessions_per_user × avg_session_duration_min) / 1440 × peak_factor
# Example: 1,000 DAU × 5 sessions/user × 8 min/session / 1440 × 3.0 peak = 83 concurrent sessions
Memory sizing uses a bucket model. Each component adds to the per-replica memory floor:
| Component | Typical range | Notes |
|---|---|---|
| Node.js baseline + V8 heap | 50–80 MB | Fixed regardless of load |
| Tool registry (schemas, validators) | 10–50 MB | Scales with number and complexity of tools |
| Per-session state (SSE only) | 2–10 KB/session | 50 active sessions ≈ 0.5 MB — typically negligible |
| In-flight tool call buffers | 100 KB–10 MB per concurrent call | Tools returning large JSON payloads dominate; measure with k6 |
| DB connection pool | 1–5 MB per 10 connections | Shared across all sessions; pool size drives this |
| GC headroom (Node.js) | 2× working set | V8 GC triggers near the heap limit — set limit at 1.5–2× working set |
The database connection pool formula for MCP servers accounts for tool call concurrency rather than just session count. Not every open session is running a tool call simultaneously:
pool_size = ceil(concurrent_tool_calls × avg_query_duration_ms / 1000) × 1.25 + 2
# Example: 20 concurrent tool calls × 200ms queries / 1000 × 1.25 + 2 = 7 connections
# The +2 is headroom for the readiness probe's db check and housekeeping queries
# The 1.25× is a 25% burst buffer
HPA thresholds should be set so that the autoscaler adds a pod at 60–65% of the single-replica limit, giving a 35–40% buffer for the new pod's connection pool to warm up before the current pods are saturated. A pod added at 95% CPU utilization does nothing for the existing traffic while its pool warms.
The capacity planning blind spot is the most subtle in this list: capacity planning is a pre-launch calculation. It gives you the right starting point. It does not tell you when you are approaching the limit in production. The signal that capacity is degrading under real traffic — before tool calls start failing, before the event loop starts queuing — is rising P95 response latency. A tool that returns in 300ms at baseline and 800ms at 80% capacity is already signaling exhaustion in the latency trend before the failure rate increases. AliveMCP's response-time history shows exactly this signal: a gradual upward drift in P95 latency is the leading indicator that the current replica count is no longer sufficient and the HPA needs its threshold or minimum replica count adjusted. A step-change in latency immediately after a deploy indicates the new code path introduced a slow operation. A recurring 30–60 second window of high latency indicates a liveness restart loop — the pod restarts, traffic flushes, latency drops, the pod fills up again, the pattern repeats.
The shared structural blind spot: inside-cluster ≠ outside-network-path
Stepping back across all five Kubernetes runtime tools, the architectural constraint they share is precise and worth naming explicitly: every tool operates from inside the cluster's own infrastructure. The kubelet fires liveness and readiness probes over the pod network, directly to the pod IP, bypassing the Ingress controller and the TLS certificate. KEDA's Prometheus trigger reads metrics exported by the MCP server process itself. k6 connects from a test runner that is co-located with or adjacent to the deployment infrastructure. Capacity planning uses load test numbers generated from inside the same network perimeter.
The failure classes that fall through this architecture:
| Failure class | Kubernetes internal view | External user experience |
|---|---|---|
| TLS certificate expiry (cert-manager renewal failure) | All probes pass — kubelet bypasses TLS | TLS handshake error; 100% of clients fail to connect |
| Ingress misconfiguration (wrong upstream or broken annotation) | Pod readiness green — Ingress health is not a pod-level signal | 502 Bad Gateway on all requests; clients see HTTP errors |
| DNS resolution failure on the public domain | Cluster-internal DNS unaffected; all internal connectivity works | NXDOMAIN or stale NS delegation; clients cannot reach the server |
| Wrong MCP protocol version in new pods | Readiness probe (HTTP 200 on /ready) passes; protocol content not checked | LLM clients reject the initialize response; tool calls never start |
| Capacity exhaustion showing in P95 latency before error rate rises | CPU and memory within HPA bounds; no autoscale triggered | Tool calls take 3× longer; LLM agent timeouts before capacity alarms fire |
AliveMCP's probe is architecturally positioned to catch all five. It connects through the public domain, through DNS resolution, through the TLS certificate, through the Ingress controller, and sends the full MCP JSON-RPC sequence — initialize, notifications/initialized, tools/list — validating not just HTTP reachability but protocol correctness. Response time is recorded with every check, so the P95 latency trend is available as a production capacity signal the moment the first probe returns.
Kubernetes runtime tooling and AliveMCP are complementary, not redundant. Liveness probes catch the failure classes that happen inside the pod — hung event loops, OOM-killed workers, deadlocked async queues. Readiness probes catch the conditions where a running pod should not receive new traffic — pool saturation, startup warmup, shared dependency failures. HPA and KEDA maintain the right pod count under varying load. k6 validates the protocol under synthetic load before deploys ship. Capacity planning gives the system the right initial sizing. AliveMCP monitors what all five cannot: the user-facing network path, continuously, every 60 seconds, from outside. The question Kubernetes answers is "is the cluster healthy?". The question AliveMCP answers is "is the MCP server reachable and protocol-correct from where my users are?". Both answers are necessary for a production MCP server — neither alone is sufficient.
FAQ
Should I use the same endpoint for liveness and readiness probes?
No. Use separate /live and /ready endpoints with different semantics. /live checks only that the process is not hung — it should be as fast and as minimal as possible (a setImmediate-based check resolves in under 1ms on an unloaded process). /ready checks actual dependencies (DB pool slots, tool registry, secrets). If you combine them, a transient DB connection spike causes a liveness failure and restarts the container, dropping all active SSE sessions — when all the pod needed was to shed new traffic for 30 seconds.
Which transport should I choose for a new MCP server that will need horizontal scaling?
Choose the Streamable HTTP transport. It is stateless by design — any pod handles any request without session affinity, sticky sessions, or KEDA custom metrics. SSE transport was the original MCP transport and has broader client support today, but it imposes significant operational complexity on scaling (Ingress affinity, SIGTERM drain handlers, per-connection memory, KEDA custom metrics). If you are starting from scratch and horizontal scaling is a requirement, Streamable HTTP eliminates the entire class of SSE scaling problems before they exist.
How do I know if k6 thresholds are calibrated correctly for my MCP server?
Run your k6 test at 50%, 75%, and 100% of your estimated peak concurrent session count. The P95 tool call duration at 100% of expected peak should be under your LLM agent's timeout value minus the LLM inference time — if your LLM timeout is 30 seconds and inference takes 15–20 seconds, your MCP tools should complete in under 10 seconds at peak load. Set the mcp_tool_duration P95 threshold at 70% of that budget to leave a margin for measurement variance. After launch, compare your k6 P95 to AliveMCP's observed P95 under real traffic — a significant gap (k6 shows 500ms, AliveMCP shows 2000ms) indicates network latency in the user-facing path that the load test network position did not simulate.
My HPA is scaling up too aggressively and scaling down too slowly. What should I tune?
For scale-up aggression: increase the CPU target threshold (60% → 70%) or add a scaleUp.stabilizationWindowSeconds (0 → 60) to require sustained high utilization before scaling. For scale-down slowness: decrease scaleDown.stabilizationWindowSeconds (300 → 120) if the load pattern is genuinely bursty and drops fast. For SSE servers, slow scale-down is usually intentional — you want active SSE connections to drain before removing pods. Watch AliveMCP's response-time graph during scale events: a latency spike during scale-up means your ramp speed is too aggressive relative to connection pool warmup; a latency spike during scale-down means the termination grace period is too short and connections are being dropped.
How do I use AliveMCP's latency data for capacity planning beyond launch?
Track the P95 response time baseline over the first two weeks after launch, at your current replica count and traffic level. When the P95 starts drifting upward — not a step change from a deploy, but a gradual week-over-week increase — you are approaching the capacity ceiling of your current configuration. The correct response is to lower the HPA scale-up threshold (60% → 50%) so the autoscaler adds pods earlier, or to increase the minimum replica count if the drift is consistent. A step-change in P95 immediately after a deploy always means a new slow code path — check the tool that changed and add timing instrumentation. AliveMCP's 90-day response-time history in the Team and Enterprise plans makes this trend visible without requiring you to instrument and maintain a separate observability stack.
Monitor your MCP server from outside the cluster
AliveMCP sends the full MCP initialize → tools/list probe sequence to your endpoint every 60 seconds from outside the cluster — through DNS, through TLS, through the Ingress — catching the failure classes that Kubernetes runtime tooling cannot see. Start free for public endpoints, or claim your listing for $9/month to add custom alert webhooks and 90-day response-time history.