Guide · Kubernetes Runtime Patterns

Horizontal Scaling for MCP Servers — HPA, KEDA, and SSE session state

Horizontally scaling an MCP server is straightforward for the Streamable HTTP transport — each request is stateless, so adding pods immediately distributes load. Scaling an SSE-transport server is harder because SSE connections are long-lived and tied to a specific pod. Understanding this distinction shapes every decision from HPA metric choice to scale-in stabilization windows and PodDisruptionBudget configuration.

TL;DR

For Streamable HTTP MCP servers: use CPU/memory HPA, set scaleDown.stabilizationWindowSeconds to 300, add a PodDisruptionBudget with minAvailable: 1, and monitor externally with AliveMCP. For SSE-transport servers: add session-affinity sticky sessions at the Ingress, use KEDA with a custom queue-depth metric instead of CPU, set scale-down stabilization to 600 seconds and termination grace period to 60+ seconds so SSE connections drain before the pod terminates. AliveMCP's external protocol probe catches the monitoring gap that exists during all scale events — new pods that pass readiness but serve an incorrect MCP protocol version are invisible to Kubernetes until a real client hits them.

Transport choice determines your scaling architecture

The fundamental choice between SSE transport and Streamable HTTP transport dominates the horizontal scaling story. This is not a performance decision — both transports can handle high throughput. It is an architectural decision that determines whether your scaling problem is simple or complex.

Dimension Streamable HTTP transport SSE transport
Connection model Stateless request-response; each HTTP request is independent Stateful long-lived SSE connection tied to a specific pod
Session state None in the transport layer; any state stored externally (Redis) In-process session state; client must reconnect if pod terminates
HPA metric CPU utilization or request rate — both are reliable signals Active SSE connection count — CPU underestimates load for idle connections
Scale-in risk Low — in-flight requests complete before pod terminates High — pod termination disconnects all open SSE connections
Sticky sessions required? No — any pod handles any request Yes — follow-up POST requests must reach the same pod as the SSE stream

If you are designing a new MCP server and horizontal scaling is a requirement, choose the Streamable HTTP transport. If you are scaling an existing SSE-transport server, the patterns below will work — but add complexity that Streamable HTTP avoids entirely.

Horizontal Pod Autoscaler for Streamable HTTP MCP servers

Streamable HTTP MCP servers are straightforward to autoscale because they are stateless. The standard Kubernetes HPA using CPU utilization as the scaling metric works correctly and requires minimal tuning.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mcp-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mcp-server
  minReplicas: 2     # Never scale below 2 to maintain HA during single-pod failures
  maxReplicas: 20    # Cap prevents runaway scaling from a traffic spike or metric anomaly

  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60  # Scale when average CPU across pods exceeds 60%
                                  # Lower than the typical 80% because MCP tool calls
                                  # can be CPU-intensive (JSON parsing, crypto, serialization)
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 70  # Also scale on memory — Node.js heap can grow during bursts

  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0   # Scale up immediately — don't wait during traffic spikes
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60           # Add at most 4 pods per minute
    scaleDown:
      stabilizationWindowSeconds: 300 # Wait 5 minutes of sustained low load before scaling in
      policies:
        - type: Pods
          value: 1
          periodSeconds: 60           # Remove at most 1 pod per minute
                                      # Slow scale-in prevents oscillation and gives connection
                                      # pool warmup on remaining pods time to complete

PodDisruptionBudget for HA during node maintenance

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: mcp-server-pdb
spec:
  minAvailable: 1       # At least 1 pod must be available during voluntary disruptions
                        # (node drains, rolling updates, cluster upgrades)
  selector:
    matchLabels:
      app: mcp-server

Horizontal Pod Autoscaler for SSE-transport MCP servers

SSE-transport servers need session affinity configured at the Ingress layer so that follow-up HTTP POST requests from MCP clients land on the same pod as their open SSE stream. Without this, the POST arrives at a different pod that has no knowledge of the SSE session — and the client receives a 404 or session-not-found error.

nginx Ingress with session affinity

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: mcp-server-ingress
  annotations:
    nginx.ingress.kubernetes.io/proxy-buffering: "off"       # Required for SSE — nginx must not buffer events
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"   # 1-hour SSE connection timeout
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/session-cookie-name: "mcp-session"
    nginx.ingress.kubernetes.io/session-cookie-max-age: "3600"
    nginx.ingress.kubernetes.io/session-cookie-samesite: "None"
    nginx.ingress.kubernetes.io/session-cookie-secure: "true"
spec:
  ingressClassName: nginx
  rules:
    - host: mcp.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: mcp-server
                port:
                  number: 3000

HPA metric for SSE-transport servers

CPU utilization is a poor metric for SSE-transport servers because idle SSE connections consume very little CPU despite holding a significant amount of the server's capacity (each connection requires memory for the event queue, a file descriptor, and a Node.js socket object). Scale on a custom metric that reflects actual connection count instead.

# Expose active SSE connection count as a Prometheus metric from your MCP server
# In your Node.js server:
import { register, Gauge } from 'prom-client';

const activeSseConnections = new Gauge({
  name: 'mcp_active_sse_connections',
  help: 'Number of currently open SSE connections',
});

// Increment when a client connects, decrement on disconnect
app.on('sse_connected', () => activeSseConnections.inc());
app.on('sse_disconnected', () => activeSseConnections.dec());

// Expose /metrics endpoint for Prometheus scraping
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

With Prometheus metrics available, use KEDA to scale on the custom metric rather than the built-in HPA CPU metric:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: mcp-server-scaledobject
spec:
  scaleTargetRef:
    name: mcp-server
  minReplicaCount: 2
  maxReplicaCount: 20
  cooldownPeriod: 300    # 5 minutes after scale-down decision before next scale-down
  pollingInterval: 30    # Check metrics every 30 seconds

  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
        metricName: mcp_active_sse_connections
        threshold: "50"    # Scale when average active SSE connections per pod exceeds 50
        query: sum(mcp_active_sse_connections) / count(mcp_active_sse_connections)

  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 600   # 10 minutes — SSE connections are long-lived;
                                            # do not scale in during short quiet periods

Graceful scale-in for SSE connections

When Kubernetes decides to scale in (reduce replica count), it sends SIGTERM to the selected pod. Without graceful shutdown handling, this immediately closes all open SSE connections on that pod — every connected MCP client sees an abrupt disconnection. Configure both the pod's termination grace period and an application-level graceful shutdown handler to drain connections over a defined window.

# Deployment spec — termination grace period for SSE drain
spec:
  terminationGracePeriodSeconds: 90   # Give SSE connections 90 seconds to drain

  containers:
    - name: mcp-server
      lifecycle:
        preStop:
          exec:
            # Signal the app to stop accepting new SSE connections immediately,
            # then wait for existing connections to close or timeout
            command: ["/bin/sh", "-c", "kill -SIGUSR1 1 && sleep 85"]
// Node.js graceful shutdown handler
// SIGUSR1: stop accepting new SSE connections (sent by preStop hook)
// SIGTERM: full shutdown (sent by kubelet after terminationGracePeriodSeconds)

let acceptingNewConnections = true;

process.on('SIGUSR1', () => {
  acceptingNewConnections = false;
  console.log('Draining: no new SSE connections accepted');
  // Wait for existing connections to close (they will when clients disconnect or reconnect)
});

process.on('SIGTERM', () => {
  console.log('SIGTERM received — closing server');
  // Force-close any remaining SSE connections
  activeSseConnections.forEach(conn => conn.end());
  server.close(() => {
    process.exit(0);
  });
});

// Block new SSE connections during drain window
app.use('/sse', (req, res, next) => {
  if (!acceptingNewConnections) {
    return res.status(503).json({
      error: 'Server draining — reconnect to another instance',
      retryAfter: 5,
    });
  }
  next();
});

The monitoring gap during scale events

Kubernetes HPA and readiness probes together prevent scaling events from causing downtime — but they do not verify MCP protocol correctness. A new pod that passes its readiness probe (HTTP /ready returns 200) but has been deployed with an incompatible MCP SDK version will accept connections, complete the TCP and HTTP handshake, and then fail the MCP initialize step with a protocol version mismatch. From Kubernetes' perspective, the pod is healthy. From the MCP client's perspective, the connection is broken.

AliveMCP catches this class of failure. Its external probe runs a real MCP initialize handshake against your public endpoint every minute. During a scale-out event where new pods with a bad SDK version are being added, AliveMCP will detect the protocol mismatch within one minute and alert your team — before a majority of client requests start hitting the bad pods.

Scale event HPA + readiness probe detection AliveMCP detection
New pod fails readiness (crash on startup) Detected — pod not added to LB Not needed — HPA handles this
New pod passes readiness but wrong MCP protocol version Not detected — /ready returns 200 Detected within 60 s — initialize fails
Scale-in disconnects SSE clients (drain failed) Not detected after pod terminates Detected if endpoint becomes unreachable
All pods scale to zero (misconfigured minReplicas) HPA sees 0 pods as desired — no alert Detected within 60 s — connection refused

Multi-region horizontal scaling

For MCP servers that need to serve users in multiple geographic regions, single-cluster HPA reaches its architectural limit. Multi-region deployment requires running separate Kubernetes clusters (or separate node pools in different regions) with a global load balancer routing traffic to the nearest healthy region.

# Namespace per region, separate HPA configs optimized per region
---
# us-east-1 cluster
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mcp-server-hpa
  namespace: production-us-east
spec:
  minReplicas: 3   # Higher minimum in primary region
  maxReplicas: 30

---
# eu-west-1 cluster
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mcp-server-hpa
  namespace: production-eu-west
spec:
  minReplicas: 2   # Lower minimum in secondary region
  maxReplicas: 15

AliveMCP probes each regional endpoint independently and provides per-region uptime history. This lets you distinguish a global outage (all regions down — infrastructure or DNS problem) from a regional outage (one region's pods all failing readiness — cluster-level issue). Navigate to alivemcp.com to add separate monitors for each regional endpoint URL.

Frequently asked questions

Should I prefer Streamable HTTP or SSE transport for a horizontally scaled MCP server?

Streamable HTTP strongly if you have a choice. Stateless transports are the standard pattern for horizontally scaled services for good reason: no sticky sessions required, no drain complexity, no SSE connection count metric to export, and simpler Ingress configuration. The SSE transport was the original MCP transport mechanism, and it works well for small deployments or development environments, but its statefulness imposes real operational complexity at scale — session affinity introduces a single point of routing failure (if the affinity cookie is lost, the client must reconnect), the KEDA custom metric setup adds Prometheus and a KEDA installation as cluster dependencies, and the graceful drain window extends every scale-in event by 60–90 seconds. If you are designing a new MCP server, the Streamable HTTP transport removes all of this complexity with no loss of protocol capability.

What is a safe minimum replica count for a production MCP server?

Two replicas minimum for any production service. This is the minimum required to survive a single pod failure (due to a node going down, a liveness probe restart, or a voluntary disruption during node maintenance) while maintaining 100% capacity. With one replica, any pod restart causes a brief complete outage. With two replicas, a pod restart drops capacity to 50% — degraded but not down. For MCP servers with predictable traffic patterns, three replicas is a better baseline: it allows one pod to restart and the PodDisruptionBudget to drain another during node maintenance simultaneously without any capacity reduction. Do not set minReplicas: 0 (scale to zero) for any production MCP server endpoint — the 30–60 second cold start when traffic arrives creates a poor user experience and will be detected by AliveMCP as a brief downtime event.

How do I prevent the HPA from scaling down too aggressively during quiet periods?

Set scaleDown.stabilizationWindowSeconds to at least 300 seconds (5 minutes) for Streamable HTTP servers and 600 seconds (10 minutes) for SSE-transport servers. The stabilization window means the HPA must observe low resource utilization continuously for the entire window before executing a scale-in. This prevents the saw-tooth pattern where traffic spikes every 5–10 minutes and the HPA oscillates between scale-out and scale-in. Also set policies to remove at most 1 pod per minute during scale-in — this gives each remaining pod's connection pool time to warm up and receive the transferred load before the next pod is removed. If you see the HPA still oscillating after setting these values, check whether your metric source (CPU or Prometheus) has a scrape interval that is too long — a 60-second Prometheus scrape interval means the HPA is acting on data that may be a minute old.

Can I use KEDA with the AliveMCP API as a scale trigger?

Not directly — KEDA's external trigger types (Prometheus, HTTP, AWS SQS, etc.) are designed for queue depth and request rate metrics, not availability monitoring. AliveMCP's value for scaling decisions is indirect: its 90-day uptime graph shows you whether your current replica count is sufficient to absorb traffic without causing protocol errors (which AliveMCP would detect as probe failures). If AliveMCP's response time graph shows latency increasing during peak hours, that is a signal to lower your HPA target utilization percentage (e.g., from 60% to 50% CPU target) so pods scale out earlier. Use AliveMCP as an after-the-fact tuning signal — if you see probe latency increasing or intermittent failures, adjust HPA parameters to maintain more headroom. The KEDA Prometheus trigger with your MCP server's own metrics (connection count, request queue depth) is the right real-time scaling mechanism.

How do I test that my horizontal scaling configuration works correctly before production?

Use a load testing tool (see the k6 guide for MCP-specific load test scripts) to ramp traffic against a staging environment with the same HPA configuration as production. Verify that: (1) the HPA scales out as expected when CPU or connection count crosses the target threshold — watch kubectl get hpa -w; (2) new pods pass readiness before receiving traffic — watch kubectl get endpoints mcp-server -w; (3) scale-in after the load test ends does not cause dropped connections — watch the k6 error rate during the scale-in window; (4) the PodDisruptionBudget prevents simultaneous pod termination — try kubectl drain <node> while the load test is running and confirm the drain waits for the PDB minAvailable constraint. AliveMCP monitoring during the load test gives you the external view — whether the endpoint appeared up or briefly unreachable from outside the cluster during any scale event.

Further reading

Know when your MCP server is down — before users do

AliveMCP probes your server's MCP endpoint every minute, detects protocol errors and transport failures, and pages you before users notice.

Start monitoring free