Guide · Kubernetes Runtime Patterns

MCP Server Capacity Planning — sizing replicas, memory, and connection pools

Capacity planning for an MCP server is different from planning a REST API. MCP servers hold more per-connection state (SSE streams, tool call context), execute longer-running tool handlers (external API calls, database queries, ML inference), and serve clients that establish sessions and maintain them for minutes or hours rather than milliseconds. Undersizing causes tool call timeouts; oversizing wastes money and inflates operational complexity.

TL;DR

Estimate peak concurrent sessions from user count and session length. Size each pod at 256–512 MB memory and 0.25–0.5 CPU cores per 20–50 concurrent Streamable HTTP sessions (or per 30–80 SSE connections for idle-heavy workloads). Set the database connection pool at 10–20% of peak concurrent tool calls per pod. Configure HPA to target 60% CPU utilization with a 2× burst headroom above your expected peak. Validate with a k6 load test before every deploy. Use AliveMCP's response-time trend to detect capacity drift in production — a monotonically increasing P95 latency curve is the leading indicator of a capacity problem weeks before it becomes user-visible.

Step 1 — estimate peak concurrent sessions

The fundamental unit of MCP server capacity is the concurrent session: one MCP client connected and actively making tool calls. Estimate this number before sizing anything else.

# Concurrent session estimation formula
# Variables:
#   DAU = daily active users (MCP clients)
#   sessions_per_user_per_day = average number of MCP sessions a user initiates daily
#   session_duration_minutes = average length of an active session
#
# Formula:
#   peak_concurrent_sessions ≈ (DAU × sessions_per_user_per_day × session_duration_minutes) / (24 × 60)
#   ... adjusted for peak-hour factor (peak traffic / average traffic)
#
# Example:
#   DAU = 500 users
#   sessions_per_user_per_day = 3 sessions
#   session_duration_minutes = 20 minutes
#   peak_hour_factor = 3× (peak hour has 3× the daily average traffic)
#
#   average_concurrent = (500 × 3 × 20) / 1440 = 20.8 concurrent sessions
#   peak_concurrent = 20.8 × 3 = ~63 concurrent sessions at peak
#
# Sizing target: support 63 concurrent sessions at peak with 30% headroom
#   → design for 82 concurrent sessions

For a new product with no historical data, use conservative estimates and plan to re-evaluate after the first month of production data. AliveMCP's response-time graph will show you when actual load is approaching your capacity limits — a rising P95 trend is the signal to revisit your estimates.

Step 2 — size memory per pod

Node.js MCP servers use memory for several distinct buckets. Understanding each bucket lets you set accurate pod memory limits rather than guessing.

Memory bucket	Typical size	Scales with
Node.js baseline (V8, libuv, built-ins)	60–80 MB	Fixed — same regardless of load
MCP SDK + tool registry	20–50 MB	Number of registered tools and their schemas
Per-session state (Streamable HTTP)	0.5–2 MB per session	Concurrent active sessions
Per-connection state (SSE transport)	1–5 MB per connection	Open SSE connections + tool call context
In-flight tool call responses	10–100 MB depending on response size	Concurrent tool calls × response payload size
Database connection pool overhead	5–20 MB	Connection pool size
V8 GC headroom (prevents OOM thrash)	20–30% of working set	Fixed ratio of other buckets

Worked memory sizing example

# Scenario: 50 concurrent Streamable HTTP sessions per pod, typical tool response size 50 KB
#
# Fixed baseline:      80 MB
# Tool registry:       30 MB   (20 tools with JSON schemas)
# Session state:       50 sessions × 1 MB = 50 MB
# In-flight responses: assume 10% of sessions have an active tool call at any moment
#                      5 concurrent tool calls × 50 KB payload = ~0.25 MB (negligible)
# Connection pool:     15 connections × 1 MB = 15 MB
# ─────────────────────────────────────────────
# Working set:         ~175 MB
# GC headroom (25%):   ~44 MB
# Total:               ~220 MB
#
# Kubernetes pod memory limit: 256 MB (next power of 2 above 220 MB for clean resource allocation)
# Kubernetes pod memory request: 192 MB (75% of limit)
#
# Node.js heap size flag (prevent OOM before k8s limit):
#   --max-old-space-size=200  (leave 56 MB for OS, libuv, and Buffer allocations)

# Dockerfile: set heap limit explicitly
ENV NODE_OPTIONS="--max-old-space-size=200"

# Kubernetes deployment spec
resources:
  requests:
    memory: "192Mi"
    cpu: "250m"
  limits:
    memory: "256Mi"
    cpu: "500m"

Step 3 — size the database connection pool

Connection pool sizing is one of the most impactful capacity decisions for MCP servers. Too few connections and tool calls queue; too many connections and you exhaust the database's connection limit, throttle connection memory on the DB server, or experience connection overhead that exceeds query time.

The formula that works for most MCP servers:

# Connection pool size formula (per pod)
#
# pool_size = ceil(peak_concurrent_tool_calls_per_pod × avg_query_duration_ms / 1000)
#             × 1.25   (25% headroom for bursts)
#             + 2       (minimum idle connections for warmth)
#
# Example:
#   peak_concurrent_tool_calls_per_pod = 20  (50 sessions × 40% actively calling a tool)
#   avg_query_duration_ms = 50 ms
#
#   pool_size = ceil(20 × 50 / 1000) × 1.25 + 2
#             = ceil(1.0) × 1.25 + 2
#             = 1.25 + 2 = ~3.25
#             → round up to 5 connections per pod (generous for this workload)
#
# For slower queries (e.g., avg 500 ms search operations):
#   pool_size = ceil(20 × 500 / 1000) × 1.25 + 2
#             = ceil(10) × 1.25 + 2
#             = 10 × 1.25 + 2 = 14.5
#             → 15 connections per pod

// Node.js / better-sqlite3: single connection (SQLite is not concurrent)
// For SQLite, use WAL mode and handle concurrency in the application layer
const db = new Database('./data.db');
db.pragma('journal_mode = WAL');
db.pragma('busy_timeout = 3000');  // Wait up to 3 s for WAL reader lock

// Node.js / PostgreSQL with pg-pool
import { Pool } from 'pg';
const pool = new Pool({
  connectionString: process.env.DATABASE_URL,
  max: 15,                // Maximum connections (from formula above)
  min: 2,                 // Always keep 2 idle connections warm
  idleTimeoutMillis: 30000,        // Release idle connections after 30 s
  connectionTimeoutMillis: 2000,   // Throw if pool exhausted for more than 2 s
  // statement_timeout per connection (prevents runaway queries holding connections)
  options: '--statement_timeout=10000',
});

Database connection budget across the cluster

# Total database connections consumed by the MCP server cluster:
#   connections = pool_size_per_pod × max_replicas
#
# Example: pool_size=15, max_replicas=20
#   total connections = 15 × 20 = 300 connections
#
# Check your database's max_connections setting:
#   PostgreSQL default: max_connections = 100
#   Managed PostgreSQL (e.g., AWS RDS db.t3.medium): max_connections ≈ 420
#
# If 300 > database max_connections / 2 (leave half for admin access):
#   → Use PgBouncer in transaction pooling mode as a connection multiplexer
#   → Or reduce pool_size_per_pod
#   → Or increase database instance size

Step 4 — set CPU limits and HPA thresholds

CPU is the trickiest resource to size for MCP servers because tool call CPU profiles are highly variable. A tool that reads a record from SQLite uses microseconds of CPU; a tool that parses a large JSON response or computes a diff over a large dataset uses hundreds of milliseconds. Build your CPU estimate around your most CPU-intensive tool, not the average.

# CPU sizing approach:
# 1. Profile your heaviest tool call (in staging, with realistic data)
# 2. Estimate concurrent instances of that tool at peak
# 3. Add baseline overhead (event loop, HTTP parsing, JSON serialization)
#
# Example:
#   Heaviest tool: diff_documents — uses ~200ms CPU per call
#   Peak concurrent calls of this tool: 10 per pod (20% of 50 concurrent sessions)
#   CPU demand: 10 × 200ms = 2000ms CPU per second = 2.0 CPU cores for this tool alone
#   Baseline overhead: 0.3 CPU cores
#   Total peak demand: 2.3 CPU cores per pod
#
# → Set pod CPU limit to 2.5 cores (10% above peak)
# → Set HPA target to 60% CPU utilization
#   (scale out when average CPU across pods exceeds 60% = 1.5 cores per pod)
#   (ensures we have 40% headroom before hitting the limit)

resources:
  requests:
    cpu: "1000m"     # 1 core request (what Kubernetes schedules on)
  limits:
    cpu: "2500m"     # 2.5 core limit (what the container can burst to)

HPA configuration derived from capacity estimates

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mcp-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mcp-server
  minReplicas: 2
  maxReplicas: 20    # peak_concurrent_sessions / sessions_per_pod × safety_factor
                     # 82 peak / 50 sessions_per_pod × 1.5 = ~2.5 → round up to max 20

  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60   # Scale out when average CPU hits 60% of limit
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 70   # Scale before hitting OOM (pods swap to disk before limit, then die)

Step 5 — validate with load testing before launch

Capacity estimates are hypotheses. Validate them with k6 load tests against a staging environment that matches production resource limits. The k6 test in the k6 testing guide provides the scaffolding; configure it with your actual tool names and inputs.

The three key questions to answer in the load test:

Does P95 tool call latency stay below your target (typically 2–3 s) at peak concurrent sessions? If not, the bottleneck is either CPU (add cores), database (add pool size or upgrade DB), or external API (add timeouts and circuit breakers).
Does the server remain stable during a 30-minute soak at 80% of peak load? Memory should not grow monotonically; if it does, there is a memory leak in a tool handler or the connection pool is not releasing connections.
Does the HPA scale out correctly during a spike test? Watch kubectl get hpa -w during the spike. The number of replicas should increase within 1–2 minutes of the spike starting. If it does not, check that the metrics-server is running and that kubectl top pods shows correct CPU metrics.

Using AliveMCP response-time data for capacity monitoring

Capacity planning does not end at launch. Actual usage patterns differ from estimates, tool handler performance degrades as data grows, and traffic patterns shift. AliveMCP's 90-day response-time history gives you a continuous external signal that is more reliable than internal metrics for detecting capacity problems that affect real users.

What to watch in AliveMCP's latency graph

Pattern in AliveMCP latency graph	What it indicates	Action
Flat baseline, occasional spikes	Normal operation; spikes are traffic bursts or external API latency	None — monitor spike frequency
Flat weekday, lower weekend	Business-hours traffic pattern; HPA working correctly	Consider lower weekend minReplicas to save cost
Gradually increasing baseline over days/weeks	Memory leak, growing data (unindexed queries), or increasing user base approaching capacity	Profile heap, check EXPLAIN on DB queries, re-run capacity calculation
Sudden step-change increase after a deploy	New tool handler is slower; new dependency added without capacity adjustment	Roll back or hotfix; profile the new code path
Brief downtime windows (30–60 s gaps) recurring every few hours	Liveness probe restarting pods due to recurring event loop hang	Fix the hang; AliveMCP downtime windows correlate with pod restart events in `kubectl describe`

Setting AliveMCP response-time alerts as capacity triggers

AliveMCP allows you to set alert thresholds on response time in addition to downtime alerts. Configure a response-time alert at 2× your load-tested P95 latency. If AliveMCP's probe takes longer than this threshold, the server is experiencing capacity pressure even if it has not yet gone down. This alert fires days or weeks before an actual outage, giving you time to scale up proactively.

Example alert configuration: if your k6 load test showed P95 tool call latency of 800 ms at peak load, set AliveMCP's response-time alert at 1,600 ms. An AliveMCP probe runs a single initialize handshake — much lighter than a full tool call — so a 1,600 ms initialize probe time correlates with approximately 5–10× higher tool call latency under the same conditions.

SSE-transport capacity: the hidden connection cost

SSE-transport MCP servers have a capacity dimension that Streamable HTTP servers do not: the cost of maintaining idle open connections. An SSE connection that is open but not actively making tool calls still consumes:

A file descriptor on the server process
A libuv event loop watcher (one per socket)
~2–10 KB of memory for the socket object
Periodic keepalive writes to prevent Nginx/ALB from timing out the connection

At 200 concurrent SSE connections (200 users idle in an MCP-connected application), a single Node.js process uses approximately 200 file descriptors and sends keepalive events every 15–30 seconds. This is well within Node.js capacity — the default file descriptor limit is 1024 for most Linux processes, configurable up to the kernel limit — but it means your memory and CPU estimates must account for idle connections, not just active tool calls.

# Increase file descriptor limit for Node.js MCP server with many SSE connections
# In your Dockerfile or entrypoint:
RUN ulimit -n 65536

# Or in your Kubernetes pod spec:
securityContext:
  sysctls:
    - name: fs.file-max
      value: "65536"

# Monitor active FD count in your liveness/ready endpoint:
import { openSync, closeSync } from 'fs';
function getFdCount() {
  try {
    const fd = openSync('/dev/null', 'r');
    closeSync(fd);
    return fd;  // FD number ≈ number of open file descriptors in simple cases
  } catch { return -1; }
}

// Expose in /health endpoint for visibility
app.get('/health', (req, res) => {
  res.json({ fd_count_approx: getFdCount(), uptime: process.uptime() });
});

Capacity planning checklist

Estimate peak concurrent sessions from DAU, session frequency, and session duration
Size pod memory using the per-bucket formula: baseline + tool registry + per-session state + in-flight responses + pool overhead + GC headroom
Set --max-old-space-size to 80% of the pod memory limit
Size connection pool: ceil(concurrent_tool_calls × avg_query_ms / 1000) × 1.25 + 2
Verify total pool connections across all pods does not exceed database max_connections / 2
Set HPA CPU target to 60%, memory target to 70%
Set minReplicas ≥ 2; set maxReplicas to cover 3× estimated peak sessions
Run k6 load test at 1.5× estimated peak — verify P95 < target and no memory leak in soak
Configure AliveMCP response-time alert at 2× load-tested P95 latency
Re-run capacity calculation every quarter or after any 50%+ traffic growth

Frequently asked questions

How many concurrent sessions can a single Node.js MCP server pod handle?

A well-tuned Node.js MCP server on a 0.5 CPU / 256 MB pod can typically handle 30–80 concurrent Streamable HTTP sessions or 50–150 concurrent idle SSE connections, depending on tool handler CPU intensity and response payload size. The binding constraint is usually one of: CPU saturation from heavy tool handlers, memory limit from large tool response buffers, or database connection pool exhaustion. The only reliable way to find your specific server's limit is to run a k6 load test against staging with representative traffic. Start the load test at your estimated capacity, look for the knee in the latency curve (where P95 starts increasing non-linearly with added VUs), and set your HPA target to 60–70% of the VU count at that knee. Do not rely on rules of thumb for production sizing — your tool handler profile is too specific to generalize.

Should I use vertical scaling (larger pods) or horizontal scaling (more pods) for MCP servers?

Horizontal scaling (more pods, same size) is almost always preferable for production MCP servers because it provides both higher throughput and higher availability — losing one pod in a 10-pod cluster means 10% capacity loss, while losing a single large pod means 100% downtime. The exception is when your workload is CPU-bound from a single long-running tool call (e.g., a tool that spawns a worker process for compute-intensive work). In that case, a larger CPU limit on each pod may be necessary to prevent individual tool calls from timing out, even if you also scale horizontally. For Streamable HTTP servers, horizontal scaling distributes load across as many pods as needed and allows each pod to use modest resources. For SSE servers, horizontal scaling distributes clients across pods (via Ingress sticky sessions), which reduces the SSE connection count per pod and avoids the single-process file descriptor limit.

How do I handle database connection pooling when my MCP server scales horizontally from 2 to 20 pods?

Plan your database connection budget for the maximum replica count, not the current count. With a 15-connection pool per pod and a maximum of 20 pods, you are potentially consuming 300 connections simultaneously at peak scale. If your database's max_connections is 100 (PostgreSQL default), you will exhaust connections well before reaching 7 pods. Solutions, in order of preference: (1) Use PgBouncer in transaction pooling mode — PgBouncer multiplexes thousands of server-side connections from your pods into a small pool of real database connections (typically 10–20% of your application-level pool size); (2) Use a managed database with a higher connection limit (AWS RDS db.t3.large allows ~420 connections, db.r5.large allows 1600); (3) Reduce your pool size per pod and increase connection reuse efficiency. Always test at maxReplicas scale in staging — connection exhaustion typically manifests as tool call timeouts at exactly the wrong moment (during a traffic spike when you have just scaled out).

How do I use AliveMCP's data to right-size my infrastructure costs?

AliveMCP's response-time graph shows you the latency pattern across a 90-day window. Look for two patterns that indicate over-provisioning: (1) consistently flat low latency (well below your target P95) combined with low CPU utilization from kubectl top — you have more capacity than you need; consider reducing minReplicas or downsizing pod resource limits; (2) weekend or overnight latency that is identical to weekday peak latency — your minReplicas is too high for off-peak periods; consider a cron-based scheduled scaling rule that reduces minReplicas to 1 or 2 overnight. On the under-provisioning side, a rising latency baseline (week-over-week increase of 10%+) indicates you are approaching your capacity ceiling and need to scale up before the next traffic growth inflection. The advantage of AliveMCP over internal cluster metrics for this purpose is that it shows you what users actually experience — not just what Kubernetes believes is happening inside the cluster.

What are the most common capacity planning mistakes for first-time MCP server operators?

The four most common mistakes: (1) Sizing for average load, not peak. MCP server traffic is bursty — users interact in waves. Size for 3× average to handle peaks without degradation. (2) Ignoring database connection budget. Horizontal scaling multiplies your connection count. Engineers discover this when the 10th pod starts and the database starts rejecting connections. (3) Setting maxReplicas too low. If maxReplicas equals your expected steady-state replica count, there is no room for the HPA to scale out during traffic spikes. Set maxReplicas at 3–5× your steady-state replica count. (4) Not testing memory under sustained load. A 5-minute k6 test does not reveal memory leaks that only manifest after hours. Run a 60-minute soak test in staging before the first production launch. Monitor RSS (resident set size) trend in the soak test — RSS should stabilize, not grow monotonically. If it grows, profile the heap with node --inspect and Chrome DevTools or the clinic.js suite.