Guide · MCP Resilience
MCP server backpressure
LLM agents can issue tool calls far faster than your backend can process them. A single agent session running parallel tool calls can saturate a database connection pool, overwhelm an external API, or push CPU to 100% — cascading into failures that affect every other client. Backpressure is the mechanism by which a server signals to callers that it is at capacity and they should slow down. For MCP servers, backpressure takes the form of concurrency limits, bounded queues, and explicit rejection — all of which protect downstream resources while giving agents the signal they need to back off.
TL;DR
Wrap every tool handler in a concurrency semaphore (max N in-flight calls). When the semaphore is full, reject immediately with HTTP 429 and a Retry-After header rather than queuing indefinitely. Size N to your database connection pool (or the bottleneck resource). Monitor active concurrency and queue depth as metrics — if active ≥ N consistently, your server needs more capacity or fewer parallel agent sessions.
Why MCP servers need explicit backpressure
Node.js is single-threaded but non-blocking — it can accept thousands of concurrent connections while awaiting I/O. The problem is not the Node event loop but your downstream resources:
- Database connection pools — a pool of 10 connections can service 10 concurrent queries. The 11th blocks waiting for a connection. If 50 tool calls are waiting, their response times grow linearly and the pool queue consumes memory without bound.
- External APIs with rate limits — an agent issuing 100 tool calls/minute to a tool that calls an external API with a 60 req/min limit will exceed the limit, triggering 429 errors that propagate back to the agent as failures.
- Memory-intensive operations — tools that load large files into memory or run CPU-intensive transforms can exhaust Node's heap if called concurrently without limits.
- Cascading failures — without backpressure, an overloaded server slows all operations, causing timeouts, which causes agent retries, which adds more load, which makes the server slower — a positive feedback loop that ends in OOM or process crash.
Concurrency semaphore pattern
A semaphore limits the number of simultaneous in-flight operations. Use the p-limit package (or a manual implementation) to wrap tool handlers:
import pLimit from 'p-limit';
// One limiter per resource class — don't share limits across unrelated tools
const dbLimit = pLimit(10); // max 10 concurrent database operations
const apiLimit = pLimit(5); // max 5 concurrent external API calls
server.tool(
'search_records',
'Full-text search across customer records',
{ query: z.string(), limit: z.number().int().min(1).max(100).default(20) },
async ({ query, limit }) => {
// pLimit queues if at capacity — good for uniform, short-lived operations
return dbLimit(async () => {
const rows = await db.query(
'SELECT * FROM records WHERE content @@ to_tsquery($1) LIMIT $2',
[query, limit]
);
return { content: [{ type: 'text', text: JSON.stringify(rows) }] };
});
}
);
By default, pLimit queues excess requests. This is acceptable when operations are fast and queue depth is bounded. For slow operations or large bursts, pair with a queue depth check.
Bounded queue with early rejection
Queuing indefinitely is dangerous: the queue grows without bound, consuming memory, and requests at the back of the queue wait so long that the agent has already timed out and retried — meaning the queued work is obsolete when it finally executes.
Instead, reject requests when the queue depth exceeds a threshold:
class BoundedSemaphore {
private active = 0;
private queued = 0;
constructor(
private readonly maxConcurrent: number,
private readonly maxQueue: number
) {}
async run<T>(fn: () => Promise<T>): Promise<T> {
if (this.active >= this.maxConcurrent) {
if (this.queued >= this.maxQueue) {
// Queue is full — reject immediately rather than growing unbounded
const err = new Error('Server at capacity — retry after backoff');
(err as any).code = 'BACKPRESSURE_REJECTION';
(err as any).retryAfterSeconds = 5;
throw err;
}
this.queued++;
await this.waitForSlot();
this.queued--;
}
this.active++;
try {
return await fn();
} finally {
this.active--;
}
}
private waitForSlot(): Promise<void> {
return new Promise((resolve) => {
const check = () => {
if (this.active < this.maxConcurrent) {
resolve();
} else {
setImmediate(check);
}
};
check();
});
}
get stats() {
return { active: this.active, queued: this.queued };
}
}
// Size maxConcurrent to your database pool size
// Size maxQueue to ~2x maxConcurrent — brief bursts queue, sustained overload rejects
const semaphore = new BoundedSemaphore(10, 20);
HTTP response codes and headers
When rejecting due to backpressure, use the correct HTTP status and signal the retry delay:
| Situation | Status | Headers | Meaning |
|---|---|---|---|
| Server at capacity — transient, retry works | 503 | Retry-After: 5 | Service temporarily unavailable |
| Rate limit per client exceeded | 429 | Retry-After: 60, X-RateLimit-Limit, X-RateLimit-Reset | Too many requests from this client |
| Queue full (server-wide) | 503 | Retry-After: 10 | Load shedding — not caller-specific |
Express middleware to translate backpressure errors into proper responses:
// Error handler middleware — place after route handlers
app.use((err: Error, req: Request, res: Response, next: NextFunction) => {
if ((err as any).code === 'BACKPRESSURE_REJECTION') {
const retryAfter = (err as any).retryAfterSeconds ?? 5;
res
.status(503)
.set('Retry-After', String(retryAfter))
.set('X-Backpressure-Reason', 'queue-full')
.json({ error: 'server_at_capacity', retryAfter });
return;
}
next(err);
});
Per-client vs global limits
Global concurrency limits protect your backend but do not prevent a single noisy client from consuming all available slots. Combine global and per-client limits:
const globalSemaphore = new BoundedSemaphore(50, 100);
const clientSemaphores = new Map<string, BoundedSemaphore>();
function getClientSemaphore(clientId: string): BoundedSemaphore {
if (!clientSemaphores.has(clientId)) {
// Each client gets at most 10 concurrent, queue of 20
clientSemaphores.set(clientId, new BoundedSemaphore(10, 20));
// GC stale entries — production code uses an LRU cache here
}
return clientSemaphores.get(clientId)!;
}
async function limitedToolCall<T>(clientId: string, fn: () => Promise<T>): Promise<T> {
// Must acquire both client and global slot
return getClientSemaphore(clientId).run(() =>
globalSemaphore.run(fn)
);
}
Monitoring queue depth
Emit queue depth as a metric so you can alert before backpressure starts rejecting requests:
import { Counter, Gauge } from 'prom-client';
const activeCalls = new Gauge({
name: 'mcp_active_tool_calls',
help: 'Number of tool calls currently executing',
labelNames: ['tool'],
});
const queuedCalls = new Gauge({
name: 'mcp_queued_tool_calls',
help: 'Number of tool calls waiting in the backpressure queue',
});
const rejectedCalls = new Counter({
name: 'mcp_backpressure_rejections_total',
help: 'Number of tool calls rejected due to backpressure',
labelNames: ['reason'],
});
// In your semaphore: update gauges on each state transition
// Instrument the semaphore.stats fields and export via /metrics
Alert when mcp_queued_tool_calls stays above 0 for more than 30 seconds — it means your server is consistently saturated. Alert when mcp_backpressure_rejections_total rate exceeds 1/minute — it means the queue is filling and clients are being turned away.
AliveMCP external probes detect the downstream symptom: probe response time rises, then probe returns 503. Pair external probing with internal queue depth metrics to distinguish "server is overloaded" from "server is down".
Backpressure and the circuit breaker
Backpressure and circuit breakers are complementary. Backpressure limits how much work enters your server from above. Circuit breakers limit how much work your server sends to dependencies below. Use both:
- Backpressure semaphore: controls inbound tool call concurrency
- Circuit breaker on database client: fast-fails when the database is overloaded
- Circuit breaker on external API client: fast-fails when the upstream API is rate-limiting you
When a downstream circuit opens, the operations that would have gone there complete faster (with errors), freeing semaphore slots sooner. This makes the system self-regulating under partial failure.
Further reading
- MCP server circuit breaker — fast-fail on known-broken dependencies
- MCP server rate limiting — per-client throttling
- MCP server concurrency — worker threads and parallel tool execution
- MCP server connection pooling — database and HTTP pool sizing
- MCP server timeout — request and tool call deadline enforcement
- MCP server graceful degradation — fallback responses under partial failure
- MCP server metrics — Prometheus instrumentation and alerting
- MCP server load testing — finding your concurrency ceiling
- AliveMCP — uptime monitoring for HTTP-deployed MCP servers