Database guide · 2026-06-20 · MCP Server Database & Event Architecture
MCP Server Data Correctness: Five Ways Your Server Can Be 'Up' While Delivering Wrong Answers
Protocol availability is necessary but not sufficient for MCP server reliability. Your initialize handshake completes. Your tools/list returns the expected schema. Your uptime monitor shows green. And yet every tools/call returns wrong data — stale inventory levels from two hours ago, a background job that will never complete, or a PostgreSQL connection pool so exhausted that tool calls silently return isError: true while the protocol layer stays healthy. This guide covers five database and event architecture patterns — PostgreSQL connection pooling, background job queues, event-driven pub/sub, read replica routing, and CDC data pipelines — and the distinct failure mode each one creates that the external protocol probe cannot detect.
The hidden layer: five failures, five false greens
Each data architecture pattern creates a distinct class of failure that the MCP protocol probe cannot distinguish from correct operation. The probe tests whether the server speaks the protocol — not whether the data the server returns is correct. All five failure modes share the same structural property: the MCP JSON-RPC handshake succeeds, the tool handler executes, and a response is returned, but the response contains wrong data. No JSON-RPC error code is set unless you explicitly add one. No exception propagates unless you write the circuit breaker yourself. The agent receives content that looks valid, acts on it, and produces wrong output.
| Pattern | Silent failure mode | What the protocol probe sees | What data correctness monitoring sees |
|---|---|---|---|
| PostgreSQL connection pool | Pool exhausted — every tool call queues until connectionTimeoutMillis, then returns isError: true |
Green — initialize and tools/list never touch the pool |
pool.waitingCount > 0 sustained; SELECT 1 times out; canary tool returns isError: true |
| Background job queue | Worker process crashed — jobs enqueue but never complete; agent polls job:{id} resource forever |
Green — MCP server still responds to all protocol messages | Canary health_check_job enqueues sentinel and times out waiting for completion |
| Event-driven pub/sub | Subscriber crashed — in-memory Map frozen; tools return data from hours ago without any error | Green — tools still execute and return valid JSON responses | lastEventAt staleness exceeds threshold; /health returns 503 degraded |
| Read replica routing | Replica lag — reads return pre-write state; agent writes record and immediately reads stale data back | Green — writes to primary succeed, reads from replica return valid responses | pg_last_xact_replay_timestamp() lag exceeds threshold; canary write not visible on replica within 5s |
| CDC data pipeline | Consumer lag or replication slot behind — entire materialized view frozen; all reads return wrong data indefinitely | Green — server responds, tools return results from frozen local cache | Per-table tableFreshness age exceeds threshold; consumer lag total exceeds alert level |
Pattern 1: PostgreSQL connection pool exhaustion
PostgreSQL connection pools have a fixed maximum size. When the pool is full and a new tool call arrives, it queues and waits. Without a connection timeout, the tool call waits indefinitely, blocking the handler. With a timeout, it fails with a timeout error and returns isError: true to the caller. Either way, the agent's work fails — but the failure is invisible to the external protocol probe.
The protocol probe stays green because initialize and tools/list are handled entirely within the MCP SDK's session layer. They never touch the database connection pool. A server with a fully exhausted pool — zero idle connections, ten queued waiters — responds to every protocol-level probe message correctly. The probe sees a healthy server. Every real agent call fails silently.
The critical configuration that converts a pool exhaustion from a 30-second hang to an immediate fast fail:
import pg from 'pg';
const pool = new pg.Pool({
connectionString: process.env.DATABASE_URL,
max: 20,
connectionTimeoutMillis: 3000, // Fail fast — don't queue indefinitely
idleTimeoutMillis: 10000,
allowExitOnIdle: false,
});
// Expose pool health for external monitoring
app.get('/health', async (req, res) => {
const utilization = (pool.totalCount - pool.idleCount) / pool.totalCount;
const isExhausted = pool.waitingCount > 0;
if (isExhausted || utilization > 0.9) {
return res.status(503).json({
status: 'degraded',
pool_total: pool.totalCount,
pool_idle: pool.idleCount,
pool_waiting: pool.waitingCount,
pool_utilization: utilization,
});
}
try {
const client = await pool.connect();
await client.query('SELECT 1');
client.release();
res.json({ status: 'healthy', pool_utilization: utilization });
} catch (err) {
res.status(503).json({ status: 'unhealthy', error: err.message });
}
});
Pool sizing uses the formula Math.floor((max_connections - reserved_connections) / instance_count) with 70–80% headroom built in. For a dedicated PostgreSQL server with max_connections = 100, 5 reserved system connections, and 5 MCP server instances, the right max per instance is 19 — leaving room for the alert threshold to fire before you hit the ceiling.
PgBouncer in transaction mode is the right upgrade when you exceed 5–6 MCP instances. Each MCP tool call maps to one database transaction — short-lived, discrete, requiring no server-side session state — which is exactly the workload transaction pooling mode is designed for. The one incompatibility: server-side prepared statements are not safe in transaction mode. Set statement_cache_size: 0 in your pg client or append ?pgbouncer=true to the connection string.
The canary monitoring pattern is a health_check tool that runs SELECT 1 via the connection pool and returns pool.totalCount, pool.idleCount, and pool.waitingCount. AliveMCP's custom health URL at /health catches the silent exhaustion case — the external protocol probe would never see it because initialize and tools/list never interact with the database at all.
Pattern 2: Background job worker failure
Background job queues decouple long-running work from the synchronous MCP tool call. The tool enqueues a job and returns immediately with a job_id and a poll resource URI. The agent polls resources/read?uri=job:{id} until the status reaches completed or failed. This pattern is essential for any operation that might exceed the implicit 30-second timeout most MCP clients enforce — PDF generation, large exports, web scraping, email delivery to a list.
The silent failure occurs when the worker process crashes. The MCP server continues responding to all protocol messages. Jobs enqueue successfully — the tool returns { job_id: "abc123", poll_resource: "job:abc123" }. The agent starts polling. The job status stays active because no worker is processing it. The agent eventually times out after N polls, hallucinates a result, or loops until a human notices the pipeline has stalled.
// BullMQ: enqueue-immediately pattern
import { Queue, Worker } from 'bullmq';
import Redis from 'ioredis';
const connection = new Redis(process.env.REDIS_URL);
const queue = new Queue('mcp-jobs', { connection });
server.setRequestHandler(CallToolRequestSchema, async (request) => {
if (request.params.name === 'run_report') {
const idempotencyKey = request.params.arguments.idempotency_key;
const job = await queue.add('run_report', request.params.arguments, {
jobId: idempotencyKey, // Deduplication: same key = same job
removeOnComplete: { age: 3600 },
removeOnFail: { age: 86400 },
});
return {
content: [{ type: 'text', text: JSON.stringify({
job_id: job.id,
poll_resource: `job:${job.id}`,
status: 'queued',
}) }]
};
}
});
// Canary tool: validates the full worker pipeline end-to-end
server.setRequestHandler(CallToolRequestSchema, async (request) => {
if (request.params.name === 'health_check_job') {
const job = await queue.add('canary', { canary: true }, {
jobId: `canary-${Date.now()}`,
priority: 1, // Process before real jobs
});
// Poll for 30 seconds
const deadline = Date.now() + 30000;
while (Date.now() < deadline) {
await new Promise(r => setTimeout(r, 1000));
const state = await job.getState();
if (state === 'completed') {
return { content: [{ type: 'text', text: 'worker_healthy' }] };
}
if (state === 'failed') {
throw new Error('canary_job_failed');
}
}
throw new Error('worker_unresponsive: canary job not completed in 30s');
}
});
Call health_check_job via AliveMCP's canary tool feature or register the tool's endpoint as the custom health check. AliveMCP calls it every probe cycle — if the worker is down, the canary times out, and AliveMCP fires an alert within 60 seconds. Without this check, a crashed worker is completely invisible: the MCP server looks healthy on all external metrics while the agent's async work pipeline is dead.
The worker isolation principle is as important as the canary. The worker process should be separate from the MCP server process. A CPU-intensive worker running in the same event loop as the MCP server blocks every protocol message behind the running job. Use worker_threads for moderately CPU-bound work, or a completely separate process or container that shares only the Redis or PostgreSQL connection string with the MCP server.
The pg-boss alternative routes jobs through PostgreSQL instead of Redis, eliminating the Redis dependency for teams that already operate PostgreSQL. The interface is identical — boss.send() and boss.work() mirror BullMQ's queue.add() and new Worker(). The tradeoff: pg-boss adds 1–10ms latency vs BullMQ's sub-millisecond enqueue, and jobs persist as rows that survive Redis flushes. For MCP servers already running PostgreSQL, pg-boss is often the right default.
Pattern 3: Event pipeline staleness
Event-driven MCP servers maintain in-memory state updated by an event subscriber — Redis pub/sub, PostgreSQL LISTEN/NOTIFY, or webhook delivery. Tool handlers read from this in-memory state at sub-millisecond latency rather than querying the source database on every call. This produces excellent performance and removes database load from the hot path entirely.
The silent failure mode is the most dangerous across all five patterns, because it produces no signal at all. When the event subscriber crashes or disconnects, the in-memory state freezes at the last-received event. Tool handlers continue executing. They continue reading from the Map. They continue returning responses. The responses are just from one hour ago — or three hours ago, or twelve, depending on how long the subscriber has been down. No error is thrown. No JSON-RPC error code is set. The agent receives valid-looking content that is factually wrong and acts on it accordingly.
import { createClient } from 'redis';
const subscriber = createClient({ url: process.env.REDIS_URL });
const deploymentStatus = new Map();
let lastEventAt = Date.now();
await subscriber.connect();
await subscriber.subscribe('deployment:status', (message) => {
const event = JSON.parse(message);
deploymentStatus.set(event.deployment_id, event);
lastEventAt = Date.now(); // Track freshness
});
// Staleness check in /health
app.get('/health', (req, res) => {
const staleSecs = (Date.now() - lastEventAt) / 1000;
const STALE_THRESHOLD_SECS = 300; // 3-5× expected maximum quiet period
if (!subscriber.isReady) {
return res.status(503).json({ status: 'unhealthy', reason: 'subscriber_disconnected' });
}
if (staleSecs > STALE_THRESHOLD_SECS) {
return res.status(503).json({
status: 'degraded',
reason: 'event_pipeline_stale',
last_event_secs_ago: Math.round(staleSecs),
threshold_secs: STALE_THRESHOLD_SECS,
});
}
res.json({ status: 'healthy', last_event_secs_ago: Math.round(staleSecs) });
});
Staleness threshold formula: 3–5× the typical maximum quiet period for the event source. If deployment status events arrive at most every 60 seconds under normal operation, a 300-second threshold is appropriate. Too low and you get false alerts during legitimate quiet periods. Too high and agents receive hours of stale data before the alert fires. Set the threshold by observing your event arrival rate over a week and taking the 99th-percentile inter-event gap.
PostgreSQL LISTEN/NOTIFY requires one additional consideration: notifications are lost during disconnection. A subscriber that reconnects must perform a full synchronization from the source table on startup before marking itself as ready. Until the sync completes, the /ready endpoint should return 503 — the MCP server can respond to protocol messages, but its tool data is not yet trustworthy:
// LISTEN/NOTIFY with mandatory startup sync
let syncComplete = false;
async function startSubscriber(pool) {
const client = await pool.connect();
await client.query('LISTEN deployment_status');
// Load full state before marking ready
const { rows } = await client.query('SELECT * FROM deployments ORDER BY updated_at DESC');
rows.forEach(row => deploymentStatus.set(row.id, row));
syncComplete = true;
lastEventAt = Date.now();
client.on('notification', (msg) => {
const payload = JSON.parse(msg.payload);
deploymentStatus.set(payload.id, payload);
lastEventAt = Date.now();
});
client.on('error', async () => {
syncComplete = false; // Mark not-ready immediately on error
await startSubscriber(pool); // Reconnect and re-sync
});
}
// /ready endpoint: 503 until sync is complete
app.get('/ready', (req, res) => {
if (!syncComplete) return res.status(503).json({ status: 'not_ready', reason: 'startup_sync_pending' });
res.json({ status: 'ready' });
});
AliveMCP's custom health URL pointing at /health fires alerts when status === 'degraded' — before agents have consumed hours of stale data from the frozen Map. The external protocol probe would never catch this failure because tool handlers execute and return valid JSON-RPC responses; they're just returning the wrong values inside those responses.
Pattern 4: Read replica lag
Read replica routing scales MCP servers under read-heavy agentic workloads. Research workflows routinely generate 10–50 reads per write. Five concurrent agents each making 20 read tool calls = 100 simultaneous read queries that would exhaust a primary database's connection pool before a single write occurs. Routing reads to replicas distributes this load and extends primary capacity by 5–10× without vertical scaling.
The silent failure is replication lag. PostgreSQL streaming replication is asynchronous — writes committed to the primary propagate to replicas with a delay that is normally under one second but can grow to minutes under high write load, network issues, or replica resource pressure. When lag exceeds your freshness requirement, read-after-write patterns fail silently: the agent calls a write tool to create a deployment record, then immediately calls a read tool to fetch it for display, and receives a 404 or empty result because the write has not replicated yet.
// Explicit WRITE_TOOLS classification — never infer from SQL analysis
const WRITE_TOOLS = new Set([
'create_deployment',
'update_deployment',
'delete_deployment',
'create_alert_rule',
]);
const primaryPool = new pg.Pool({ connectionString: process.env.PRIMARY_URL, max: 20 });
const replicaPool = new pg.Pool({ connectionString: process.env.REPLICA_URL, max: 30 });
// Background lag check every 10 seconds
let replicaLagSeconds = 0;
let replicaHealthy = true;
const LAG_THRESHOLD_SECS = 30;
async function checkReplicaLag() {
try {
const { rows } = await replicaPool.query(
`SELECT extract(epoch FROM (now() - pg_last_xact_replay_timestamp())) AS lag`
);
replicaLagSeconds = parseFloat(rows[0].lag) || 0;
replicaHealthy = replicaLagSeconds < LAG_THRESHOLD_SECS;
} catch {
replicaHealthy = false;
}
}
setInterval(checkReplicaLag, 10000);
// Lag-aware pool selection
function getActiveReadPool() {
return replicaHealthy ? replicaPool : primaryPool; // Fallback to primary when lagging
}
async function withPool(toolName, fn) {
const pool = WRITE_TOOLS.has(toolName) ? primaryPool : getActiveReadPool();
return fn(pool);
}
The read-after-write anti-pattern has a simple fix: when a write tool is followed immediately by a read of the same record, both must route to the primary. The write tool can return an explicit hint — { result: ..., read_from_primary: true } — or the read tool can accept an optional pool_override: "primary" argument for scenarios where the agent knows it just wrote.
// Canary: validates replication end-to-end
server.setRequestHandler(CallToolRequestSchema, async (request) => {
if (request.params.name === 'health_check_replication') {
const marker = `canary_${Date.now()}`;
// Write canary row to primary
await primaryPool.query(
`INSERT INTO canary_checks (marker, checked_at)
VALUES ($1, NOW())
ON CONFLICT (marker) DO UPDATE SET checked_at = NOW()`,
[marker]
);
// Poll replica until visible or 5-second timeout
const deadline = Date.now() + 5000;
while (Date.now() < deadline) {
await new Promise(r => setTimeout(r, 500));
const { rows } = await replicaPool.query(
'SELECT 1 FROM canary_checks WHERE marker = $1', [marker]
);
if (rows.length > 0) {
return { content: [{ type: 'text', text: JSON.stringify({
status: 'replication_healthy',
lag_secs: replicaLagSeconds,
}) }] };
}
}
throw new Error(`replication_lag: write not visible on replica after 5s (current: ${replicaLagSeconds}s)`);
}
});
The /ready endpoint should check both pools independently: primary unreachable = 503 unhealthy (full outage, P1 page), replica unreachable = 503 degraded (reads fall back to primary automatically, P2 ticket). AliveMCP's custom health URL at /ready can alert on both conditions at different severity levels — primary failure wakes someone up, replica failure queues a daytime investigation.
Pattern 5: CDC data pipeline gap
CDC (Change Data Capture) data pipelines are the most scalable freshness architecture for MCP servers with many concurrent tool calls reading the same underlying data. Instead of querying the source database directly (query-on-demand) or running periodic batch refreshes, CDC streams changes from the source into a local materialized view. Tool calls read from the local view at sub-5ms latency without touching the source database at all.
The silent failure is more systemic than the event-driven case. When the CDC pipeline stops — replication slot falls behind, Kafka consumer lag grows, Debezium connector pauses for a schema change — the entire materialized view freezes. Not one table. Not one channel. All tables simultaneously. Every tool call reads from the frozen view and returns data from the point in time when the pipeline stopped. No error. No warning. Wrong data for every query, indefinitely, until someone notices the consequences downstream.
// Per-table freshness tracking
const tableFreshness = new Map();
// PostgreSQL logical replication handler
import { LogicalReplicationService } from 'pg-logical-replication';
const service = new LogicalReplicationService({ connectionString: process.env.SOURCE_DATABASE_URL });
service.on('data', (lsn, log) => {
if (log.tag === 'insert' || log.tag === 'update' || log.tag === 'delete') {
const tableName = log.relation.name;
db.prepare(`INSERT OR REPLACE INTO local_${tableName} VALUES (?)`).run(JSON.stringify(log.new));
tableFreshness.set(tableName, new Date());
}
});
// Per-table staleness thresholds (set based on expected change rate)
const FRESHNESS_THRESHOLDS = {
deployments: 60, // Should see changes within 60s
alert_rules: 300, // Config changes: 5 minutes acceptable
server_metrics: 30, // Near-real-time: 30 seconds max
};
app.get('/health', async (req, res) => {
const freshnessReport = {};
let anyStale = false;
for (const [table, threshold] of Object.entries(FRESHNESS_THRESHOLDS)) {
const lastUpdate = tableFreshness.get(table);
const ageSecs = lastUpdate ? (Date.now() - lastUpdate.getTime()) / 1000 : Infinity;
const status = ageSecs < threshold ? 'fresh' : 'stale';
if (status === 'stale') anyStale = true;
freshnessReport[table] = { status, age_secs: Math.round(ageSecs), threshold_secs: threshold };
}
if (anyStale) {
return res.status(503).json({ status: 'degraded', freshness: freshnessReport });
}
res.json({ status: 'healthy', freshness: freshnessReport });
});
The circuit breaker pattern converts silent wrong-data failures into explicit isError: true responses. Rather than returning stale data that the agent will act on incorrectly, the handler checks freshness before serving:
// Circuit breaker: fail explicitly rather than return wrong data
function checkDataFreshness(tableName) {
const MAX_STALENESS_SECS = (FRESHNESS_THRESHOLDS[tableName] || 300) * 2;
const lastUpdate = tableFreshness.get(tableName);
if (!lastUpdate) throw new Error(`data_unavailable: ${tableName} never populated`);
const ageSecs = (Date.now() - lastUpdate.getTime()) / 1000;
if (ageSecs > MAX_STALENESS_SECS) {
throw new Error(`data_stale: ${tableName} last updated ${Math.round(ageSecs)}s ago (max: ${MAX_STALENESS_SECS}s)`);
}
}
server.setRequestHandler(CallToolRequestSchema, async (request) => {
if (request.params.name === 'get_deployment_status') {
checkDataFreshness('deployments'); // Throws if pipeline has stalled
const row = db.prepare('SELECT * FROM deployments WHERE id = ?').get(request.params.arguments.id);
return { content: [{ type: 'text', text: JSON.stringify(row) }] };
}
});
For Kafka-based pipelines, consumer lag is the leading freshness indicator. Monitor it continuously from the Kafka admin API:
// Kafka consumer lag monitoring
import { Kafka } from 'kafkajs';
const kafka = new Kafka({ brokers: [process.env.KAFKA_BROKER] });
const admin = kafka.admin();
await admin.connect();
async function getConsumerLag(groupId, topic) {
const [consumerOffsets, topicOffsets] = await Promise.all([
admin.fetchOffsets({ groupId, topics: [topic] }),
admin.fetchTopicOffsets(topic),
]);
const consumerOffset = parseInt(consumerOffsets[0].partitions[0].offset);
const latestOffset = parseInt(topicOffsets[0].offset);
return latestOffset - consumerOffset; // Alert if > 1000 messages
}
Debezium-managed pipelines add schema change handling at the CDC layer. When the source table adds a column, Debezium emits the updated schema version via the Schema Registry, and consumers receive null for the new field until backfill completes. Make all handlers schema-tolerant by using optional fields with defaults. The source.ts_ms field in the Debezium event envelope provides accurate end-to-end lag measurement — it is the database transaction commit timestamp, so subtracting it from the current time gives true pipeline latency rather than the proxy metric of consumer offset delta.
One operational risk specific to PostgreSQL logical replication: the replication slot holds WAL files on the source server until the subscriber acknowledges them. If the MCP server goes down for hours or days, WAL accumulation can exhaust source server disk. Set max_slot_wal_keep_size (PostgreSQL 13+) to limit per-slot WAL retention — the slot will be dropped if it falls too far behind, triggering a full re-snapshot on reconnect, which is preferable to a disk-full outage on the source.
Choosing the right data architecture
The five patterns cover different points on the freshness-versus-load tradeoff curve. The right choice depends on three variables: how frequently the source data changes, how much load the source database can absorb from direct MCP queries, and how stale the data can be before agents make wrong decisions.
| Pattern | Best for | Query latency | Source DB load | Staleness potential | Operational complexity |
|---|---|---|---|---|---|
| Query-on-demand + connection pool | Low-frequency changes; always-current data required; simple first deployment | 10–200ms | High — one query per tool call | Zero — live query | Low |
| Background jobs | Long-running operations; export/generation tasks; idempotent batch work | <1ms (enqueue); minutes (result) | Isolated — worker pool decoupled | N/A — async by design | Medium (Redis or pg-boss) |
| Event-driven pub/sub | Real-time push events; state updated by external systems; low-cardinality in-memory cache | <5ms | None after startup sync | Seconds to minutes on reconnect | Medium (subscriber management) |
| Read replica routing | Read:write ratio > 10:1; horizontal read scaling; primary connection pool pressure | 5–50ms | Writes to primary only; reads to replica | Milliseconds to seconds | Medium (replica + lag monitoring) |
| CDC data pipeline | Many MCP instances reading same data; source DB cannot absorb query load; <10s freshness at <5ms latency | <5ms | Near-zero — replication stream | 50ms–2s pipeline latency | High (replication slot or Kafka) |
The decision path: start with query-on-demand and a connection pool. Add read replicas when the primary shows connection pressure or CPU load correlated with tool call volume. Introduce event-driven or CDC when direct query latency or load becomes unacceptable. Add background jobs for any tool call that might run longer than 10 seconds. Each step up in architecture adds a new failure class — the monitoring requirements grow proportionally with the complexity.
Monitoring the data correctness layer
The external protocol probe — initialize → tools/list → connectivity verification — covers the availability layer. It detects process death, TLS failure, network unreachability, and protocol version mismatches. It covers everything that happens before the tool handler executes. But all five data architecture failure modes occur inside the tool handler, after the protocol layer succeeds. The probe cannot see them.
Closing the data correctness gap requires two complementary mechanisms: a custom health endpoint that validates data-layer invariants, and a canary tool call that exercises the actual data path with a known-good query.
A single /health endpoint can aggregate all four data-layer checks:
app.get('/health', async (req, res) => {
const checks = {};
// 1. Connection pool: pool.waitingCount is the key signal
checks.pool = {
utilization: (pool.totalCount - pool.idleCount) / pool.totalCount,
waiting: pool.waitingCount,
status: pool.waitingCount > 0 ? 'degraded' : 'ok',
};
// 2. Event pipeline freshness
const eventStaleSecs = (Date.now() - lastEventAt) / 1000;
checks.events = {
last_event_secs_ago: Math.round(eventStaleSecs),
status: eventStaleSecs < EVENT_STALE_THRESHOLD ? 'ok' : 'stale',
};
// 3. Replica lag
checks.replica = {
lag_secs: replicaLagSeconds,
routing: replicaHealthy ? 'replica' : 'primary_fallback',
status: replicaHealthy ? 'ok' : 'degraded',
};
// 4. CDC pipeline freshness per table
const freshnessOk = [...tableFreshness.entries()].every(([table, ts]) => {
const ageSecs = (Date.now() - ts.getTime()) / 1000;
return ageSecs < (FRESHNESS_THRESHOLDS[table] || 300);
});
checks.pipeline = { status: freshnessOk ? 'ok' : 'stale' };
const healthy = Object.values(checks).every(c => c.status === 'ok');
res.status(healthy ? 200 : 503).json({
status: healthy ? 'healthy' : 'degraded',
checks,
});
});
Register /health as AliveMCP's custom health URL. AliveMCP polls it every 60 seconds alongside the protocol probe. When the endpoint returns 503, AliveMCP fires an alert through your configured channels — Slack, PagerDuty, email — with the full response body included so the on-call engineer knows immediately which layer is degraded and why, without opening a dashboard.
The canary tool call goes one layer deeper than the health endpoint. It validates the full data path with an actual tool call using a known-good query:
server.setRequestHandler(CallToolRequestSchema, async (request) => {
if (request.params.name === 'health_check') {
const results = {};
// Database connectivity via pool
try {
const client = await pool.connect();
await client.query('SELECT 1');
client.release();
results.database = 'ok';
} catch (err) {
results.database = `error: ${err.message}`;
}
// Pool saturation metrics
results.pool_utilization = (pool.totalCount - pool.idleCount) / pool.totalCount;
results.pool_waiting = pool.waitingCount;
// Application-layer: known-good query with expected result
try {
const { rows } = await pool.query('SELECT 1 AS healthcheck');
results.query = rows[0]?.healthcheck === 1 ? 'ok' : 'unexpected_result';
} catch (err) {
results.query = `error: ${err.message}`;
}
const allOk = results.database === 'ok' && results.query === 'ok' && results.pool_waiting === 0;
if (!allOk) throw new Error(`health_check_failed: ${JSON.stringify(results)}`);
return { content: [{ type: 'text', text: JSON.stringify(results) }] };
}
});
The three-layer monitoring stack maps cleanly to three failure classes:
- External protocol probe (AliveMCP built-in) — catches availability failures: process death, TLS expiry, network unreachability, DNS failure, protocol version mismatch. Fires when
initializeortools/listfails. - Custom health URL at
/health(AliveMCP custom URL) — catches infrastructure-layer failures: pool exhaustion, event pipeline staleness, replica lag, CDC pipeline gap. Fires when any data-layer invariant is breached, before agents receive wrong data. - Canary tool call (per-probe
health_checktool or AliveMCP canary feature) — catches application-layer failures: a handler that executes but returns wrong output, a query that runs but returns unexpected results, a cache that serves from an unexpected state. Fires when the full data path produces a wrong answer.
Each layer catches a distinct failure class that the others miss. The availability probe misses all five data-layer failures. The health endpoint misses application-layer semantic failures. The canary tool call catches what both miss — the case where everything looks healthy but the actual answer coming out of the tool is wrong. Together they give you the monitoring coverage that database-backed MCP servers need, not just the coverage that protocol-based uptime checks provide.
Monitor the data correctness layer, not just availability
AliveMCP probes your MCP endpoint every 60 seconds and polls a custom health URL that validates your data-layer invariants — pool utilization, event pipeline freshness, replica lag, and CDC pipeline staleness. Add your endpoint and connect Slack or PagerDuty in under 5 minutes.
Add your endpoint — $9/mo