Guide · Snapshot Testing
MCP server snapshot testing
Snapshot tests for MCP servers work differently from snapshot tests for React components or REST APIs. The consumer isn't a human reading a browser or a developer eyeing a JSON diff — it's an LLM that parses your tool output and uses it to decide what to do next. That changes what matters: the exact field names in a JSON response, the presence of a content[] array, whether a number is returned as a string or an integer — all of these affect how confidently the LLM can act on your result. A formatting regression that looks cosmetic in a code review can cause the LLM to misread the entire response. Snapshot tests catch those regressions at build time, before they reach production and before any agent goes wrong.
TL;DR
Use toMatchSnapshot() in Vitest to lock down the output of your MCP tool calls. Serialize the full CallToolResult including the content[] array structure. Before snapshotting, sanitize dynamic fields — timestamps, generated IDs, pagination cursors — by replacing them with stable placeholders. Snapshot the response structure and field names (LLM-visible concerns); do not snapshot implementation details like API call counts or internal metadata. Commit snapshot files to git, review them in PRs, and block CI merges on unapproved snapshot changes. AliveMCP checks that your deployed server produces any output at all; snapshot tests verify that the output has the exact shape your LLM consumers expect.
Why output formatting matters for MCP servers
When a REST API changes a field name from created_at to createdAt, the immediate symptom is a TypeError in the client code that reads it — a hard failure a developer sees immediately. When an MCP tool makes the same change, the immediate symptom is subtler: the LLM receives a response with a differently-shaped object and tries to infer what to do with it. Depending on context, the model might hallucinate the old field name, silently skip the value, or — worst — proceed with a wrong interpretation and take an incorrect action on the user's behalf.
Consider a list_tickets tool that returns a list of support tickets. The original output looks like this:
// Original CallToolResult content[0].text (parsed)
{
"tickets": [
{ "id": "TKT-001", "subject": "Login broken", "status": "open", "priority": "high" },
{ "id": "TKT-002", "subject": "Slow dashboard", "status": "closed", "priority": "low" }
],
"total": 2
}
A developer refactors the response serializer to use a more detailed envelope format:
// Refactored CallToolResult content[0].text (parsed) — looks reasonable in a PR
{
"data": {
"items": [
{ "ticketId": "TKT-001", "title": "Login broken", "state": "open", "urgency": "high" },
{ "ticketId": "TKT-002", "title": "Slow dashboard", "state": "closed", "urgency": "low" }
],
"count": 2
}
}
Four field renames and one structural nesting change. An LLM agent that previously issued close_ticket(id="TKT-001") after reading tickets[0].id now receives a response where tickets doesn't exist and id doesn't exist. The agent may hallucinate a ticket ID, pick up a stale value from conversation history, or fail to act at all. None of these failures produce an obvious error — the MCP tool call succeeded with isError: false. The test suite passed. The CI pipeline went green.
Snapshot tests prevent this by turning the before/after comparison into an explicit, diff-visible change in the repository. The change is still allowed — you just have to acknowledge it and update the snapshot intentionally, giving the diff a chance to surface in code review.
What to snapshot and what not to snapshot
The biggest mistake teams make with MCP snapshot tests is snapshotting everything, including fields that change every run. The snapshot breaks on every CI run, the team starts reflexively running vitest --update-snapshots without looking at the diff, and the guard that was supposed to catch regressions trains everyone to ignore it.
The rule is: snapshot anything the LLM reads to decide what to do. Sanitize anything that changes between runs without the LLM caring.
| Good snapshot targets | Sanitize before snapshotting |
|---|---|
Top-level content[] array structure | Timestamps (created_at, updated_at) |
| Field names inside JSON responses | Auto-generated IDs and UUIDs |
| Output format choice: text vs. JSON vs. structured | Pagination cursors and continuation tokens |
| Error message wording for known error cases | API call counts and internal metrics |
| Number of content blocks returned | Server-generated nonces and salts |
Presence of isError: true/false | Duration and latency values |
Here is a sanitizer function that replaces the most common dynamic values with stable placeholders before snapshotting:
// test/helpers/sanitize-snapshot.ts
import type { CallToolResult } from '@modelcontextprotocol/sdk/types.js';
const UUID_RE = /[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/gi;
const ISO_DATE_RE = /\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?Z/g;
const UNIX_TS_RE = /\b17\d{8,9}\b/g; // unix timestamps in the 2020s
export function sanitizeForSnapshot(result: CallToolResult): CallToolResult {
return {
...result,
content: result.content.map((block) => {
if (block.type !== 'text') return block;
let text = block.text;
text = text.replace(UUID_RE, '<UUID>');
text = text.replace(ISO_DATE_RE, '<ISO_DATE>');
text = text.replace(UNIX_TS_RE, '<TIMESTAMP>');
return { ...block, text };
}),
};
}
// For structured JSON text blocks, sanitize within the parsed object
export function sanitizeJsonBlock(text: string, keysToRedact: string[]): string {
try {
const obj = JSON.parse(text);
for (const key of keysToRedact) {
redactKey(obj, key);
}
return JSON.stringify(obj, null, 2);
} catch {
return text; // not JSON — sanitize as string above
}
}
function redactKey(obj: unknown, key: string): void {
if (!obj || typeof obj !== 'object') return;
if (Array.isArray(obj)) {
obj.forEach((item) => redactKey(item, key));
} else {
for (const [k, v] of Object.entries(obj as Record<string, unknown>)) {
if (k === key) {
(obj as Record<string, unknown>)[k] = `<REDACTED:${key}>`;
} else {
redactKey(v, key);
}
}
}
}
Setting up MCP snapshot tests with Vitest
Vitest ships toMatchSnapshot() and toMatchInlineSnapshot() out of the box. The key for MCP is serializing the CallToolResult correctly — specifically, keeping the content[] array structure visible in the snapshot rather than letting it collapse into [Object].
// test/snapshots/list-tickets.test.ts
import { describe, it, expect, beforeEach, afterEach } from 'vitest';
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { InMemoryTransport } from '@modelcontextprotocol/sdk/inMemory.js';
import { createServer } from '../../src/server.js';
import { createTestDeps } from '../helpers/test-deps.js';
import { sanitizeForSnapshot, sanitizeJsonBlock } from '../helpers/sanitize-snapshot.js';
describe('list_tickets — snapshot', () => {
let client: Client;
beforeEach(async () => {
const deps = createTestDeps();
await deps.db.seed([
{ id: 'TKT-001', subject: 'Login broken', status: 'open', priority: 'high', createdAt: new Date('2026-01-01') },
{ id: 'TKT-002', subject: 'Slow dashboard', status: 'closed', priority: 'low', createdAt: new Date('2026-01-02') },
]);
const [serverTransport, clientTransport] = InMemoryTransport.createLinkedPair();
const server = createServer(deps);
await server.connect(serverTransport);
client = new Client({ name: 'snapshot-test', version: '0.0.0' }, { capabilities: {} });
await client.connect(clientTransport);
});
afterEach(async () => {
await client.close();
});
it('response structure matches snapshot', async () => {
const raw = await client.callTool({ name: 'list_tickets', arguments: { status: 'all' } });
// Sanitize: redact dynamic timestamps within the JSON payload
const sanitized = sanitizeForSnapshot(raw);
if (sanitized.content[0]?.type === 'text') {
sanitized.content[0].text = sanitizeJsonBlock(
sanitized.content[0].text,
['createdAt', 'updatedAt', 'cursor']
);
}
// Snapshot the full sanitized CallToolResult — structure is now stable
expect(sanitized).toMatchSnapshot();
});
it('error response structure matches snapshot', async () => {
const raw = await client.callTool({ name: 'list_tickets', arguments: { status: 'invalid' } });
expect(raw.isError).toBe(true);
// Error messages should be stable — no sanitization needed
expect(raw).toMatchSnapshot();
});
});
The generated snapshot file (__snapshots__/list-tickets.test.ts.snap) looks like this after the first run:
// Vitest Snapshot v1, https://vitest.dev/guide/snapshot.html
exports[`list_tickets — snapshot > response structure matches snapshot 1`] = `
{
"content": [
{
"text": "{
\\"tickets\\": [
{
\\"id\\": \\"TKT-001\\",
\\"subject\\": \\"Login broken\\",
\\"status\\": \\"open\\",
\\"priority\\": \\"high\\",
\\"createdAt\\": \\"<REDACTED:createdAt>\\"
},
{
\\"id\\": \\"TKT-002\\",
\\"subject\\": \\"Slow dashboard\\",
\\"status\\": \\"closed\\",
\\"priority\\": \\"low\\",
\\"createdAt\\": \\"<REDACTED:createdAt>\\"
}
],
\\"total\\": 2
}",
"type": "text",
},
],
"isError": false,
}
`;
exports[`list_tickets — snapshot > error response structure matches snapshot 1`] = `
{
"content": [
{
"text": "Invalid status value. Expected: open | closed | all",
"type": "text",
},
],
"isError": true,
}
`;
If you later refactor the response to use the nested data.items structure, the snapshot fails immediately with a clear diff. The field renames from subject to title, status to state, and the structural change from tickets[] to data.items[] are all visible in the snapshot diff in the PR.
For tools with many output variants, toMatchInlineSnapshot() is useful for short responses where you want the expected value visible in the test file itself:
it('get_ticket returns text summary inline', async () => {
const raw = await client.callTool({ name: 'get_ticket', arguments: { id: 'TKT-001' } });
const sanitized = sanitizeForSnapshot(raw);
expect(sanitized.content[0]).toMatchInlineSnapshot(`
{
"text": "Ticket TKT-001: Login broken (open, high priority)",
"type": "text",
}
`);
});
Snapshot discipline
Snapshot tests only protect you if you treat snapshot updates as meaningful code changes. The workflow should be:
- Snapshot failure in CI — the build is red. Do not reflexively run
vitest --update-snapshots. - Ask: was this intentional? Look at the recent commits. Did someone change the tool handler's output format on purpose? If yes, review the diff carefully — does the new shape still make sense for LLM consumption? Then update.
- If unintentional — this is the bug you were looking for. Revert the formatting change, not the snapshot.
Use --reporter=verbose to see the full snapshot diff in the terminal rather than just "snapshot mismatch":
vitest run --reporter=verbose
Vitest prints the diff between the stored snapshot and the received value, with - lines for what the snapshot expected and + lines for what the tool actually returned. A diff that shows - "tickets" / + "data" across every line is clearly a structural rename — easy to catch in code review. A diff that shows one field changing from "open" to "active" across a single test is equally clear.
Snapshot files belong in git. Commit __snapshots__/ directories alongside the test files that create them. This means:
- Snapshot updates show up in PRs as diffs, the same way source code changes do.
- Reviewers can see exactly what the tool output used to look like and what it looks like now.
- CI runs on the committed snapshots — a PR that adds a formatting change without updating the snapshots fails the build before it's merged.
- The snapshot history in git lets you bisect when a formatting regression was introduced.
Never add __snapshots__/ to .gitignore. A missing snapshot file causes the test to pass on first run (Vitest writes the file and reports success), masking regressions until the snapshot is committed and a second run can compare.
Prompt-regression snapshots
Standard snapshot tests verify that the MCP tool returns a stable output for a given input. Prompt-regression snapshots go one step further: they verify that the LLM's downstream response to that tool output doesn't change. This is a more expensive but more complete regression check — it catches cases where the output shape is technically the same but the wording change causes a different LLM interpretation.
The pattern uses a fixed, deterministic LLM call (temperature 0, same model version, same system prompt) and snapshots the model's response to a canned tool result:
// test/snapshots/prompt-regression.test.ts
import { describe, it, expect } from 'vitest';
import Anthropic from '@anthropic-ai/sdk';
// Fixed tool result — simulates what list_tickets returns
const TOOL_RESULT_V1 = `{
"tickets": [
{ "id": "TKT-001", "subject": "Login broken", "status": "open", "priority": "high" }
],
"total": 1
}`;
// The LLM's task: read the ticket list and decide what to do next
const SYSTEM_PROMPT = `You are a support agent assistant. Given a list of tickets, output JSON:
{"action": "escalate" | "defer" | "close", "ticketId": string, "reason": string}`;
describe('prompt regression — list_tickets', () => {
it('LLM correctly reads ticket list and escalates high-priority open ticket', async () => {
const client = new Anthropic();
const response = await client.messages.create({
model: 'claude-3-5-haiku-20241022', // pin exact version for snapshot stability
max_tokens: 256,
temperature: 0,
system: SYSTEM_PROMPT,
messages: [{ role: 'user', content: `Tickets: ${TOOL_RESULT_V1}` }],
});
const text = response.content[0].type === 'text' ? response.content[0].text : '';
const parsed = JSON.parse(text);
// Snapshot the parsed action — not the raw LLM text which may have minor wording variation
expect(parsed).toMatchSnapshot();
});
});
Prompt-regression snapshots are worth the cost when:
- Your tool output is the primary input to an autonomous agent loop with real-world side effects (sending emails, updating records, triggering workflows).
- You are considering a formatting change that preserves semantics but restructures the JSON significantly.
- You want to verify that a new version of an LLM interprets your tool output the same way as the previous version.
They are not worth the cost for every tool call — LLM API latency and token costs make them slow and expensive at scale. Run prompt-regression snapshots in a separate test suite gated behind a PROMPT_REGRESSION=true environment variable, executed nightly or before major releases rather than on every PR commit.
CI integration
Snapshot files committed to git, combined with a CI job that runs tests in read-only mode (no --update-snapshots), create an automatic gate against unapproved output changes. Here is a complete GitHub Actions workflow:
# .github/workflows/snapshot-tests.yml
name: Snapshot tests
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
snapshot:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '22'
cache: 'npm'
- run: npm ci
- name: Run snapshot tests (no update)
run: npx vitest run --reporter=verbose
# Vitest exits non-zero if any snapshot does not match the committed file.
# --update-snapshots is intentionally NOT passed here: CI must never
# silently accept a changed snapshot without a human reviewing the diff.
- name: Upload snapshot diff on failure
if: failure()
uses: actions/upload-artifact@v4
with:
name: snapshot-diff
path: |
**/__snapshots__/*.snap
retention-days: 7
The upload-artifact step on failure captures the current snapshot files (what the code actually produces) so developers can diff them against the committed snapshots without checking out the branch locally. This is especially useful when a snapshot regression is discovered in a dependency update PR where a developer didn't write the failing code themselves.
To update snapshots intentionally, run locally:
npx vitest run --update-snapshots
git diff __snapshots__/ # review every change before staging
git add __snapshots__/
git commit -m "snapshot: update list_tickets output after field rename to match new schema"
The commit message convention (snapshot: update ...) makes snapshot updates searchable in the git log. When a production incident is traced back to a formatting change, you can run git log --oneline --all -- '**/__snapshots__/*.snap' to find exactly when the snapshot was last updated and which PR approved it.
Snapshot tests and AliveMCP
Snapshot tests and production monitoring address entirely different failure modes. Understanding the distinction prevents false confidence in either direction.
Snapshot tests run in your development and CI environment against a controlled, deterministic fixture. They verify that, given a specific input, your tool produces a specific output shape. They catch: field renames, structural reorganization, output format changes (text vs. JSON), error message wording changes, and any other formatting regression introduced during development. They do not catch: your deployed server being down, a database being unreachable, a memory leak causing timeouts after 10 hours of uptime, or a cloud provider networking issue making the server unavailable.
AliveMCP runs every 60 seconds against your live deployed server. It verifies that the MCP initialize handshake completes, that tools/list returns a valid response, and that your server is reachable from the public internet (or your private network, on Team plan). It catches: server crashes, failed deployments, network partitions, certificate expiry, and any condition that makes the server not respond at all. It does not catch formatting regressions inside tool responses — it doesn't call individual tools, and it doesn't know what the "correct" output should look like.
| Failure mode | Snapshot tests catch it | AliveMCP catches it |
|---|---|---|
| Field renamed in tool response JSON | Yes | No |
| Response restructured (nested deeper) | Yes | No |
| Error message wording changed | Yes | No |
| Server down / port not bound | No | Yes |
| TLS certificate expired | No | Yes |
| Database unreachable at runtime | No | Yes (if tool fails to respond) |
| tools/list returns empty after bad deploy | No | Yes |
Both are required. Snapshot tests are part of the build process — they run before the code ships. AliveMCP is part of the runtime layer — it runs after the code ships and keeps running. A server that passes all snapshot tests can still be down in production. A server that AliveMCP reports as up can still have broken output formatting if a snapshot regression slipped through a careless --update-snapshots run. Use both.
For teams using AliveMCP health checks alongside snapshot tests, a useful convention is to add an AliveMCP webhook notification to the same Slack channel where CI failures post — so a formatting regression caught in CI and a runtime outage caught by AliveMCP both surface in the same workflow, and neither gets lost.
Related questions
Should I snapshot the entire CallToolResult or just the content?
Snapshot the entire CallToolResult, including isError. The isError flag is part of what the LLM client reads — a tool that previously returned isError: false and now returns isError: true for the same input is a regression, even if the content[] text is otherwise identical. Including the top-level structure in the snapshot means that change is caught automatically.
How do I handle tools that return image or binary content?
For ImageContent blocks (type: 'image'), snapshot the mimeType and the length of the data string rather than the full base64 payload. A 50,000-character base64 string in a snapshot file is unreadable in a PR diff and makes the snapshot file huge. Replace the data with a stable hash: createHash('sha256').update(block.data).digest('hex'). The hash changes if the image changes, but is diffable at a glance.
Can I snapshot tools/list as well as tool responses?
Yes — and you should. Snapshot the output of client.listTools() sorted by tool name. This catches tool renames, added/removed tools, description changes, and inputSchema modifications. The integration testing guide covers the SHA-256 hash approach for schema snapshots; using Vitest's toMatchSnapshot() directly on the sorted tool list is an alternative that gives you a human-readable diff in the snapshot file.
How many snapshot tests should I have?
One snapshot test per meaningful output variant per tool, plus one for each error case that has a specific message. For a tool with three output shapes (success with results, success with empty results, validation error), that's three snapshot tests. Don't snapshot every possible argument combination — snapshot the cases where the output structure differs, not the cases where only the content values differ.
Further reading
- MCP server unit testing — InMemoryTransport and isolated tool tests
- MCP server integration testing — InMemoryTransport, test clients, and CI gates
- MCP server end-to-end testing — full agent loop tests against a live server
- MCP server mutation testing — finding gaps in your test assertions
- MCP server testing — protocol compliance, schema snapshots, and CI
- MCP server health check — the production complement to snapshot tests
- AliveMCP — uptime and health monitoring for deployed MCP servers