Guide · Snapshot Testing

MCP server snapshot testing

Snapshot tests for MCP servers work differently from snapshot tests for React components or REST APIs. The consumer isn't a human reading a browser or a developer eyeing a JSON diff — it's an LLM that parses your tool output and uses it to decide what to do next. That changes what matters: the exact field names in a JSON response, the presence of a content[] array, whether a number is returned as a string or an integer — all of these affect how confidently the LLM can act on your result. A formatting regression that looks cosmetic in a code review can cause the LLM to misread the entire response. Snapshot tests catch those regressions at build time, before they reach production and before any agent goes wrong.

TL;DR

Use toMatchSnapshot() in Vitest to lock down the output of your MCP tool calls. Serialize the full CallToolResult including the content[] array structure. Before snapshotting, sanitize dynamic fields — timestamps, generated IDs, pagination cursors — by replacing them with stable placeholders. Snapshot the response structure and field names (LLM-visible concerns); do not snapshot implementation details like API call counts or internal metadata. Commit snapshot files to git, review them in PRs, and block CI merges on unapproved snapshot changes. AliveMCP checks that your deployed server produces any output at all; snapshot tests verify that the output has the exact shape your LLM consumers expect.

Why output formatting matters for MCP servers

When a REST API changes a field name from created_at to createdAt, the immediate symptom is a TypeError in the client code that reads it — a hard failure a developer sees immediately. When an MCP tool makes the same change, the immediate symptom is subtler: the LLM receives a response with a differently-shaped object and tries to infer what to do with it. Depending on context, the model might hallucinate the old field name, silently skip the value, or — worst — proceed with a wrong interpretation and take an incorrect action on the user's behalf.

Consider a list_tickets tool that returns a list of support tickets. The original output looks like this:

// Original CallToolResult content[0].text (parsed)
{
  "tickets": [
    { "id": "TKT-001", "subject": "Login broken", "status": "open", "priority": "high" },
    { "id": "TKT-002", "subject": "Slow dashboard", "status": "closed", "priority": "low" }
  ],
  "total": 2
}

A developer refactors the response serializer to use a more detailed envelope format:

// Refactored CallToolResult content[0].text (parsed) — looks reasonable in a PR
{
  "data": {
    "items": [
      { "ticketId": "TKT-001", "title": "Login broken", "state": "open", "urgency": "high" },
      { "ticketId": "TKT-002", "title": "Slow dashboard", "state": "closed", "urgency": "low" }
    ],
    "count": 2
  }
}

Four field renames and one structural nesting change. An LLM agent that previously issued close_ticket(id="TKT-001") after reading tickets[0].id now receives a response where tickets doesn't exist and id doesn't exist. The agent may hallucinate a ticket ID, pick up a stale value from conversation history, or fail to act at all. None of these failures produce an obvious error — the MCP tool call succeeded with isError: false. The test suite passed. The CI pipeline went green.

Snapshot tests prevent this by turning the before/after comparison into an explicit, diff-visible change in the repository. The change is still allowed — you just have to acknowledge it and update the snapshot intentionally, giving the diff a chance to surface in code review.

What to snapshot and what not to snapshot

The biggest mistake teams make with MCP snapshot tests is snapshotting everything, including fields that change every run. The snapshot breaks on every CI run, the team starts reflexively running vitest --update-snapshots without looking at the diff, and the guard that was supposed to catch regressions trains everyone to ignore it.

The rule is: snapshot anything the LLM reads to decide what to do. Sanitize anything that changes between runs without the LLM caring.

Good snapshot targets Sanitize before snapshotting
Top-level content[] array structureTimestamps (created_at, updated_at)
Field names inside JSON responsesAuto-generated IDs and UUIDs
Output format choice: text vs. JSON vs. structuredPagination cursors and continuation tokens
Error message wording for known error casesAPI call counts and internal metrics
Number of content blocks returnedServer-generated nonces and salts
Presence of isError: true/falseDuration and latency values

Here is a sanitizer function that replaces the most common dynamic values with stable placeholders before snapshotting:

// test/helpers/sanitize-snapshot.ts
import type { CallToolResult } from '@modelcontextprotocol/sdk/types.js';

const UUID_RE = /[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/gi;
const ISO_DATE_RE = /\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?Z/g;
const UNIX_TS_RE = /\b17\d{8,9}\b/g; // unix timestamps in the 2020s

export function sanitizeForSnapshot(result: CallToolResult): CallToolResult {
  return {
    ...result,
    content: result.content.map((block) => {
      if (block.type !== 'text') return block;

      let text = block.text;
      text = text.replace(UUID_RE, '<UUID>');
      text = text.replace(ISO_DATE_RE, '<ISO_DATE>');
      text = text.replace(UNIX_TS_RE, '<TIMESTAMP>');

      return { ...block, text };
    }),
  };
}

// For structured JSON text blocks, sanitize within the parsed object
export function sanitizeJsonBlock(text: string, keysToRedact: string[]): string {
  try {
    const obj = JSON.parse(text);
    for (const key of keysToRedact) {
      redactKey(obj, key);
    }
    return JSON.stringify(obj, null, 2);
  } catch {
    return text; // not JSON — sanitize as string above
  }
}

function redactKey(obj: unknown, key: string): void {
  if (!obj || typeof obj !== 'object') return;
  if (Array.isArray(obj)) {
    obj.forEach((item) => redactKey(item, key));
  } else {
    for (const [k, v] of Object.entries(obj as Record<string, unknown>)) {
      if (k === key) {
        (obj as Record<string, unknown>)[k] = `<REDACTED:${key}>`;
      } else {
        redactKey(v, key);
      }
    }
  }
}

Setting up MCP snapshot tests with Vitest

Vitest ships toMatchSnapshot() and toMatchInlineSnapshot() out of the box. The key for MCP is serializing the CallToolResult correctly — specifically, keeping the content[] array structure visible in the snapshot rather than letting it collapse into [Object].

// test/snapshots/list-tickets.test.ts
import { describe, it, expect, beforeEach, afterEach } from 'vitest';
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { InMemoryTransport } from '@modelcontextprotocol/sdk/inMemory.js';
import { createServer } from '../../src/server.js';
import { createTestDeps } from '../helpers/test-deps.js';
import { sanitizeForSnapshot, sanitizeJsonBlock } from '../helpers/sanitize-snapshot.js';

describe('list_tickets — snapshot', () => {
  let client: Client;

  beforeEach(async () => {
    const deps = createTestDeps();
    await deps.db.seed([
      { id: 'TKT-001', subject: 'Login broken', status: 'open', priority: 'high', createdAt: new Date('2026-01-01') },
      { id: 'TKT-002', subject: 'Slow dashboard', status: 'closed', priority: 'low', createdAt: new Date('2026-01-02') },
    ]);

    const [serverTransport, clientTransport] = InMemoryTransport.createLinkedPair();
    const server = createServer(deps);
    await server.connect(serverTransport);

    client = new Client({ name: 'snapshot-test', version: '0.0.0' }, { capabilities: {} });
    await client.connect(clientTransport);
  });

  afterEach(async () => {
    await client.close();
  });

  it('response structure matches snapshot', async () => {
    const raw = await client.callTool({ name: 'list_tickets', arguments: { status: 'all' } });

    // Sanitize: redact dynamic timestamps within the JSON payload
    const sanitized = sanitizeForSnapshot(raw);
    if (sanitized.content[0]?.type === 'text') {
      sanitized.content[0].text = sanitizeJsonBlock(
        sanitized.content[0].text,
        ['createdAt', 'updatedAt', 'cursor']
      );
    }

    // Snapshot the full sanitized CallToolResult — structure is now stable
    expect(sanitized).toMatchSnapshot();
  });

  it('error response structure matches snapshot', async () => {
    const raw = await client.callTool({ name: 'list_tickets', arguments: { status: 'invalid' } });
    expect(raw.isError).toBe(true);
    // Error messages should be stable — no sanitization needed
    expect(raw).toMatchSnapshot();
  });
});

The generated snapshot file (__snapshots__/list-tickets.test.ts.snap) looks like this after the first run:

// Vitest Snapshot v1, https://vitest.dev/guide/snapshot.html

exports[`list_tickets — snapshot > response structure matches snapshot 1`] = `
{
  "content": [
    {
      "text": "{
  \\"tickets\\": [
    {
      \\"id\\": \\"TKT-001\\",
      \\"subject\\": \\"Login broken\\",
      \\"status\\": \\"open\\",
      \\"priority\\": \\"high\\",
      \\"createdAt\\": \\"<REDACTED:createdAt>\\"
    },
    {
      \\"id\\": \\"TKT-002\\",
      \\"subject\\": \\"Slow dashboard\\",
      \\"status\\": \\"closed\\",
      \\"priority\\": \\"low\\",
      \\"createdAt\\": \\"<REDACTED:createdAt>\\"
    }
  ],
  \\"total\\": 2
}",
      "type": "text",
    },
  ],
  "isError": false,
}
`;

exports[`list_tickets — snapshot > error response structure matches snapshot 1`] = `
{
  "content": [
    {
      "text": "Invalid status value. Expected: open | closed | all",
      "type": "text",
    },
  ],
  "isError": true,
}
`;

If you later refactor the response to use the nested data.items structure, the snapshot fails immediately with a clear diff. The field renames from subject to title, status to state, and the structural change from tickets[] to data.items[] are all visible in the snapshot diff in the PR.

For tools with many output variants, toMatchInlineSnapshot() is useful for short responses where you want the expected value visible in the test file itself:

it('get_ticket returns text summary inline', async () => {
  const raw = await client.callTool({ name: 'get_ticket', arguments: { id: 'TKT-001' } });
  const sanitized = sanitizeForSnapshot(raw);

  expect(sanitized.content[0]).toMatchInlineSnapshot(`
    {
      "text": "Ticket TKT-001: Login broken (open, high priority)",
      "type": "text",
    }
  `);
});

Snapshot discipline

Snapshot tests only protect you if you treat snapshot updates as meaningful code changes. The workflow should be:

  1. Snapshot failure in CI — the build is red. Do not reflexively run vitest --update-snapshots.
  2. Ask: was this intentional? Look at the recent commits. Did someone change the tool handler's output format on purpose? If yes, review the diff carefully — does the new shape still make sense for LLM consumption? Then update.
  3. If unintentional — this is the bug you were looking for. Revert the formatting change, not the snapshot.

Use --reporter=verbose to see the full snapshot diff in the terminal rather than just "snapshot mismatch":

vitest run --reporter=verbose

Vitest prints the diff between the stored snapshot and the received value, with - lines for what the snapshot expected and + lines for what the tool actually returned. A diff that shows - "tickets" / + "data" across every line is clearly a structural rename — easy to catch in code review. A diff that shows one field changing from "open" to "active" across a single test is equally clear.

Snapshot files belong in git. Commit __snapshots__/ directories alongside the test files that create them. This means:

Never add __snapshots__/ to .gitignore. A missing snapshot file causes the test to pass on first run (Vitest writes the file and reports success), masking regressions until the snapshot is committed and a second run can compare.

Prompt-regression snapshots

Standard snapshot tests verify that the MCP tool returns a stable output for a given input. Prompt-regression snapshots go one step further: they verify that the LLM's downstream response to that tool output doesn't change. This is a more expensive but more complete regression check — it catches cases where the output shape is technically the same but the wording change causes a different LLM interpretation.

The pattern uses a fixed, deterministic LLM call (temperature 0, same model version, same system prompt) and snapshots the model's response to a canned tool result:

// test/snapshots/prompt-regression.test.ts
import { describe, it, expect } from 'vitest';
import Anthropic from '@anthropic-ai/sdk';

// Fixed tool result — simulates what list_tickets returns
const TOOL_RESULT_V1 = `{
  "tickets": [
    { "id": "TKT-001", "subject": "Login broken", "status": "open", "priority": "high" }
  ],
  "total": 1
}`;

// The LLM's task: read the ticket list and decide what to do next
const SYSTEM_PROMPT = `You are a support agent assistant. Given a list of tickets, output JSON:
{"action": "escalate" | "defer" | "close", "ticketId": string, "reason": string}`;

describe('prompt regression — list_tickets', () => {
  it('LLM correctly reads ticket list and escalates high-priority open ticket', async () => {
    const client = new Anthropic();

    const response = await client.messages.create({
      model: 'claude-3-5-haiku-20241022', // pin exact version for snapshot stability
      max_tokens: 256,
      temperature: 0,
      system: SYSTEM_PROMPT,
      messages: [{ role: 'user', content: `Tickets: ${TOOL_RESULT_V1}` }],
    });

    const text = response.content[0].type === 'text' ? response.content[0].text : '';
    const parsed = JSON.parse(text);

    // Snapshot the parsed action — not the raw LLM text which may have minor wording variation
    expect(parsed).toMatchSnapshot();
  });
});

Prompt-regression snapshots are worth the cost when:

They are not worth the cost for every tool call — LLM API latency and token costs make them slow and expensive at scale. Run prompt-regression snapshots in a separate test suite gated behind a PROMPT_REGRESSION=true environment variable, executed nightly or before major releases rather than on every PR commit.

CI integration

Snapshot files committed to git, combined with a CI job that runs tests in read-only mode (no --update-snapshots), create an automatic gate against unapproved output changes. Here is a complete GitHub Actions workflow:

# .github/workflows/snapshot-tests.yml
name: Snapshot tests

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  snapshot:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: '22'
          cache: 'npm'

      - run: npm ci

      - name: Run snapshot tests (no update)
        run: npx vitest run --reporter=verbose
        # Vitest exits non-zero if any snapshot does not match the committed file.
        # --update-snapshots is intentionally NOT passed here: CI must never
        # silently accept a changed snapshot without a human reviewing the diff.

      - name: Upload snapshot diff on failure
        if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: snapshot-diff
          path: |
            **/__snapshots__/*.snap
          retention-days: 7

The upload-artifact step on failure captures the current snapshot files (what the code actually produces) so developers can diff them against the committed snapshots without checking out the branch locally. This is especially useful when a snapshot regression is discovered in a dependency update PR where a developer didn't write the failing code themselves.

To update snapshots intentionally, run locally:

npx vitest run --update-snapshots
git diff __snapshots__/   # review every change before staging
git add __snapshots__/
git commit -m "snapshot: update list_tickets output after field rename to match new schema"

The commit message convention (snapshot: update ...) makes snapshot updates searchable in the git log. When a production incident is traced back to a formatting change, you can run git log --oneline --all -- '**/__snapshots__/*.snap' to find exactly when the snapshot was last updated and which PR approved it.

Snapshot tests and AliveMCP

Snapshot tests and production monitoring address entirely different failure modes. Understanding the distinction prevents false confidence in either direction.

Snapshot tests run in your development and CI environment against a controlled, deterministic fixture. They verify that, given a specific input, your tool produces a specific output shape. They catch: field renames, structural reorganization, output format changes (text vs. JSON), error message wording changes, and any other formatting regression introduced during development. They do not catch: your deployed server being down, a database being unreachable, a memory leak causing timeouts after 10 hours of uptime, or a cloud provider networking issue making the server unavailable.

AliveMCP runs every 60 seconds against your live deployed server. It verifies that the MCP initialize handshake completes, that tools/list returns a valid response, and that your server is reachable from the public internet (or your private network, on Team plan). It catches: server crashes, failed deployments, network partitions, certificate expiry, and any condition that makes the server not respond at all. It does not catch formatting regressions inside tool responses — it doesn't call individual tools, and it doesn't know what the "correct" output should look like.

Failure mode Snapshot tests catch it AliveMCP catches it
Field renamed in tool response JSONYesNo
Response restructured (nested deeper)YesNo
Error message wording changedYesNo
Server down / port not boundNoYes
TLS certificate expiredNoYes
Database unreachable at runtimeNoYes (if tool fails to respond)
tools/list returns empty after bad deployNoYes

Both are required. Snapshot tests are part of the build process — they run before the code ships. AliveMCP is part of the runtime layer — it runs after the code ships and keeps running. A server that passes all snapshot tests can still be down in production. A server that AliveMCP reports as up can still have broken output formatting if a snapshot regression slipped through a careless --update-snapshots run. Use both.

For teams using AliveMCP health checks alongside snapshot tests, a useful convention is to add an AliveMCP webhook notification to the same Slack channel where CI failures post — so a formatting regression caught in CI and a runtime outage caught by AliveMCP both surface in the same workflow, and neither gets lost.

Related questions

Should I snapshot the entire CallToolResult or just the content?

Snapshot the entire CallToolResult, including isError. The isError flag is part of what the LLM client reads — a tool that previously returned isError: false and now returns isError: true for the same input is a regression, even if the content[] text is otherwise identical. Including the top-level structure in the snapshot means that change is caught automatically.

How do I handle tools that return image or binary content?

For ImageContent blocks (type: 'image'), snapshot the mimeType and the length of the data string rather than the full base64 payload. A 50,000-character base64 string in a snapshot file is unreadable in a PR diff and makes the snapshot file huge. Replace the data with a stable hash: createHash('sha256').update(block.data).digest('hex'). The hash changes if the image changes, but is diffable at a glance.

Can I snapshot tools/list as well as tool responses?

Yes — and you should. Snapshot the output of client.listTools() sorted by tool name. This catches tool renames, added/removed tools, description changes, and inputSchema modifications. The integration testing guide covers the SHA-256 hash approach for schema snapshots; using Vitest's toMatchSnapshot() directly on the sorted tool list is an alternative that gives you a human-readable diff in the snapshot file.

How many snapshot tests should I have?

One snapshot test per meaningful output variant per tool, plus one for each error case that has a specific message. For a tool with three output shapes (success with results, success with empty results, validation error), that's three snapshot tests. Don't snapshot every possible argument combination — snapshot the cases where the output structure differs, not the cases where only the content values differ.

Further reading

Snapshot tests catch formatting regressions in dev. AliveMCP catches outages in production.

AliveMCP runs the MCP initialize + tools/list probe against your server every 60 seconds. Know the moment your server stops responding — before your users do.

Get early access