Guide · AI Integration
MCP Server Context Window Management — token budget, chunking, and session continuity
Tool responses consume the LLM's context window. A search_documents tool that returns 20 chunks at 300 tokens each burns 6,000 tokens before the LLM writes a single word of response — and if the agent has already accumulated 60,000 tokens of conversation history, you're near the limit of even a 128K-context model. MCP servers that ignore token budgets cause context overflow errors, LLM retries, and silent response truncation. This guide covers token counting in tool implementations, budget-aware retrieval, truncation strategies, multi-turn deduplication, and how a server restart breaks active sessions — and how AliveMCP detects the restart that drops your users' context.
TL;DR
Count tokens before returning from every tool that produces large text output. Use tiktoken (Python) or js-tiktoken (Node.js) to count accurately — character-based estimates are wrong by up to 40% for code and structured data. If the response would exceed a configurable max_tokens parameter (default 2000), truncate from the bottom of the result set (keep the highest-ranked items), add a truncated: true field to the response, and let the agent decide whether to request fewer results. For MCP servers managing multi-turn sessions, track document deduplication across turns to avoid resending the same chunks. Monitor with AliveMCP — a server restart during an active session causes a protocol handshake failure that AliveMCP detects within 60 seconds, letting you alert users before they lose context.
The context window problem in MCP
Every MCP tool call result is appended to the LLM's context window as a tool message. For a long-running agent session with many tool calls, this accumulates quickly:
| Turn | Tool call | Tokens added | Running total |
|---|---|---|---|
| 1 | search_documents (10 chunks) | 3,000 | 3,500 |
| 2 | read_file (README.md) | 2,000 | 6,200 |
| 3 | search_documents (10 more chunks) | 3,000 | 10,000 |
| 4 | get_database_schema | 4,000 | 15,000 |
| … | … | … | … |
| 20 | — | — | ~80,000 |
At 80,000 tokens with a 128K-context model, the agent has 48,000 tokens of headroom — enough for a few more turns. But at turn 25, it hits the limit. The LLM starts receiving truncated contexts and produces responses that appear to "forget" earlier conversation. If the model provider returns a context overflow error, the entire agent loop breaks.
MCP servers cannot control how agents accumulate context from other sources (conversation history, other tool calls), but they can control how many tokens their own tool responses consume — and they should.
Token counting: why character estimates fail
The naive approach is character division: estimate 4 characters per token, so a 2000-character response is ~500 tokens. This is wrong for anything other than simple English prose:
- Source code: identifiers like
vectorStore.idleConnectionCounttokenize to 6+ tokens despite being 30 characters. Code is typically 2–3 characters per token, not 4. - JSON structures:
{"key": "value"}— punctuation tokens are each 1 token. JSON is expensive per character. - Non-Latin text: Chinese, Japanese, Arabic — characters map to 3–5 tokens each. A 1000-character Chinese response might be 3000 tokens.
- Whitespace and formatting: markdown headers, code blocks, and newlines add tokens that character counts miss.
// Node.js: accurate token counting with js-tiktoken
import { encodingForModel } from 'js-tiktoken';
const encoder = encodingForModel('gpt-4o'); // use your target model's encoding
function countTokens(text) {
const tokens = encoder.encode(text);
return tokens.length;
}
// Python: tiktoken
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
def count_tokens(text: str) -> int:
return len(enc.encode(text))
Token counting is fast — encoding 10,000 characters takes under 1ms. There's no reason to estimate when you can count exactly.
Budget-aware retrieval
The simplest budget enforcement: accept a max_tokens parameter in the tool's input schema and truncate results to fit.
async function searchWithBudget({ query, top_k = 20, max_tokens = 2000 }) {
// Over-fetch, then trim to budget
const candidates = await hybridSearch(query, top_k);
const reranked = await crossEncoderRerank(query, candidates);
const chunks = [];
let usedTokens = 0;
const overhead = countTokens(JSON.stringify({ results: [], query, truncated: false }));
const budget = max_tokens - overhead;
for (const chunk of reranked) {
const chunkTokens = countTokens(JSON.stringify({
text: chunk.text,
source: chunk.source,
score: chunk.score,
}));
if (usedTokens + chunkTokens > budget) break;
chunks.push(chunk);
usedTokens += chunkTokens;
}
return {
content: [{
type: "text",
text: JSON.stringify({
results: chunks,
query,
total_retrieved: reranked.length,
returned: chunks.length,
truncated: chunks.length < reranked.length,
tokens_used: usedTokens,
})
}]
};
}
The truncated: true field in the response is important: it tells the agent that more results exist but were withheld to fit the budget. A well-implemented agent can then request a smaller top_k with a source_filter to narrow the search, or proceed knowing its results are incomplete.
Dynamic budget from the sampling parameters
MCP's sampling API allows clients to pass a maxTokens hint in the sampling parameters. Server implementations can read this from the session context and use it to inform tool response sizing. If the client signals it's working with a 4K-token budget total, your search_documents tool should return far fewer chunks than if the client has a 128K budget.
// Access the client's declared context limit from the MCP session metadata
function getContextBudget(session) {
const maxTokens = session?.clientInfo?.capabilities?.sampling?.maxTokens;
if (maxTokens && maxTokens > 0) {
// Reserve 50% for the LLM response, allocate 30% to this tool call
return Math.floor(maxTokens * 0.30);
}
return 2000; // safe default
}
Truncation strategies
When you must truncate, the strategy matters for answer quality:
| Strategy | How | Best for | Risk |
|---|---|---|---|
| Tail truncation | Return top-N ranked chunks; drop the rest | Ranked retrieval where relevance falls off | May miss a crucial detail in a lower-ranked chunk |
| Within-chunk truncation | Shorten each chunk to fit more chunks | Large chunks with redundant content | Truncated chunks lose their closing sentences — often the conclusion |
| Summarize to fit | LLM-summarize each chunk before returning | Dense technical content | Adds latency; introduces paraphrase errors |
| Deduplicate first | Remove near-duplicate chunks before truncating | Large corpora with repetitive content | Requires similarity computation across results |
Tail truncation — return the highest-ranked results that fit, drop the rest — is the default. It's the most predictable and easiest to reason about. Add chunk deduplication before truncating: if your top-20 candidates include 3 chunks from the same document paragraph (common with overlapping windows), deduplicate them first to use your token budget on diversity, not repetition.
function deduplicateChunks(chunks, similarityThreshold = 0.92) {
const selected = [];
for (const chunk of chunks) {
const isDuplicate = selected.some(s =>
jaccardSimilarity(chunk.text, s.text) > similarityThreshold
);
if (!isDuplicate) {
selected.push(chunk);
}
}
return selected;
}
function jaccardSimilarity(a, b) {
const setA = new Set(a.split(/\s+/));
const setB = new Set(b.split(/\s+/));
const intersection = new Set([...setA].filter(w => setB.has(w)));
const union = new Set([...setA, ...setB]);
return intersection.size / union.size;
}
Multi-turn deduplication: don't resend chunks the agent already has
In a multi-turn agent session, the agent may call search_documents multiple times with related queries. The top-ranked results often overlap — the same document chunks appear in multiple search results. If the MCP server returns the same chunk in turn 2 that it returned in turn 1, the agent's context window now contains the same text twice, wasting tokens.
// MCP server maintains session state for seen chunk IDs
const sessionSeenChunks = new Map(); // sessionId → Set of chunk IDs
async function searchWithDeduplication({ query, top_k, sessionId, max_tokens = 2000 }) {
const seenIds = sessionSeenChunks.get(sessionId) || new Set();
const candidates = await hybridSearch(query, top_k * 2);
// Separate already-seen from new
const newChunks = candidates.filter(c => !seenIds.has(c.id));
const seenChunks = candidates.filter(c => seenIds.has(c.id));
// Return new chunks first, fill remaining budget with seen summaries if any
const toReturn = selectWithBudget(newChunks, max_tokens * 0.8);
const seenRefs = seenChunks.slice(0, 3).map(c => ({
id: c.id,
source: c.source,
note: "already in context from earlier search"
}));
// Update session state
const updatedSeen = new Set(seenIds);
toReturn.forEach(c => updatedSeen.add(c.id));
sessionSeenChunks.set(sessionId, updatedSeen);
return {
content: [{
type: "text",
text: JSON.stringify({
results: toReturn,
already_in_context: seenRefs,
new_chunks: toReturn.length,
deduplicated: candidates.length - toReturn.length,
})
}]
};
}
Session state for deduplication is in-memory — it does not persist across server restarts. This is a critical detail: if the MCP server restarts mid-session, the new server instance has no memory of which chunks have been sent, and the agent's next search_documents call will re-send duplicates. AliveMCP detects server restarts within 60 seconds via protocol probe interruption — use this signal to trigger a session reset notification to active clients if your architecture supports it.
Session continuity across server restarts
A server restart during an active agent session creates three problems:
- Dropped SSE connections: agents connected via SSE receive a connection closed event and must reconnect. Depending on the agent framework's reconnection logic, this may terminate the session or trigger an automatic reconnect that starts a new MCP session without the previous session's context.
- Lost in-memory session state: per-session data (seen chunk IDs, active context summary, in-flight tool calls) disappears. The new server instance starts with a blank slate.
- Mid-operation failures: a tool call in progress at restart time fails with a transport error. The agent may retry the tool call or propagate the error up.
// Persist critical session state to Redis or SQLite for restart recovery
async function saveSessionState(sessionId, state) {
const key = `session:${sessionId}`;
await redis.setex(key, 3600, JSON.stringify({
seen_chunk_ids: [...state.seenChunkIds],
turn_count: state.turnCount,
last_active: Date.now(),
}));
}
async function loadSessionState(sessionId) {
const key = `session:${sessionId}`;
const raw = await redis.get(key);
if (!raw) return { seenChunkIds: new Set(), turnCount: 0 };
const data = JSON.parse(raw);
return {
seenChunkIds: new Set(data.seen_chunk_ids),
turnCount: data.turn_count,
};
}
AliveMCP detects the restart within 60 seconds when the MCP protocol probe fails (connection refused during restart, then recovers when the server comes back up). Configure your AliveMCP Author plan to send a webhook on restart events. Your application layer can use this webhook to notify active users: "The documentation assistant is restarting — your session will resume in 30 seconds." This is more informative than the silent error the user would otherwise see when their agent loop breaks.
Frequently asked questions
Should MCP tools expose a max_tokens parameter or manage budget internally?
Expose max_tokens as an optional parameter with a sensible default (1500–2000 tokens). This lets agents that know their context budget pass it through explicitly, while agents that don't know or don't care get a safe default. Don't force the caller to always specify it — most agent frameworks don't pass explicit token budgets. The internal default should be conservative: 2000 tokens fits in the context window of any current model while still returning meaningful content. If your tool regularly hits the default and agents request more, increase the default. If agents are hitting context overflow errors, reduce it.
How do I handle a tool that must return a large response (e.g., a full file read)?
For tools that read files or databases, add pagination instead of truncation. A read_file tool should accept offset (byte position or line number) and limit (lines or bytes to return). The first call returns the start of the file; subsequent calls can request later sections. Return next_offset in the response so the agent knows where to continue. This is preferable to truncating silently: the agent knows there's more content and can decide whether to retrieve it based on what it's found so far. For file reads specifically, also return total_lines and total_tokens_estimated so the agent can make an informed decision about whether to read the entire file.
What causes sudden P95 latency spikes in MCP servers handling context-heavy requests?
The most common cause is LLM retries triggered by context overflow: the agent calls your tool, receives a large response, the LLM hits its context limit, the framework retries with a truncated context, and the retry includes another tool call to your server. You see doubled or tripled tool call volume from the same session. The second cause is context serialization at the MCP transport layer: very large tool responses (>100KB of JSON) take measurable time to serialize and send over SSE. Add response size logging to your tool implementations to distinguish between retrieval latency and serialization latency. AliveMCP's P95 tracking catches the aggregate effect — use it as a signal to investigate, then your server-side logs to diagnose the cause.
How should I design tool responses for agents with very large context windows (200K+ tokens)?
Don't change the default response size for large-context models. The right response size is the amount of information that's actually useful for the task — which is determined by the query specificity and corpus relevance, not by what the context window can hold. Returning 50 chunks when 5 would answer the question wastes the agent's context budget on noise and makes the LLM's retrieval augmentation less effective (the model must synthesize a larger, lower-quality result set). Design tools to return the minimum useful information by default. Let the agent request more explicitly via top_k or max_tokens parameters when it determines more context is genuinely needed.
How does AliveMCP help with context window management specifically?
AliveMCP monitors the MCP server's availability and response time, not the content of tool responses — it cannot detect context overflow inside the agent loop. Its value for context window management is indirect but important: AliveMCP detects MCP server restarts within 60 seconds via protocol probe failure. A restart during an active session drops all in-memory session state (seen chunk deduplication, active context) and forces the agent to rebuild context from scratch — consuming extra context window tokens. Fast restart detection via AliveMCP alert lets you minimize restart duration and notify users before they experience unexplained "forgetting" from their agent assistant. Configure AliveMCP's webhook to trigger a graceful session recovery flow in your application layer.
Further reading
- RAG with MCP Servers — retrieval-augmented generation tool patterns
- Semantic Caching for MCP Servers — reduce latency for similar queries
- MCP Server Token Budget — controlling LLM spend per tool call
- MCP Server Streaming — progressive tool responses for large outputs
- MCP Server Session Management — SSE lifecycle and reconnection
- MCP Server Health Checks — liveness, readiness, and custom probes