Guide · AI Integration
RAG with MCP Servers — retrieval-augmented generation tool patterns
Retrieval-augmented generation turns an MCP server into the memory layer for AI agents — your tools fetch documents, embed queries, and assemble context so the LLM answers from facts, not hallucination. The architecture looks clean until a vector store connection pool saturates at 3 AM and the search_documents tool silently returns empty arrays that the LLM confidently describes as "no relevant results found." This guide covers the complete RAG-over-MCP implementation: chunking strategy, hybrid retrieval, reranking, context assembly — and how AliveMCP monitors RAG servers that degrade without dying.
TL;DR
Expose three MCP tools: search_documents (embed query → cosine search → return top-k chunks), index_document (chunk → embed → upsert to vector store), and list_sources (enumerate indexed document sources). For retrieval, combine BM25 keyword search with vector similarity and use a cross-encoder to rerank the merged result before returning to the LLM. Monitor with AliveMCP because RAG server failures produce wrong answers, not error responses — the tool returns HTTP 200 with empty results, the LLM fills in the gap with confabulation, and no alarm fires without proactive protocol-level monitoring.
Why MCP is the right layer for RAG
Before MCP, adding retrieval to an LLM application meant coupling the retrieval logic into the agent code: the application had to manage vector store credentials, embedding model API keys, chunking parameters, and the assembly of retrieved context into the prompt. Every agent that needed document search duplicated this logic.
MCP externalizes retrieval behind a tool boundary. The agent calls search_documents({ query: "how do I configure rate limiting?", top_k: 5 }) and receives structured chunks with metadata. The agent does not know or care whether retrieval runs on pgvector, Chroma, or Pinecone — it knows only the tool contract. This separation makes the retrieval layer independently deployable, independently testable, and independently monitored.
The practical benefits compound at scale. A single MCP RAG server can serve multiple agents — a support bot, a documentation assistant, a code review helper — all sharing the same indexed document corpus without each embedding its own retrieval logic. When the corpus changes (new documents, updated policies, deprecated API references), re-indexing happens in one place and all agents see the update immediately.
The three core MCP tools for a RAG server
A minimal RAG MCP server exposes three tools. Every additional tool is optional.
search_documents
The retrieval tool. Takes a natural-language query, embeds it, performs similarity search, and returns ranked chunks with source metadata. This is the tool agents call at inference time.
// Tool schema
{
name: "search_documents",
description: "Search the indexed document corpus for content relevant to a query. Returns ranked text chunks with source attribution.",
inputSchema: {
type: "object",
properties: {
query: {
type: "string",
description: "Natural-language search query"
},
top_k: {
type: "integer",
default: 5,
description: "Number of chunks to return (max 20)"
},
source_filter: {
type: "string",
description: "Optional: restrict search to a specific document source (e.g. 'docs/api-reference')"
}
},
required: ["query"]
}
}
// Implementation
async function searchDocuments({ query, top_k = 5, source_filter }) {
const queryEmbedding = await embedText(query);
const results = await vectorStore.query({
vector: queryEmbedding,
topK: top_k * 2, // over-fetch for reranking
filter: source_filter ? { source: source_filter } : undefined,
includeMetadata: true,
includeValues: false,
});
const reranked = await crossEncoderRerank(query, results.matches);
const topK = reranked.slice(0, top_k);
return {
content: [{
type: "text",
text: JSON.stringify({
results: topK.map(m => ({
text: m.metadata.text,
source: m.metadata.source,
score: m.score,
})),
query,
total_results: topK.length,
})
}]
};
}
index_document
The ingestion tool. Takes a document URL or raw text, chunks it, embeds each chunk, and upserts into the vector store. Used at setup time and when corpus content changes. Agents typically don't call this directly — an ingestion pipeline does. But exposing it as a tool lets authorized agents add documents dynamically.
async function indexDocument({ url, text, source_id, chunk_size = 512, chunk_overlap = 64 }) {
const rawText = url ? await fetchUrl(url) : text;
const chunks = chunkText(rawText, chunk_size, chunk_overlap);
const embeddings = await embedBatch(chunks.map(c => c.text)); // batch API call
const vectors = chunks.map((chunk, i) => ({
id: `${source_id}-${i}`,
values: embeddings[i],
metadata: {
text: chunk.text,
source: source_id,
chunk_index: i,
char_start: chunk.start,
char_end: chunk.end,
}
}));
await vectorStore.upsert({ vectors, namespace: "docs" });
return {
content: [{
type: "text",
text: JSON.stringify({ indexed_chunks: chunks.length, source_id })
}]
};
}
list_sources
Enumerates the indexed document sources with chunk counts and last-indexed timestamps. Lets agents verify whether a specific document is in the corpus before issuing a targeted query. Also surfaces index staleness: if a source was last indexed 30 days ago and the underlying document changes frequently, the agent can warn users that results may be outdated.
Chunking strategy
Chunking determines retrieval quality more than almost any other factor. Chunks too large dilute signal (the relevant sentence is buried in 2000 tokens of surrounding context). Chunks too small lose context (a sentence about "the retry count" doesn't know it refers to Redis connection retries).
| Strategy | Chunk size | Best for | Weakness |
|---|---|---|---|
| Fixed-character | 512 chars + 64 overlap | Any text, simple to implement | Splits mid-sentence, losing coherence |
| Sentence-boundary | 3–5 sentences per chunk | Prose documents, articles, policies | Variable chunk size complicates token budgets |
| Semantic | Topic-coherent paragraphs | Technical documentation with clear sections | Requires LLM or heuristic to detect topic shifts |
| Hierarchical | Parent doc + child chunks | Long documents where context matters | Double the storage; complex to implement |
For most MCP RAG servers serving technical documentation, the sentence-boundary strategy with 4–6 sentences per chunk and 1-sentence overlap works well. The implementation uses a sentence tokenizer (natural.SentenceTokenizer in Node.js, nltk.sent_tokenize in Python) to split at proper sentence boundaries, then groups sentences into chunks respecting a max-token limit checked with tiktoken.
function chunkText(text, maxTokens = 200, overlapSentences = 1) {
const sentences = tokenizeSentences(text);
const chunks = [];
let buffer = [];
let bufferTokens = 0;
for (const sentence of sentences) {
const sentTokens = countTokens(sentence);
if (bufferTokens + sentTokens > maxTokens && buffer.length > 0) {
chunks.push({ text: buffer.join(' '), start: buffer[0].start, end: buffer[buffer.length-1].end });
// Keep last `overlapSentences` sentences for context continuity
buffer = buffer.slice(-overlapSentences);
bufferTokens = buffer.reduce((sum, s) => sum + countTokens(s), 0);
}
buffer.push(sentence);
bufferTokens += sentTokens;
}
if (buffer.length > 0) {
chunks.push({ text: buffer.join(' ') });
}
return chunks;
}
Hybrid retrieval: BM25 + vector similarity
Pure vector search misses exact keyword matches. If a user queries "MCP initialize method timeout" and your documents contain the exact phrase "initialize() timeout", a keyword search finds it immediately — but a vector search may rank it lower than semantically related documents about general timeout handling. Hybrid retrieval combines both signals.
The standard implementation uses Reciprocal Rank Fusion (RRF) to merge two ranked lists: the vector search results (by cosine distance) and the keyword search results (by BM25 score).
async function hybridSearch(query, top_k) {
const [vectorResults, keywordResults] = await Promise.all([
vectorStore.query({ vector: await embedText(query), topK: top_k * 2 }),
fullTextIndex.search(query, { limit: top_k * 2 }), // SQLite FTS5 or Elasticsearch
]);
// Reciprocal Rank Fusion: score = 1 / (rank + 60)
const scores = new Map();
const k = 60;
vectorResults.matches.forEach((match, rank) => {
const prev = scores.get(match.id) || 0;
scores.set(match.id, prev + 1 / (rank + k));
});
keywordResults.forEach((result, rank) => {
const prev = scores.get(result.id) || 0;
scores.set(result.id, prev + 1 / (rank + k));
});
return [...scores.entries()]
.sort(([, a], [, b]) => b - a)
.slice(0, top_k * 2)
.map(([id]) => getChunkById(id));
}
The k = 60 constant was empirically determined in the original RRF paper to reduce sensitivity to high-ranked outliers. Adjust it based on your corpus: lower values weight top ranks more heavily, higher values produce more balanced fusion.
Cross-encoder reranking
Vector similarity and BM25 both score each document independently of the query. A cross-encoder jointly encodes the query and each candidate document, producing a relevance score that captures interaction between them — a fundamentally stronger signal than bi-encoder similarity.
The cost is latency: a cross-encoder processes each (query, document) pair sequentially through a transformer model, while vector search queries a precomputed index. The practical pattern is to over-fetch (top-20 from hybrid retrieval) and rerank to top-5 using the cross-encoder, paying the cross-encoder cost only on a small candidate set.
// Using the @xenova/transformers cross-encoder in Node.js
import { pipeline } from '@xenova/transformers';
const reranker = await pipeline('text-classification', 'cross-encoder/ms-marco-MiniLM-L-6-v2');
async function crossEncoderRerank(query, candidates) {
const pairs = candidates.map(c => [query, c.metadata.text]);
const scores = await reranker(pairs, { topk: 1 });
return candidates
.map((c, i) => ({ ...c, rerank_score: scores[i][0].score }))
.sort((a, b) => b.rerank_score - a.rerank_score);
}
For production use, ms-marco-MiniLM-L-6-v2 runs inference in ~10–30ms per pair on CPU for short passages. A reranking pass over 20 candidates takes 200–600ms total. If this latency is too high for your use case, skip cross-encoding and rely on RRF — the quality difference is measurable but not dramatic for typical documentation retrieval tasks.
Context assembly and token budget
The retrieved chunks become the text field in the tool response. How you format them determines how much of the LLM's context window they consume and how coherently the model uses them.
function assembleContext(chunks, tokenBudget = 2048) {
const parts = [];
let usedTokens = 0;
for (const chunk of chunks) {
const formatted = `[Source: ${chunk.source}]\n${chunk.text}\n`;
const tokens = countTokens(formatted);
if (usedTokens + tokens > tokenBudget) break;
parts.push(formatted);
usedTokens += tokens;
}
return {
context: parts.join('\n---\n'),
chunks_included: parts.length,
tokens_used: usedTokens,
chunks_truncated: chunks.length - parts.length,
};
}
Include the source citation inline so the LLM can reference it in its response. A chunk that begins with [Source: docs/api-reference/initialize.md] gives the model accurate attribution material without requiring a separate citation tool call.
Why RAG servers fail silently — and what AliveMCP detects
A RAG MCP server's failure modes are different from a typical API server. When a REST API fails, it returns a 4xx or 5xx status code and the client knows something is wrong. When a RAG MCP tool fails softly, it returns HTTP 200 with results: [] and a total_results: 0 payload. The MCP client's agent sees a successful tool call. The LLM receives empty results and produces a response like "I couldn't find any documentation on that topic" — or worse, fills in the gap with confident confabulation.
| Failure mode | What the tool returns | What the LLM does | AliveMCP detects via |
|---|---|---|---|
| Vector store connection pool exhausted | HTTP 200, results: [] | Hallucinated answer with false confidence | P95 response time spike → alert |
| Embedding API (OpenAI) rate-limited | HTTP 200, results: [] (if error swallowed) | No retrieval; model works from training data | failure_reason: external_api_failure |
| MCP server process dead | Connection refused | Tool call fails, agent loop breaks | failure_reason: connection_refused → P1 |
| Index stale (documents changed, not re-indexed) | HTTP 200, results: [outdated chunks] | Answers from stale facts | Application-level health probe (custom check URL) |
| Reranker model OOM | HTTP 500 or timeout | Tool call error, agent retries | failure_reason: timeout or protocol_error |
AliveMCP's external protocol probe sends a valid MCP initialize handshake followed by a tools/list call every 60 seconds. This detects process death, protocol errors, and response time degradation. It does not detect semantic quality degradation (stale index, empty results for valid queries) — that requires application-level monitoring, typically a canary query with a known good answer that you verify periodically.
Configure a custom health check URL in AliveMCP that calls your search_documents tool with a query you know should return at least one result. If total_results is 0, return HTTP 503 from your /health endpoint — AliveMCP will detect this as a failure and page you. This turns semantic degradation into a detectable failure mode.
// Health endpoint that checks retrieval quality
app.get('/health', async (req, res) => {
try {
// Process liveness
const storeOk = await vectorStore.ping();
if (!storeOk) return res.status(503).json({ status: 'unhealthy', reason: 'vector_store_unreachable' });
// Semantic health: canary query must return results
const canaryResult = await searchDocuments({ query: 'MCP server health check', top_k: 1 });
const results = JSON.parse(canaryResult.content[0].text);
if (results.total_results === 0) {
return res.status(503).json({ status: 'degraded', reason: 'index_empty_or_stale' });
}
res.json({ status: 'ok', chunks_returned: results.total_results });
} catch (err) {
res.status(503).json({ status: 'unhealthy', reason: err.message });
}
});
Python implementation with Chroma
For Python MCP servers using mcp SDK and Chroma as the vector store:
import chromadb
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent
import asyncio
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
name="documents",
metadata={"hnsw:space": "cosine"}
)
server = Server("rag-server")
@server.list_tools()
async def list_tools():
return [
Tool(
name="search_documents",
description="Search indexed documents by semantic similarity",
inputSchema={
"type": "object",
"properties": {
"query": {"type": "string"},
"top_k": {"type": "integer", "default": 5}
},
"required": ["query"]
}
)
]
@server.call_tool()
async def call_tool(name: str, arguments: dict):
if name == "search_documents":
query = arguments["query"]
top_k = arguments.get("top_k", 5)
results = collection.query(
query_texts=[query],
n_results=top_k,
include=["documents", "distances", "metadatas"]
)
chunks = []
for doc, dist, meta in zip(
results["documents"][0],
results["distances"][0],
results["metadatas"][0]
):
chunks.append({
"text": doc,
"source": meta.get("source", "unknown"),
"score": 1 - dist, # Chroma cosine returns distance, not similarity
})
return [TextContent(type="text", text=str({"results": chunks, "total": len(chunks)}))]
async def main():
async with stdio_server() as (read_stream, write_stream):
await server.run(read_stream, write_stream, server.create_initialization_options())
asyncio.run(main())
Frequently asked questions
How many MCP tools should a RAG server expose?
Start with one: search_documents. Add index_document only if agents need to add documents dynamically at runtime — otherwise run ingestion out-of-band via a script. Add list_sources if agents need to enumerate the corpus to choose which source to query. Resist the temptation to expose individual steps like embed_query or rerank_results as separate tools — this forces the agent to orchestrate retrieval steps, adding latency and context consumption. The agent should call one tool and receive ranked text chunks; the MCP server handles the rest internally.
What embedding model should I use for a new RAG MCP server?
For hosted deployment, start with OpenAI text-embedding-3-small (1536 dimensions, $0.02 per million tokens). It outperforms text-embedding-ada-002 on most benchmarks at lower cost. For local/offline deployment, BAAI/bge-small-en-v1.5 (384 dimensions) runs efficiently on CPU and scores competitively on the MTEB retrieval benchmark. The critical constraint is consistency: once you embed your corpus with a specific model, all queries must use the same model — switching models requires re-embedding the entire corpus.
How do I handle documents that are too long to chunk meaningfully?
For documents over 50,000 characters (roughly 40 pages), use a hierarchical chunking strategy: create one summary chunk for the whole document (generated by a summarization call) and smaller detail chunks for each section. The agent first queries the summary chunks to identify relevant documents, then queries detail chunks within those documents. This two-stage retrieval pattern reduces noise from large corpora and avoids retrieving detail chunks from irrelevant documents. Store the document-level summary chunk with a type: "summary" metadata field and restrict summary-level queries using the source_filter parameter.
Why does my RAG server return good results in testing but poor results in production?
The most common cause is query distribution shift: your test queries are similar to the indexed document vocabulary (same technical terms, same phrasing), while production queries use the vocabulary of the user (paraphrases, abbreviations, different conceptual framing). Hybrid retrieval (BM25 + vector) handles this better than pure vector search because BM25 catches exact keyword matches that vector similarity misses. A second cause is corpus staleness — production documents update faster than the indexing pipeline runs. Add a last_indexed timestamp to each source's metadata and surface it in list_sources so agents can warn users when results may be outdated.
How does AliveMCP monitor a RAG server differently from a regular MCP server?
The baseline monitoring is the same: AliveMCP probes the MCP protocol layer (initialize handshake + tools/list) every 60 seconds and alerts on connection refused, protocol errors, and response time P95 spikes. What's different for RAG servers is the application-layer check: configure your /health endpoint to run a canary query against your actual vector store and return HTTP 503 if total_results is 0. Point AliveMCP's custom health check URL at this endpoint. This bridges the gap between "MCP server is running" (protocol layer) and "MCP server is retrieving correctly" (semantic layer) — the failure mode that produces wrong answers instead of error responses.
Further reading
- Vector Search MCP Tools — pgvector, Chroma, and Pinecone integration
- Embedding Tools in MCP Servers — generate and store vectors via MCP
- MCP Server Context Window Management — token budget and chunking
- Semantic Caching for MCP Servers — reduce latency for similar queries
- MCP Server Observability — metrics, tracing, and structured logging
- MCP Server Health Checks — liveness, readiness, and custom probes