Guide · AI Platform Integration
MCP server tools with Ollama
Ollama runs local LLMs — Llama 3.x, Qwen 2.5, Mistral, Phi-4, and others — through an OpenAI-compatible REST API at http://localhost:11434. Ollama doesn't natively speak MCP, but it speaks OpenAI's function-calling format. This means you can connect MCP servers to Ollama using the same adapter pattern as any OpenAI-compatible API: fetch MCP tools via tools/list, convert their inputSchema to OpenAI's function format, point the OpenAI Python client at localhost:11434, and dispatch function calls to the MCP server in a loop. The pattern works well, but tool calling quality depends heavily on model choice — not all Ollama models are equally capable, and some silently ignore tool calls entirely. Model selection and capability verification are essential first steps before building any MCP integration on top of Ollama.
TL;DR
Use openai.AsyncOpenAI(base_url="http://localhost:11434/v1", api_key="ollama"). Convert MCP tool inputSchema to OpenAI's {"type": "function", "function": {"name": ..., "description": ..., "parameters": ...}} format. Run the standard OpenAI tool-calling loop. Recommended models: llama3.1:8b, llama3.2:3b, qwen2.5:7b, mistral-nemo, command-r. Verify tool support before building: models that claim tool support sometimes silently return plain text instead of calling tools. Monitor the MCP servers your Ollama setup calls with AliveMCP — local LLM deployments often run unattended and MCP server failures surface as silent agent degradation.
Ollama's OpenAI-compatible API
Ollama exposes its API at http://localhost:11434/v1 with OpenAI-compatible endpoints. The openai Python package works out-of-the-box by pointing base_url at Ollama:
import openai
# Ollama's OpenAI-compatible client
ollama_client = openai.AsyncOpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # Ollama doesn't require a real API key
)
# Test the connection
async def test_ollama():
response = await ollama_client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Hello from MCP!"}],
)
print(response.choices[0].message.content)
Ollama also exposes its own native API at http://localhost:11434/api with a slightly different interface. Use the /v1 OpenAI-compatible endpoint for MCP integration — it's the standard that all existing MCP adapter code targets, and it ensures your code works with both Ollama and cloud-hosted OpenAI-compatible APIs (Groq, Mistral, Together, etc.) without changes.
Tool-capable Ollama models
Not all Ollama models support function calling. The model's Modelfile and the underlying model weights determine whether tool calling works. Some models include tool-calling training in their weights but produce unreliable results with MCP's JSON schemas.
| Model | Tool calling | Notes |
|---|---|---|
llama3.1:8b | Reliable | Best general-purpose tool caller for Ollama; 8B fits on most hardware |
llama3.1:70b | Reliable | Stronger reasoning; requires 48 GB+ VRAM |
llama3.2:3b | Good | Fast on CPU; tool support is lighter than 3.1 series |
qwen2.5:7b | Reliable | Strong on structured output; good tool calling fidelity |
qwen2.5:72b | Excellent | Near-GPT-4o quality tool calling; large hardware requirement |
mistral-nemo | Good | Mistral's small model; solid tool calling for constrained tasks |
command-r | Good | Cohere's retrieval-focused model; built for tool-augmented RAG |
phi4 | Variable | Strong reasoning but inconsistent tool format compliance |
gemma2:9b | Limited | Reports tool support; often returns plain text instead |
Always verify tool calling with a simple test before building an MCP integration on a new model. The verification test in the next section catches silent failures early.
Verifying tool calling before building
import asyncio
import json
import openai
ollama_client = openai.AsyncOpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama",
)
# Minimal test tool to verify function calling works
TEST_TOOL = [{
"type": "function",
"function": {
"name": "get_time",
"description": "Returns the current time",
"parameters": {"type": "object", "properties": {}, "required": []},
}
}]
async def verify_tool_calling(model: str) -> bool:
response = await ollama_client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": "What time is it right now?"}],
tools=TEST_TOOL,
tool_choice="required", # Force tool use
)
choice = response.choices[0]
has_tool_calls = bool(choice.message.tool_calls)
print(f"{model}: {'PASS' if has_tool_calls else 'FAIL — returned plain text'}")
return has_tool_calls
asyncio.run(verify_tool_calling("llama3.1:8b"))
Use tool_choice="required" in the verification test to force the model to call a tool. If the model still returns plain text with tool_choice="required", its tool calling implementation is broken or not trained into the weights — do not build an MCP integration on top of it. With tool_choice="auto" (the default), models that don't support tools will silently skip tool calls, making failures invisible until you're debugging production behavior.
MCP to OpenAI function format conversion
from mcp import ClientSession
from mcp.client.http import http_client
async def get_ollama_tools(mcp_url: str) -> tuple[list, dict]:
"""Fetch MCP tools and convert to OpenAI function format.
Returns (tool_specs, tool_map) where tool_map maps name → MCP tool.
"""
async with http_client(mcp_url) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
result = await session.list_tools()
tool_specs = []
tool_map = {}
for tool in result.tools:
spec = {
"type": "function",
"function": {
"name": tool.name,
"description": tool.description or f"Call {tool.name}",
# MCP inputSchema is already JSON Schema — use directly
"parameters": tool.inputSchema or {
"type": "object",
"properties": {},
},
}
}
tool_specs.append(spec)
tool_map[tool.name] = tool
return tool_specs, tool_map
Ollama models are more sensitive to tool schema quality than cloud models. Keep inputSchema flat — avoid nested objects more than one level deep. Write description fields that are clear and specific: "Search the web for recent information on a topic" outperforms "Search tool". Poor descriptions cause small Ollama models to choose wrong tools or hallucinate tool arguments more frequently than they would for simpler schemas.
The MCP tool-calling loop with Ollama
import asyncio
import json
from mcp import ClientSession
from mcp.client.http import http_client
MCP_URL = "https://search.internal/mcp"
async def run_ollama_agent(question: str, model: str = "llama3.1:8b") -> str:
async with http_client(MCP_URL) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
tool_specs, _ = await get_ollama_tools_from_session(session)
messages = [{"role": "user", "content": question}]
max_iterations = 10
for _ in range(max_iterations):
response = await ollama_client.chat.completions.create(
model=model,
messages=messages,
tools=tool_specs if tool_specs else None,
tool_choice="auto",
temperature=0,
)
choice = response.choices[0]
messages.append({"role": "assistant", "content": choice.message.content,
"tool_calls": choice.message.tool_calls})
if not choice.message.tool_calls:
# No tool calls — final answer
return choice.message.content or ""
# Dispatch tool calls to MCP server (sequentially — Ollama models
# rarely request multiple tools per turn)
for tc in choice.message.tool_calls:
fn = tc.function
args = json.loads(fn.arguments) if fn.arguments else {}
mcp_result = await session.call_tool(fn.name, arguments=args)
result_text = "\n".join(
c.text for c in mcp_result.content if hasattr(c, "text")
)
if mcp_result.isError:
result_text = f"Error: {result_text}"
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"content": result_text,
})
return "Agent reached maximum iterations without a final answer."
asyncio.run(run_ollama_agent("Which MCP endpoints have been down this week?"))
Unlike Gemini, Ollama models typically generate one tool call per turn. The loop processes tool calls sequentially — this is simpler and is rarely a bottleneck because the local inference time (0.5–5 seconds depending on hardware and model size) dwarfs the MCP round-trip. The max_iterations guard is important: small local models occasionally loop on tool errors rather than admitting they can't answer.
Latency profile: local inference vs MCP round-trip
Ollama's inference time depends on hardware. The MCP server round-trip is typically 10–200 ms — the same as any network call. The ratio of inference time to tool call time determines where to optimize:
| Hardware | Inference (llama3.1:8b, ~500 tokens) | MCP call (local server) | Bottleneck |
|---|---|---|---|
| Apple M3 Pro (GPU) | 1–2 s | 5–20 ms | Inference |
| CPU only (8-core) | 10–30 s | 5–20 ms | Inference |
| NVIDIA RTX 4090 (GPU) | 0.5–1 s | 5–20 ms | Inference |
| CPU only (cheap VPS) | 60–120 s | 5–200 ms | Inference (heavily) |
In local Ollama setups, inference is almost always the bottleneck — MCP tool calls are fast. The implication: tool call parallelism (like Gemini's multi-call architecture) provides minimal benefit for Ollama because you'd need to be running inference on multiple models simultaneously. Optimize inference first (quantization, GPU offloading, smaller model) before tuning MCP call patterns.
Monitoring MCP servers in Ollama deployments
Local Ollama deployments often run unattended — a developer sets up a local agent, and it runs periodically without anyone actively watching it. When an MCP server the agent depends on goes down, the failure is silent: the agent runs, finds no tools available or gets errors, and either returns a degraded response or stops working entirely. Without external monitoring, the failure isn't noticed until someone manually checks the agent's output.
This failure pattern is common even when the MCP server is remote and the Ollama instance is local. The MCP server might be a cloud-hosted SaaS tool (a search API, a database connector, a webhook service), and any of these can go down independently of the local Ollama instance. AliveMCP monitors the MCP server's initialize endpoint continuously and alerts on failure — so even in a local development setup or an unattended automated agent, you know when the MCP server your Ollama agent depends on becomes unavailable. Set alerts to a Slack channel or email so you're notified before investigating why the agent is producing degraded output.
Frequently asked questions
Can I use Ollama with MCP servers over stdio transport?
Yes. The transport (HTTP/SSE vs stdio) is entirely on the MCP client side — Ollama doesn't care how you call your MCP server. Use the MCP Python SDK's stdio_client() to manage a stdio subprocess, then convert its tools to OpenAI format and use the Ollama client the same way. Stdio MCP servers are common for local tools (filesystem access, local database queries, running shell commands) that don't need to be network-accessible.
Does Ollama support parallel tool calls like GPT-4o or Gemini?
Ollama's OpenAI-compatible API can return multiple tool calls per response turn (the tool_calls array can have more than one entry). Whether a specific model actually generates multiple tool calls depends on its training. In practice, Llama 3.1, Qwen 2.5, and Mistral models rarely generate multiple parallel tool calls — they tend to work sequentially. This is fine for most Ollama use cases where inference speed is the dominant latency factor.
Why does my Ollama model sometimes return plain text instead of calling tools?
This happens for several reasons: (1) the model's tool calling training doesn't fire for the specific prompt phrasing — try making the user message more directly actionable ("Search for X" rather than "I wonder about X"); (2) the tool schema is complex and confuses the model — simplify the inputSchema; (3) the model is too small for reliable tool calling — switch to a larger model; (4) tool_choice is set to "auto" and the model decides it doesn't need tools — use "required" to force tool use when debugging.
Can I run Ollama on a server and connect remote MCP servers to it?
Yes. Start Ollama with OLLAMA_HOST=0.0.0.0:11434 to accept external connections, or use a reverse proxy for HTTPS. Your MCP integration code runs wherever it can reach both ollama_host:11434 and the MCP server's endpoint — typically on the same server as Ollama, or on a client machine. For remote deployments, ensure the Ollama host is not publicly accessible without authentication, as the API has no built-in auth.
How do I choose between Ollama and a cloud API (OpenAI, Groq) for MCP integration?
Use Ollama when: data privacy requires local processing; you have hardware and want zero per-token costs; offline operation is required; or you're prototyping and want to avoid API spend. Use cloud APIs when: inference quality is paramount; you need reliable tool calling without model tuning; fast inference speed is required (Groq is 5–10x faster than local Ollama on good hardware); or you can't provision the GPU hardware that large models require. A common pattern: develop with Ollama locally, deploy with Groq or OpenAI in production.
Further reading
- MCP servers with Groq — OpenAI-compatible ultra-fast inference
- MCP servers with OpenAI Agents SDK — native MCPServerHTTP integration
- MCP server tool design — flat schemas for better LLM accuracy
- MCP server stdio transport — local process tool servers
- Open-source MCP monitoring — AliveMCP public dashboard and self-hosted options
- AliveMCP — continuous protocol monitoring for MCP servers