Guide · AI Platform Integration

MCP server tools with Ollama

Ollama runs local LLMs — Llama 3.x, Qwen 2.5, Mistral, Phi-4, and others — through an OpenAI-compatible REST API at http://localhost:11434. Ollama doesn't natively speak MCP, but it speaks OpenAI's function-calling format. This means you can connect MCP servers to Ollama using the same adapter pattern as any OpenAI-compatible API: fetch MCP tools via tools/list, convert their inputSchema to OpenAI's function format, point the OpenAI Python client at localhost:11434, and dispatch function calls to the MCP server in a loop. The pattern works well, but tool calling quality depends heavily on model choice — not all Ollama models are equally capable, and some silently ignore tool calls entirely. Model selection and capability verification are essential first steps before building any MCP integration on top of Ollama.

TL;DR

Use openai.AsyncOpenAI(base_url="http://localhost:11434/v1", api_key="ollama"). Convert MCP tool inputSchema to OpenAI's {"type": "function", "function": {"name": ..., "description": ..., "parameters": ...}} format. Run the standard OpenAI tool-calling loop. Recommended models: llama3.1:8b, llama3.2:3b, qwen2.5:7b, mistral-nemo, command-r. Verify tool support before building: models that claim tool support sometimes silently return plain text instead of calling tools. Monitor the MCP servers your Ollama setup calls with AliveMCP — local LLM deployments often run unattended and MCP server failures surface as silent agent degradation.

Ollama's OpenAI-compatible API

Ollama exposes its API at http://localhost:11434/v1 with OpenAI-compatible endpoints. The openai Python package works out-of-the-box by pointing base_url at Ollama:

import openai

# Ollama's OpenAI-compatible client
ollama_client = openai.AsyncOpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Ollama doesn't require a real API key
)

# Test the connection
async def test_ollama():
    response = await ollama_client.chat.completions.create(
        model="llama3.1:8b",
        messages=[{"role": "user", "content": "Hello from MCP!"}],
    )
    print(response.choices[0].message.content)

Ollama also exposes its own native API at http://localhost:11434/api with a slightly different interface. Use the /v1 OpenAI-compatible endpoint for MCP integration — it's the standard that all existing MCP adapter code targets, and it ensures your code works with both Ollama and cloud-hosted OpenAI-compatible APIs (Groq, Mistral, Together, etc.) without changes.

Tool-capable Ollama models

Not all Ollama models support function calling. The model's Modelfile and the underlying model weights determine whether tool calling works. Some models include tool-calling training in their weights but produce unreliable results with MCP's JSON schemas.

Model	Tool calling	Notes
`llama3.1:8b`	Reliable	Best general-purpose tool caller for Ollama; 8B fits on most hardware
`llama3.1:70b`	Reliable	Stronger reasoning; requires 48 GB+ VRAM
`llama3.2:3b`	Good	Fast on CPU; tool support is lighter than 3.1 series
`qwen2.5:7b`	Reliable	Strong on structured output; good tool calling fidelity
`qwen2.5:72b`	Excellent	Near-GPT-4o quality tool calling; large hardware requirement
`mistral-nemo`	Good	Mistral's small model; solid tool calling for constrained tasks
`command-r`	Good	Cohere's retrieval-focused model; built for tool-augmented RAG
`phi4`	Variable	Strong reasoning but inconsistent tool format compliance
`gemma2:9b`	Limited	Reports tool support; often returns plain text instead

Always verify tool calling with a simple test before building an MCP integration on a new model. The verification test in the next section catches silent failures early.

Verifying tool calling before building

import asyncio
import json
import openai

ollama_client = openai.AsyncOpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",
)

# Minimal test tool to verify function calling works
TEST_TOOL = [{
    "type": "function",
    "function": {
        "name": "get_time",
        "description": "Returns the current time",
        "parameters": {"type": "object", "properties": {}, "required": []},
    }
}]

async def verify_tool_calling(model: str) -> bool:
    response = await ollama_client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": "What time is it right now?"}],
        tools=TEST_TOOL,
        tool_choice="required",  # Force tool use
    )
    choice = response.choices[0]
    has_tool_calls = bool(choice.message.tool_calls)
    print(f"{model}: {'PASS' if has_tool_calls else 'FAIL — returned plain text'}")
    return has_tool_calls

asyncio.run(verify_tool_calling("llama3.1:8b"))

Use tool_choice="required" in the verification test to force the model to call a tool. If the model still returns plain text with tool_choice="required", its tool calling implementation is broken or not trained into the weights — do not build an MCP integration on top of it. With tool_choice="auto" (the default), models that don't support tools will silently skip tool calls, making failures invisible until you're debugging production behavior.

MCP to OpenAI function format conversion

from mcp import ClientSession
from mcp.client.http import http_client

async def get_ollama_tools(mcp_url: str) -> tuple[list, dict]:
    """Fetch MCP tools and convert to OpenAI function format.

    Returns (tool_specs, tool_map) where tool_map maps name → MCP tool.
    """
    async with http_client(mcp_url) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()
            result = await session.list_tools()

    tool_specs = []
    tool_map = {}
    for tool in result.tools:
        spec = {
            "type": "function",
            "function": {
                "name": tool.name,
                "description": tool.description or f"Call {tool.name}",
                # MCP inputSchema is already JSON Schema — use directly
                "parameters": tool.inputSchema or {
                    "type": "object",
                    "properties": {},
                },
            }
        }
        tool_specs.append(spec)
        tool_map[tool.name] = tool
    return tool_specs, tool_map

Ollama models are more sensitive to tool schema quality than cloud models. Keep inputSchema flat — avoid nested objects more than one level deep. Write description fields that are clear and specific: "Search the web for recent information on a topic" outperforms "Search tool". Poor descriptions cause small Ollama models to choose wrong tools or hallucinate tool arguments more frequently than they would for simpler schemas.

The MCP tool-calling loop with Ollama

import asyncio
import json
from mcp import ClientSession
from mcp.client.http import http_client

MCP_URL = "https://search.internal/mcp"

async def run_ollama_agent(question: str, model: str = "llama3.1:8b") -> str:
    async with http_client(MCP_URL) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()

            tool_specs, _ = await get_ollama_tools_from_session(session)

            messages = [{"role": "user", "content": question}]
            max_iterations = 10

            for _ in range(max_iterations):
                response = await ollama_client.chat.completions.create(
                    model=model,
                    messages=messages,
                    tools=tool_specs if tool_specs else None,
                    tool_choice="auto",
                    temperature=0,
                )

                choice = response.choices[0]
                messages.append({"role": "assistant", "content": choice.message.content,
                                  "tool_calls": choice.message.tool_calls})

                if not choice.message.tool_calls:
                    # No tool calls — final answer
                    return choice.message.content or ""

                # Dispatch tool calls to MCP server (sequentially — Ollama models
                # rarely request multiple tools per turn)
                for tc in choice.message.tool_calls:
                    fn = tc.function
                    args = json.loads(fn.arguments) if fn.arguments else {}

                    mcp_result = await session.call_tool(fn.name, arguments=args)
                    result_text = "\n".join(
                        c.text for c in mcp_result.content if hasattr(c, "text")
                    )
                    if mcp_result.isError:
                        result_text = f"Error: {result_text}"

                    messages.append({
                        "role": "tool",
                        "tool_call_id": tc.id,
                        "content": result_text,
                    })

            return "Agent reached maximum iterations without a final answer."

asyncio.run(run_ollama_agent("Which MCP endpoints have been down this week?"))

Unlike Gemini, Ollama models typically generate one tool call per turn. The loop processes tool calls sequentially — this is simpler and is rarely a bottleneck because the local inference time (0.5–5 seconds depending on hardware and model size) dwarfs the MCP round-trip. The max_iterations guard is important: small local models occasionally loop on tool errors rather than admitting they can't answer.

Latency profile: local inference vs MCP round-trip

Ollama's inference time depends on hardware. The MCP server round-trip is typically 10–200 ms — the same as any network call. The ratio of inference time to tool call time determines where to optimize:

Hardware	Inference (llama3.1:8b, ~500 tokens)	MCP call (local server)	Bottleneck
Apple M3 Pro (GPU)	1–2 s	5–20 ms	Inference
CPU only (8-core)	10–30 s	5–20 ms	Inference
NVIDIA RTX 4090 (GPU)	0.5–1 s	5–20 ms	Inference
CPU only (cheap VPS)	60–120 s	5–200 ms	Inference (heavily)

In local Ollama setups, inference is almost always the bottleneck — MCP tool calls are fast. The implication: tool call parallelism (like Gemini's multi-call architecture) provides minimal benefit for Ollama because you'd need to be running inference on multiple models simultaneously. Optimize inference first (quantization, GPU offloading, smaller model) before tuning MCP call patterns.

Monitoring MCP servers in Ollama deployments

Local Ollama deployments often run unattended — a developer sets up a local agent, and it runs periodically without anyone actively watching it. When an MCP server the agent depends on goes down, the failure is silent: the agent runs, finds no tools available or gets errors, and either returns a degraded response or stops working entirely. Without external monitoring, the failure isn't noticed until someone manually checks the agent's output.

This failure pattern is common even when the MCP server is remote and the Ollama instance is local. The MCP server might be a cloud-hosted SaaS tool (a search API, a database connector, a webhook service), and any of these can go down independently of the local Ollama instance. AliveMCP monitors the MCP server's initialize endpoint continuously and alerts on failure — so even in a local development setup or an unattended automated agent, you know when the MCP server your Ollama agent depends on becomes unavailable. Set alerts to a Slack channel or email so you're notified before investigating why the agent is producing degraded output.

Frequently asked questions

Can I use Ollama with MCP servers over stdio transport?

Yes. The transport (HTTP/SSE vs stdio) is entirely on the MCP client side — Ollama doesn't care how you call your MCP server. Use the MCP Python SDK's stdio_client() to manage a stdio subprocess, then convert its tools to OpenAI format and use the Ollama client the same way. Stdio MCP servers are common for local tools (filesystem access, local database queries, running shell commands) that don't need to be network-accessible.

Does Ollama support parallel tool calls like GPT-4o or Gemini?

Ollama's OpenAI-compatible API can return multiple tool calls per response turn (the tool_calls array can have more than one entry). Whether a specific model actually generates multiple tool calls depends on its training. In practice, Llama 3.1, Qwen 2.5, and Mistral models rarely generate multiple parallel tool calls — they tend to work sequentially. This is fine for most Ollama use cases where inference speed is the dominant latency factor.

Why does my Ollama model sometimes return plain text instead of calling tools?

This happens for several reasons: (1) the model's tool calling training doesn't fire for the specific prompt phrasing — try making the user message more directly actionable ("Search for X" rather than "I wonder about X"); (2) the tool schema is complex and confuses the model — simplify the inputSchema; (3) the model is too small for reliable tool calling — switch to a larger model; (4) tool_choice is set to "auto" and the model decides it doesn't need tools — use "required" to force tool use when debugging.

Can I run Ollama on a server and connect remote MCP servers to it?

Yes. Start Ollama with OLLAMA_HOST=0.0.0.0:11434 to accept external connections, or use a reverse proxy for HTTPS. Your MCP integration code runs wherever it can reach both ollama_host:11434 and the MCP server's endpoint — typically on the same server as Ollama, or on a client machine. For remote deployments, ensure the Ollama host is not publicly accessible without authentication, as the API has no built-in auth.

How do I choose between Ollama and a cloud API (OpenAI, Groq) for MCP integration?

Use Ollama when: data privacy requires local processing; you have hardware and want zero per-token costs; offline operation is required; or you're prototyping and want to avoid API spend. Use cloud APIs when: inference quality is paramount; you need reliable tool calling without model tuning; fast inference speed is required (Groq is 5–10x faster than local Ollama on good hardware); or you can't provision the GPU hardware that large models require. A common pattern: develop with Ollama locally, deploy with Groq or OpenAI in production.