Guide · AI Platform Integration
MCP server tools with Groq
Groq runs LLMs at 500–800 tokens per second using custom Language Processing Units (LPUs), making it the fastest inference API available as of mid-2026. Groq's API is OpenAI-compatible, which means the same MCP adapter pattern used with OpenAI — convert tools/list to OpenAI function specs, run the chat completions loop, dispatch tool_calls to the MCP server — works with Groq by changing only the client configuration. The speed difference is significant for MCP-backed agents: where GPT-4o might take 3–8 seconds per inference call in a multi-tool agent, Groq with Llama 3.1 70B completes in 0.5–2 seconds. With 4–6 tool calls per run, the compounding effect is large. The tradeoff: Groq's model selection is narrower than OpenAI's, and rate limits on the free tier are stricter. But for production agents where inference speed matters, Groq is a compelling choice — and the MCP servers your agents call need monitoring regardless of which inference provider you use.
TL;DR
Use the groq Python package or point the openai client at https://api.groq.com/openai/v1. Best models for tool calling: llama-3.1-70b-versatile (quality) and llama-3.1-8b-instant (speed). Dispatch multiple tool calls in parallel with asyncio.gather() — MCP round-trips are the main latency source when inference is fast. Watch for context window limits in long agent runs: Groq's Llama 3.1 70B has a 128k context, but tool-heavy conversations accumulate tokens fast. Monitor MCP servers with AliveMCP — Groq's speed makes per-run latency obviously anomalous when an MCP server is slow, but you still need proactive alerts before the anomaly affects users.
Why Groq changes the MCP latency profile
In a typical MCP-backed agent run with 5 tool calls, the total time is: (inference time × turns) + (MCP call time × tool calls). When inference is slow (OpenAI GPT-4o at 3–8 s/call), the inference component dominates. When inference is fast (Groq at 0.5–1.5 s/call), MCP round-trips become a larger fraction of total time.
| Provider | Inference per turn | MCP calls (5 × 100 ms each) | Total (5-tool run) |
|---|---|---|---|
| GPT-4o | 3–8 s | 500 ms | 15–40 s |
| Claude Sonnet 4.6 | 2–5 s | 500 ms | 10–25 s |
| Groq Llama 3.1 70B | 0.5–1.5 s | 500 ms | 3–8 s |
| Groq Llama 3.1 8B | 0.2–0.5 s | 500 ms | 1.5–3 s |
With Groq, the MCP call time (500 ms for 5 sequential 100 ms calls) represents 15–35% of total run time — compared to <5% with GPT-4o. This is why parallel MCP tool dispatch matters more with Groq: if the 5 tool calls run in parallel (using asyncio.gather()), MCP time drops to ~100 ms (max of parallel calls), and total run time for a 5-tool run falls to 1.5–4 s.
Setting up the Groq client
import os
from groq import AsyncGroq
# Official Groq client — mirrors the OpenAI SDK interface
groq_client = AsyncGroq(api_key=os.environ["GROQ_API_KEY"])
# Alternative: use openai client with Groq base URL
# import openai
# groq_client = openai.AsyncOpenAI(
# base_url="https://api.groq.com/openai/v1",
# api_key=os.environ["GROQ_API_KEY"],
# )
# Available tool-capable models (as of mid-2026):
# - "llama-3.1-70b-versatile" — best quality, 128k context
# - "llama-3.1-8b-instant" — fastest, 128k context, lighter reasoning
# - "llama-3.3-70b-versatile" — newer, improved tool calling
# - "mixtral-8x7b-32768" — MoE model, 32k context
# - "gemma2-9b-it" — Google Gemma, 8k context, limited tool support
The groq package's AsyncGroq class mirrors openai.AsyncOpenAI exactly — the same methods, same response objects, same patterns. Any MCP adapter code written for OpenAI works with Groq by changing AsyncOpenAI to AsyncGroq and the model name. This is the key advantage of the OpenAI-compatible API standard: your MCP integration is provider-portable.
MCP integration with parallel tool dispatch
import asyncio
import json
import os
from groq import AsyncGroq
from mcp import ClientSession
from mcp.client.http import http_client
groq_client = AsyncGroq(api_key=os.environ["GROQ_API_KEY"])
MCP_URL = "https://search.internal/mcp"
def mcp_to_openai_tools(tools) -> list[dict]:
return [{
"type": "function",
"function": {
"name": t.name,
"description": t.description or f"Call {t.name}",
"parameters": t.inputSchema or {"type": "object", "properties": {}},
}
} for t in tools]
async def run_groq_agent(question: str, model: str = "llama-3.1-70b-versatile") -> str:
async with http_client(MCP_URL) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
tools_result = await session.list_tools()
tool_specs = mcp_to_openai_tools(tools_result.tools)
messages = [{"role": "user", "content": question}]
max_iterations = 12
for iteration in range(max_iterations):
response = await groq_client.chat.completions.create(
model=model,
messages=messages,
tools=tool_specs,
tool_choice="auto",
temperature=0,
max_tokens=4096,
)
choice = response.choices[0]
messages.append(choice.message.model_dump())
if choice.finish_reason == "stop":
return choice.message.content or ""
if choice.finish_reason != "tool_calls":
return f"Unexpected stop reason: {choice.finish_reason}"
tool_calls = choice.message.tool_calls
if not tool_calls:
return choice.message.content or ""
# Dispatch ALL tool calls in parallel — critical for Groq's speed
async def call_mcp_tool(tc):
args = json.loads(tc.function.arguments or "{}")
result = await session.call_tool(tc.function.name, arguments=args)
content = "\n".join(c.text for c in result.content if hasattr(c, "text"))
return tc.id, content, result.isError
results = await asyncio.gather(
*[call_mcp_tool(tc) for tc in tool_calls],
return_exceptions=True,
)
for tc, outcome in zip(tool_calls, results):
if isinstance(outcome, Exception):
content = f"Tool error: {outcome}"
is_error = True
else:
_, content, is_error = outcome
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"content": f"Error: {content}" if is_error else content,
})
return "Agent reached iteration limit."
asyncio.run(run_groq_agent("What are the top 5 most-monitored MCP servers right now?"))
The parallel dispatch with asyncio.gather() is essential when using Groq. Groq models frequently return 2–4 tool calls per turn (especially for research-style tasks), and dispatching them sequentially would eliminate the inference speed advantage. Always use return_exceptions=True in asyncio.gather() — if one MCP tool call fails, the others still complete and the agent can work with partial results.
Groq model selection for MCP tool calling
Tool calling quality varies by model on Groq. The key dimensions for MCP-backed agents are: argument accuracy (does the model generate valid JSON arguments matching the inputSchema?), parallel tool request rate (how often does the model call multiple tools in one turn?), and context efficiency (how many tokens does the model use per turn?).
| Model | Speed (tok/s) | Tool accuracy | Context | Best for |
|---|---|---|---|---|
llama-3.3-70b-versatile | ~450 | Excellent | 128k | Production agents, complex tool schemas |
llama-3.1-70b-versatile | ~500 | Very good | 128k | Production agents, most use cases |
llama-3.1-8b-instant | ~750 | Good | 128k | High-volume, simple tool schemas |
mixtral-8x7b-32768 | ~550 | Good | 32k | Cost-sensitive, limited context |
For production MCP integrations, start with llama-3.3-70b-versatile. Drop to llama-3.1-8b-instant only after verifying that your specific tool schemas don't require deeper reasoning — small models struggle with schemas that have many optional fields or ambiguous descriptions.
Rate limits and context budgeting
Groq's free tier has aggressive rate limits: 14,400 tokens per minute on Llama 3.1 70B. For MCP-heavy agents where each run consumes 2,000–8,000 tokens (tools list + conversation + tool results), free tier limits are hit quickly in development. The paid Developer and Production tiers have much higher limits.
Context window management is critical for long agent runs. Each tool call adds to the conversation: the model's function call JSON (~50–200 tokens) plus the tool result (potentially 1,000+ tokens if a tool returns a long document). After 6–10 tool calls, the context often exceeds half the model's window, causing the model to start dropping detail from its reasoning. Implement a rolling context window:
def trim_conversation_to_budget(messages: list, max_tokens: int = 100_000) -> list:
"""Keep the system message + last N messages within the token budget.
Rough token estimate: 1 token ≈ 4 chars (English).
"""
total_chars = sum(len(str(m.get("content", ""))) for m in messages)
estimated_tokens = total_chars // 4
if estimated_tokens <= max_tokens:
return messages
# Always keep system message (index 0) and recent messages
kept = [messages[0]] if messages[0]["role"] == "system" else []
recent = messages[-8:] # Keep last 8 turns
return kept + recent
Trim to the last 8 messages when approaching the context limit. The cost is that the model loses access to early tool results — but for most agents, recent context matters more than the full history. For agents that need full history (audit trails, research summaries), use RAG over the tool results instead of keeping everything in context.
Monitoring MCP servers in Groq pipelines
Groq's speed makes slow MCP servers immediately visible in latency measurements — if a typical agent run takes 2 seconds and suddenly takes 12, the culprit is almost always an MCP server responding slowly rather than Groq inference (which is consistently fast). Add timing around each call_mcp_tool() call and log results above a threshold:
import time
async def call_mcp_tool_timed(session, tc) -> tuple:
args = json.loads(tc.function.arguments or "{}")
start = time.monotonic()
result = await session.call_tool(tc.function.name, arguments=args)
elapsed = time.monotonic() - start
if elapsed > 1.0:
print(f"SLOW MCP tool: {tc.function.name} took {elapsed:.2f}s")
content = "\n".join(c.text for c in result.content if hasattr(c, "text"))
return tc.id, content, result.isError
Logging slow tools identifies which specific MCP server is the latency source — in a multi-server setup, this narrows the investigation immediately. AliveMCP provides the proactive side: rather than waiting for a slow or failing tool call to alert you, AliveMCP pings the MCP server every 60 seconds and fires an alert when response time degrades or the server becomes unreachable. In Groq pipelines where end-to-end latency is the value proposition, AliveMCP's advance warning ensures a slow MCP server doesn't quietly eliminate the inference speed advantage you're paying for.
Frequently asked questions
Does Groq support streaming with tool calls?
Yes. Use groq_client.chat.completions.create(stream=True) for streaming. With tool calls, Groq streams the text portions of the response but delivers function call blocks in chunks that must be accumulated before dispatch. Accumulate delta.tool_calls chunks until the stream completes (finish_reason is "tool_calls"), then dispatch to MCP. Streaming is most valuable for the final response to the user — intermediate tool-call turns don't benefit from streaming because the user isn't seeing them.
Can I use Groq's parallel tool calling with MCP servers that have rate limits?
Yes, but coordinate your Groq and MCP rate limits separately. If you dispatch 5 MCP tool calls in parallel and the MCP server has a rate limit of 2 requests per second, you'll hit the rate limit on 3 of them. Implement a semaphore in your parallel dispatch: async with asyncio.Semaphore(2): await call_mcp_tool(...). This limits concurrent MCP calls without changing the Groq-side parallel tool call behavior — Groq still returns all tool calls at once, but your dispatch logic respects the MCP server's limits.
Why does Groq have shorter context windows than OpenAI for some models?
Groq's LPU architecture is optimized for speed over context length. Processing a 128k context window at Groq's token generation speeds requires holding more state in fast memory — the hardware is physically optimized for throughput at medium context lengths (8k–32k tokens), and the 128k context on newer models represents a hardware scaling achievement. For MCP-backed agents, this is rarely a limitation: most agent runs complete within 10–30k tokens of context even with multiple tool calls. Monitor context usage by logging response.usage.total_tokens per run.
How does Groq handle tool call errors — should I return error strings or raise exceptions?
Follow the same rule as AutoGen: return error strings, don't raise exceptions. When a tool call fails, add a role: "tool" message with the error text as content. The model then sees the error and can decide to retry, use a different tool, or ask the user for clarification. Raising an uncaught exception aborts the entire agent run and returns no response. The exception to this: fatal errors (the MCP server is down, auth is invalid) should propagate up to the caller so the application can log them and return a proper error to the user.
Can I use Groq for the orchestration model and Claude or GPT-4o for individual tool calls?
Groq is an inference API — it runs the LLM that decides when to call tools. The MCP server that backs each tool can use any model internally; the LLM choice on the MCP server side is completely independent of Groq. A common pattern: use Groq Llama 3.1 70B as the fast orchestrator that decides which tools to call, with MCP servers that internally use Claude Sonnet for complex reasoning tasks. This hybrid gives you Groq's fast orchestration loop plus Claude's reasoning quality for the heavy-lifting tool handlers.
Further reading
- MCP server tools with Ollama — local LLMs and OpenAI-compatible adapter
- MCP servers with OpenAI Agents SDK — native MCPServerHTTP integration
- MCP servers with Google Gemini — function calling and ADK integration
- MCP server tool design — flat schemas and LLM-friendly descriptions
- MCP server timeout — preventing agentic loop stalls
- AliveMCP — continuous protocol monitoring for MCP servers