Guide · Enterprise Security
MCP server SLA
A Service Level Agreement for an MCP server is a contractual commitment to your customers about how available the server will be — typically expressed as a monthly availability percentage and backed by a credit schedule when you fall short. Getting an MCP server SLA right involves three distinct decisions: what to measure (the Service Level Indicator), what percentage to commit to (the Service Level Objective), and what the contractual consequence is when you miss it (the SLA itself). This guide covers the full stack — from choosing the right probe to generating the monthly PDF report that satisfies enterprise procurement SLA evidence requirements.
TL;DR
Measure MCP server availability from an external probe that performs a real initialize handshake — not from your server's own metrics, which don't count failed requests the server never received. Common tiers: 99.9% (≈43 min/month allowed downtime), 99.95% (≈22 min), 99.99% (≈4 min). Credit schedule: 25–50% of monthly fee for a single incident; up to 100% for multiple incidents that breach the monthly threshold. AliveMCP's Team and Enterprise plans export monthly SLA PDF reports with availability percentage, incident list, and response-time percentiles — the exact evidence your customers' procurement teams request.
SLI, SLO, and SLA: the three-tier framework
These three terms are often used interchangeably but represent distinct layers of the availability framework:
| Term | What it is | Example for MCP server | Who owns it |
|---|---|---|---|
| SLI (Service Level Indicator) | The raw measurement — what you actually observe | % of 60-second probe intervals where MCP initialize succeeds in <5 seconds | Engineering (what to measure) |
| SLO (Service Level Objective) | Your internal target for the SLI | SLI ≥ 99.95% measured over a rolling 30-day window | Engineering + Product (what to aim for) |
| SLA (Service Level Agreement) | The contractual commitment, typically less strict than the SLO, with credits if breached | 99.9% monthly availability; 25% credit for 99.0–99.9%, 50% credit for <99.0% | Legal + Business (what's promised and what's owed) |
The gap between SLO and SLA is intentional: your SLO should be more aggressive than your SLA so that when the SLO fires (internal alert), you still have room to recover before breaching the SLA (contractual trigger). If SLO = SLA, every internal alert is also a credit event — engineering has no runway.
What to measure: the MCP server SLI
The SLI for an MCP server should reflect what your customers actually depend on: the ability to successfully initiate an MCP session and call tools. This means the probe must do more than check if a TCP port is open.
#!/bin/bash
# mcp-sli-probe.sh — measure one probe interval for SLI calculation
SERVER_URL="${1:-https://mcp-search.internal/mcp}"
TIMEOUT_SECS=5
START_MS=$(($(date +%s%N)/1000000))
# Send real MCP initialize request
RESPONSE=$(curl \
--silent \
--max-time "$TIMEOUT_SECS" \
--write-out "\n%{http_code}\n%{time_total}" \
--request POST "$SERVER_URL" \
--header "Content-Type: application/json" \
--data '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","clientInfo":{"name":"sli-probe","version":"1.0"}}}')
HTTP_CODE=$(echo "$RESPONSE" | tail -2 | head -1)
ELAPSED_MS=$(echo "$RESPONSE" | tail -1 | awk '{printf "%d", $1*1000}')
BODY=$(echo "$RESPONSE" | head -1)
# SLI = "good" if:
# 1. HTTP 200
# 2. Response contains protocolVersion (valid MCP response)
# 3. Elapsed time < SLI threshold (5000ms)
if [ "$HTTP_CODE" = "200" ] \
&& echo "$BODY" | grep -q '"protocolVersion"' \
&& [ "$ELAPSED_MS" -lt 5000 ]; then
echo "GOOD latency=${ELAPSED_MS}ms"
exit 0
else
echo "BAD http_code=$HTTP_CODE latency=${ELAPSED_MS}ms"
exit 1
fi
Log each probe result (timestamp, good/bad, latency) to calculate the availability percentage at the end of each calendar month.
Latency SLI: beyond binary availability
A server that responds to every initialize in 4,900 ms is technically "up" but is causing meaningful latency in agent pipelines. Consider adding a latency SLI alongside the availability SLI:
# Latency SLI (separate from availability SLI)
# "good" = p99 response time for tools/call < 500ms over the measurement window
# Measured via AliveMCP response-time histogram:
# curl -H "Authorization: Bearer $TOKEN" \
# "https://alivemcp.com/api/v1/servers/mcp-search/metrics?period=2026-05" \
# | jq '.response_time_p99_ms'
Availability percentage and downtime budget
Availability percentage = (total probe intervals − failed intervals) / total probe intervals × 100. A probe interval that times out counts as failed. A probe interval that returns a 5xx error counts as failed. A probe interval that returns a 200 with a malformed (non-MCP) response counts as failed.
| SLA target | Maximum downtime per month (30d) | Maximum downtime per year | Failed probe budget (60s interval) |
|---|---|---|---|
| 99.0% | 7h 18m | 3d 15h | ≤ 438 failed probes/month |
| 99.5% | 3h 39m | 1d 20h | ≤ 219 failed probes/month |
| 99.9% | 43m 12s | 8h 45m | ≤ 43 failed probes/month |
| 99.95% | 21m 36s | 4h 22m | ≤ 21 failed probes/month |
| 99.99% | 4m 19s | 52m | ≤ 4 failed probes/month |
Choose the SLA tier based on your current measured performance with headroom. If your MCP server historically achieves 99.97% availability (measured over the last 90 days), committing to 99.9% gives you a 7× safety margin on the failure budget. Committing to 99.99% leaves you with a 3× margin — acceptable, but requiring rigorous on-call coverage to recover within 4 minutes of failure.
What counts as downtime
Define downtime precisely in the SLA. Ambiguity here is where enterprise procurement disputes start. A robust definition for MCP servers:
# SLA downtime definition (include verbatim in customer agreements)
"Downtime" means any period during which AliveMCP's external monitoring probe
is unable to successfully complete a Model Context Protocol (MCP) initialize
handshake with the Service endpoint within 5,000 milliseconds. A period of
downtime begins when three consecutive probe intervals fail and ends when three
consecutive probe intervals succeed.
Excluded from downtime calculation:
- Scheduled maintenance windows (notified ≥72 hours in advance via status page)
- Probe failures caused by customer-side network issues or client misconfigurations
- Periods of force majeure (as defined in the Master Services Agreement)
- The first 5 minutes of any incident (allowing for transient network noise)
The "three consecutive failures to start, three consecutive successes to end" rule prevents a single probe timeout (which could be a transient network blip) from triggering an SLA event. With 60-second probe intervals, this means a genuine outage is detected within 3 minutes of starting.
| Error condition | Counts as downtime? | Rationale |
|---|---|---|
| HTTP 5xx on MCP endpoint | Yes | Server-side failure |
| HTTP 4xx on MCP endpoint | No (investigate separately) | May indicate auth regression on customer side |
| Connection timeout (>5s) | Yes | Network or server failure, agent perspective: unusable |
| Valid HTTP 200 but invalid MCP response | Yes | Schema regression — tools/call would fail even if connection succeeds |
| Scheduled maintenance (pre-announced) | No | Excluded per SLA terms |
| DNS resolution failure | Yes | Customer perspective: service unavailable |
Credit schedule design
Credit schedules should scale with severity — a minor SLA miss (99.88% vs 99.9% target) earns a smaller credit than a catastrophic month (95% availability). Standard enterprise SaaS credit schedules adapted for MCP servers:
# Credit schedule (include in customer SLA addendum)
Monthly Availability Credit (% of monthly fee)
───────────────────── ──────────────────────────
≥ 99.9% No credit (within SLA)
99.0% – 99.9% 25%
98.0% – 99.0% 50%
< 98.0% 100%
Credit for individual incidents:
─────────────────────────────────
Incident duration < 30 min: No credit
Incident duration 30–60 min: 10% of monthly fee
Incident duration 60–120 min: 25% of monthly fee
Incident duration > 120 min: 50% of monthly fee
(Credits from monthly availability and individual incidents do not stack.
The higher of the two applies for any given month.)
Cap total credits at 100% of monthly fee. Enterprise customers sometimes negotiate for credits that exceed the monthly fee ("consequential damages"), but this is standard liability waiver territory — your legal team should handle this via limitation of liability clauses in the Master Services Agreement, not in the SLA addendum.
Monthly SLA report format
Enterprise procurement teams require monthly SLA reports as evidence for their own SOC 2 audits and vendor management processes. A complete MCP server SLA report contains:
- Reporting period: calendar month start and end date (UTC)
- Availability percentage: calculated from external probe data (not self-reported)
- Total probe intervals: number of probe intervals in the period
- Failed intervals: count and percentage
- SLA commitment met: yes/no
- Credit owed: dollar amount if applicable, or $0 if SLA met
- Incident list: each incident with start time, end time, duration, and root cause classification
- Response time percentiles: p50, p95, p99 for the period
- Uptime chart: 30-day availability timeline, 1-day resolution
- Maintenance windows: list of scheduled windows and whether they were used
# Generate SLA report from AliveMCP API (Team/Enterprise tier)
curl -H "Authorization: Bearer $ALIVEMCP_TOKEN" \
"https://alivemcp.com/api/v1/servers/mcp-search/sla-report?month=2026-05" \
--output "sla-report-mcp-search-2026-05.pdf"
# Or JSON for programmatic use
curl -H "Authorization: Bearer $ALIVEMCP_TOKEN" \
"https://alivemcp.com/api/v1/servers/mcp-search/sla-report?month=2026-05&format=json" \
| jq '{
period: .period,
availability_pct: .availability_pct,
sla_target_pct: .sla_target_pct,
sla_met: .sla_met,
credit_owed_usd: .credit_owed_usd,
incidents: .incidents | length,
p99_response_ms: .response_time_p99_ms
}'
Measuring SLA from external probes vs internal metrics
Self-reported availability (measuring from your own server's metrics) has a fundamental problem: it only counts requests the server actually received and processed. It cannot measure downtime caused by:
- Network failures between the load balancer and the internet
- DNS resolution failures
- TLS certificate expiry at the edge
- Infrastructure-level failures that prevent requests from reaching the server
From the customer's perspective, all of these are downtime — their agent couldn't connect. From your internal metrics, they appear as zero requests (not as errors). The gap between "external availability" and "server-side request success rate" is often 0.05–0.1% over a year — enough to push a 99.95% self-reported number below 99.9% when measured externally.
Enterprise SLA agreements should always specify that availability is measured by an independent external probe, not by the server's own metrics. AliveMCP's probe is the canonical source for this measurement in AliveMCP Team and Enterprise accounts — the monthly report timestamps map directly to the data your customers' auditors want.
Frequently asked questions
What's the right SLA target for an enterprise MCP server?
Match your SLA commitment to your measured track record with headroom. If you've been running an MCP server for 90 days and measured 99.97% external availability, committing to 99.9% (a 7× headroom on failure budget) is reasonable. Committing to 99.95% (3× headroom) requires confidence in your infrastructure stability. Committing to 99.99% requires a very mature HA setup with sub-minute failover — don't commit to it unless you've actually measured the failover working. New MCP server deployments without a track record should start at 99.5% and ratchet up as the track record accumulates.
How should planned maintenance be handled in the SLA?
Require advance notification (72 hours is standard for most enterprise SaaS SLAs) posted to a public status page. Exclude the maintenance window from the monthly availability calculation. Cap maintenance windows at 4 hours per month — customers who need zero-maintenance uptime require active-active HA where updates are rolled out without windows. For MCP servers, zero-downtime updates are achievable with blue-green deployment — you may eventually be able to remove maintenance windows from your SLA entirely, which is a competitive advantage.
Can I use an internal monitoring system for SLA measurement instead of AliveMCP?
You can, but enterprise procurement teams will ask whether the monitoring system is independent of the service being measured. An internal system that goes down when the MCP server goes down is measuring its own availability, not the MCP server's. The credibility of your SLA report depends partly on who measured it. Third-party monitoring (AliveMCP, or any external synthetic monitoring service) removes the conflict of interest. If you use internal monitoring for SLA reporting, document the independence (runs on a separate VPS, different DNS resolver, different network path) to satisfy auditor questions.
What happens if my external probe provider has an outage — do probe failures count against my SLA?
No — SLAs should exclude periods where the monitoring infrastructure itself was unavailable, provided you can document the probe outage separately. Most external monitoring services (including AliveMCP) maintain their own status pages. If AliveMCP shows an outage on the same interval as your MCP server shows failed probes, the probe data for that window is excluded from SLA calculation. Document this exclusion policy in your SLA terms to prevent disputes.
Should SLAs cover individual tools or just MCP server availability overall?
Cover MCP server availability overall (ability to complete the initialize handshake and call any tool). Per-tool SLAs (committing to "the query_database tool will respond within 200ms in 99.9% of calls") create audit complexity and expose you to SLA claims from tool-level degradation outside your control (a downstream database being slow). Keep the SLA at the protocol level. If specific tools have special latency requirements, address those in a separate Technical Requirements Document rather than the SLA — that way, tool latency becomes an engineering objective, not a contractual credit trigger.
Further reading
- MCP server SLO — internal objectives, error budgets, and burn rate alerts
- MCP server SOC 2 — Availability criterion and audit evidence
- Enterprise MCP deployment — HA, blue-green, and change management
- MCP server incident response — detection, triage, and post-mortem
- MCP server uptime monitoring — probe types, intervals, and alert tiers
- MCP server status page — public and private status communication
- AliveMCP — continuous protocol monitoring for MCP servers