Guide · Enterprise Security

MCP server SLA

A Service Level Agreement for an MCP server is a contractual commitment to your customers about how available the server will be — typically expressed as a monthly availability percentage and backed by a credit schedule when you fall short. Getting an MCP server SLA right involves three distinct decisions: what to measure (the Service Level Indicator), what percentage to commit to (the Service Level Objective), and what the contractual consequence is when you miss it (the SLA itself). This guide covers the full stack — from choosing the right probe to generating the monthly PDF report that satisfies enterprise procurement SLA evidence requirements.

TL;DR

Measure MCP server availability from an external probe that performs a real initialize handshake — not from your server's own metrics, which don't count failed requests the server never received. Common tiers: 99.9% (≈43 min/month allowed downtime), 99.95% (≈22 min), 99.99% (≈4 min). Credit schedule: 25–50% of monthly fee for a single incident; up to 100% for multiple incidents that breach the monthly threshold. AliveMCP's Team and Enterprise plans export monthly SLA PDF reports with availability percentage, incident list, and response-time percentiles — the exact evidence your customers' procurement teams request.

SLI, SLO, and SLA: the three-tier framework

These three terms are often used interchangeably but represent distinct layers of the availability framework:

Term	What it is	Example for MCP server	Who owns it
SLI (Service Level Indicator)	The raw measurement — what you actually observe	% of 60-second probe intervals where MCP initialize succeeds in <5 seconds	Engineering (what to measure)
SLO (Service Level Objective)	Your internal target for the SLI	SLI ≥ 99.95% measured over a rolling 30-day window	Engineering + Product (what to aim for)
SLA (Service Level Agreement)	The contractual commitment, typically less strict than the SLO, with credits if breached	99.9% monthly availability; 25% credit for 99.0–99.9%, 50% credit for <99.0%	Legal + Business (what's promised and what's owed)

The gap between SLO and SLA is intentional: your SLO should be more aggressive than your SLA so that when the SLO fires (internal alert), you still have room to recover before breaching the SLA (contractual trigger). If SLO = SLA, every internal alert is also a credit event — engineering has no runway.

What to measure: the MCP server SLI

The SLI for an MCP server should reflect what your customers actually depend on: the ability to successfully initiate an MCP session and call tools. This means the probe must do more than check if a TCP port is open.

#!/bin/bash
# mcp-sli-probe.sh — measure one probe interval for SLI calculation

SERVER_URL="${1:-https://mcp-search.internal/mcp}"
TIMEOUT_SECS=5
START_MS=$(($(date +%s%N)/1000000))

# Send real MCP initialize request
RESPONSE=$(curl \
    --silent \
    --max-time "$TIMEOUT_SECS" \
    --write-out "\n%{http_code}\n%{time_total}" \
    --request POST "$SERVER_URL" \
    --header "Content-Type: application/json" \
    --data '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","clientInfo":{"name":"sli-probe","version":"1.0"}}}')

HTTP_CODE=$(echo "$RESPONSE" | tail -2 | head -1)
ELAPSED_MS=$(echo "$RESPONSE" | tail -1 | awk '{printf "%d", $1*1000}')
BODY=$(echo "$RESPONSE" | head -1)

# SLI = "good" if:
# 1. HTTP 200
# 2. Response contains protocolVersion (valid MCP response)
# 3. Elapsed time < SLI threshold (5000ms)
if [ "$HTTP_CODE" = "200" ] \
    && echo "$BODY" | grep -q '"protocolVersion"' \
    && [ "$ELAPSED_MS" -lt 5000 ]; then
    echo "GOOD latency=${ELAPSED_MS}ms"
    exit 0
else
    echo "BAD http_code=$HTTP_CODE latency=${ELAPSED_MS}ms"
    exit 1
fi

Log each probe result (timestamp, good/bad, latency) to calculate the availability percentage at the end of each calendar month.

Latency SLI: beyond binary availability

A server that responds to every initialize in 4,900 ms is technically "up" but is causing meaningful latency in agent pipelines. Consider adding a latency SLI alongside the availability SLI:

# Latency SLI (separate from availability SLI)
# "good" = p99 response time for tools/call < 500ms over the measurement window

# Measured via AliveMCP response-time histogram:
# curl -H "Authorization: Bearer $TOKEN" \
#   "https://alivemcp.com/api/v1/servers/mcp-search/metrics?period=2026-05" \
#   | jq '.response_time_p99_ms'

Availability percentage and downtime budget

Availability percentage = (total probe intervals − failed intervals) / total probe intervals × 100. A probe interval that times out counts as failed. A probe interval that returns a 5xx error counts as failed. A probe interval that returns a 200 with a malformed (non-MCP) response counts as failed.

SLA target	Maximum downtime per month (30d)	Maximum downtime per year	Failed probe budget (60s interval)
99.0%	7h 18m	3d 15h	≤ 438 failed probes/month
99.5%	3h 39m	1d 20h	≤ 219 failed probes/month
99.9%	43m 12s	8h 45m	≤ 43 failed probes/month
99.95%	21m 36s	4h 22m	≤ 21 failed probes/month
99.99%	4m 19s	52m	≤ 4 failed probes/month

Choose the SLA tier based on your current measured performance with headroom. If your MCP server historically achieves 99.97% availability (measured over the last 90 days), committing to 99.9% gives you a 7× safety margin on the failure budget. Committing to 99.99% leaves you with a 3× margin — acceptable, but requiring rigorous on-call coverage to recover within 4 minutes of failure.

What counts as downtime

Define downtime precisely in the SLA. Ambiguity here is where enterprise procurement disputes start. A robust definition for MCP servers:

# SLA downtime definition (include verbatim in customer agreements)

"Downtime" means any period during which AliveMCP's external monitoring probe
is unable to successfully complete a Model Context Protocol (MCP) initialize
handshake with the Service endpoint within 5,000 milliseconds. A period of
downtime begins when three consecutive probe intervals fail and ends when three
consecutive probe intervals succeed.

Excluded from downtime calculation:
- Scheduled maintenance windows (notified ≥72 hours in advance via status page)
- Probe failures caused by customer-side network issues or client misconfigurations
- Periods of force majeure (as defined in the Master Services Agreement)
- The first 5 minutes of any incident (allowing for transient network noise)

The "three consecutive failures to start, three consecutive successes to end" rule prevents a single probe timeout (which could be a transient network blip) from triggering an SLA event. With 60-second probe intervals, this means a genuine outage is detected within 3 minutes of starting.

Error condition	Counts as downtime?	Rationale
HTTP 5xx on MCP endpoint	Yes	Server-side failure
HTTP 4xx on MCP endpoint	No (investigate separately)	May indicate auth regression on customer side
Connection timeout (>5s)	Yes	Network or server failure, agent perspective: unusable
Valid HTTP 200 but invalid MCP response	Yes	Schema regression — tools/call would fail even if connection succeeds
Scheduled maintenance (pre-announced)	No	Excluded per SLA terms
DNS resolution failure	Yes	Customer perspective: service unavailable

Credit schedule design

Credit schedules should scale with severity — a minor SLA miss (99.88% vs 99.9% target) earns a smaller credit than a catastrophic month (95% availability). Standard enterprise SaaS credit schedules adapted for MCP servers:

# Credit schedule (include in customer SLA addendum)

Monthly Availability    Credit (% of monthly fee)
─────────────────────   ──────────────────────────
≥ 99.9%                 No credit (within SLA)
99.0% – 99.9%           25%
98.0% – 99.0%           50%
< 98.0%                 100%

Credit for individual incidents:
─────────────────────────────────
Incident duration  < 30 min:   No credit
Incident duration 30–60 min:   10% of monthly fee
Incident duration 60–120 min:  25% of monthly fee
Incident duration > 120 min:   50% of monthly fee

(Credits from monthly availability and individual incidents do not stack.
 The higher of the two applies for any given month.)

Cap total credits at 100% of monthly fee. Enterprise customers sometimes negotiate for credits that exceed the monthly fee ("consequential damages"), but this is standard liability waiver territory — your legal team should handle this via limitation of liability clauses in the Master Services Agreement, not in the SLA addendum.

Monthly SLA report format

Enterprise procurement teams require monthly SLA reports as evidence for their own SOC 2 audits and vendor management processes. A complete MCP server SLA report contains:

Reporting period: calendar month start and end date (UTC)
Availability percentage: calculated from external probe data (not self-reported)
Total probe intervals: number of probe intervals in the period
Failed intervals: count and percentage
SLA commitment met: yes/no
Credit owed: dollar amount if applicable, or $0 if SLA met
Incident list: each incident with start time, end time, duration, and root cause classification
Response time percentiles: p50, p95, p99 for the period
Uptime chart: 30-day availability timeline, 1-day resolution
Maintenance windows: list of scheduled windows and whether they were used

# Generate SLA report from AliveMCP API (Team/Enterprise tier)
curl -H "Authorization: Bearer $ALIVEMCP_TOKEN" \
    "https://alivemcp.com/api/v1/servers/mcp-search/sla-report?month=2026-05" \
    --output "sla-report-mcp-search-2026-05.pdf"

# Or JSON for programmatic use
curl -H "Authorization: Bearer $ALIVEMCP_TOKEN" \
    "https://alivemcp.com/api/v1/servers/mcp-search/sla-report?month=2026-05&format=json" \
    | jq '{
        period: .period,
        availability_pct: .availability_pct,
        sla_target_pct: .sla_target_pct,
        sla_met: .sla_met,
        credit_owed_usd: .credit_owed_usd,
        incidents: .incidents | length,
        p99_response_ms: .response_time_p99_ms
    }'

Measuring SLA from external probes vs internal metrics

Self-reported availability (measuring from your own server's metrics) has a fundamental problem: it only counts requests the server actually received and processed. It cannot measure downtime caused by:

Network failures between the load balancer and the internet
DNS resolution failures
TLS certificate expiry at the edge
Infrastructure-level failures that prevent requests from reaching the server

From the customer's perspective, all of these are downtime — their agent couldn't connect. From your internal metrics, they appear as zero requests (not as errors). The gap between "external availability" and "server-side request success rate" is often 0.05–0.1% over a year — enough to push a 99.95% self-reported number below 99.9% when measured externally.

Enterprise SLA agreements should always specify that availability is measured by an independent external probe, not by the server's own metrics. AliveMCP's probe is the canonical source for this measurement in AliveMCP Team and Enterprise accounts — the monthly report timestamps map directly to the data your customers' auditors want.

Frequently asked questions

What's the right SLA target for an enterprise MCP server?

Match your SLA commitment to your measured track record with headroom. If you've been running an MCP server for 90 days and measured 99.97% external availability, committing to 99.9% (a 7× headroom on failure budget) is reasonable. Committing to 99.95% (3× headroom) requires confidence in your infrastructure stability. Committing to 99.99% requires a very mature HA setup with sub-minute failover — don't commit to it unless you've actually measured the failover working. New MCP server deployments without a track record should start at 99.5% and ratchet up as the track record accumulates.

How should planned maintenance be handled in the SLA?

Require advance notification (72 hours is standard for most enterprise SaaS SLAs) posted to a public status page. Exclude the maintenance window from the monthly availability calculation. Cap maintenance windows at 4 hours per month — customers who need zero-maintenance uptime require active-active HA where updates are rolled out without windows. For MCP servers, zero-downtime updates are achievable with blue-green deployment — you may eventually be able to remove maintenance windows from your SLA entirely, which is a competitive advantage.

Can I use an internal monitoring system for SLA measurement instead of AliveMCP?

You can, but enterprise procurement teams will ask whether the monitoring system is independent of the service being measured. An internal system that goes down when the MCP server goes down is measuring its own availability, not the MCP server's. The credibility of your SLA report depends partly on who measured it. Third-party monitoring (AliveMCP, or any external synthetic monitoring service) removes the conflict of interest. If you use internal monitoring for SLA reporting, document the independence (runs on a separate VPS, different DNS resolver, different network path) to satisfy auditor questions.

What happens if my external probe provider has an outage — do probe failures count against my SLA?

No — SLAs should exclude periods where the monitoring infrastructure itself was unavailable, provided you can document the probe outage separately. Most external monitoring services (including AliveMCP) maintain their own status pages. If AliveMCP shows an outage on the same interval as your MCP server shows failed probes, the probe data for that window is excluded from SLA calculation. Document this exclusion policy in your SLA terms to prevent disputes.

Should SLAs cover individual tools or just MCP server availability overall?

Cover MCP server availability overall (ability to complete the initialize handshake and call any tool). Per-tool SLAs (committing to "the query_database tool will respond within 200ms in 99.9% of calls") create audit complexity and expose you to SLA claims from tool-level degradation outside your control (a downstream database being slow). Keep the SLA at the protocol level. If specific tools have special latency requirements, address those in a separate Technical Requirements Document rather than the SLA — that way, tool latency becomes an engineering objective, not a contractual credit trigger.