IaC guide · 2026-06-18 · Infrastructure as Code & GitOps

MCP Servers in Production: Terraform, Helm, GitHub Actions, GitOps, and Ansible

Every IaC and automation tool in the modern deployment stack — Terraform, Helm, GitHub Actions, ArgoCD and Flux, Ansible — creates a natural place to embed an MCP protocol verification checkpoint. Terraform's null_resource provisioner fires after infrastructure is provisioned. Helm's test hook runs after every helm upgrade. GitHub Actions posts a live initialize probe as a workflow step. ArgoCD's PostSync hook marks a sync as Failed if the protocol check doesn't pass. Ansible's uri module can halt a rolling update the moment a bad server enters the fleet. This guide synthesizes all five deployment patterns, maps the specific MCP protocol failure each checkpoint catches, and explains the structural blind spot every one of them shares — the gap that makes continuous external monitoring essential, not optional.

Five tools, five checkpoints, one shared gap

The comparison table below captures how each tool deploys an MCP server, what kind of MCP-specific verification it can embed, and the failure class that its checkpoint architecture cannot catch.

Tool	Primary role	MCP verification mechanism	Structural blind spot
Terraform	Provision cloud infrastructure (VMs, ECS, ALB, IAM)	`null_resource` local-exec `initialize` probe; tainted resource = forced replacement	Only fires on infrastructure changes; silent after steady state
Helm	Package and deploy versioned Kubernetes manifests	Test hook Job sending `initialize` inside the cluster; deploy marked Failed on probe failure	Runs inside the cluster — bypasses Ingress and TLS; cert expiry invisible
GitHub Actions	Automate test → build → deploy pipeline	Post-deploy step: `curl \| jq -e '.result.protocolVersion == ...'`; workflow fails on mismatch	Probe runs from the CI runner's network, not from user geography
GitOps (ArgoCD/Flux)	Git as source of truth; controller syncs cluster to repo	ArgoCD PostSync hook; Flux `healthChecks` + Kustomization readiness	Hook runs from inside cluster; drift detection doesn't catch runtime failures
Ansible	Agentless configuration management for VPS fleets	`uri` module `initialize` probe with `serial: 1` halts rolling update on failure	Control machine runs the probe; nothing watching after the playbook exits

The pattern is consistent: each tool's verification mechanism is a point-in-time check executed from a privileged position inside the provisioning network. The Terraform runner, the Helm test pod, the GitHub Actions runner, the ArgoCD probe Job, the Ansible control machine — none of these are positioned on the same network path that LLM clients actually traverse when they call your MCP server. And none of them keep running after the deployment is done.

Terraform: infrastructure-layer verification with a hard deploy gate

Terraform's value for MCP servers is stronger than for typical web services, because MCP servers have infrastructure requirements that compound quickly: sticky sessions or Streamable HTTP transport configuration at the ALB, IAM task roles to pull secrets at startup, security groups wide enough to admit external monitoring probes, EIP or fixed-DNS addresses for deterministic monitoring configuration. Encoding all of this in HCL puts every environment in version control and makes terraform plan the diff tool for your infrastructure rather than a mental model someone carries around in their head.

The MCP-specific innovation in Terraform deployments is the null_resource with a local-exec provisioner that sends a real initialize JSON-RPC request after apply completes:

resource "null_resource" "mcp_health_probe" {
  triggers = {
    instance_id = aws_instance.mcp.id
    elastic_ip  = aws_eip.mcp.public_ip
  }
  provisioner "local-exec" {
    command = <<-EOT
      sleep 30
      curl -sf -X POST https://${aws_eip.mcp.public_ip}/ \
        -H 'Content-Type: application/json' \
        -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","clientInfo":{"name":"terraform-probe","version":"1.0"}}}' \
        | jq -e '.result.protocolVersion == "2024-11-05"' \
        || (echo "MCP protocol probe failed" && exit 1)
    EOT
  }
  depends_on = [aws_instance.mcp, aws_eip_association.mcp]
}

The hard property of this pattern is in the taint mechanics: if the provisioner exits non-zero, Terraform marks the null_resource as tainted. The next terraform apply destroys it and re-runs it — and because the triggers reference the instance ID and Elastic IP, a broken deploy causes the instance to be replaced automatically. You don't need a human to notice the failure and trigger a rollback; Terraform does it on the next apply.

The blind spot is equally clear: the null_resource triggers only change when the infrastructure changes. A memory leak that causes the MCP server to crash six hours after a clean deploy, a certificate renewal failure at 3am, or an IAM role rotation that revokes the secret at runtime — none of these cause the triggers map to change, so Terraform never re-runs the probe. The infrastructure is unchanged; only the application state has degraded. Terraform's state file still shows a healthy deployment. To the person looking at Terraform, everything is green.

For ECS/Fargate deployments, an additional wrinkle: the Terraform post-apply probe uses aws ecs wait services-stable to block until ECS reports the service at its desired count. This combines infrastructure-level readiness (ECS task is running and the ALB health check passes) with protocol-level correctness (the initialize handshake succeeds against the ALB domain). It catches broken container images and broken MCP startup. It does not catch the ECS task that passes the health check (HTTP 200 on /health) but serves a hung JSON-RPC response on the actual / path.

Helm: versioned Kubernetes packaging with a post-deploy protocol gate

Helm solves a different problem from Terraform. Where Terraform manages what cloud infrastructure exists, Helm manages how application workloads run inside Kubernetes: the Deployment spec, the Service, the Ingress annotations, the HPA policy, the PodDisruptionBudget. A Helm chart makes all of these parameterizable across environments — a single helm upgrade --install mcp-server-production ./mcp-server -f values-production.yaml --set image.tag=$SHA command handles both first-time installs and subsequent upgrades, idempotently.

Two MCP-specific details that generic Kubernetes guides miss. First, the Ingress must disable proxy buffering and extend read timeouts for SSE transport:

annotations:
  nginx.ingress.kubernetes.io/proxy-buffering: "off"
  nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
  nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"

Without proxy-buffering: "off", nginx accumulates SSE events in its buffer before delivering them — breaking real-time streaming. Without a long read timeout (the default is 60 seconds), nginx closes idle SSE connections before the client has finished receiving a long tool response. Second, the Deployment needs terminationGracePeriodSeconds: 60 so that running SSE sessions can drain before Kubernetes sends SIGKILL during rolling updates or HPA scale-in.

The MCP-specific test hook sends a real initialize request from inside the cluster after every deploy:

annotations:
  "helm.sh/hook": test
  "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded

If the hook fails — wrong protocol version, HTTP 5xx, no JSON response — the Helm release is marked as failed in its history, and helm upgrade --install with --atomic automatically rolls back to the previous revision. The combination of --wait (blocks until pods are Ready) + --atomic (rolls back on timeout) + the explicit test hook (validates MCP protocol after pods are ready) creates a three-layer deployment gate.

The Helm blind spot is structural: the test pod runs inside the cluster and connects to the MCP server through the ClusterIP Service, bypassing the Ingress entirely. This means the test succeeds even when the Ingress is misconfigured, the TLS certificate is invalid, or the cert-manager renewal has silently failed. A cert-manager certificate that expires on a Sunday — because the ACME DNS-01 challenge timed out — leaves the cluster pods healthy and the Helm test green, while every external user gets a TLS handshake error. AliveMCP probes from outside the cluster through the full network path and catches this within one minute.

The HPA configuration also needs MCP-specific tuning. Because SSE connections are long-lived, the scaleDown.stabilizationWindowSeconds: 300 setting prevents flapping — the HPA waits five minutes after utilization drops before removing pods, giving active sessions time to complete before the pod is terminated. MCP servers that accumulate per-session memory (tool call history, cached embeddings) scale better on memory utilization than CPU; the HPA template should include both metrics.

GitHub Actions: pipeline-layer verification with SHA traceability

GitHub Actions embeds the MCP protocol probe at the most visible layer of the development workflow: the CI/CD pipeline itself. A failed probe makes the deployment workflow red, blocks the same engineer who triggered it, and leaves an annotated error in the GitHub UI before any user experiences the failure.

The recommended pattern is a three-job workflow — test → build → deploy — where each job has a single responsibility:

test: runs on every pull request. No secrets, no production access, no Docker images. Pure correctness gate — the code is wrong, fix it before merging.
build: runs on push to main, after test passes. Pushes a Docker image tagged with both the commit SHA and latest. The SHA tag is the key: it creates an immutable reference between the running container and the exact Git commit that produced it. Rollback = redeploy the previous SHA.
deploy: depends on build, uses environment: production for environment-level secrets isolation (repository-level secrets are accessible from PR workflows — an attack vector for credential exfiltration). Deploys the SHA-tagged image, then runs the protocol probe:

- name: Verify MCP protocol endpoint
  run: |
    sleep 15
    curl -sf -X POST https://${{ secrets.MCP_ENDPOINT }}/ \
      -H 'Content-Type: application/json' \
      -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","clientInfo":{"name":"ci-probe","version":"1.0"}}}' \
      | jq -e '.result.protocolVersion == "2024-11-05"' \
      || (echo "::error::MCP protocol verification failed"; exit 1)

The echo "::error::..." prefix uses GitHub's workflow command format to emit a visible error annotation in the PR summary, not just a red step. A final step calls the AliveMCP API to register or update the monitor, so continuous monitoring is always current with the latest deployment without any manual configuration.

The GitHub Actions blind spot is geographic and temporal. The probe runs from a GitHub-hosted runner sitting in Microsoft Azure's network infrastructure. A failure mode specific to your deployment region — a degraded Availability Zone in AWS us-east-1, an ISP routing issue between specific geographic user clusters and your load balancer, an anycast IP that resolves differently for European users — may not manifest from the Azure-hosted runner's perspective. The probe is checking that the endpoint is reachable from one data center; AliveMCP checks from multiple geographic regions, matching the diversity of where your actual LLM clients are running. And after the workflow completes, the pipeline never runs again until the next commit — everything between deploys is invisible.

GitOps: continuous reconciliation with drift prevention

GitOps inverts the deployment model: instead of a CI/CD pipeline pushing changes to the cluster, a controller running inside the cluster watches the Git repository and pulls changes in. For MCP servers, this has a specific operational value. Tool definitions, allowed origins, rate-limit parameters, and MCP-specific ConfigMap values are all configuration that an engineer might "just quickly" tweak directly in the production cluster to test something. With GitOps and selfHeal: true, that drift is reverted automatically within ArgoCD's reconciliation cycle — typically under two minutes. The engineer gets a clear signal: commit it to Git, or the cluster will revert it.

The ArgoCD PostSync hook is the GitOps equivalent of Terraform's null_resource — it runs an MCP protocol probe after every sync completes:

annotations:
  argocd.argoproj.io/hook: PostSync
  argocd.argoproj.io/hook-delete-policy: BeforeHookCreation

If the Job exits non-zero, ArgoCD marks the sync as Failed and shows the application as degraded in the UI. The previous running Pods are not affected — the rolling update already completed. What the failure signals is precise: infrastructure synced correctly, but MCP protocol is not responding. That distinction — infrastructure-level success, application-level failure — directs the on-call engineer to application logs and MCP initialization errors rather than Kubernetes events and node health.

Flux CD takes a different architectural approach: instead of a PostSync hook Job, the Kustomization resource's healthChecks field instructs the Kustomize Controller to wait for Deployment rollout completion before marking the Kustomization as Ready. Flux evaluates health using the same logic as kubectl rollout status: desired pods running, readiness probes passing, observed generation matching spec generation. A crash-looping pod or a failed image pull causes the Kustomization to remain in a degraded state with a descriptive error that the notification controller forwards to Slack or PagerDuty.

The GitOps blind spot mirrors the Helm blind spot: both ArgoCD's PostSync hook and Flux's health checks run from inside the cluster. The PostSync probe connects to the MCP server through the cluster-internal DNS (mcp-server.production.svc.cluster.local), bypassing the Ingress, the load balancer, and TLS. A certificate managed by cert-manager that expires between syncs — ArgoCD syncs on Git change, not on a timer — produces the same invisible failure: green ArgoCD application, red user experience.

Ansible: agentless automation with rolling-update safety

Ansible occupies a specific niche: VPS-hosted MCP servers that don't warrant Kubernetes overhead but need repeatable, reviewable, automated configuration management. A single Ansible playbook installs Node.js, creates a system user, clones the application, installs npm dependencies, deploys a systemd service, configures nginx with SSE-safe settings, provisions a TLS certificate, and opens only the ports actually needed. Re-run the same playbook a month later and it converges to the same state without touching anything that hasn't changed.

The MCP-specific addition to the playbook is a uri module task at the end of the mcp_app role:

- name: Verify MCP protocol handshake
  uri:
    url: "https://{{ mcp_domain }}/"
    method: POST
    body_format: json
    body:
      jsonrpc: "2.0"
      id: 1
      method: initialize
      params:
        protocolVersion: "2024-11-05"
        clientInfo:
          name: ansible-probe
          version: "1.0"
    return_content: true
    status_code: 200
  register: mcp_probe
  until: "'protocolVersion' in mcp_probe.content"
  retries: 3
  delay: 10

The until / retries / delay combination is important for MCP servers: the first probe attempt may land before the Node.js process has finished initializing its tool registry. Three retries with 10-second delays give a 30-second window for startup while still failing fast on genuinely broken deployments.

The rolling-update safety comes from the playbook's serial directive. Setting serial: 1 with max_fail_percentage: 0 means Ansible deploys to one host at a time and halts the entire playbook if any host fails its post-deploy probe. A broken image that passes tests but fails the MCP handshake reaches exactly one host in your fleet before the rollout stops:

# deploy.yml
- name: Rolling deploy
  hosts: mcp_servers
  serial: 1
  max_fail_percentage: 0
  roles:
    - mcp_app

The Ansible blind spot is temporal, not geographic: the uri probe runs from the Ansible control machine (often a CI server or a developer's laptop), and after the playbook exits, nothing is left watching. The systemd service may crash at 4am due to a memory leak in a new code path; the nginx reverse proxy may start returning 502 errors because the MCP server process exited under OOM pressure. Ansible will not run again until the next deploy. The combination of Ansible for automated provisioning and AliveMCP for continuous external monitoring gives you both the automation and the runtime visibility — neither one alone is sufficient for production MCP servers.

The shared structural blind spot: deploy-time ≠ continuous, internal ≠ external

Stepping back across all five tools, the pattern is identical at the architectural level. Each tool creates a verification checkpoint. Each checkpoint fires at deploy time. Each checkpoint runs from a position inside the provisioning network — the Terraform runner, the Helm test pod, the GitHub Actions runner, the ArgoCD hook Job, the Ansible control machine. After the checkpoint passes, the tool stops watching.

The failures this architecture cannot catch fall into four categories:

Failure class	IaC checkpoint result	What users experience
Memory leak on new code path	Deploy-time probe passes (clean startup)	MCP server crashes 4 hours post-deploy; tool calls time out
TLS certificate expiry	Internal probe bypasses Ingress/TLS; always passes	Every external client gets TLS handshake error
Out-of-band infrastructure change	No deploy triggered; no probe runs	Security group rule blocks port; monitoring probes fail
Upstream API rate limit	Probe sends `initialize` (no upstream calls); passes	Tool calls that reach upstream API fail silently

The first three are universal to all five IaC tools. The fourth is specific to MCP servers: the initialize handshake does not call any upstream APIs. It succeeds even when the tool registry is broken, the database is unreachable, or the API key has been rotated without updating the server's configuration. A probe that only sends initialize passes on a server that will fail every tool call.

This is why the monitoring strategy for a production MCP server must pair the IaC-embedded deploy-time checkpoint with continuous external monitoring. The checkpoint tells you the deployment was correct at the moment it happened. AliveMCP tells you the server is correct right now — every minute, from multiple geographic regions, using the same three-message JSON-RPC sequence (initialize, tools/list, tools/call) that LLM clients send in production.

Choosing the right combination for your deployment model

The five tools are not mutually exclusive. Most production MCP server deployments use two or three of them together, each handling a layer of the stack. The common combinations:

Deployment model	Tool combination	MCP verification layer
Small team, VPS	Terraform (VM) + Ansible (configuration)	Terraform `null_resource` + Ansible `uri` module
Container-first, no existing cluster	GitHub Actions (CI/CD) + ECS via Terraform	GitHub Actions post-deploy probe + Terraform ECS post-apply probe
Kubernetes, fast-moving team	GitHub Actions (build) + Helm (deploy)	GitHub Actions probe + Helm test hook
Kubernetes, compliance-focused team	GitHub Actions (build) + ArgoCD (deploy)	GitHub Actions probe + ArgoCD PostSync hook
Multi-cluster, platform team	Helm (chart) + Flux (GitOps across clusters)	Helm test hook + Flux Kustomization healthChecks

In each combination, the deploy-time checkpoints are complementary: a GitHub Actions probe catches failures visible from the CI runner's network before the GitOps controller ever sees the image; the ArgoCD PostSync hook catches failures that the CI probe missed because the pod wasn't fully initialized yet when the CI probe ran. Stacking multiple checkpoints at different points in the pipeline increases the probability of catching a bad deploy before it reaches steady-state production traffic.

But the checkpoints are all pre-runtime. After the last pipeline step completes, there is no automated verification running until the next deployment. For most MCP servers — especially those used by LLM agents running autonomously, often overnight — that window is where most outages actually occur. A deployment that is clean at noon can be down by midnight for reasons that have nothing to do with the code that was deployed: a certificate renewal ran at 11pm and failed because the DNS-01 challenge token wasn't cleaned up from the previous month; an upstream API rotated its rate limit keys; a long-running tool call triggered a memory allocation that crossed the OOM threshold. None of the five IaC tools catch these failures. AliveMCP does.

Frequently asked questions

Which IaC tool should I use first for an MCP server if I'm starting from scratch?

It depends on where your MCP server lives. If you're provisioning a VPS and setting up the server from scratch, start with Terraform to create the cloud resources (VM, Elastic IP, security groups, DNS records) and Ansible to configure the server (Node.js, nginx, systemd, TLS). This combination requires no Kubernetes knowledge and produces a production-grade server with full audit trails. If you're already containerizing your MCP server, GitHub Actions for CI/CD plus Helm for Kubernetes packaging is the most common path for teams new to Kubernetes — Helm's chart structure is learnable in a day, and the helm upgrade --install + --atomic pattern gives you automatic rollback without any additional tooling. Add ArgoCD or Flux when you have multiple clusters or need the audit trail and drift prevention that GitOps provides — usually when a second environment or a second team gets involved.

Can I use Terraform to manage my Helm releases, or should they be separate?

You can — Terraform has a helm provider that can manage Helm releases as Terraform resources. For small teams with a single cluster this is convenient: one terraform apply provisions the cloud infrastructure and deploys the Helm chart. The tradeoff is that Helm releases managed by Terraform don't benefit from Helm's native rollback history (helm history, helm rollback) because Terraform tracks the desired state, not the Helm revision chain. For MCP server deployments where rapid rollback on a bad protocol probe failure matters, maintaining Helm releases separately — applied by a GitHub Actions workflow or ArgoCD — preserves the full rollback capability. Use Terraform for the infrastructure boundary (cluster, node pools, load balancers, IAM) and let Helm manage everything inside the cluster boundary.

How do I prevent the ArgoCD PostSync hook from probing the MCP endpoint before the pods are ready?

The PostSync hook runs after ArgoCD confirms that all resources in the sync wave have reached a healthy state — which, for a Deployment, means all pods are running and their readiness probes have passed. However, a pod can pass its HTTP readiness probe (/health returns 200) while the MCP server is still initializing its tool registry, loading configuration files, or establishing connections to upstream APIs. Adding a sleep 10 at the start of the probe command in the PostSync hook Job is the simplest mitigation — it gives the MCP server process ten additional seconds to complete initialization after the readiness probe has passed. If your server has a longer startup path (loading large models, populating caches from a database), increase the sleep value or replace it with a retry loop: until curl -sf ... | grep -q '"protocolVersion"'; do sleep 5; done with a timeout to prevent infinite waiting. The retry approach is more robust and doesn't over-sleep on healthy servers.

How do I wire AliveMCP monitoring into my IaC pipeline so it's always current after a deploy?

Add a final step to each tool's deploy phase that calls the AliveMCP API. In Terraform, a second null_resource that depends on the health probe resource registers the endpoint and runs only when the probe passes. In GitHub Actions, a final workflow step runs curl -X POST https://alivemcp.com/api/v1/monitors ... after the protocol probe succeeds. In Ansible, a uri task at the end of the playbook registers the monitor URL. In Helm's CI pipeline, add the AliveMCP registration curl call to the same step that runs helm test. The monitor upsert endpoint is idempotent — re-running it on the same URL updates the monitor configuration rather than creating a duplicate. Store the AliveMCP API key in the same secret store as your other deployment credentials (GitHub Environment secrets, Ansible Vault, Terraform variable input from CI), so it's never committed to version control and rotates consistently with your other credentials.

What's the difference between the Helm test hook and a Kubernetes readiness probe for MCP servers?

Kubernetes readiness probes run continuously throughout the pod's lifetime — they determine whether the pod should receive traffic from a Service. Helm test hooks run exactly once, after each helm install or helm upgrade. For MCP servers, a readiness probe should check something lightweight and synchronous — typically an HTTP /health endpoint that returns 200 quickly. This probe determines routing decisions and is called every 10 seconds; it must not involve tool calls, upstream API calls, or database queries. A Helm test hook can be much more thorough because it runs once at deploy time and doesn't affect production routing: it can send a full initialize request, validate the protocol version, and optionally call tools/list to verify the tool registry loaded correctly. The division is: readiness probe = is this pod alive? Helm test hook = does this pod speak correct MCP? AliveMCP = is this endpoint reachable and correct right now, from the outside, from multiple regions?