Guide · CI/CD

Deploying MCP Servers with GitHub Actions

GitHub Actions automates the full MCP server lifecycle across every push to main: run the test suite on every pull request, build and push a Docker image to GHCR on merge, deploy the image to your infrastructure, probe the live endpoint to verify the MCP protocol actually works, then register the new endpoint with AliveMCP so uptime monitoring is automatically current after every deployment. This guide shows the concrete workflow YAML, the environment secrets pattern, matrix Node version testing, Kubernetes and ECS deployment alternatives, and the two-layer monitoring strategy that catches both deploy-time and post-deploy failures.

TL;DR

Use a three-job workflow: test runs on every pull request and catches regressions before merge; build runs on push to main and produces a tagged Docker image in GHCR; deploy depends on build, uses GitHub Environments for manual approval gating, and ends with a curl step that sends a live initialize request to the deployed endpoint — if the JSON-RPC response does not contain the expected protocolVersion, the workflow fails and the previous version stays live. A final step calls the AliveMCP API to register or update the check, so continuous monitoring reflects the new endpoint without any manual configuration.

Workflow structure for an MCP server

Most application deployment workflows run a single job that mixes testing, building, and shipping. That works for simple apps, but MCP servers have a hard requirement that separates them from typical web services: the deployed server must respond correctly to a specific JSON-RPC handshake — the initialize method — before it is considered healthy. A flat single-job pipeline makes it awkward to enforce that gate. The three-job pattern below keeps each responsibility distinct and makes the dependency chain explicit in the workflow graph.

The test job runs on every pull_request event. Its job is purely preventive: run npm ci to get a clean install, run the unit and integration test suite, and run npm run build to confirm the TypeScript compiles without errors. If any step fails, GitHub blocks the merge — the code never reaches main. This job does not need any secrets, does not produce any artifacts, and runs in complete isolation from production infrastructure. That isolation is intentional: a compromised PR cannot exfiltrate credentials.

The build job runs on push to main and needs: packages: write permission to push to GHCR. It builds a Docker image tagged with both the commit SHA and latest. Using the commit SHA as a tag is important: it creates an immutable reference that the deploy job can use, and it makes rollbacks trivial — you always know exactly which commit is running in production because the image tag is the commit hash.

The deploy job depends on build via needs: build and runs with environment: production, which activates GitHub's deployment protection rules. It pulls the image by SHA, runs it, waits 15 seconds for the server to accept connections, then issues a live MCP protocol probe. The probe is not optional and is not a health check endpoint — it sends the actual initialize JSON-RPC request that any MCP client would send. If the server responds with a different protocol version, or times out, or returns an HTTP error, the workflow fails.

The table below shows how the deploy job pattern changes across common deployment targets.

Target	Deploy mechanism	Actions integration	Rollback method
VPS via SSH	`docker pull` + `docker run` on remote host	`appleboy/ssh-action`	SSH and run previous image tag
Kubernetes	`kubectl set image` + `kubectl rollout status`	Write kubeconfig from secret, run `kubectl`	`kubectl rollout undo`
AWS ECS	Update task definition, force new deployment	`aws-actions/amazon-ecs-deploy-task-definition`	Register previous task def revision
Railway	Push to linked GitHub branch triggers Railway deploy	No extra step; Railway webhooks on `main`	Railway dashboard rollback
Fly.io	`flyctl deploy --image ghcr.io/...`	`superfly/flyctl-actions`	`flyctl releases rollback`

Complete workflow file for VPS SSH deployment

The following file goes in .github/workflows/deploy.yml in your repository root. It implements the three-job pattern described above. Replace secrets.VPS_HOST, secrets.VPS_USER, secrets.VPS_SSH_KEY, and secrets.MCP_ENDPOINT with secrets stored in the production GitHub Environment (not at the repository level — that difference matters and is explained in the next section).

name: Deploy MCP Server
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'
      - run: npm ci
      - run: npm test
      - run: npm run build

  build:
    needs: test
    if: github.event_name == 'push'
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
    steps:
      - uses: actions/checkout@v4
      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: |
            ghcr.io/${{ github.repository }}:${{ github.sha }}
            ghcr.io/${{ github.repository }}:latest

  deploy:
    needs: build
    runs-on: ubuntu-latest
    environment: production
    steps:
      - name: Deploy via SSH
        uses: appleboy/ssh-action@v1.0.3
        with:
          host: ${{ secrets.VPS_HOST }}
          username: ${{ secrets.VPS_USER }}
          key: ${{ secrets.VPS_SSH_KEY }}
          script: |
            docker pull ghcr.io/${{ github.repository }}:${{ github.sha }}
            docker stop mcp-server || true
            docker rm mcp-server || true
            docker run -d --name mcp-server \
              --restart unless-stopped \
              -p 3000:3000 \
              -e NODE_ENV=production \
              ghcr.io/${{ github.repository }}:${{ github.sha }}

      - name: Verify MCP protocol endpoint
        run: |
          sleep 15
          curl -sf -X POST https://${{ secrets.MCP_ENDPOINT }}/ \
            -H 'Content-Type: application/json' \
            -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","clientInfo":{"name":"ci-probe","version":"1.0"}}}' \
            | jq -e '.result.protocolVersion == "2024-11-05"' \
            || (echo "::error::MCP protocol verification failed"; exit 1)

      - name: Register deployment with AliveMCP
        run: |
          curl -sf -X POST https://alivemcp.com/api/v1/checks \
            -H 'Authorization: Bearer ${{ secrets.ALIVEMCP_API_KEY }}' \
            -H 'Content-Type: application/json' \
            -d '{
              "name": "MCP Server Production",
              "url": "https://${{ secrets.MCP_ENDPOINT }}/",
              "interval_seconds": 60,
              "check_type": "mcp_protocol"
            }'

A few implementation details worth calling out. The if: github.event_name == 'push' guard on the build job prevents the build from running on pull request events — the test job still runs on PRs, but there is no reason to push a Docker image for an unmerged branch. The docker stop mcp-server || true pattern suppresses the error that would occur if no container named mcp-server exists yet on a fresh server. The sleep 15 before the protocol probe is a pragmatic wait for the Node.js process inside the container to bind to port 3000 and complete the MCP server startup handshake; for servers with heavy initialization (loading large models, populating caches) you may need to increase this value or replace the sleep with a until curl -sf ...; do sleep 2; done retry loop.

The echo "::error::MCP protocol verification failed" syntax uses GitHub Actions' workflow command format to emit a visible error annotation in the PR summary and in the Actions run log. Without that prefix, a failing exit code alone would mark the step red but would not produce a readable annotation in the GitHub UI.

Environment secrets and protection rules

GitHub Actions has two places to store secrets: at the repository level (available to all workflows and all branches) and at the environment level (available only to jobs that declare environment: <name> and only after any protection rules pass). For MCP server deployments, you want the second kind for anything related to production infrastructure.

Repository-level secrets are accessible from pull request workflows. That means a contributor who opens a PR can exfiltrate your VPS_SSH_KEY by adding a step that prints it to the log. Environment-level secrets block that path: the production environment can be configured to require a reviewer's manual approval before the deploy job starts, and the secrets are never injected into a job that runs without that approval. Put VPS_SSH_KEY, MCP_ENDPOINT, ALIVEMCP_API_KEY, and any database connection strings in the production environment, not at the repo level.

To configure protection rules, navigate to your repository's Settings > Environments > production. The relevant options are:

Required reviewers: adds up to six GitHub users or teams who must approve a deployment before the deploy job is unblocked. The job sits in a "waiting" state for up to 30 days. This is the right choice for teams where production deployments should require a second set of eyes even after automated tests pass.
Deployment branches and tags: a branch filter that restricts which branches may deploy to this environment. Setting this to main prevents someone from creating a hotfix/secret-exfil branch, pushing a malicious workflow, and deploying it to production. Set this filter; it costs nothing and prevents a class of supply-chain attack.
Wait timer: introduces a mandatory delay (up to 43,200 minutes) between when deployment is triggered and when it starts. Useful if you want a cooling-off period for catching bad deployments based on monitoring alerts before the next one lands.

For teams that do not need manual approval — for example, a solo developer who trusts the test suite — environment secrets still provide value because the branch filter blocks deployments from untrusted branches. You get the secret isolation benefit without the approval step friction.

One pattern that works well for MCP servers is the staging-first deployment: configure two GitHub Environments, staging and production. The deploy job first deploys to staging and verifies the MCP protocol probe there. If the probe passes, a second sub-step or a separate deploy-prod job (with needs: deploy-staging and environment: production) deploys the same image SHA to production. Because both environments use the same image tag, you know the artifact that passed staging verification is the same artifact running in production — no re-build, no drift.

Matrix builds and TypeScript type safety

MCP servers written in TypeScript have a specific failure mode that matrix testing catches early: the @modelcontextprotocol/sdk package and the TypeScript compiler are updated frequently, and new versions occasionally drop support for older Node.js runtimes or change the shape of types that your tool definitions depend on. A build that compiles cleanly and tests pass on Node 20 may fail on Node 18 with a runtime error referencing an API that did not exist until Node 20 — structuredClone, certain crypto.subtle methods, or the fetch global before it was added to Node in v18 and stabilized in v21.

Add a matrix strategy to the test job to catch these regressions before they reach production:

  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        node-version: ['18', '20', '22']
      fail-fast: false
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node-version }}
          cache: 'npm'
      - run: npm ci
      - name: Type check
        run: npx tsc --noEmit
      - run: npm test
      - run: npm run build

The fail-fast: false option tells GitHub Actions to keep running all matrix variants even if one fails. Without it, the first failing Node version cancels the remaining jobs, and you lose information about which versions are broken. With it, a single run tells you "passes on 18 and 20, fails on 22" — enough information to triage the issue immediately.

The explicit npx tsc --noEmit step runs the TypeScript compiler in type-check-only mode before the test runner. This matters because many test frameworks (Jest with ts-jest, Vitest with @vitest/runner) can transpile TypeScript without running the type checker, meaning you can have a green test run and still have type errors that only surface when you do a proper tsc build. For MCP servers, type errors in tool definitions are especially dangerous: the MCP SDK generates the JSON Schema for tool parameters from TypeScript types, so a type error in a tool's input schema will produce a malformed or missing schema at runtime that confuses the MCP client attempting to call that tool.

For projects that use strict: true in tsconfig.json (which you should), add a dedicated linting step as well:

      - name: Lint
        run: npx eslint src --ext .ts --max-warnings 0

The --max-warnings 0 flag promotes warnings to errors in CI, preventing the gradual accumulation of ignored warnings that eventually become real bugs. This is a low-cost gate that pays for itself the first time it catches a nullable dereference in a tool handler before it reaches production.

Kubernetes and ECS deployment variants

Teams running MCP servers on Kubernetes or AWS ECS replace the appleboy/ssh-action step with platform-specific tooling, but the surrounding job structure — depends on build, uses the production environment, ends with a protocol probe — stays identical. The protocol verification step is infrastructure-agnostic and should appear in every variant.

For Kubernetes, the deploy job writes the kubeconfig from a secret, updates the deployment image, and waits for the rollout to complete before the protocol probe runs:

      - name: Deploy to Kubernetes
        run: |
          echo "${{ secrets.KUBECONFIG }}" > kubeconfig
          KUBECONFIG=kubeconfig kubectl set image deployment/mcp-server \
            mcp-server=ghcr.io/${{ github.repository }}:${{ github.sha }} \
            --namespace production
          KUBECONFIG=kubeconfig kubectl rollout status deployment/mcp-server \
            --namespace production --timeout=5m

The kubectl rollout status --timeout=5m call blocks until Kubernetes reports that the new pods are running and passing their readiness checks, or until the timeout expires. If the new pods crash-loop because the MCP server fails to start — missing environment variable, bad configuration, corrupt image — rollout status exits non-zero and the workflow fails before the protocol probe even runs. Kubernetes will automatically keep the old ReplicaSet running (the default rolling update strategy does not terminate old pods until new ones are ready), so users see no downtime.

For AWS ECS, the official aws-actions/amazon-ecs-deploy-task-definition action handles the task definition update and waits for service stability:

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      - name: Update ECS task definition with new image
        id: task-def
        uses: aws-actions/amazon-ecs-render-task-definition@v1
        with:
          task-definition: ecs-task-def.json
          container-name: mcp-server
          image: ghcr.io/${{ github.repository }}:${{ github.sha }}

      - name: Deploy to ECS
        uses: aws-actions/amazon-ecs-deploy-task-definition@v1
        with:
          task-definition: ${{ steps.task-def.outputs.task-definition }}
          service: mcp-server-service
          cluster: production
          wait-for-service-stability: true

The wait-for-service-stability: true option is the ECS equivalent of kubectl rollout status: it blocks the workflow until ECS confirms the desired count of healthy tasks is running. Like Kubernetes rolling updates, ECS keeps old tasks alive until new tasks pass their load balancer health checks, so users experience no interruption during the update. The health check target for an MCP server running HTTP/SSE transport should be the / path with a POST body — a standard HTTP GET health check endpoint is fine for load balancer purposes, but you want the actual MCP protocol probe in CI where you control the check logic.

Both variants should end with the same MCP protocol verification step from the VPS workflow. The protocol probe is independent of the deployment mechanism: it hits the public endpoint over HTTPS and checks the JSON-RPC response. It is the only step in the entire pipeline that tests the server from the perspective of an MCP client, which is exactly the perspective that matters for end-user reliability.

Post-deploy MCP protocol verification and AliveMCP integration

The protocol probe step in the GitHub Actions workflow catches one category of failure: the new version of the server does not work correctly immediately after deploy. But there is a second, larger category that CI cannot catch by definition: failures that happen after the deployment succeeds. These include infrastructure failures (the VPS host reboots, the Kubernetes node runs out of memory and evicts the pod), certificate expiry (the TLS certificate for your MCP endpoint expires silently at 2 AM), memory leaks that cause gradual performance degradation and eventual OOM crashes, and transient network conditions that make the endpoint unreachable from specific regions.

This is the distinction between CI verification and continuous uptime monitoring. They answer different questions. CI verification asks: "did this specific deployment succeed?" Uptime monitoring asks: "is the server reachable and protocol-compliant right now, and was it reachable every minute for the past week?" You need both.

The Register deployment with AliveMCP step in the workflow above calls the AliveMCP API to create or update a monitoring check for the deployed endpoint. The API call is idempotent: if a check with the same name already exists, it updates the URL and interval; if it does not exist, it creates it. This means the monitoring configuration is automatically correct after every deployment even if the endpoint URL changes — for example, when you deploy to a new subdomain or migrate to a different infrastructure provider.

      - name: Register deployment with AliveMCP
        run: |
          curl -sf -X POST https://alivemcp.com/api/v1/checks \
            -H 'Authorization: Bearer ${{ secrets.ALIVEMCP_API_KEY }}' \
            -H 'Content-Type: application/json' \
            -d '{
              "name": "MCP Server Production",
              "url": "https://${{ secrets.MCP_ENDPOINT }}/",
              "interval_seconds": 60,
              "check_type": "mcp_protocol"
            }'

Store ALIVEMCP_API_KEY in the production environment alongside your other production secrets. It does not need to go in the staging environment unless you also want AliveMCP to monitor your staging instance, which is a reasonable choice for teams who want to catch regressions on staging before they are promoted to production.

The check_type: mcp_protocol field tells AliveMCP to probe the endpoint using the same initialize handshake that the CI workflow uses, rather than a simple HTTP 200 check. This matters because an MCP server can return HTTP 200 from a generic health check endpoint while the actual JSON-RPC handler is broken — misconfigured routing, a crashed worker process that is not properly tracked by the process manager, or a middleware that intercepts requests and returns a maintenance page. An HTTP check passes; an MCP protocol check fails. You want the protocol check.

The two-layer monitoring strategy looks like this in practice: your README includes a GitHub Actions status badge (![CI](https://github.com/org/repo/actions/workflows/deploy.yml/badge.svg)) and an AliveMCP status badge. The CI badge answers "did the last deployment succeed?" The AliveMCP badge answers "is the server up right now?" They are different indicators, and a green CI badge with a red AliveMCP badge is a meaningful signal — it tells you the deployment itself worked but something went wrong afterward, which is information that focuses your investigation immediately on post-deploy infrastructure issues rather than code bugs.

AliveMCP probes from multiple regions so you also get information about geographic availability. An MCP server that is reachable from Europe but not from Asia due to a routing problem will not show up in a single-origin monitoring setup, but it will show up in AliveMCP's regional probe results. For MCP servers used by AI agents or integrated into products with global user bases, geographic probe coverage is not optional.

Frequently asked questions

How do I prevent a bad deployment from reaching production automatically?

Use GitHub Environments with required reviewers on the production environment. Navigate to Settings > Environments > production, add yourself and any teammates as required reviewers, and set the deployment branch filter to main. When a push to main triggers the workflow, the test and build jobs run automatically, but the deploy job pauses and sends a notification asking for approval before it starts. This gives you — or a teammate — a chance to review the build output, check the staging environment, and decide whether the deployment should proceed. Pair this with a staging environment that deploys automatically (no required reviewers) so you can verify behavior on staging before approving the production deployment. The same Docker image SHA deployed to staging is what goes to production after approval, so there is no risk of a diverging artifact.

How do I roll back an MCP server deployment in GitHub Actions?

Add a separate rollback.yml workflow triggered via workflow_dispatch with an input for the image tag to roll back to. The workflow_dispatch event type lets you trigger a workflow manually from the Actions tab in the GitHub UI, optionally with input parameters. Define an input named image_tag that defaults to the previous commit SHA. The rollback workflow re-runs the deploy job steps (SSH pull and run, or kubectl set image) with the specified tag rather than github.sha. For Kubernetes, kubectl rollout undo deployment/mcp-server --namespace production rolls back to the previous ReplicaSet without needing an explicit image tag — Kubernetes keeps the rollout history. For VPS deployments, you need the explicit SHA, which is why storing images by commit SHA in GHCR rather than just latest is important: latest gets overwritten on every push, but SHA-tagged images are permanently addressable.

Can I run MCP integration tests in GitHub Actions?

Yes, and you should. Start your MCP server using Docker Compose in the test job before running integration tests, then tear it down afterward. In the test job, add a step that runs docker compose up -d to start the server and any dependencies (database, Redis, whatever your server needs). Wait for the server to become healthy with a retry loop, then run your integration test suite using the MCP TypeScript SDK's client to make real tool calls against the running server. Finish with docker compose down to clean up. Integration tests catch an important category of bugs that unit tests miss: JSON-RPC serialization mismatches (where the server encodes a response in a way that the client cannot deserialize), transport-level issues (the server handles the first request correctly but drops the SSE connection under load), and tool execution errors that only surface when the tool runs against real data. The GitHub Actions runner has Docker and Docker Compose pre-installed, so no extra setup is required.

How do I cache npm dependencies in GitHub Actions to speed up the test job?

The actions/setup-node@v4 action has built-in caching support. When you set cache: 'npm' in the with block, the action caches the ~/.npm directory between runs using a cache key derived from the hash of your package-lock.json file. If package-lock.json has not changed since the last run, the cached ~/.npm directory is restored before npm ci runs, and npm ci populates node_modules from the local cache rather than downloading packages from the registry. This typically reduces the install step from three to four minutes down to under one minute for a moderately sized MCP server project. The cache key includes the OS, Node version, and lock file hash, so matrix builds on different Node versions each get their own cache entry without conflict. For monorepos with multiple package-lock.json files, specify the cache-dependency-path option to point at the correct lock file for the package being tested.

What is the correct way to handle staging versus production deploys in the same workflow?

Use branch-based triggering combined with GitHub Environments to keep staging and production deploys in a single workflow file without conflating their behavior. Configure the workflow to trigger on push to main for staging and on push to release/** tags (or a dedicated release branch) for production. In the deploy job, use a conditional expression to set the environment: environment: ${{ github.ref == 'refs/heads/main' && 'staging' || 'production' }}. The staging environment gets no required reviewers and deploys automatically. The production environment requires manual approval. Both environments have their own sets of secrets — staging has its own VPS host, its own database URL, its own MCP endpoint — so secrets are never shared between environments and there is no risk of a staging deployment accidentally hitting production infrastructure. The AliveMCP registration step at the end of the deploy job uses the appropriate endpoint URL from each environment's secrets, so both staging and production are independently monitored.