Guide · MCP Tool Implementation

MCP server code execution tools

Code execution is the most powerful — and most dangerous — MCP tool category. When an LLM can run arbitrary code, it can process data, run calculations, test hypotheses, and automate tasks in ways no static tool can match. It can also execute rm -rf /, open reverse shells, and exhaust your server's CPU and memory. This guide covers how to build a safe execute_code tool using Docker container isolation, resource limits, network blocking, timeout enforcement, and output capture.

TL;DR

Never execute LLM-generated code with eval(), child_process.exec(), or a bare subprocess.run() on your MCP server host. Always run untrusted code inside a Docker container with --network none, --memory 256m, --cpus 0.5, --read-only, and --no-new-privileges. Set a hard wall-clock timeout on the docker run call. Capture stdout and stderr separately, truncate large outputs, and never return raw process output without length limits.

Why eval() and child_process are not enough

The naive approach — passing code to eval() or child_process.exec() — runs with the same privileges as your MCP server process. In production, that means filesystem access to secrets, network access to internal services, and the ability to crash the server process. Even with a restricted Node.js VM (new vm.Script()), a skilled attacker can escape the sandbox via prototype pollution or by exploiting native module boundaries.

Isolation levelEscapes sandbox?Network access?Filesystem access?
Node.js eval()Yes — full processYesYes
Node.js vm.ScriptYes — prototype pollutionYes (via require)Yes (via require)
Worker Thread with allowedModules: []PartialNo (limited)Limited
Docker container (default)No (container boundary)YesContainer only
Docker + --network none + --read-onlyNoNoNo
gVisor / Firecracker microVMNo (kernel boundary)ConfigurableConfigurable

For production deployments, Docker with restrictive flags is the practical minimum. For multi-tenant or high-security scenarios, consider gVisor (runsc) or Firecracker microVMs that provide kernel-level isolation.

Building the execute_code tool

The tool writes code to a temp file, passes it to a sandboxed Docker container, captures output, and cleans up. The container is ephemeral — created and destroyed per execution:

import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
import { z } from 'zod';
import { execFile } from 'child_process';
import { promisify } from 'util';
import fs from 'fs/promises';
import path from 'path';
import os from 'os';

const execFileAsync = promisify(execFile);
const server = new McpServer({ name: 'code-runner', version: '1.0.0' });

const RUNTIME_IMAGES: Record<string, string> = {
  python:     'python:3.12-slim',
  javascript: 'node:22-alpine',
  typescript: 'tsx:latest',   // or a custom image with ts-node/tsx
  bash:       'bash:5-alpine',
};

const EXECUTION_TIMEOUT_MS = 15_000;   // 15 seconds wall-clock
const MAX_OUTPUT_CHARS     = 20_000;   // truncate stdout+stderr beyond this

server.tool(
  'execute_code',
  'Execute code in an isolated sandbox and return stdout/stderr output',
  {
    language: z.enum(['python', 'javascript', 'typescript', 'bash']),
    code: z.string().max(50_000).describe('Code to execute'),
    stdin_input: z.string().max(10_000).default('').describe('Optional stdin to pass to the program'),
  },
  async ({ language, code, stdin_input }) => {
    const image = RUNTIME_IMAGES[language];
    const tmpDir = await fs.mkdtemp(path.join(os.tmpdir(), 'mcp-exec-'));
    const ext = { python: 'py', javascript: 'js', typescript: 'ts', bash: 'sh' }[language];
    const codeFile = path.join(tmpDir, `code.${ext}`);
    const stdinFile = path.join(tmpDir, 'stdin.txt');

    try {
      await fs.writeFile(codeFile, code, 'utf8');
      await fs.writeFile(stdinFile, stdin_input, 'utf8');

      const entrypoint = {
        python:     ['python', `/sandbox/code.py`],
        javascript: ['node', `/sandbox/code.js`],
        typescript: ['tsx', `/sandbox/code.ts`],
        bash:       ['bash', `/sandbox/code.sh`],
      }[language];

      const dockerArgs = [
        'run', '--rm',
        '--network', 'none',           // no network access
        '--memory', '256m',             // 256 MB RAM limit
        '--memory-swap', '256m',        // disable swap (same as RAM limit)
        '--cpus', '0.5',               // half a CPU core
        '--read-only',                  // read-only root filesystem
        '--no-new-privileges',          // block privilege escalation
        '--security-opt', 'no-new-privileges:true',
        '--tmpfs', '/tmp:size=64m',    // writable temp space (64 MB)
        '-v', `${tmpDir}:/sandbox:ro`, // mount code as read-only
        '-i',                           // enable stdin
        image,
        ...entrypoint,
      ];

      const { stdout, stderr } = await execFileAsync('docker', dockerArgs, {
        timeout: EXECUTION_TIMEOUT_MS,
        maxBuffer: 1024 * 1024,        // 1 MB max combined output buffer
        input: stdin_input,
      });

      const combined = [
        stdout ? `STDOUT:\n${stdout}` : '',
        stderr ? `STDERR:\n${stderr}` : '',
      ].filter(Boolean).join('\n\n') || '(no output)';

      return {
        content: [{
          type: 'text',
          text: combined.length > MAX_OUTPUT_CHARS
            ? combined.slice(0, MAX_OUTPUT_CHARS) + `\n\n[truncated — ${combined.length} chars total]`
            : combined,
        }]
      };
    } catch (e) {
      const err = e as NodeJS.ErrnoException & { killed?: boolean; stdout?: string; stderr?: string };
      if (err.killed || err.code === 'ETIMEDOUT') {
        return { isError: true, content: [{ type: 'text', text: `Execution timed out after ${EXECUTION_TIMEOUT_MS / 1000}s` }] };
      }
      return {
        isError: true,
        content: [{ type: 'text', text: [
          `Execution failed (exit code: ${err.code ?? 'unknown'})`,
          err.stderr ? `STDERR:\n${String(err.stderr).slice(0, 5_000)}` : '',
        ].filter(Boolean).join('\n\n') }]
      };
    } finally {
      await fs.rm(tmpDir, { recursive: true, force: true });
    }
  }
);

Container resource limits explained

FlagWhat it preventsRecommended value
--network noneOutbound HTTP, lateral movement to internal servicesAlways set for untrusted code
--memory 256m --memory-swap 256mMemory exhaustion, OOM killing the host64–512 MB depending on workload
--cpus 0.5CPU saturation, forking bombs0.25–1.0 CPU
--read-onlyPersistent filesystem writes in the container layerAlways set; add --tmpfs for temp writes
--no-new-privilegesPrivilege escalation via setuid binariesAlways set
--pids-limit 64Fork bombs that spawn unlimited child processes32–128 PIDs
--ulimit nofile=64File descriptor exhaustion64–256 open files

Add --pids-limit 64 and --ulimit nofile=256 to the dockerArgs array for defense in depth against fork bombs and file descriptor exhaustion attacks.

Pre-pulling images to avoid cold-start latency

The first execution of a language pulls the Docker image — potentially 50–200 MB of download that adds 30+ seconds to the first tool call. Pre-pull all runtime images at server startup:

async function prePullImages(): Promise<void> {
  for (const [lang, image] of Object.entries(RUNTIME_IMAGES)) {
    try {
      await execFileAsync('docker', ['image', 'inspect', image], { timeout: 5_000 });
      console.error(`[executor] ${lang} image present: ${image}`);
    } catch {
      console.error(`[executor] pulling ${lang} image: ${image}`);
      await execFileAsync('docker', ['pull', image], { timeout: 120_000 });
    }
  }
}

// Call at startup, before registering the server transport
await prePullImages();

In Kubernetes, use an init container or DaemonSet to warm images on every node. In a PM2 setup, add a pre-start script to ecosystem.config.js.

Output from long-running computations

Some computations produce output incrementally — a data processing script that prints progress every few seconds. The execFile pattern above buffers all output and returns it at completion. For streaming output, use MCP streaming responses or structure the tool to accept a time-budget parameter and return partial results on timeout.

// Partial-result pattern: run up to budget_seconds, return whatever completed
server.tool(
  'execute_code_partial',
  'Run code with a time budget; returns partial output on timeout',
  {
    language: z.enum(['python', 'javascript']),
    code: z.string().max(50_000),
    budget_seconds: z.number().min(1).max(30).default(10),
  },
  async ({ language, code, budget_seconds }) => {
    // ... same Docker setup as above ...
    // On ETIMEDOUT, return whatever stdout/stderr was captured before timeout
    // (requires using spawn() instead of execFile() to capture incremental output)
    return { content: [{ type: 'text', text: '...' }] };
  }
);

Monitoring code-execution MCP servers

Code execution servers fail in ways that differ from typical MCP tool failures. A Docker daemon crash takes down all execution capability silently — the MCP transport still responds normally to initialize and tools/list, but every execute_code call fails with "Cannot connect to Docker socket." A disk-full condition prevents temp file creation. An image pull failure makes a language unavailable while others work.

Add a canary execution check to your health check endpoint: run a trivial code snippet (print("ok")) and verify output contains the expected string. This end-to-end check catches Docker daemon failures, image issues, and resource limits that a transport-only check misses. AliveMCP probes your full MCP endpoint every 60 seconds, catching protocol-level and handler-level failures before users encounter broken code execution in their AI workflows.

Further reading