Guide · Multi-modal & Media Integration

MCP Server FFmpeg — Audio Transcription, Video Processing, and Media Tools

Audio and video processing are increasingly common in MCP servers — transcribing meeting recordings, extracting frames from training videos, converting audio formats for downstream APIs, and analyzing media metadata. This guide covers integrating FFmpeg into a TypeScript MCP server using fluent-ffmpeg, building audio transcription tools with the OpenAI Whisper API, implementing video frame extraction, managing temporary files safely, and wiring a /health/media endpoint so AliveMCP detects when FFmpeg is missing or misconfigured before callers encounter silent failures.

TL;DR

Use fluent-ffmpeg (a Node.js wrapper around the FFmpeg binary) rather than spawning ffmpeg directly — the wrapper handles argument quoting, stream piping, and error parsing. Store FFmpeg's working directory in /tmp and clean up temp files in a finally block — a failed cleanup is the most common source of disk exhaustion in media MCP servers. For transcription, convert audio to flac or mp3 before calling Whisper — WAV files from raw recordings are 10× larger than the compressed equivalent. Set a hard timeout on all FFmpeg operations (60 seconds for audio, 120 for video) using ffmpeg.kill() on timeout to prevent runaway processes. Wire /health/media to run ffprobe on a test file rather than just checking if the process is alive.

FFmpeg setup and installation

import ffmpeg from 'fluent-ffmpeg';
import ffmpegStatic from 'ffmpeg-static';
import ffprobeStatic from '@ffprobe-installer/ffprobe';
import fs from 'node:fs/promises';
import path from 'node:path';
import os from 'node:os';
import crypto from 'node:crypto';

// Point fluent-ffmpeg to the static binaries (no system FFmpeg required)
ffmpeg.setFfmpegPath(ffmpegStatic!);
ffmpeg.setFfprobePath(ffprobeStatic.path);

// Create a temporary working directory for media operations
export async function makeTempDir(): Promise<string> {
  const tmpDir = path.join(os.tmpdir(), `mcp-media-${crypto.randomBytes(8).toString('hex')}`);
  await fs.mkdir(tmpDir, { recursive: true });
  return tmpDir;
}

// Always call this in a finally block
export async function cleanupTempDir(dir: string): Promise<void> {
  await fs.rm(dir, { recursive: true, force: true });
}

// Promisify an FFmpeg command with a hard timeout
export function runFfmpeg(
  command: ReturnType<typeof ffmpeg>,
  timeoutMs = 60_000
): Promise<void> {
  return new Promise((resolve, reject) => {
    const timeout = setTimeout(() => {
      command.kill('SIGKILL');
      reject(new Error(`FFmpeg timed out after ${timeoutMs}ms`));
    }, timeoutMs);

    command
      .on('end', () => { clearTimeout(timeout); resolve(); })
      .on('error', (err) => { clearTimeout(timeout); reject(err); });
  });
}

The ffmpeg-static and @ffprobe-installer/ffprobe packages bundle platform-specific FFmpeg binaries as npm packages — you don't need FFmpeg installed on the host or in the Docker image. This simplifies deployment significantly. The downside is bundle size (~50 MB for both packages); if that's a concern, install FFmpeg in the Docker base image and point to /usr/bin/ffmpeg instead.

Audio format conversion

import { z } from 'zod';
import { McpError, ErrorCode } from '@modelcontextprotocol/sdk/types.js';
import { makeTempDir, cleanupTempDir, runFfmpeg } from './media.js';

const MAX_AUDIO_BYTES = 200 * 1024 * 1024; // 200 MB

const AUDIO_INPUT_FORMATS = ['audio/mpeg', 'audio/wav', 'audio/ogg', 'audio/aac', 'audio/flac', 'audio/mp4', 'video/mp4'] as const;
const AUDIO_OUTPUT_FORMATS = ['mp3', 'flac', 'wav', 'ogg', 'aac'] as const;

server.tool(
  'convert_audio',
  {
    audio_base64: z.string().min(1),
    input_format: z.enum(AUDIO_INPUT_FORMATS),
    output_format: z.enum(AUDIO_OUTPUT_FORMATS).default('mp3'),
    sample_rate: z.number().int().optional().describe('Output sample rate in Hz, e.g. 16000 for Whisper'),
    channels: z.number().int().min(1).max(2).optional().describe('1=mono, 2=stereo'),
    bitrate_kbps: z.number().int().min(32).max(320).default(128)
  },
  async ({ audio_base64, input_format, output_format, sample_rate, channels, bitrate_kbps }) => {
    const inputBuffer = Buffer.from(audio_base64, 'base64');
    if (inputBuffer.length > MAX_AUDIO_BYTES) {
      throw new McpError(ErrorCode.InvalidParams, `Audio too large: ${(inputBuffer.length / 1e6).toFixed(1)} MB`);
    }

    // Determine input file extension from MIME type
    const extMap: Record<string, string> = {
      'audio/mpeg': 'mp3', 'audio/wav': 'wav', 'audio/ogg': 'ogg',
      'audio/aac': 'aac', 'audio/flac': 'flac', 'audio/mp4': 'm4a', 'video/mp4': 'mp4'
    };
    const inputExt = extMap[input_format] ?? 'audio';

    const tmpDir = await makeTempDir();
    try {
      const inputPath = path.join(tmpDir, `input.${inputExt}`);
      const outputPath = path.join(tmpDir, `output.${output_format}`);
      await fs.writeFile(inputPath, inputBuffer);

      const cmd = ffmpeg(inputPath).output(outputPath);

      if (sample_rate) cmd.audioFrequency(sample_rate);
      if (channels) cmd.audioChannels(channels);
      if (output_format === 'mp3') cmd.audioBitrate(bitrate_kbps);

      await runFfmpeg(cmd);

      const outputBuffer = await fs.readFile(outputPath);
      const stat = await fs.stat(outputPath);

      return {
        content: [{
          type: 'text',
          text: JSON.stringify({
            output_format,
            size_bytes: stat.size,
            compression_ratio: (inputBuffer.length / outputBuffer.length).toFixed(2),
            audio_base64: outputBuffer.toString('base64')
          })
        }]
      };
    } finally {
      await cleanupTempDir(tmpDir);
    }
  }
);

Audio transcription with Whisper

The OpenAI Whisper API accepts audio files up to 25 MB in common formats. The most cost-effective workflow is: (1) convert the input to mono 16kHz MP3 with FFmpeg (reduces file size dramatically), (2) send to Whisper. A 60-minute WAV recording (1.5 GB) converts to a ~30 MB MP3 after compression — still over the 25 MB limit, so you may need to chunk long recordings.

import OpenAI from 'openai';
import { createReadStream } from 'node:fs';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const WHISPER_MAX_BYTES = 25 * 1024 * 1024; // 25 MB Whisper API limit

server.tool(
  'transcribe_audio',
  {
    audio_base64: z.string().min(1),
    input_format: z.enum(AUDIO_INPUT_FORMATS),
    language: z.string().length(2).optional().describe('ISO 639-1 language code, e.g. "en". Auto-detected if omitted.'),
    response_format: z.enum(['text', 'json', 'verbose_json', 'srt', 'vtt']).default('json'),
    temperature: z.number().min(0).max(1).default(0)
  },
  async ({ audio_base64, input_format, language, response_format, temperature }) => {
    const inputBuffer = Buffer.from(audio_base64, 'base64');
    if (inputBuffer.length > MAX_AUDIO_BYTES) {
      throw new McpError(ErrorCode.InvalidParams, `Audio too large: ${(inputBuffer.length / 1e6).toFixed(1)} MB`);
    }

    const extMap: Record<string, string> = {
      'audio/mpeg': 'mp3', 'audio/wav': 'wav', 'audio/ogg': 'ogg',
      'audio/aac': 'aac', 'audio/flac': 'flac', 'audio/mp4': 'm4a', 'video/mp4': 'mp4'
    };
    const inputExt = extMap[input_format] ?? 'audio';

    const tmpDir = await makeTempDir();
    try {
      const inputPath = path.join(tmpDir, `input.${inputExt}`);
      const whisperPath = path.join(tmpDir, 'whisper.mp3');
      await fs.writeFile(inputPath, inputBuffer);

      // Convert to mono 16kHz MP3 for optimal Whisper performance
      // This reduces file size by 10–50× compared to raw WAV/FLAC
      await runFfmpeg(
        ffmpeg(inputPath)
          .output(whisperPath)
          .audioChannels(1)       // mono
          .audioFrequency(16000)  // 16 kHz — Whisper's native sample rate
          .audioBitrate(64)       // 64 kbps sufficient for speech
      );

      const whisperStat = await fs.stat(whisperPath);
      if (whisperStat.size > WHISPER_MAX_BYTES) {
        throw new McpError(
          ErrorCode.InvalidParams,
          `Compressed audio is ${(whisperStat.size / 1e6).toFixed(1)} MB — exceeds Whisper's 25 MB limit. ` +
          `Split the audio into shorter segments.`
        );
      }

      const transcription = await openai.audio.transcriptions.create({
        file: createReadStream(whisperPath) as unknown as File,
        model: 'whisper-1',
        language,
        response_format,
        temperature
      });

      return {
        content: [{
          type: 'text',
          text: typeof transcription === 'string'
            ? transcription
            : JSON.stringify(transcription, null, 2)
        }]
      };
    } finally {
      await cleanupTempDir(tmpDir);
    }
  }
);

Video frame extraction

server.tool(
  'extract_video_frames',
  {
    video_base64: z.string().min(1),
    timestamps_seconds: z.array(z.number().min(0)).max(20).describe('List of timestamps to extract frames at'),
    output_format: z.enum(['jpeg', 'png']).default('jpeg'),
    width: z.number().int().min(1).max(3840).optional().describe('Output frame width (maintains aspect ratio if height omitted)')
  },
  async ({ video_base64, timestamps_seconds, output_format, width }) => {
    const inputBuffer = Buffer.from(video_base64, 'base64');
    if (inputBuffer.length > 500 * 1024 * 1024) { // 500 MB video limit
      throw new McpError(ErrorCode.InvalidParams, 'Video too large (max 500 MB)');
    }

    const tmpDir = await makeTempDir();
    try {
      const inputPath = path.join(tmpDir, 'input.mp4');
      await fs.writeFile(inputPath, inputBuffer);

      const frames: Array<{ timestamp_seconds: number; image_base64: string; mimeType: string }> = [];

      for (const ts of timestamps_seconds) {
        const framePath = path.join(tmpDir, `frame_${ts}.${output_format}`);
        const cmd = ffmpeg(inputPath)
          .seekInput(ts)
          .frames(1)
          .output(framePath);

        if (width) cmd.size(`${width}x?`);

        await runFfmpeg(cmd, 30_000);

        const frameBuffer = await fs.readFile(framePath).catch(() => null);
        if (frameBuffer) {
          frames.push({
            timestamp_seconds: ts,
            image_base64: frameBuffer.toString('base64'),
            mimeType: `image/${output_format}`
          });
        }
      }

      // Return frames as a mix of image content blocks and metadata
      const content = frames.flatMap(f => [
        { type: 'text' as const, text: `Frame at ${f.timestamp_seconds}s:` },
        { type: 'image' as const, data: f.image_base64, mimeType: f.mimeType as 'image/jpeg' | 'image/png' }
      ]);

      return { content };
    } finally {
      await cleanupTempDir(tmpDir);
    }
  }
);

Operation	Typical latency	Recommended timeout	Bottleneck
Audio format conversion (60s audio)	2–8 s	60 s	CPU (codec encoding)
Whisper transcription (60s audio)	5–20 s	60 s	OpenAI API latency
Video frame extraction (1 frame)	0.5–2 s	30 s	Seek + decode
Video frame extraction (20 frames)	5–30 s	120 s	Multiple seeks through large file

Health endpoint for media monitoring

http.get('/health/media', async (req, reply) => {
  const start = Date.now();
  const tmpDir = await makeTempDir();
  try {
    // Probe: use ffprobe to inspect a minimal valid MP3 header
    // A real 1-second silent MP3 (455 bytes)
    const PROBE_MP3 = Buffer.from(
      'SUQzAwAAAAAAEFRJVDIAAAAFAAAAdGVzdFRQRTEAAAAIAAAAdGVzdGVyAP' +
      'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' +
      'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA',
      'base64'
    );
    const probePath = path.join(tmpDir, 'probe.mp3');
    await fs.writeFile(probePath, PROBE_MP3);

    await new Promise<void>((resolve, reject) => {
      ffmpeg.ffprobe(probePath, (err, metadata) => {
        if (err) reject(err);
        else resolve();
      });
    });

    return reply.send({
      status: 'ok',
      latency_ms: Date.now() - start,
      ffmpeg_path: ffmpegStatic,
      ffprobe_path: ffprobeStatic.path
    });
  } catch (err) {
    return reply.code(503).send({
      status: 'error',
      detail: err instanceof Error ? err.message : String(err),
      latency_ms: Date.now() - start
    });
  } finally {
    await cleanupTempDir(tmpDir);
  }
});

The health probe runs ffprobe against a minimal MP3 file. This validates that both the FFmpeg binary (ffmpeg-static) and ffprobe binary (@ffprobe-installer/ffprobe) are executable and accessible — a common failure mode after botched deployments where native binaries don't have execute permissions or are the wrong architecture (e.g., x86 binary in an ARM container).

Silent failure modes

Failure	Symptom	Caught by process ping?	Detection
FFmpeg binary not executable (wrong arch)	All media tools throw EACCES or ENOENT	No	`/health/media` with ffprobe probe
Temp directory disk full	FFmpeg writes fail; ENOSPC error	No	Add disk-space check to health endpoint; monitor /tmp usage
Temp file cleanup skipped (missing finally)	Disk fills gradually; tools start failing after hours/days	No — until disk full	Always use `finally { await cleanupTempDir(tmpDir) }`
OpenAI API key expired	Transcription tool throws 401	No	Add `/health/whisper` that tests the API key with a tiny probe
Runaway FFmpeg process	CPU pegged; other tool calls time out	No	Hard timeout in `runFfmpeg()` with SIGKILL; monitor CPU

Frequently asked questions

Can I run FFmpeg in a serverless environment like Lambda?

Yes — Lambda allows writing to /tmp (up to 512 MB by default, configurable to 10 GB). Use the static FFmpeg binary from ffmpeg-static so there's no dependency on the Lambda runtime environment. Set the Lambda timeout to at least 60 seconds for audio processing. The main constraint is that Lambda's /tmp is ephemeral — your cleanup code matters, but leftover files don't persist across cold starts. CPU is limited on small Lambda instances (~1 vCPU); audio conversion at high bitrates can hit the 15-minute Lambda timeout on very long recordings. For long recordings on Lambda, split the audio into 5-minute chunks and process in parallel using multiple Lambda invocations.

What's the best way to handle audio longer than Whisper's 25 MB limit?

Split the audio into overlapping segments using FFmpeg's segment muxer. A 10-minute segment with a 5-second overlap ensures words at segment boundaries are captured: ffmpeg -i input.mp3 -f segment -segment_time 600 -segment_overlap 5 segment_%03d.mp3. Transcribe each segment and concatenate the results, deduplicating the overlapping region (compare the last few words of segment N with the first few words of segment N+1 and merge). Return the full transcript with segment boundaries marked so the caller can trace which part of the recording each sentence came from.

How do I extract audio from a video file for transcription?

Extract the audio stream from the video before sending to Whisper: ffmpeg(videoPath).noVideo().output(audioPath).audioCodec('libmp3lame').audioBitrate(64).audioChannels(1).audioFrequency(16000). This strips the video track entirely and produces a small MP3. For a 10-minute MP4 video at 1080p (~200 MB), the extracted 64kbps mono MP3 is typically under 5 MB — well within Whisper's limit. The input_format: 'video/mp4' parameter in the transcribe_audio tool handles this automatically: it detects a video MIME type and uses FFmpeg to extract audio before passing to Whisper.

How do I return metadata about an audio or video file?

Use FFprobe via fluent-ffmpeg's ffprobe(path, callback) to extract container metadata: duration, bitrate, codec, sample rate, dimensions, and codec parameters. Return this as a JSON TextContent block alongside the processed media. For MCP tools that accept user-uploaded files, running ffprobe before processing is a good security practice — it validates that the file is actually a media file (not a renamed executable or ZIP archive) and gives you the duration to estimate processing time and enforce size limits.