Guide · Multi-Tenant SaaS

Tenant Onboarding Automation for MCP Servers — schema provisioning, connection pools, and health verification

In a shared-nothing multi-tenant MCP deployment, adding a new tenant is not just a database row insert — it is a provisioning workflow: create the tenant's database schema or isolated table partition, initialize a connection pool scoped to that tenant, seed any default configuration, register health monitoring, and verify the tenant can actually execute a tool call end-to-end before marking them as active. Doing this manually does not scale past a handful of tenants. This guide covers idempotent provisioning pipelines triggered by Stripe webhooks, schema-per-tenant PostgreSQL provisioning, lazy vs eager pool initialization, and automated canary tool calls that verify each new tenant is fully operational before their first real agent session.

TL;DR

Build a provisionTenant(tenantId) function that is fully idempotent (safe to call multiple times), runs all setup steps in a transaction where possible, and ends with a canary tool call that proves the tenant's data path works end-to-end. Trigger it from a Stripe checkout.session.completed webhook. Wire AliveMCP to a per-tenant health endpoint (/health?tenant=:id) so you know the moment a new tenant's infrastructure degrades after onboarding.

Provisioning pipeline overview

A complete tenant onboarding flow for a schema-per-tenant deployment has five sequential steps, each of which can fail independently:

Step What happens Failure impact
1. Tenant row Insert tenant record with status provisioning Tenant cannot be found; all subsequent steps fail
2. Schema creation CREATE SCHEMA tenant_{id} + run migrations in that schema No data isolation; tool calls hit wrong schema
3. Pool initialization Create and warm a connection pool for the tenant's schema First tool calls are slow (cold pool) or fail if pool not initialized
4. Seed data Insert default config rows, roles, and initial state Tool calls that depend on config rows fail or return empty results
5. Health verification Run canary tool call; set status active on success Tenant flagged as active while broken; first agent session fails

Idempotent provisioning function

The provisioning function must be safe to call multiple times. Webhook delivery is at-least-once: Stripe may retry a checkout.session.completed webhook if your endpoint returns a non-200 response. If step 2 (schema creation) succeeded but step 3 (pool initialization) failed and caused a 500, Stripe retries — you must not fail on CREATE SCHEMA that already exists.

// tenant-provisioner.ts
import { getAdminPool } from './db.js';
import { TenantPoolRegistry } from './pool-registry.js';
import { runCanaryToolCall } from './canary.js';

export async function provisionTenant(tenantId: string): Promise<void> {
  const db = getAdminPool();

  // Step 1: Upsert tenant row (idempotent)
  await db.query(`
    INSERT INTO tenants (id, status, created_at)
    VALUES ($1, 'provisioning', NOW())
    ON CONFLICT (id) DO UPDATE
      SET status = CASE WHEN tenants.status = 'active' THEN 'active' ELSE 'provisioning' END
  `, [tenantId]);

  // Step 2: Create schema (idempotent — IF NOT EXISTS)
  const schemaName = `tenant_${tenantId.replace(/-/g, '_')}`;

  await db.query(`CREATE SCHEMA IF NOT EXISTS ${schemaName}`);

  // Run migrations within the tenant schema
  await runMigrationsInSchema(db, schemaName);

  // Step 3: Initialize connection pool
  await TenantPoolRegistry.initPool(tenantId, schemaName);

  // Step 4: Seed default data (idempotent — ON CONFLICT DO NOTHING)
  const tenantPool = TenantPoolRegistry.getPool(tenantId);
  await seedDefaultData(tenantPool, tenantId);

  // Step 5: Verify with canary tool call
  const canaryResult = await runCanaryToolCall(tenantId);
  if (!canaryResult.success) {
    throw new Error(`Canary tool call failed for tenant ${tenantId}: ${canaryResult.error}`);
  }

  // Mark as active only after successful canary
  await db.query(
    `UPDATE tenants SET status = 'active', provisioned_at = NOW() WHERE id = $1`,
    [tenantId]
  );
}

async function runMigrationsInSchema(db: Pool, schemaName: string): Promise<void> {
  // Set search_path to run migrations in the correct schema
  const client = await db.connect();
  try {
    await client.query(`SET search_path TO ${schemaName}, public`);
    await client.query(`
      CREATE TABLE IF NOT EXISTS tool_calls (
        id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
        tool_name TEXT NOT NULL,
        called_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
        duration_ms INTEGER,
        success BOOLEAN NOT NULL
      )
    `);
    await client.query(`
      CREATE TABLE IF NOT EXISTS tenant_config (
        key TEXT PRIMARY KEY,
        value JSONB NOT NULL,
        updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
      )
    `);
    // Add more migration steps as DDL evolves
  } finally {
    client.release();
  }
}

async function seedDefaultData(pool: Pool, tenantId: string): Promise<void> {
  await pool.query(`
    INSERT INTO tenant_config (key, value)
    VALUES
      ('alert_email', '"null"'),
      ('webhook_url', '"null"'),
      ('plan', '"free"'),
      ('max_tool_calls_per_hour', '100')
    ON CONFLICT (key) DO NOTHING
  `);
}

Connection pool registry

In schema-per-tenant mode, each tenant needs a pool configured with their schema in search_path. A registry manages the lifecycle of per-tenant pools:

// pool-registry.ts
import { Pool } from 'pg';

const pools = new Map<string, Pool>();

export const TenantPoolRegistry = {
  async initPool(tenantId: string, schemaName: string): Promise<Pool> {
    if (pools.has(tenantId)) {
      return pools.get(tenantId)!;
    }

    const pool = new Pool({
      connectionString: process.env.DATABASE_URL,
      max: 5,                         // 5 connections per tenant
      idleTimeoutMillis: 30_000,
      connectionTimeoutMillis: 3_000,
    });

    // Set search_path on every new connection
    pool.on('connect', (client) => {
      client.query(`SET search_path TO ${schemaName}, public`);
    });

    // Warm: establish one connection now to fail fast if DB is unreachable
    const warmupClient = await pool.connect();
    await warmupClient.query('SELECT 1');
    warmupClient.release();

    pools.set(tenantId, pool);
    return pool;
  },

  getPool(tenantId: string): Pool {
    const pool = pools.get(tenantId);
    if (!pool) throw new Error(`Pool not initialized for tenant ${tenantId}`);
    return pool;
  },

  async drainPool(tenantId: string): Promise<void> {
    const pool = pools.get(tenantId);
    if (!pool) return;
    pools.delete(tenantId);
    await pool.end();
  },

  getPoolCount(): number {
    return pools.size;
  },
};

Lazy initialization for large tenant counts

Eagerly initializing pools for all tenants at server startup is impractical when tenant count exceeds a few hundred — each pool holds up to 5 connections, so 1,000 tenants × 5 connections = 5,000 PostgreSQL connections at server startup. Use lazy initialization: create a pool on the first tool call for each tenant, cache it, and evict idle pools after a period of inactivity:

// Lazy pool with LRU eviction
import LRU from 'lru-cache';

const poolCache = new LRU<string, Pool>({
  max: 500,                           // max 500 active tenant pools
  dispose: async (pool, tenantId) => {
    console.log(`Evicting idle pool for tenant ${tenantId}`);
    await pool.end();
  },
  ttl: 30 * 60 * 1000,               // evict pools idle for 30 minutes
});

export async function getOrCreatePool(tenantId: string): Promise<Pool> {
  if (poolCache.has(tenantId)) {
    return poolCache.get(tenantId)!;
  }

  const schemaName = await getSchemaName(tenantId);  // lookup from tenants table
  const pool = await TenantPoolRegistry.initPool(tenantId, schemaName);
  poolCache.set(tenantId, pool);
  return pool;
}

Canary tool call verification

The final onboarding step runs a real tool call through the MCP server's dispatch path for the new tenant. This catches issues that schema creation cannot: missing seed data, misconfigured search_path, pool exhaustion under load, and permission errors.

// canary.ts
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StreamableHTTPClientTransport } from '@modelcontextprotocol/sdk/client/streamableHttp.js';

export interface CanaryResult {
  success: boolean;
  durationMs: number;
  error?: string;
}

export async function runCanaryToolCall(tenantId: string): Promise<CanaryResult> {
  const start = Date.now();

  try {
    // Connect to the MCP server as the tenant (using a provisioning API key)
    const transport = new StreamableHTTPClientTransport(
      new URL(`${process.env.MCP_SERVER_URL}/mcp`),
      {
        requestInit: {
          headers: {
            'X-Tenant-Id': tenantId,
            'X-Api-Key': process.env.PROVISIONING_API_KEY!,
          },
        },
      }
    );

    const client = new Client({ name: 'provisioner-canary', version: '1.0.0' });
    await client.connect(transport);

    // Run the canary tool — a lightweight read that touches the tenant's schema
    const result = await client.callTool({
      name: 'health_check',
      arguments: {},
    });

    await client.close();

    if (result.isError) {
      return { success: false, durationMs: Date.now() - start, error: String(result.content) };
    }

    return { success: true, durationMs: Date.now() - start };
  } catch (err) {
    return { success: false, durationMs: Date.now() - start, error: String(err) };
  }
}

Webhook trigger from Stripe

Wire the provisioning pipeline to a Stripe checkout.session.completed webhook. Stripe delivers this event when a customer completes checkout — exactly when you want to provision their tenant:

// webhook.ts (Express handler)
import Stripe from 'stripe';
import { provisionTenant } from './tenant-provisioner.js';

const stripe = new Stripe(process.env.STRIPE_SECRET_KEY!);

app.post('/webhooks/stripe', express.raw({ type: 'application/json' }), async (req, res) => {
  const sig = req.headers['stripe-signature'] as string;
  let event: Stripe.Event;

  try {
    event = stripe.webhooks.constructEvent(req.body, sig, process.env.STRIPE_WEBHOOK_SECRET!);
  } catch (err) {
    return res.status(400).send(`Webhook signature verification failed`);
  }

  // Acknowledge immediately — provisioning is async
  res.status(200).json({ received: true });

  if (event.type === 'checkout.session.completed') {
    const session = event.data.object as Stripe.Checkout.Session;
    const tenantId = session.metadata?.tenant_id;

    if (!tenantId) {
      console.error('checkout.session.completed missing tenant_id metadata', session.id);
      return;
    }

    provisionTenant(tenantId)
      .then(() => console.log(`Tenant ${tenantId} provisioned successfully`))
      .catch((err) => {
        console.error(`Tenant ${tenantId} provisioning failed`, err);
        // Notify ops via alert, not a re-throw — webhook already returned 200
      });
  }
});

Return 200 before provisionTenant completes. Provisioning can take 2–5 seconds; Stripe has a 30-second webhook timeout, but returning promptly prevents Stripe from marking the delivery as failed if provisioning is slow. The tenant's UI should poll a /api/tenants/:id/status endpoint that reads the status field from the tenants table — transitioning from provisioning to active as the pipeline completes.

Monitoring new tenant health after onboarding

Provisioning success does not guarantee ongoing health. A tenant whose connection pool was healthy at provisioning time may degrade later due to idle connection eviction, schema migration failures, or configuration changes. Add per-tenant health checks to your monitoring stack:

// /health endpoint with per-tenant probe support
app.get('/health', async (req, res) => {
  const tenantId = req.query.tenant as string | undefined;

  if (tenantId) {
    // Per-tenant health check
    try {
      const pool = TenantPoolRegistry.getPool(tenantId);
      const client = await pool.connect();
      await client.query('SELECT 1');
      client.release();
      return res.json({ status: 'ok', tenant: tenantId });
    } catch (err) {
      return res.status(503).json({ status: 'down', tenant: tenantId, error: String(err) });
    }
  }

  // Global health check
  const poolCount = TenantPoolRegistry.getPoolCount();
  res.json({ status: 'ok', active_pools: poolCount });
});

Register each new tenant with AliveMCP after provisioning: call the AliveMCP API to add a monitor for https://your-mcp-server.com/health?tenant=:tenantId. When that tenant's connection pool fails, AliveMCP alerts before the tenant notices broken tool calls. This is the difference between reactive support ("tenant files a ticket") and proactive operations ("you fix it before the tenant's agent session starts").

Frequently asked questions

How do I handle provisioning failures without leaving orphaned schemas?

Use a cleanup job that scans for tenants stuck in provisioning status for more than 10 minutes and attempts re-provisioning or deletes the partial schema. Since provisionTenant is idempotent, re-running it from scratch is safe. For cleanup of partial schemas: DROP SCHEMA IF EXISTS tenant_{id} CASCADE followed by a fresh provisionTenant call. Track provisioning attempts in a provisioning_log table (attempt number, started_at, finished_at, error) so you can identify tenants that fail repeatedly and alert ops.

What is the maximum number of tenant schemas one PostgreSQL instance can support?

PostgreSQL has no hard limit on schema count, but practical limits emerge from connection count and migration time. With 5 connections per tenant pool and a PostgreSQL max_connections of 200 (leaving 20 for admin), you can support about 36 simultaneously active tenant pools. Use lazy pool initialization with LRU eviction to support hundreds to thousands of tenants with far fewer connections — most tenants are not active simultaneously. For truly large tenant counts (10,000+), consider a single schema with row-level security rather than schema-per-tenant, which eliminates per-tenant connection pool overhead.

How do I run schema migrations across all tenant schemas without downtime?

Iterate through all tenant schemas in batches and run SET search_path TO tenant_{id}; <migration DDL> for each. Use an online migration approach: add columns as nullable first, backfill in batches, then add NOT NULL constraint. PostgreSQL's ADD COLUMN NOT NULL DEFAULT in PostgreSQL 11+ rewrites the table — for large tables, use ADD COLUMN (nullable) followed by ALTER COLUMN SET DEFAULT and background backfill. A migration tracking table within each tenant schema records which migrations have run, making the process idempotent across schema counts.

How long should tenant onboarding take?

Schema creation and seed data insertion: 100–500ms. Pool initialization and warmup: 50–200ms. Canary tool call round-trip: 100–500ms. Total: 250ms–1.2s for a fresh tenant with an empty schema. This is fast enough to complete synchronously within a user's post-checkout redirect flow if you keep it tight. For tenants with large default datasets to import (e.g., importing their existing inventory into the tenant schema), split the provisioning into two phases: basic provisioning (fast, synchronous) and data import (background job, status polled by UI).

How do I offboard a tenant (delete their data)?

Add a deprovisionTenant(tenantId) function that mirrors the provisioning pipeline in reverse: drain the tenant's pool, drop the schema (DROP SCHEMA IF EXISTS tenant_{id} CASCADE), delete the tenant row. Make it idempotent (safe to call if the schema was already dropped). Store a deprovisioned_at timestamp and archive summary metadata (but not PII) for accounting. Trigger from a Stripe customer.subscription.deleted webhook — but add a grace period (e.g., 30 days) before actually dropping data, in case of accidental cancellation. During the grace period, block new tool calls but allow data export.

Further reading

Monitor each new tenant from the moment they go live

AliveMCP lets you register per-tenant health check URLs after provisioning — so you know the moment any tenant's MCP infrastructure degrades, before their first agent session fails.

Start monitoring free