Guide · Agentic Patterns

MCP Server State Machines — persisting multi-step agentic workflow state

Agents don't just call single tools in isolation — they execute workflows: validate an order, reserve inventory, charge the card, send a confirmation, then trigger fulfillment. If the card charge succeeds but the fulfillment trigger fails, the server must know where in the workflow it got stuck and resume correctly on retry rather than re-charging the card. State machines make that resumption safe by encoding exactly what states exist, what transitions are valid, and what data is carried between steps. This guide covers implementing persistent state machines in MCP servers: the table schema, transition enforcement, optimistic locking for concurrent agents, and recovery patterns for partial failures.

TL;DR

Store workflow state as a { state, context } row in Postgres. Enforce valid transitions server-side with an allowlist. Use WHERE state = $expected in your UPDATE to detect concurrent modifications (optimistic locking). Give the agent a get_workflow_state tool to read current state and a set of action tools that advance it. Each action tool runs its side effect and transitions state in the same database transaction. AliveMCP monitors a /health/workflows endpoint that alerts when workflows are stuck in terminal-approach states for too long.

Why in-memory state is not enough

If you store workflow state in a server-side Map or module variable, you lose it when the process restarts, when the pod is replaced, or when the agent's session times out and the user comes back later. Persistent state means:

Workflows survive server restarts — the agent can resume from the last committed state
Concurrent agents (two sessions for the same order) see consistent state
You have an audit trail of every state transition for debugging and compliance
Long-running workflows that span hours or days are safe

Storage approach	Survives restart	Concurrent safe	Audit trail
In-memory Map	No	No (race conditions)	No
Redis	Yes (if persistent)	Partial (Lua scripts)	No (TTL eviction)
Postgres table	Yes	Yes (row locking)	Yes (event log)

Database schema

-- workflows table: one row per workflow instance
CREATE TABLE workflows (
  id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  workflow_type TEXT NOT NULL,        -- 'order_fulfillment', 'invoice_approval', etc.
  state        TEXT NOT NULL,          -- current state name
  context      JSONB NOT NULL DEFAULT '{}', -- data carried between steps
  version      INTEGER NOT NULL DEFAULT 1,  -- for optimistic locking
  created_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_at   TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- workflow_events table: append-only audit log
CREATE TABLE workflow_events (
  id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  workflow_id  UUID NOT NULL REFERENCES workflows(id),
  from_state   TEXT NOT NULL,
  to_state     TEXT NOT NULL,
  event        TEXT NOT NULL,
  payload      JSONB,
  actor        TEXT,          -- agent session, user, or system
  occurred_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX ON workflow_events (workflow_id, occurred_at DESC);
CREATE INDEX ON workflows (state, updated_at);

State machine definition

Define the state machine as a typed constant — valid states, valid transitions, and any guard conditions that must pass before a transition is allowed:

// order-fulfillment-machine.ts
type OrderState =
  | 'created'
  | 'inventory_reserved'
  | 'payment_authorized'
  | 'payment_captured'
  | 'fulfillment_triggered'
  | 'shipped'
  | 'delivered'
  | 'cancelled'
  | 'refunded';

type OrderEvent =
  | 'reserve_inventory'
  | 'authorize_payment'
  | 'capture_payment'
  | 'trigger_fulfillment'
  | 'mark_shipped'
  | 'mark_delivered'
  | 'cancel'
  | 'refund';

const TRANSITIONS: Record<OrderState, Partial<Record<OrderEvent, OrderState>>> = {
  created: {
    reserve_inventory: 'inventory_reserved',
    cancel: 'cancelled'
  },
  inventory_reserved: {
    authorize_payment: 'payment_authorized',
    cancel: 'cancelled'
  },
  payment_authorized: {
    capture_payment: 'payment_captured',
    cancel: 'cancelled'
  },
  payment_captured: {
    trigger_fulfillment: 'fulfillment_triggered'
  },
  fulfillment_triggered: {
    mark_shipped: 'shipped'
  },
  shipped: {
    mark_delivered: 'delivered',
    refund: 'refunded'
  },
  delivered: {
    refund: 'refunded'
  },
  cancelled: {},
  refunded: {}
};

export function getNextState(
  currentState: OrderState,
  event: OrderEvent
): OrderState | null {
  return TRANSITIONS[currentState]?.[event] ?? null;
}

Transition enforcement with optimistic locking

Two agents racing on the same workflow is the dangerous case. Agent A reads state inventory_reserved, agent B reads state inventory_reserved. Both try to advance via authorize_payment. Without locking, both would succeed and you'd have two payment authorizations. Optimistic locking uses the version column to detect the race:

// workflow-service.ts
import { db } from './db.js';
import { getNextState } from './order-fulfillment-machine.js';

export class WorkflowService {
  async transition(
    workflowId: string,
    event: string,
    newContextData: Record<string, unknown>,
    actor: string
  ): Promise<{ success: true; newState: string } | { success: false; reason: string }> {
    const client = await db.connect();
    try {
      await client.query('BEGIN');

      // Read current state with row lock
      const row = await client.query(
        `SELECT state, context, version FROM workflows WHERE id = $1 FOR UPDATE`,
        [workflowId]
      );

      if (!row.rows.length) {
        await client.query('ROLLBACK');
        return { success: false, reason: 'Workflow not found' };
      }

      const { state: currentState, context, version } = row.rows[0];
      const nextState = getNextState(currentState, event as any);

      if (!nextState) {
        await client.query('ROLLBACK');
        return {
          success: false,
          reason: `Invalid transition: ${event} is not allowed from state ${currentState}`
        };
      }

      const newContext = { ...context, ...newContextData };

      // Update with version bump (optimistic lock check via FOR UPDATE above)
      await client.query(
        `UPDATE workflows
         SET state = $1, context = $2, version = version + 1, updated_at = NOW()
         WHERE id = $3`,
        [nextState, newContext, workflowId]
      );

      // Append to audit log
      await client.query(
        `INSERT INTO workflow_events
         (workflow_id, from_state, to_state, event, payload, actor)
         VALUES ($1, $2, $3, $4, $5, $6)`,
        [workflowId, currentState, nextState, event, newContextData, actor]
      );

      await client.query('COMMIT');
      return { success: true, newState: nextState };
    } catch (err) {
      await client.query('ROLLBACK');
      throw err;
    } finally {
      client.release();
    }
  }
}

MCP tools for workflow interaction

Expose the workflow as a set of focused tools — one to read state, one per allowed action:

const workflowService = new WorkflowService();

// Read current workflow state
server.tool('get_order_workflow', {
  workflow_id: z.string().uuid()
}, async ({ workflow_id }) => {
  const row = await db.query(
    `SELECT w.state, w.context, w.updated_at,
            json_agg(e ORDER BY e.occurred_at DESC) FILTER (WHERE e.id IS NOT NULL) AS recent_events
     FROM workflows w
     LEFT JOIN workflow_events e ON e.workflow_id = w.id
     WHERE w.id = $1
     GROUP BY w.id`,
    [workflow_id]
  );

  if (!row.rows.length) {
    return { content: [{ type: 'text', text: JSON.stringify({ error: 'Not found' }) }] };
  }

  const { state, context, updated_at, recent_events } = row.rows[0];
  return {
    content: [{
      type: 'text',
      text: JSON.stringify({
        workflow_id,
        current_state: state,
        context,
        last_updated: updated_at,
        recent_events: (recent_events ?? []).slice(0, 5)
      })
    }]
  };
});

// Advance to next state with side-effect
server.tool('capture_order_payment', {
  workflow_id: z.string().uuid(),
  payment_method_id: z.string()
}, async ({ workflow_id, payment_method_id }, { meta }) => {
  // Execute side effect
  const charge = await stripe.paymentIntents.capture(payment_method_id);
  if (charge.status !== 'succeeded') {
    return {
      content: [{ type: 'text', text: JSON.stringify({ error: 'Payment capture failed', charge_status: charge.status }) }]
    };
  }

  // Transition state
  const result = await workflowService.transition(
    workflow_id,
    'capture_payment',
    { stripe_charge_id: charge.id, captured_at: new Date().toISOString() },
    meta?.requestId ?? 'agent'
  );

  return {
    content: [{
      type: 'text',
      text: JSON.stringify(result.success
        ? { new_state: result.newState, stripe_charge_id: charge.id }
        : { error: result.reason }
      )
    }]
  };
});

Detecting stuck workflows

A workflow stuck in an intermediate state for longer than expected is a silent failure — the agent session ended, an error was swallowed, or the user navigated away mid-process. Monitor for stuck workflows:

app.get('/health/workflows', async (req, res) => {
  // Workflows that should have progressed but haven't
  const stuck = await db.query(`
    SELECT workflow_type, state, count(*) as count
    FROM workflows
    WHERE state NOT IN ('delivered', 'cancelled', 'refunded')  -- terminal states
      AND updated_at < NOW() - INTERVAL '1 hour'
    GROUP BY workflow_type, state
    ORDER BY count DESC
  `);

  const totalStuck = stuck.rows.reduce((sum: number, r: any) => sum + parseInt(r.count), 0);

  res.status(totalStuck === 0 ? 200 : 503).json({
    status: totalStuck === 0 ? 'ok' : 'degraded',
    stuck_workflows: stuck.rows,
    total_stuck: totalStuck
  });
});

Register this with AliveMCP at a 5-minute interval. Any workflow stuck for more than an hour in a non-terminal state represents either a stalled agent, a side-effect error that wasn't surfaced, or a missing human-approval step. AliveMCP catches the condition automatically — you get paged before customer support tickets start arriving.

Frequently asked questions

How does a state machine handle partial failures — the side effect succeeds but the DB write fails?

Run the side effect and the DB state transition in the same database transaction where possible. For side effects outside the database (Stripe charges, email sends), implement idempotency: check whether the side effect already completed before re-executing. For Stripe, use an idempotency key ({ idempotencyKey: workflowId + ':capture_payment' }) — a second call with the same key returns the original charge without re-charging. For email, store a sent_at timestamp in the workflow context and skip the send if it's already set. This "at-least-once with idempotency" pattern is safer than trying to achieve exactly-once semantics across two external systems.

When should I use XState vs a custom state machine implementation?

XState provides a rich state machine specification format (hierarchical states, parallel states, guards, actions) and a visual inspector that's valuable for complex workflows. For simple linear flows (5–10 states, linear transitions), the overhead of XState's machinery is usually not worth it — a plain TypeScript constant like the TRANSITIONS map above is easier to read and debug. Use XState when you have conditional branching (the same event leads to different states based on context), parallel subflows (fulfillment and notification proceeding simultaneously), or deeply nested state hierarchies. XState works with the Postgres persistence layer above — persist machine.getSnapshot() to the context column and restore it with createActor(machine, { snapshot }).

How do I expose workflow state to the LLM without overwhelming context?

Return the state name and a brief summary of the context, not the raw context blob. The raw JSONB context may contain internal IDs, timestamps, and implementation details the LLM doesn't need. Shape the response explicitly: { current_state: "payment_captured", next_allowed_actions: ["trigger_fulfillment"], summary: "Payment of $149.99 captured via Stripe. Ready to trigger fulfillment." }. The next_allowed_actions list is particularly valuable — it tells the LLM exactly which tools to call next, reducing the chance of the agent trying an invalid transition and getting an error it needs to reason about. See tool discovery for how to design tool schemas that align with state machine transitions.

How do I handle workflows that require human approval mid-way?

Model the approval as a state: payment_authorized → awaiting_manager_approval → payment_captured. The awaiting_manager_approval state persists until an approver calls the approve_payment tool (or the approval endpoint in your UI). The agent polls get_order_workflow and sees the state is still awaiting_manager_approval. No job queue is needed — the workflow row sits in that state until the human acts. Combine with the human-in-the-loop approval pattern for the notification and UI layer. The stuck-workflow health check catches approvals that expire without being acted on.