Guide · Agentic Patterns
MCP Server State Machines — persisting multi-step agentic workflow state
Agents don't just call single tools in isolation — they execute workflows: validate an order, reserve inventory, charge the card, send a confirmation, then trigger fulfillment. If the card charge succeeds but the fulfillment trigger fails, the server must know where in the workflow it got stuck and resume correctly on retry rather than re-charging the card. State machines make that resumption safe by encoding exactly what states exist, what transitions are valid, and what data is carried between steps. This guide covers implementing persistent state machines in MCP servers: the table schema, transition enforcement, optimistic locking for concurrent agents, and recovery patterns for partial failures.
TL;DR
Store workflow state as a { state, context } row in Postgres. Enforce valid transitions server-side with an allowlist. Use WHERE state = $expected in your UPDATE to detect concurrent modifications (optimistic locking). Give the agent a get_workflow_state tool to read current state and a set of action tools that advance it. Each action tool runs its side effect and transitions state in the same database transaction. AliveMCP monitors a /health/workflows endpoint that alerts when workflows are stuck in terminal-approach states for too long.
Why in-memory state is not enough
If you store workflow state in a server-side Map or module variable, you lose it when the process restarts, when the pod is replaced, or when the agent's session times out and the user comes back later. Persistent state means:
- Workflows survive server restarts — the agent can resume from the last committed state
- Concurrent agents (two sessions for the same order) see consistent state
- You have an audit trail of every state transition for debugging and compliance
- Long-running workflows that span hours or days are safe
| Storage approach | Survives restart | Concurrent safe | Audit trail |
|---|---|---|---|
| In-memory Map | No | No (race conditions) | No |
| Redis | Yes (if persistent) | Partial (Lua scripts) | No (TTL eviction) |
| Postgres table | Yes | Yes (row locking) | Yes (event log) |
Database schema
-- workflows table: one row per workflow instance
CREATE TABLE workflows (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
workflow_type TEXT NOT NULL, -- 'order_fulfillment', 'invoice_approval', etc.
state TEXT NOT NULL, -- current state name
context JSONB NOT NULL DEFAULT '{}', -- data carried between steps
version INTEGER NOT NULL DEFAULT 1, -- for optimistic locking
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- workflow_events table: append-only audit log
CREATE TABLE workflow_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
workflow_id UUID NOT NULL REFERENCES workflows(id),
from_state TEXT NOT NULL,
to_state TEXT NOT NULL,
event TEXT NOT NULL,
payload JSONB,
actor TEXT, -- agent session, user, or system
occurred_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX ON workflow_events (workflow_id, occurred_at DESC);
CREATE INDEX ON workflows (state, updated_at);
State machine definition
Define the state machine as a typed constant — valid states, valid transitions, and any guard conditions that must pass before a transition is allowed:
// order-fulfillment-machine.ts
type OrderState =
| 'created'
| 'inventory_reserved'
| 'payment_authorized'
| 'payment_captured'
| 'fulfillment_triggered'
| 'shipped'
| 'delivered'
| 'cancelled'
| 'refunded';
type OrderEvent =
| 'reserve_inventory'
| 'authorize_payment'
| 'capture_payment'
| 'trigger_fulfillment'
| 'mark_shipped'
| 'mark_delivered'
| 'cancel'
| 'refund';
const TRANSITIONS: Record<OrderState, Partial<Record<OrderEvent, OrderState>>> = {
created: {
reserve_inventory: 'inventory_reserved',
cancel: 'cancelled'
},
inventory_reserved: {
authorize_payment: 'payment_authorized',
cancel: 'cancelled'
},
payment_authorized: {
capture_payment: 'payment_captured',
cancel: 'cancelled'
},
payment_captured: {
trigger_fulfillment: 'fulfillment_triggered'
},
fulfillment_triggered: {
mark_shipped: 'shipped'
},
shipped: {
mark_delivered: 'delivered',
refund: 'refunded'
},
delivered: {
refund: 'refunded'
},
cancelled: {},
refunded: {}
};
export function getNextState(
currentState: OrderState,
event: OrderEvent
): OrderState | null {
return TRANSITIONS[currentState]?.[event] ?? null;
}
Transition enforcement with optimistic locking
Two agents racing on the same workflow is the dangerous case. Agent A reads state inventory_reserved, agent B reads state inventory_reserved. Both try to advance via authorize_payment. Without locking, both would succeed and you'd have two payment authorizations. Optimistic locking uses the version column to detect the race:
// workflow-service.ts
import { db } from './db.js';
import { getNextState } from './order-fulfillment-machine.js';
export class WorkflowService {
async transition(
workflowId: string,
event: string,
newContextData: Record<string, unknown>,
actor: string
): Promise<{ success: true; newState: string } | { success: false; reason: string }> {
const client = await db.connect();
try {
await client.query('BEGIN');
// Read current state with row lock
const row = await client.query(
`SELECT state, context, version FROM workflows WHERE id = $1 FOR UPDATE`,
[workflowId]
);
if (!row.rows.length) {
await client.query('ROLLBACK');
return { success: false, reason: 'Workflow not found' };
}
const { state: currentState, context, version } = row.rows[0];
const nextState = getNextState(currentState, event as any);
if (!nextState) {
await client.query('ROLLBACK');
return {
success: false,
reason: `Invalid transition: ${event} is not allowed from state ${currentState}`
};
}
const newContext = { ...context, ...newContextData };
// Update with version bump (optimistic lock check via FOR UPDATE above)
await client.query(
`UPDATE workflows
SET state = $1, context = $2, version = version + 1, updated_at = NOW()
WHERE id = $3`,
[nextState, newContext, workflowId]
);
// Append to audit log
await client.query(
`INSERT INTO workflow_events
(workflow_id, from_state, to_state, event, payload, actor)
VALUES ($1, $2, $3, $4, $5, $6)`,
[workflowId, currentState, nextState, event, newContextData, actor]
);
await client.query('COMMIT');
return { success: true, newState: nextState };
} catch (err) {
await client.query('ROLLBACK');
throw err;
} finally {
client.release();
}
}
}
MCP tools for workflow interaction
Expose the workflow as a set of focused tools — one to read state, one per allowed action:
const workflowService = new WorkflowService();
// Read current workflow state
server.tool('get_order_workflow', {
workflow_id: z.string().uuid()
}, async ({ workflow_id }) => {
const row = await db.query(
`SELECT w.state, w.context, w.updated_at,
json_agg(e ORDER BY e.occurred_at DESC) FILTER (WHERE e.id IS NOT NULL) AS recent_events
FROM workflows w
LEFT JOIN workflow_events e ON e.workflow_id = w.id
WHERE w.id = $1
GROUP BY w.id`,
[workflow_id]
);
if (!row.rows.length) {
return { content: [{ type: 'text', text: JSON.stringify({ error: 'Not found' }) }] };
}
const { state, context, updated_at, recent_events } = row.rows[0];
return {
content: [{
type: 'text',
text: JSON.stringify({
workflow_id,
current_state: state,
context,
last_updated: updated_at,
recent_events: (recent_events ?? []).slice(0, 5)
})
}]
};
});
// Advance to next state with side-effect
server.tool('capture_order_payment', {
workflow_id: z.string().uuid(),
payment_method_id: z.string()
}, async ({ workflow_id, payment_method_id }, { meta }) => {
// Execute side effect
const charge = await stripe.paymentIntents.capture(payment_method_id);
if (charge.status !== 'succeeded') {
return {
content: [{ type: 'text', text: JSON.stringify({ error: 'Payment capture failed', charge_status: charge.status }) }]
};
}
// Transition state
const result = await workflowService.transition(
workflow_id,
'capture_payment',
{ stripe_charge_id: charge.id, captured_at: new Date().toISOString() },
meta?.requestId ?? 'agent'
);
return {
content: [{
type: 'text',
text: JSON.stringify(result.success
? { new_state: result.newState, stripe_charge_id: charge.id }
: { error: result.reason }
)
}]
};
});
Detecting stuck workflows
A workflow stuck in an intermediate state for longer than expected is a silent failure — the agent session ended, an error was swallowed, or the user navigated away mid-process. Monitor for stuck workflows:
app.get('/health/workflows', async (req, res) => {
// Workflows that should have progressed but haven't
const stuck = await db.query(`
SELECT workflow_type, state, count(*) as count
FROM workflows
WHERE state NOT IN ('delivered', 'cancelled', 'refunded') -- terminal states
AND updated_at < NOW() - INTERVAL '1 hour'
GROUP BY workflow_type, state
ORDER BY count DESC
`);
const totalStuck = stuck.rows.reduce((sum: number, r: any) => sum + parseInt(r.count), 0);
res.status(totalStuck === 0 ? 200 : 503).json({
status: totalStuck === 0 ? 'ok' : 'degraded',
stuck_workflows: stuck.rows,
total_stuck: totalStuck
});
});
Register this with AliveMCP at a 5-minute interval. Any workflow stuck for more than an hour in a non-terminal state represents either a stalled agent, a side-effect error that wasn't surfaced, or a missing human-approval step. AliveMCP catches the condition automatically — you get paged before customer support tickets start arriving.
Frequently asked questions
How does a state machine handle partial failures — the side effect succeeds but the DB write fails?
Run the side effect and the DB state transition in the same database transaction where possible. For side effects outside the database (Stripe charges, email sends), implement idempotency: check whether the side effect already completed before re-executing. For Stripe, use an idempotency key ({ idempotencyKey: workflowId + ':capture_payment' }) — a second call with the same key returns the original charge without re-charging. For email, store a sent_at timestamp in the workflow context and skip the send if it's already set. This "at-least-once with idempotency" pattern is safer than trying to achieve exactly-once semantics across two external systems.
When should I use XState vs a custom state machine implementation?
XState provides a rich state machine specification format (hierarchical states, parallel states, guards, actions) and a visual inspector that's valuable for complex workflows. For simple linear flows (5–10 states, linear transitions), the overhead of XState's machinery is usually not worth it — a plain TypeScript constant like the TRANSITIONS map above is easier to read and debug. Use XState when you have conditional branching (the same event leads to different states based on context), parallel subflows (fulfillment and notification proceeding simultaneously), or deeply nested state hierarchies. XState works with the Postgres persistence layer above — persist machine.getSnapshot() to the context column and restore it with createActor(machine, { snapshot }).
How do I expose workflow state to the LLM without overwhelming context?
Return the state name and a brief summary of the context, not the raw context blob. The raw JSONB context may contain internal IDs, timestamps, and implementation details the LLM doesn't need. Shape the response explicitly: { current_state: "payment_captured", next_allowed_actions: ["trigger_fulfillment"], summary: "Payment of $149.99 captured via Stripe. Ready to trigger fulfillment." }. The next_allowed_actions list is particularly valuable — it tells the LLM exactly which tools to call next, reducing the chance of the agent trying an invalid transition and getting an error it needs to reason about. See tool discovery for how to design tool schemas that align with state machine transitions.
How do I handle workflows that require human approval mid-way?
Model the approval as a state: payment_authorized → awaiting_manager_approval → payment_captured. The awaiting_manager_approval state persists until an approver calls the approve_payment tool (or the approval endpoint in your UI). The agent polls get_order_workflow and sees the state is still awaiting_manager_approval. No job queue is needed — the workflow row sits in that state until the human acts. Combine with the human-in-the-loop approval pattern for the notification and UI layer. The stuck-workflow health check catches approvals that expire without being acted on.
Further reading
- Long-Running Tasks in MCP Servers — async dispatch and job queue patterns
- Human-in-the-Loop for MCP Servers — approval gates as workflow states
- MCP Server Session Lifecycle — how server state relates to session boundaries
- MCP Server Shared State — concurrent access patterns for agentic servers
- MCP Server Audit Logging — recording every workflow transition