Long-Running Agents: State and Resumption

Your agent runs in a REPL. A user types a question, the agent loop makes five tool calls, returns an answer, and exits. It works beautifully. You deploy it behind a web service. The user sends a request, the agent starts working, and then:

The request takes three minutes and the HTTP client times out at sixty seconds.
The Kubernetes pod gets rescheduled mid-agent-run and the whole conversation state is lost.
A tool fails on the fourth call and your code has no way to resume from the third — it starts the whole agent over and bills the user twice for something already done.
The user closes their browser tab and comes back in an hour. Your agent has no idea what it was doing.

This is the gap between "agent demo" and "agent product." The loop from B4.2 handled the intelligence part. This post is about the durability part — state, resumption, idempotency — which is not intelligent at all, but without it no agent survives first contact with real users on real infrastructure. It's also the part where most teams prematurely reach for a framework ("we need LangGraph for this") when what they actually need is a database.

Let me show you what "durable" actually means for an agent, and the three moves that get you there without a new framework.

What "long-running" actually means

Agents run into three kinds of time pressure:

Slow tasks. A single user request that takes minutes: "research ten competitors and write a summary." The agent needs to keep working across an HTTP timeout boundary.
Conversational memory. A user has a chat with the agent now, closes the tab, comes back in two hours, expects the conversation to continue. The agent needs to rehydrate the state.
Crashes and restarts. Your server restarts — rolling deploy, pod eviction, crash. An agent that was mid-flight needs to resume from where it was, not start over.

All three reduce to the same problem: the agent's state cannot live in process memory. It has to live somewhere durable, with a shape that supports restart and resume.

"Durable somewhere" sounds obvious until you ask "what is the state, exactly?" Agent state turns out to be three things:

The conversation history. The list of messages — user, assistant, tool results — that make up the context for the next messages.create call.
The agent's current position in the task. Has it finished planning? Is it executing step 3 of 5? Has it called a tool that's still pending?
The idempotency keys for any side effects it has already performed, so a resumed agent doesn't do them twice.

Those three things, stored durably, get you everything you need. Let me walk through each.

Move 1: store the message history in a database

The first and most important move: every append to the agent's message list is a database write. After every model response, after every tool call, after every tool result — append to persistent storage. The message list is no longer a Python list you hold in memory; it's a table you read and write to.

The schema doesn't have to be fancy. Postgres is fine:

CREATE TABLE agent_runs (
  id UUID PRIMARY KEY,
  user_id TEXT NOT NULL,
  status TEXT NOT NULL, -- 'running', 'completed', 'failed', 'paused'
  task TEXT NOT NULL,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE agent_messages (
  id BIGSERIAL PRIMARY KEY,
  run_id UUID REFERENCES agent_runs(id),
  seq INTEGER NOT NULL,
  role TEXT NOT NULL, -- 'user', 'assistant', 'tool_result'
  content JSONB NOT NULL,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  UNIQUE (run_id, seq)
);

One row per agent run. One row per message within a run. The seq column gives you deterministic ordering. When your loop reloads a run, it reads all messages for that run_id in seq order and you have the full history ready for the next model call.

The reload-and-resume function is trivial:

def load_run(run_id: str) -> list[dict]:
    rows = db.query(
        "SELECT role, content FROM agent_messages "
        "WHERE run_id = %s ORDER BY seq",
        (run_id,),
    )
    return [{"role": r["role"], "content": r["content"]} for r in rows]

def append_message(run_id: str, role: str, content: dict) -> None:
    db.execute(
        "INSERT INTO agent_messages (run_id, seq, role, content) "
        "VALUES (%s, (SELECT COALESCE(MAX(seq), 0) + 1 FROM agent_messages WHERE run_id = %s), %s, %s)",
        (run_id, run_id, role, json.dumps(content)),
    )

Every agent loop iteration becomes:

def run_agent_step(run_id: str) -> str | None:
    messages = load_run(run_id)
    resp = client.messages.create(model=..., messages=messages, tools=TOOLS)

    if resp.stop_reason == "end_turn":
        append_message(run_id, "assistant", resp.content_as_dict)
        mark_run_completed(run_id)
        return extract_text(resp)

    # Tool use path
    append_message(run_id, "assistant", resp.content_as_dict)
    for tool_block in resp.tool_use_blocks:
        result = dispatch_tool(tool_block, run_id)
        append_message(run_id, "tool_result", {
            "tool_use_id": tool_block.id,
            "content": json.dumps(result),
        })
    return None  # not done, loop again

Each call to run_agent_step advances the run by one iteration. The state is fully durable. You can crash the process, restart it, call run_agent_step(run_id) again, and it picks up exactly where it left off.

Move 2: tool idempotency

Your agent just called send_email. The result came back successful. The process crashed before the result was written to the database. On restart, the agent reloads the messages, sees no tool_result for the email, and tries to send it again. Your user just got two emails.

This is the classic at-least-once delivery problem. The fix is at-least-once delivery combined with idempotency at the tool level. Every tool call has an idempotency key. The tool's implementation uses that key to decide: have I already done this? If yes, return the previous result. If no, do the work, record the result with the key, return it.

def send_email_idempotent(idempotency_key: str, to: str, body: str) -> dict:
    # Check if we already sent this
    prior = db.query(
        "SELECT result FROM tool_calls WHERE idempotency_key = %s",
        (idempotency_key,),
    )
    if prior:
        return json.loads(prior[0]["result"])

    # Actually send
    result = email_service.send(to=to, body=body)

    # Record before returning
    db.execute(
        "INSERT INTO tool_calls (idempotency_key, tool_name, result) "
        "VALUES (%s, 'send_email', %s)",
        (idempotency_key, json.dumps(result)),
    )
    return result

The idempotency key can be whatever uniquely identifies this specific tool invocation within this agent run. A good choice: f"{run_id}-{tool_use_id}" — the agent run ID plus the tool_use block ID from the model's response. This is deterministic and unique: the same logical tool call has the same key every time.

Now your agent can crash, restart, and re-run the tool call — and the tool will recognise the idempotency key, return the previously stored result, and no side effect runs twice.

Three practical notes on this:

Idempotency is a property of the tool, not the agent. Every tool that has side effects needs its own idempotency logic. Read-only tools (web search, database query) don't need it.
The idempotency record has to land before the side effect. If you send the email and then crash before writing the record, you've already sent twice on retry. The right pattern is: write a "pending" row with the key, perform the side effect, update the row with the result. On restart, a "pending" row means "I was about to do this; verify with the downstream system whether it happened."
Some tools can't be made idempotent, only safe. You can't un-send a Slack message. For these, the safest pattern is to mark the tool call as "requires human confirmation" (per B2.4) so the agent can't call it without a human in the loop.

Move 3: run status and resumption logic

The third move is the orchestration: how do you know which runs are paused, crashed, or in-flight, and how do you resume them?

The agent_runs table has a status column with values like running, paused, completed, failed. Every time the agent makes progress, it bumps updated_at. A background job scans for runs that are running but haven't been updated in N seconds (stale runs) and restarts them from their current state.

def resume_stale_runs():
    stale = db.query(
        "SELECT id FROM agent_runs "
        "WHERE status = 'running' "
        "AND updated_at < NOW() - INTERVAL '2 minutes'"
    )
    for row in stale:
        enqueue_work(row["id"])  # kick the agent loop for this run_id

def worker():
    while True:
        run_id = dequeue_work()
        try:
            while True:
                result = run_agent_step(run_id)
                if result is not None:
                    break  # completed
                db.execute(
                    "UPDATE agent_runs SET updated_at = NOW() WHERE id = %s",
                    (run_id,),
                )
        except Exception as e:
            mark_run_failed(run_id, str(e))

Three components:

A worker that pulls run_ids off a queue and executes one step at a time.
A watchdog that promotes stale runs back to the queue.
The run itself is stateless — the worker can pick up any run and continue it from persistent state.

You can pick your queue and your worker infra — Celery, RQ, a database-backed queue like graphile-worker, an SQS queue, a simple SELECT FOR UPDATE SKIP LOCKED loop on the agent_runs table. The queue choice is orthogonal to the agent logic. Start with the simplest one you already run.

The database-first shape

Here's the full picture once all three moves are in place:

Every step is a database transaction. Every side effect has an idempotency key. Every run is resumable from any point. The worker is stateless. The user can close their tab and come back. The process can crash and restart. The system survives.

This is what every "agent framework" is reinventing, with varying degrees of opinion about the schema and the queue. If you use a framework, you get this out of the box (and often some extra complexity). If you build it yourself, you own every piece — which is usually a good trade because the shape is simple and the framework-specific assumptions are a bigger drag than the code you'd write.

Why "database first, framework later"

Here is the argument for building the durable loop yourself before adopting LangGraph, Temporal, or any other orchestration framework:

The database is the source of truth. Message history lives in Postgres. It's there whether or not your framework is running. You can query it with SQL, back it up with your existing backup story, and inspect it with tools you already use. A framework that stores state in its own format removes this.
Debugging is SELECT * FROM agent_messages WHERE run_id = ?. Nothing fancier. Every engineer on your team can read the trace, because it's just rows in a table.
Migrations are SQL. Need to add a field? Alter the table. Need to reprocess old runs? Write an UPDATE. The operational story matches every other database-backed system you run.
The loop is 100 lines. You're not "re-implementing Temporal." You're writing a dispatcher that reads rows, calls a model, writes rows. If your requirements grow past what this simple loop can do, then reach for a framework — with evidence of the specific limit you hit.
Your ops team already knows Postgres. They don't know LangGraph's internals. When something breaks at 3am, the oncall engineer can debug SQL.

Frameworks are appropriate when:

You need cross-agent coordination with complex graph topologies (not just "one agent in a loop").
You need time-travel debugging or deterministic replay at a level a simple database can't provide.
You're running at a scale where a purpose-built orchestrator's features save you real ops work.
You have a team already fluent in the framework and the opportunity cost of learning something new is high.

For most single-agent, conversation-style products, database-first is the right starting point. Build the 100 lines. Ship it. Reach for a framework when (and only when) you hit the limit.

Admit what breaks

Database writes have latency. Every message-append is a round trip to Postgres. At 200ms round trips, a 20-iteration agent pays 4 seconds just in DB writes. Cache the last few messages in memory and flush to DB on step boundaries if the latency hurts.
JSONB columns can balloon. Conversation histories with large tool outputs can grow fast. Consider TOAST-aware storage, summarisation of old turns, or archiving completed runs to cold storage.
Idempotency keys need to be deterministic. If you're using a random key per tool call, you've lost the point. Derive the key from (run_id, tool_use_id) or similar stable identifiers.
At-least-once requires careful side-effect ordering. If the side effect is "write to an external system," the external system must also be idempotent, or you need a two-phase pattern (pending row → act → confirm row). Payment systems are the canonical hard case.
Stale-run detection has false positives. A worker that's making progress but is slow (an 8-minute LLM call for long reasoning) will look stale to the watchdog. Tune the updated_at heartbeat to be shorter than the stale-detection threshold, or have the worker post heartbeats inside long operations.
Retries amplify bugs. A bug in a tool that causes it to fail will trigger retries, which will fail, which will retry. Cap retries per run, alert when the cap is hit, and stop instead of looping forever.
Messages table grows unboundedly. Archive or partition by created_at for runs older than N days. Keep recent runs hot; move completed runs to cold storage.

What just changed in your code

Stop holding agent state in memory. Every message-append is a DB write. Every load is a DB read. Your agent loop becomes a series of database transactions.
Add an idempotency key to every side-effectful tool call. Derive it from (run_id, tool_use_id).
Add a status column to your agent runs and a watchdog that resumes stale runs.
Make the worker stateless. Any worker can pick up any run. No sticky sessions, no in-memory state.
Build the database-first version before adopting a framework. You'll understand exactly what the framework is doing and whether it's worth it.
Test resumption by killing the process mid-run. If your agent doesn't recover cleanly, you have a bug. Chaos-test this before you ship.

And that closes Module B4 — Tools and Agents. You now have the tool-use primitive, the agent loop in forty lines, the honest comparison of reactive vs planning vs reflection, the honest line on multi-agent systems, and the durability patterns that make agents survive real infrastructure.

Next up: Module B5 — Shipping. The unglamorous module that makes the difference between "demo" and "product": cost and latency, caching, observability, guardrails, and rolling out model upgrades without breaking the things you already shipped.

⬅️ Previous	📍 You are here	Next ➡️
⬅️ Previous B4.4 · Multi-Agent Systems	B4.5 of B6.4	Next ➡️ B5.1 · Cost and Latency, the Two Dials Users Feel

📚 AI for Builders · Course Home — 28 posts, six modules.

Cover photo via Unsplash. This post is part of the AI for Builders series.

Long-Running Agents: State, Resumption, and the Database You Need First

What "long-running" actually means

Move 1: store the message history in a database

Move 2: tool idempotency

Move 3: run status and resumption logic

The database-first shape

Why "database first, framework later"

Admit what breaks

What just changed in your code

Course navigation

Comments

AI for Builders

Cost and Latency: the Two Dials Users Feel

More from this blog

A Reading List and Two Habits: Staying Current in Ten Minutes a Week

What to Decide Now, What to Defer, What to Ignore: The AI Action Matrix

The Next 18 Months of AI: A Calibrated Leader's Forecast

Calibrating Your AI Exposure: Upside and Downside in One Matrix

Five AI Capabilities That Matter for Your Business, and Five That Do Not

Command Palette

What "long-running" actually means

Move 1: store the message history in a database

Move 2: tool idempotency

Move 3: run status and resumption logic

The database-first shape

Why "database first, framework later"

Admit what breaks

What just changed in your code

Course navigation

Comments

AI for Builders

Cost and Latency: the Two Dials Users Feel

More from this blog