Engineering

Why we use Temporal for every long-running AI workflow

Long-running AI workflows fail in ways HTTP retries can't fix; Temporal's durable execution model is the only thing we've found that survives production.

Read time
11 min
Published
Jun 22, 2026

The failure mode nobody warns you about

An AI workflow is not a request/response. A document-processing agent might run for 40 minutes: OCR a 200-page PDF, chunk it, embed it, run six LLM passes for extraction, call out to a tool that hits a third-party API with a 30-second p99, wait for a human to approve a flagged field, then write to three downstream systems.

Now ask the obvious question: what happens when the process holding that state crashes at minute 38?

If you built this as a chain of async functions behind an HTTP endpoint, the answer is: everything is gone. The embeddings you paid for, the five LLM calls you already made, the partial extraction state — all of it lives in memory or in scattered rows you have to manually reconcile. You either re-run the whole thing (and pay for it twice) or you write a mountain of checkpointing code that turns out to be a worse, buggier version of a workflow engine.

We wrote that mountain of code twice before we stopped. The first version used a Postgres jobs table with a status column and a cron worker. The second added a state machine and a step_log table. Both leaked. Both had race conditions on retry. Both turned every new workflow into a two-week project because the durability plumbing was bespoke each time.

Temporal exists to delete that plumbing. We now use it for every AI workflow that runs longer than a single synchronous request, and the decision is no longer up for debate internally.

What durable execution buys you

The core idea: Temporal records every step your workflow takes as an event in a history log. When a worker process dies, another worker picks up the workflow and replays the history to reconstruct the exact in-memory state up to the last completed step. Your code looks like a normal function with normal variables and normal control flow. Under the hood, every await on an activity is a durable checkpoint.

Concretely, this means a function like:

@workflow.defn
class DocumentPipeline:
    @workflow.run
    async def run(self, doc_id: str) -> Result:
        text = await workflow.execute_activity(ocr, doc_id, ...)
        chunks = await workflow.execute_activity(chunk, text, ...)
        embeddings = await workflow.execute_activity(embed, chunks, ...)
        extracted = await workflow.execute_activity(extract, chunks, ...)
        return await workflow.execute_activity(persist, extracted, ...)

If the worker crashes after embed completes, the replay re-runs run from the top, but ocr, chunk, and embed return their recorded results instantly without re-executing. Execution resumes at extract. You paid for OCR and embeddings once. The 40-minute workflow survives an arbitrary number of process restarts, deploys, and machine failures.

This is the property that matters for AI work specifically, because the steps are expensive. A re-run isn't a wasted 50ms — it's a wasted $4 in tokens and a wasted 90 seconds of wall-clock latency. Durability is a cost-control mechanism as much as a reliability one.

Modeling an AI pipeline as a workflow

The mental model that took us longest to internalize: workflow code is orchestration, activity code is work. The workflow function decides what happens and in what order. Activities do anything that touches the outside world — network calls, disk, GPU inference, randomness.

The split maps cleanly onto AI systems. Your workflow is the agent loop. Your activities are tool calls, model invocations, retrieval queries, and writes. Each activity gets its own retry policy, timeout, and resource pool. A GPU embedding activity runs on a worker pool with GPU nodes; a cheap metadata write runs on a general pool. Same workflow, different task queues.

This also gives you natural concurrency. Fan-out is just asyncio.gather over activity stubs:

results = await asyncio.gather(*[
    workflow.execute_activity(extract_field, c, ...) for c in chunks
])

Temporal tracks all of those branches durably. If three of twelve parallel extractions are done and the worker dies, replay restores the three completed results and only re-issues the nine outstanding ones. Building that yourself with a thread pool and a checkpoint table is exactly the trap we fell into earlier.

The determinism tax

Here is the part that bites every team. Workflow code must be deterministic, because it gets replayed. If your workflow function produces different results on replay than it did originally, Temporal detects the divergence and throws a non-determinism error.

That means inside workflow code you cannot call datetime.now(), random.random(), uuid4(), read environment variables, hit a database, or iterate a set whose order isn't stable. All of that is non-deterministic and must move into activities or use Temporal's deterministic SDK equivalents (workflow.now(), workflow.random(), workflow.uuid4()).

For AI workflows this trips people in a specific way: do not call the LLM from workflow code. The temptation is real because the SDK is so frictionless. But an LLM call is the most non-deterministic thing in your entire system — same prompt, different output, every time. It has to be an activity. The workflow orchestrates the prompt and consumes the response; the activity makes the actual API call.

A subtler trap: agent loops that branch on model output. If your workflow does if response.action == "search", that branch is deterministic given the recorded response, because the response came from an activity and is replayed from history. That's fine. The non-determinism is sealed inside the activity boundary, which is exactly where you want it. The discipline is simple once internalized: anything that can return two different answers must be an activity.

Handling LLM calls as activities

LLM activities need retry policies that reflect how LLM APIs actually fail. Our default:

RetryPolicy(
    initial_interval=timedelta(seconds=2),
    backoff_coefficient=2.0,
    maximum_interval=timedelta(seconds=60),
    maximum_attempts=6,
    non_retryable_error_types=["InvalidRequestError", "ContextLengthExceeded"],
)

The non-retryable list matters. A 400 from a malformed prompt or a context-length error will fail identically on every retry — retrying just burns time and quota. A 429 or 503, on the other hand, is exactly what the backoff exists for. Classify your provider's errors into retryable and terminal explicitly; do not let everything fall into the default retry bucket.

Set start_to_close_timeout based on real latency distributions, not optimism. A single GPT-class call with a large context can take 60+ seconds. Streaming completions that generate thousands of tokens can run several minutes. We set timeouts at roughly 3x observed p99 and let the retry policy handle the tail. If you set the timeout too tight, Temporal cancels a call that would have succeeded, and you pay for a partial generation you discard.

For idempotency: LLM activities are not naturally idempotent because retries produce different output. That's usually fine — you want an answer, not the same answer. But when an activity writes to a downstream system, attach an idempotency key derived from the workflow ID and activity sequence so a retried write doesn't duplicate a row. Temporal guarantees at-least-once execution of activities, not exactly-once. The exactly-once illusion is something you build with idempotency keys on the write side.

One more practical note: keep activity payloads small. The inputs and outputs of every activity are serialized into the workflow history. If you pass a 50MB OCR blob through an activity result, that blob lives in history forever and bloats replay. Store large artifacts in S3 and pass references. We cap activity payloads at 2MB by convention and use blob storage for anything bigger. Temporal enforces a hard limit anyway (256KB per payload, 2MB total by default on the gRPC frame), so this isn't optional at scale.

Human-in-the-loop and signals

The feature that sold the rest of the team was how cleanly Temporal handles human approval steps. An AI extraction pipeline flags a low-confidence field and needs a human to confirm before it writes to the system of record. The human might respond in 30 seconds or three days.

In the bespoke version, this meant a paused job, a webhook, a reconciliation worker, and a timeout sweeper. In Temporal it's a signal and a timer:

self._approved = None
await workflow.wait_condition(
    lambda: self._approved is not None,
    timeout=timedelta(days=3),
)

@workflow.signal
def approve(self, decision: bool):
    self._approved = decision

The workflow blocks — durably, for three days — consuming zero compute while it waits. There is no polling, no parked thread, no row that a cron job has to remember to check. When the human hits approve, your API sends a signal to the workflow by ID, the wait_condition unblocks, and execution continues. If three days pass, the timeout fires and you route to a fallback path.

This pattern generalizes to anything event-driven in an agent: waiting for a webhook from an external tool, waiting for a long-running batch job in another system, waiting for a rate limit window to reset. A workflow that sleeps for a day costs nothing. That changes how you design pipelines — you stop avoiding long waits and start modeling them honestly.

Versioning workflows in production

Replay is a gift until you deploy. If you change workflow code and a worker tries to replay an in-flight execution that was recorded against the old code, the histories diverge and you get a non-determinism error. This is the operational footgun that catches every team in their first month.

Temporal gives you workflow.patched() to gate changes:

if workflow.patched("use-reranker-v2"):
    results = await workflow.execute_activity(rerank_v2, ...)
else:
    results = await workflow.execute_activity(rerank_v1, ...)

Old in-flight workflows replay down the v1 branch; new ones take v2. Once all old executions drain, you call deprecate_patch and later remove the dead branch. It's tedious but mechanical.

Our rule: any change to the structure of a running workflow — adding, removing, or reordering activities — needs a patch or a new workflow type. Changing the internals of an activity does not, because activities aren't replayed; only their recorded results are. This is a strong reason to keep workflow code thin and push logic into activities. The more your prompt engineering and model selection lives in activities, the more you can iterate on the AI behavior without touching workflow versioning at all. We ship new prompts daily and have not patched a workflow in weeks.

The costs and when not to use it

Temporal is not free, in either sense. You run a cluster (or pay for Temporal Cloud), the programming model has a learning curve measured in weeks, and the local development story with the test server adds setup friction. The history-replay model is genuinely different from how most engineers think about code, and the determinism rules feel arbitrary until they click.

Don't reach for it when the work is a single synchronous LLM call that returns in two seconds. A chat completion behind an API endpoint does not need durable execution; it needs a timeout and a retry. Putting it in Temporal adds latency and operational surface for no benefit.

The line we draw: if a workflow runs longer than 30 seconds, makes more than two expensive external calls, involves any human wait, or must survive a deploy mid-execution, it goes in Temporal. Everything below that line stays a plain request handler.

For the workflows above that line — which is most real agentic systems — the alternative isn't "no orchestration engine." It's the homegrown jobs table you'll build anyway, except worse. We've built that twice. Temporal is the version that doesn't page us at 3 a.m. because a worker died mid-extraction. That's the whole pitch.

Found this useful?

Let's apply this thinking to your stack

Book a free architecture call. A senior engineer will give you an honest assessment — no pitch required.