LangGraph in production: what the docs don't tell you

The Gap Between Demo and Production

LangGraph is the best general-purpose orchestration layer for stateful agents right now. That's not in dispute. What is in dispute is whether the tutorials prepare you to run it under load, with real users, real failures, and real money on the line. They don't.

The quickstart gives you a StateGraph, a couple of nodes, a conditional edge, and a MemorySaver. It runs locally, it streams tokens, it looks magical. Then you deploy it, traffic arrives, and you discover the failure modes that never show up in a single-user notebook session.

This post is the set of things we learned the hard way running LangGraph across several production agent systems — voice pipelines, multi-agent SaaS backends, and long-running research workflows. None of it is in the docs in any consolidated form. All of it cost us incidents.

State Is Not Free

The State object is the heart of every LangGraph app, and the docs treat it like a free-form dictionary you can stuff anything into. In production, your state is serialized and persisted on every superstep. That has direct cost implications.

The default pattern people copy is appending message history to state with add_messages. For a chat agent that runs 40 turns, each with tool calls and tool results, your state object balloons. We saw single-conversation states cross 400KB because nobody pruned tool outputs. Every checkpoint write then serializes and stores that entire blob again. With a Postgres checkpointer, that's 400KB written per node execution, not per conversation.

The fix is to be deliberate about what lives in state versus what lives in external storage. Large tool outputs — a scraped page, a 50KB JSON API response, a retrieved document set — should be written to object storage or a cache, with only a reference ID in state. Your reducer functions should actively trim. We cap message history in state at the last N turns plus a running summary, and we strip raw tool payloads after the model has consumed them.

Another trap: custom reducers run on every update. If your reducer does anything non-trivial — deduplication, sorting, merging — that cost multiplies by the number of nodes that touch that key. Keep reducers O(n) and boring. We had a reducer doing a set-based dedup over a growing list, turning a graph into accidental O(n²) behavior across the run.

Treat state as a hot, frequently-serialized data structure, because that's exactly what it is. Profile its size. Set a hard alert if any state object exceeds a threshold — ours is 100KB — because a state that grows unbounded is a memory and latency leak that compounds invisibly.

Checkpointers Are Your Bottleneck

MemorySaver is for demos. The moment you need persistence across process restarts or horizontal scaling, you move to PostgresSaver or a custom checkpointer. This is where most teams hit their first real scaling wall.

LangGraph writes a checkpoint after every superstep. A graph with ten nodes that each run once produces around ten checkpoint writes per invocation, plus writes for pending writes and channel versions. The official Postgres checkpointer does this with multiple tables and several round trips per checkpoint. Under concurrent load, that connection pool saturates fast.

We measured the default PostgresSaver topping out around 40-50 concurrent graph executions before connection contention dominated latency. The checkpoint writes, not the LLM calls, became the limiting factor for a workflow that should have been entirely LLM-bound.

Three things matter here. First, tune your connection pool explicitly and separate the checkpointer pool from your application pool. They contend for the same database otherwise, and a slow checkpoint write blocks an unrelated API query. Second, consider whether you need full checkpointing at all. If a workflow is short and idempotent, you can run it without a checkpointer and only persist the final result. Checkpointing exists for resumability and human-in-the-loop; if you have neither, you're paying for nothing.

Third, when you do need durability, evaluate a checkpointer backed by something built for high write throughput. We moved several high-volume workflows to a custom checkpointer over Redis with periodic Postgres snapshotting. Redis absorbed the per-superstep write churn, and we only persisted durable checkpoints at interrupt boundaries. That dropped our p99 step latency by roughly 60%.

The lesson: checkpoint frequency is a cost you control through graph design. Fewer, coarser nodes mean fewer checkpoints. A graph with 30 fine-grained nodes writes 30 times; the same logic in 6 nodes writes 6 times. Granularity is a performance decision, not just a readability one.

Interrupts and Human-in-the-Loop Reality

The interrupt mechanism is genuinely good, and it's the main reason to choose LangGraph over a simpler orchestrator. But the docs present the happy path: you interrupt, a human responds, you resume. Production introduces ugly questions.

What happens when the human never responds? Your checkpoint sits there indefinitely. You need a sweep job that finds interrupted threads older than some TTL and either expires them or escalates. LangGraph gives you no garbage collection for abandoned threads. We accumulated tens of thousands of stale interrupted checkpoints before anyone noticed the table growth.

What happens when you change the graph topology between the interrupt and the resume? If you deploy a new version of your graph while a thread is interrupted mid-execution, the resumed run may target a node that no longer exists or whose signature changed. There is no built-in version compatibility check. We now embed a graph version in thread metadata and refuse to resume threads against incompatible graph versions, routing them to a migration or expiry path instead.

And interrupt() raises and re-runs the node from the top on resume. Any side effects before the interrupt call execute twice. If you sent a Slack message, charged a card, or wrote to a database before the interrupt, that happens again on resume. Put all side effects after the interrupt resolves, or make them idempotent with a dedup key. This single fact has caused more double-charges and duplicate notifications than any other LangGraph behavior we've seen.

Streaming and the Partial Output Problem

Streaming in LangGraph is multi-layered: you can stream state updates, LLM tokens, or custom events. The docs show each mode in isolation. Real applications need several at once and have to demultiplex them on the client.

The practical problem is that token streams and state-update streams interleave, and the consumer has to reconstruct what's happening. A node streams tokens, then a tool call fires, then another node streams more tokens. Your frontend needs to know which output belongs to which node and when a logical message is complete. LangGraph emits the events but gives you no semantic framing — you build the protocol that tells the client "this token stream is the final answer" versus "this is intermediate reasoning."

We standardized on tagging every streamable node with explicit metadata and wrapping the stream in our own event envelope before it hits the client. Without that, the client logic becomes a fragile state machine that breaks every time you add a node.

The second issue: partial output on failure. If a node streams 200 tokens and then the LLM call errors on token 201, you've already sent 200 tokens downstream. The retry then re-streams from the beginning, and the client sees duplicate content. You need a sequence number per logical output and client-side dedup, or you need to buffer and only flush complete units. For voice systems this is brutal, because you may have already spoken the partial output. We buffer at sentence boundaries for any output that goes to TTS.

Observability You Have to Build Yourself

LangSmith is the official answer and it's good for tracing. But it is not a production observability stack. It does not give you the metrics, alerts, and dashboards your SRE team needs to run a service.

What you actually need: per-node latency histograms, per-node error rates, token consumption per node, checkpoint write latency, and graph completion rates broken down by the terminal node. None of this is automatic. We instrument every node with a wrapper that emits structured metrics to our existing pipeline — the same one our non-AI services use. This matters because your on-call engineer should not have to learn a separate AI-specific tool at 3 AM.

The metric that saved us most often is per-edge transition counts. Conditional edges encode your agent's decision logic, and watching the distribution of which branches get taken tells you when a prompt change quietly shifted behavior. We caught a regression where a routing node started sending 80% of traffic down a fallback path after a model upgrade, because the new model formatted its output slightly differently and our parser silently failed into the fallback. No errors were thrown. Only the edge distribution shifted.

Log the full state at error time, sanitized. When a graph fails three nodes deep, you need to see the state that produced the failure, not just the stack trace. The stack trace tells you a parser failed; the state tells you the model returned malformed JSON because the context was truncated.

Concurrency and the Superstep Model

LangGraph executes in supersteps: all nodes scheduled in a given step run, then the framework collects their updates and applies them via reducers before the next step. This is a BSP-style model, and it has consequences people don't anticipate.

When two parallel nodes write to the same state key, your reducer arbitrates the conflict. If you didn't write a reducer for that key, the default is last-write-wins, and which write "last" means is not deterministic across parallel branches. We had a fan-out pattern where three nodes each updated a results list, and intermittently one result vanished because the default assignment behavior overwrote instead of appended. The fix was an explicit appending reducer, but the bug was invisible until we looked at edge-level counts.

Parallel branches also don't share intermediate state within a superstep. A node cannot see what its sibling produced in the same step — only what was committed before the step began. If your design assumes node B can read node A's output and they're scheduled in parallel, you have a race that the model hides until it doesn't.

For anything truly concurrent, be explicit about the join. Use a dedicated aggregation node that runs after the parallel branches complete, and put all cross-branch logic there. Don't try to coordinate inside the parallel nodes themselves.

What We Do Differently Now

We still use LangGraph, and we'd choose it again. But our defaults diverge sharply from the docs.

We cap and prune state aggressively, with size alerts. We separate checkpointer database resources from application resources and prefer Redis-backed checkpointing for high-volume paths, persisting durably only at interrupt boundaries. We put all side effects after interrupt() calls and make them idempotent. We embed graph versions in thread metadata and refuse incompatible resumes. We run a sweep job for abandoned threads. We wrap every node in our own metrics layer and alert on edge-transition distributions, not just errors. We write explicit reducers for every key touched by parallel branches.

None of this is exotic. It's the same discipline you'd apply to any stateful distributed system, which is exactly what a production LangGraph deployment is. The trap is that the framework's ergonomics make it feel like a simple library, so teams skip the operational rigor they'd never skip for a database-backed service. Treat your graph like the distributed, stateful, side-effecting system it actually is, and it holds up well. Treat it like the quickstart, and it falls over the first time real traffic arrives.