Controlling LLM costs in production before they control you

The Cost Curve Nobody Models

LLM costs do not behave like compute costs you are used to. A traditional API scales roughly with infrastructure you provision. LLM costs scale linearly with usage and quadratically with bad design. Every retry, every bloated system prompt, every overstuffed RAG context multiplies against per-token pricing on every single call.

The failure pattern is predictable. You launch with GPT-4-class models because they just work and you want to ship. Usage is low, the bill is a rounding error, nobody cares. Six months later you have 40,000 daily active users, each session fires 8 to 15 model calls, and finance is asking why inference costs $90,000 a month against $200,000 in revenue. Your gross margin is now hostage to a token meter.

The core problem: cost scales with usage, not revenue. A free-tier user costs you exactly as much as a paying one, often more, because free users churn slower and probe harder. If you have not modeled cost-per-action against revenue-per-user, you are running a business whose unit economics you cannot see.

Model this early. Pick your three highest-frequency LLM operations. Measure their actual token consumption — input and output separately, because output tokens are typically 2-4x the price of input. Multiply by realistic call volume per user, then by user count. That number is your real cost floor, and it is almost always higher than the napkin math your PM did.

Where the Tokens Actually Go

Most teams assume the user's question is the expensive part. It rarely is. The expensive part is everything you bolt around it.

A typical production prompt looks like: a 600-token system prompt establishing persona and rules, 1,200 tokens of few-shot examples, 3,000 tokens of retrieved context from your vector store, 400 tokens of conversation history, and a 30-token user question. The user contributes 0.6% of the input. You are paying to re-send the other 99.4% on every single turn.

Few-shot examples are the worst offender because they feel free. You add three examples to fix an edge case, it works, you move on. Now every call in production carries 1,200 dead tokens forever. At scale that line item alone can be five figures monthly. Most of those examples can be replaced by tighter instructions or, better, by fine-tuning a smaller model once you have enough labeled cases.

Conversation history compounds badly. A naive chat implementation appends every prior turn, so by message 20 you are sending 15,000 tokens of context to answer a one-line follow-up. The cost of turn 20 is 20x the cost of turn 1 for no added value. You need active context management, not passive accumulation.

Audit your actual prompts in production, not the ones in your repo. Log a sample of real requests with token breakdowns by component. You will find dead weight you did not know existed — duplicated instructions, retrieved chunks that overlap 80%, history that should have been summarized 10 turns ago.

Model Routing as a First-Class Concern

Using your most capable model for every request is the single most expensive mistake in production LLM systems. The price gap between frontier and small models is 10-30x. Most of your traffic does not need the frontier.

Classify your requests by required capability. Intent detection, classification, extraction, simple rewrites, and formatting are handled fine by small, fast, cheap models. Complex reasoning, multi-step planning, and nuanced generation justify the expensive tier. The trick is routing each request to the cheapest model that can actually do the job.

Build a router. The simplest version is a cheap classifier model that reads the request and outputs a difficulty tier, then dispatches accordingly. This adds one cheap call per request but can cut total spend by 60-70% because the long tail of trivial requests stops hitting your premium tier.

A more aggressive pattern is escalation. Send everything to the cheap model first. Evaluate the response with a lightweight check — confidence heuristics, schema validation, or a verifier model. Only escalate to the expensive model when the cheap one fails. This works well when your cheap model succeeds 70%+ of the time, because you only pay the premium on the 30% it cannot handle.

The trade-off is latency and complexity. Escalation means some requests pay two model calls. Routing adds a classification hop. Measure whether the savings justify the added p95 latency, and always give yourself a kill switch to route everything to one model when a cheap-tier model regresses after a provider update.

Caching That Survives Real Traffic

There are three caching layers worth running, and most teams run zero.

Provider prompt caching is the easiest win and people ignore it. Anthropic and OpenAI both cache stable prefixes. If your system prompt and few-shot block are identical across requests, structure them as a cached prefix and you pay a fraction of the price on cache hits. The catch: cache keys are prefix-sensitive. Put your stable content first and your variable content last. Inject a timestamp or user ID at the top of the prompt and you invalidate the cache on every call — a mistake I have seen cost a team 40% of their potential savings.

Semantic caching is the high-value, high-risk layer. You embed the incoming request, search for a near-identical prior request, and return the cached answer if similarity exceeds a threshold. This works beautifully for FAQ-style traffic where users ask the same thing in slightly different words. It fails catastrophically when your similarity threshold is too loose and you return a confidently wrong cached answer to a question that just looked similar. Tune the threshold conservatively, scope caches per-tenant to avoid leaking data across customers, and never semantic-cache anything with personalized or time-sensitive context.

Exact-match caching for deterministic operations is free money. If you are extracting structured data from the same document repeatedly, or running the same classification on identical input, hash the input and cache the output. No model call needed on a hit.

Cache hit rate is the metric that matters. A cache nobody hits is just added latency. Instrument hit rates per layer and per request type. If a layer sits below 15% hit rate, it is probably not worth the complexity and the staleness risk.

Context Window Discipline

Bigger context windows are a trap. Vendors market million-token windows like they are free. They are not — you pay per token in that window, and model attention degrades on long contexts anyway, so you are paying more for worse answers.

For RAG, retrieve fewer, better chunks. Teams default to top-k of 10 or 20 because more context feels safer. In practice, retrieval quality matters far more than quantity. A reranker that narrows 20 candidates down to the 3 most relevant chunks usually beats stuffing all 20 into the prompt, and it cuts your input tokens by 70%. Spend the engineering effort on retrieval precision, not context volume.

For conversations, summarize aggressively. Once history exceeds a threshold — say 3,000 tokens — replace older turns with a running summary maintained by a cheap model. You keep the last few turns verbatim for immediate coherence and compress everything before that. This caps the context cost of a conversation regardless of length, which is what stops your turn-20 cost explosion.

Trim your system prompts. Most production system prompts are accreted over months of patching, full of redundant instructions and dead rules nobody validated. Periodically rewrite them from scratch. A 600-token prompt can usually become 250 tokens with no behavioral change, and you are paying for those 350 tokens on every call forever.

You cannot control what you do not measure, and most LLM cost overruns are invisible until the invoice arrives 30 days later. By then you have already burned the money.

Log every model call with cost attributes: model used, input tokens, output tokens, computed cost, latency, cache hit or miss, and a tag for the feature or endpoint that triggered it. Aggregate this in real time, not at month end. You want a dashboard that shows cost-per-feature, cost-per-user-segment, and cost-per-tenant updating live.

The per-tenant view matters more than people expect. In any multi-tenant product, cost distribution is brutally skewed. Often 5% of tenants generate 50% of inference cost. Some of those are your best customers and the cost is justified. Some are abusing a free tier or running an integration in a tight loop. You cannot tell which is which without per-tenant attribution, and you cannot price correctly without it either.

Set up anomaly alerts on cost velocity. If today's spend is tracking 3x yesterday's at the same hour, something changed — a bad deploy, a retry storm, a customer who scripted your endpoint. A real-time alert catches it in an hour. The monthly invoice catches it after you have lost $30,000.

Track token efficiency as a first-class metric alongside latency and error rate. Cost-per-successful-request is the number to watch. If it creeps up release over release, you have prompt bloat or routing regression, and you want to catch the trend before it compounds.

Setting Hard Limits

Soft budgets get ignored. Hard limits get respected. Build the constraints into the system so cost cannot run away even when something breaks.

Put a max-tokens cap on every output. An unbounded generation can run to thousands of tokens when the model gets stuck repeating itself, and you pay for every one. Set a sane ceiling per operation. If a response legitimately needs more, that should be an explicit, justified exception, not the default.

Rate-limit per tenant and per user, scoped to your unit economics. A free-tier user should hit a wall at a usage level that keeps them profitable or at least break-even as a funnel. Without this, a single automated client can run your free tier into a five-figure loss in a weekend.

Guard retries. Retry logic is where costs hide. A naive retry on timeout can triple your spend on slow requests, and an exponential backoff that retries five times pays for five full completions. Cap retries at two, only retry on genuinely transient errors, and never retry on a content-quality failure — that is a routing or prompt problem, not a transient one.

Finally, run a circuit breaker on aggregate spend. Define a daily ceiling. When you approach it, degrade gracefully — route everything to the cheapest model, disable expensive features, queue non-urgent work. A degraded product beats a product that blew its entire monthly margin by the 12th of the month. The teams who control LLM costs are the ones who decided, in advance, exactly what happens when the meter spins faster than the revenue.