How much can prompt caching realistically save?

For workloads with a large static prefix (system prompt, tool definitions, RAG context), cache reads cost 10% of normal input pricing. Real savings are typically 60u201390% on the cached portion of each request. The bigger and more reused the prefix, the better the ratio.

When should I use Haiku instead of Sonnet?

Haiku handles classification, extraction, formatting, short summaries, and routine tool calls reliably at roughly one-quarter the cost. Reserve Sonnet for multi-step reasoning, nuanced writing, and ambiguous troubleshooting. Always validate routing decisions against a labeled eval set before shipping.

Is the Batch API safe for production workloads?

Yes, for anything that doesn't need a synchronous response within seconds. Anthropic's Batch API delivers results within 24 hours at 50% off both input and output. It's ideal for nightly jobs, enrichment pipelines, eval runs, and scheduled reports.

What's the biggest mistake founders make with Claude costs?

Shipping without per-request logging. Without token-level telemetry tagged by endpoint and request type, you can't see which 3u20135 flows are driving 80% of your bill. Most teams optimize the wrong things until they instrument first.

Does prompt caching work with tool use and RAG?

Yes. Tool definitions cache cleanly because they rarely change. RAG context caches well when you have hot documents reused across users (product docs, policies). Keep dynamic per-user content at the end of the prompt so it doesn't break the cache prefix.

Claude API cost optimization

The fastest way to blow up your runway with a Claude-powered product is to ship Sonnet on every request and never touch caching. The fastest way to fix it is four moves: cache your system prompts, route easy calls to Haiku, trim context aggressively, and push anything non-interactive through the Batch API. Done well, Claude API cost optimization can cut a production bill by 60–90% without users noticing. This guide walks through the per-token math behind each tactic so you can model burn before launch instead of after your first invoice shock.

The numbers below use Anthropic’s public list pricing as of late 2024. Prices change; the ratios between models and the structure of the optimizations don’t.

The Per-Token Math You Need Before Optimizing Anything

Start with the table. Everything downstream is a multiplier on these base rates.

Model	Input ($/MTok)	Output ($/MTok)	Cache write	Cache read
Claude 3.5 Haiku	$0.80	$4.00	$1.00	$0.08
Claude 3.5 Sonnet	$3.00	$15.00	$3.75	$0.30
Claude 3 Opus	$15.00	$75.00	$18.75	$1.50

Two ratios to memorize. Output is 5× input across every Claude model. Cache reads are 10% of normal input. That second number is the single biggest lever most teams ignore.

A worked example. Say your support agent uses Sonnet with a 4,000-token system prompt plus tool definitions, pulls in 2,000 tokens of conversation history, and writes a 500-token reply. Naive cost per turn: (6,000 × $3 + 500 × $15) / 1M = $0.0255. At 50,000 turns a month, that’s $1,275. Not catastrophic, but every optimization below stacks on this baseline.

Prompt Caching: The Single Biggest Win

Prompt caching lets Claude store a prefix of your prompt (system message, tool definitions, few-shot examples, retrieved documents) and reuse it across requests at 10% of the normal input price. You pay a 25% premium on the first write, then read at one-tenth cost for the next 5 minutes (or 1 hour on the extended TTL).

Back to the support agent. That 4,000-token system prompt is identical on every request. Cache it once, and per-turn input cost on the cached portion drops from $0.012 to $0.0012. Across 50,000 turns, the system prompt alone goes from $600 to $60. Add cache writes (one per 5-minute window, maybe 8,640 per month at most), and you’re looking at roughly $93 instead of $600-a $507/month win on a single component.

What’s cacheable in practice:

System prompts and persona instructions — almost always static.
Tool/function definitions — change on deploy, not per request.
RAG context that’s reused — product docs, policy PDFs, knowledge base chunks pulled on common queries.
Few-shot examples — the more elaborate, the bigger the saving.

Two practical traps. First, the cache is keyed on an exact prefix match. If you splice a per-user variable into the middle of your system prompt, you’ve broken caching for everything after it. Put dynamic content at the end. Second, the 5-minute TTL resets on every cache hit, so a busy endpoint keeps the cache hot for free. Quiet endpoints (nightly jobs, low-traffic tenants) may want the 1-hour TTL even though the write premium is steeper.

Model Routing: Don’t Pay Sonnet Prices for Haiku Work

Sonnet is roughly 3.75× the cost of Haiku on both input and output. Opus is 5× Sonnet. If you’re running Sonnet on every call because it’s the “smart default,” you’re overpaying for the 60–70% of requests that Haiku handles fine.

A routing pattern that works in production:

Classifier on Haiku. A small, cached prompt asks Haiku to label the incoming request as simple, complex, or requires reasoning. Costs maybe $0.0003 per call.
Route by label. Simple intents (FAQ lookups, data extraction, formatting, classification, short summaries) stay on Haiku. Complex intents (multi-step tool use, nuanced writing, ambiguous troubleshooting) go to Sonnet. Reserve Opus for genuinely hard reasoning where you’ve measured Sonnet failing.
Escalate on failure. If Haiku returns low confidence, can’t call a required tool, or hits a guardrail, retry on Sonnet. Log every escalation so you can tune thresholds.

Real numbers on a 50,000-turn month. Assume 65% of turns route to Haiku at $0.005/turn and 35% to Sonnet at $0.0255/turn. Blended cost: $0.0163/turn, $815/month. Versus all-Sonnet at $1,275, that’s a 36% cut layered on top of caching. Stack both and you’re at roughly $300/month for the same product.

One caveat. Routing only pays off if your classifier is accurate. Spend a weekend building an eval set of 200–500 labeled requests and measure routing accuracy before you ship. We’ve seen founders deploy routing logic that mis-routes 20% of complex requests to Haiku, save 30% on tokens, and lose customers because the agent suddenly feels dumb. The savings aren’t worth the churn.

Context Trimming: The Quiet Tax on Long Conversations

Conversation history is the line item that grows silently. Turn 1 sends 500 tokens of history. Turn 20 sends 12,000. By turn 50 you’re paying Sonnet input rates on content the user wrote 30 minutes ago and the model doesn’t need anymore.

Three techniques that compose well:

Sliding window with summary

Keep the last N turns verbatim (typically 4–8). Summarize everything older into a 200–400 token block that lives at the top of the message array. Refresh the summary every 10 turns. A 12,000-token history collapses to roughly 2,500 tokens, cutting per-turn input cost by ~80% on long sessions.

Tool result pruning

Tool calls produce verbose JSON. A database query might return 8,000 tokens of rows when the agent only needed the top 3 to answer. Truncate or post-process tool results before passing them back into the conversation. Better: have the tool itself return a compact view by default and a “verbose” mode the agent can request.

Retrieval over stuffing

If you’re injecting documents into context for RAG, retrieve only the top 3–5 chunks instead of the top 20. Re-rank if quality drops. Most “long context” cost overruns are RAG pipelines tuned for recall when precision would have served the same answer at one-fifth the price.

AI infographic on context trimming for lower token costs.

The Batch API: 50% Off for Anything That Can Wait 24 Hours

Anthropic’s Batch API processes requests asynchronously within 24 hours at 50% off both input and output. For any workload that isn’t user-facing in real time, this is a flat 2× efficiency gain with no engineering trade-offs beyond an async pattern.

Workloads that belong in batch:

Nightly summarization of the day’s support tickets.
Bulk enrichment of CRM records or lead lists.
Backfilling embeddings or categorizations on historical data.
Generating reports, weekly digests, or scheduled briefings.
Eval runs and regression testing against your prompt library.

A concrete case. A B2B sales-intel startup we worked with was scoring 80,000 new company records a week on Sonnet, paying about $4,800/month. Moving the entire scoring pipeline to Batch-same model, same prompts, just async-dropped it to $2,400. They also stacked prompt caching on the scoring prompt and got to $1,100. Two weeks of engineering for a $44K annual saving.

Output Tokens Are 5× Input Stop Letting the Model Ramble

Founders fixate on input optimization and forget the output side, where every token costs 5× more. A model that writes a 1,200-token answer when 300 would do is quietly tripling your bill on the most expensive side of the ledger.

Tactics that work:

Explicit length budgets in the prompt. “Reply in 3 sentences or fewer.” “Return only the JSON object, no preamble.” Sonnet and Haiku both respect these reliably.
Structured outputs over prose. If your app parses the response anyway, ask for JSON or a tight schema. You’ll cut output tokens 40–60% versus conversational answers.
Stop sequences. Define hard stops on patterns the model produces when it’s about to over-explain (“Additionally,”, “In summary,”).
Set max_tokens conservatively. Don’t leave it at 4096 if your real ceiling is 800. A runaway output can cost more than the entire conversation that preceded it.

Instrument Before You Optimize

You can’t optimize what you can’t see. Before applying any of the above, wire up per-request logging that captures: model, input tokens, output tokens, cache read tokens, cache write tokens, latency, and a request-type tag (intent, endpoint, customer tier). Pipe it to whatever you use-Datadog, a Postgres table, even a daily CSV.

Within a week you’ll have a Pareto chart showing which 3–5 request types account for 80% of spend. Optimize those first. We’ve seen teams spend two sprints shaving tokens off a flow that represented 4% of their bill while a single misconfigured webhook was burning $3,000/month in retries against Opus.

A Realistic Pre-Launch Burn Model

Put it together. Imagine a Claude-powered onboarding agent for a B2B SaaS, 2,000 active users, average 15 agent turns/user/month, 30,000 turns total.

Configuration	Per-turn cost	Monthly cost
Naive Sonnet, no caching	$0.0255	$765
+ Prompt caching	$0.0145	$435
+ Haiku routing (65/35 split)	$0.0088	$264
+ Context trimming (-40% input)	$0.0061	$183
+ Batch for nightly recap (20% of turns)	$0.0053	$159

Same product, same model quality where it matters, 79% lower bill. Run this exercise on your own assumptions before you commit to a pricing plan or a fundraise model. The difference between $159/month and $765/month at scale is the difference between gross margins that look like a SaaS business and margins that look like a reseller.

Cost discipline isn’t about being cheap. It’s about keeping enough margin to invest in the parts of the product that actually differentiate-better evals, better tools, faster iteration-instead of subsidizing token waste.