Claude Cost Calculator As of 2026-05

Model your monthly Claude API spend. Sliders are interactive — outputs update live. Treat the result as a governance constraint, not a curiosity: define $/task ceiling, $/day cap, cache hit-rate floor, batch eligibility floor before pilot launch (see governance-overlay §15).

Workload inputs

Total requests/month across all use cases.
Includes system prompt, history, retrieved context. RAG patterns trend high (10k–50k); chat trends low (1k–5k).
Tool-calling agents accumulate output across turns. A 5-turn agent w/ 800 tok/turn ≈ 4,000.

Model mix

Haiku 4.5 60%
Sonnet 4.6 35%
Opus 4.7 5%

Optimization levers

% of input tokens served from cache (90% cheaper). Realistic: 50–85% once warm. New conversations start cold.
% of traffic that can wait up to 24h (50% off). Interactive copilots: 0%. Nightly enrichment: 100%.

Monthly cost

$0/mo
— per request
Governance gates — §15.1
Set at 1.5× modeled $/task. Current $/req: . Leave blank to disable.
Auto-throttle at 80%, auto-disable at 100%. Est. $/day (monthly ÷ 30): . Leave blank to disable.
0% saved vs. naive baseline (all-Opus, no caching, no batch): $0/mo

Cost breakdown

ComponentTokens/mo$/mo
Pricing assumptions (per million tokens, USD)
ModelInputOutputCache readCache write 5m
Haiku 4.5$1.00$5.00$0.10$1.25
Sonnet 4.6$3.00$15.00$0.30$3.75
Opus 4.7$15.00$75.00$1.50$18.75

Batch API: 50% off all rates. Cache read = 10% of input price. Cache write 5m = 125% of input price (one-time per cached prefix).

Pricing approximate as of 2026-05. Verify current rates at anthropic.com/pricing before final budgeting.

Reading the result

What drives spend most
  • Model mix dominates. Opus is 15× Haiku on input, 15× on output. Triaging with Haiku and routing only hard cases to Sonnet/Opus is the single highest-leverage decision.
  • Cache hit rate is the second lever. A 70% hit rate on a 50k-token system prompt is roughly equivalent to running a much smaller model. RAG patterns benefit most.
  • Output tokens cost 5× input. Agentic loops that generate long chains pay disproportionately. Worth measuring tokens-per-task and capping with explicit instructions.
  • Batch is free money where latency allows. Overnight enrichment, eval runs, content generation, analytics — half-price by default.
Common modeling mistakes
  • Forgetting agent fan-out — a single user request can become 5–20 model calls with tool use. Multiply.
  • Treating cache hit rate as a constant — real workloads vary 30–80% depending on conversation reset frequency.
  • Pricing all output at the highest tier — if you're routing 80% of traffic to Haiku, output blends accordingly.
  • Ignoring extended thinking budget — when enabled, thinking tokens count as output. Budget separately.