Claude Cost Calculator As of 2026-05

Model your monthly Claude API spend. Sliders are interactive — outputs update live. Treat the result as a governance constraint, not a curiosity: define $/task ceiling, $/day cap, cache hit-rate floor, batch eligibility floor before pilot launch (see governance-overlay §15).

Workload inputs

Monthly requests 1,000,000

Total requests/month across all use cases.

Avg input tokens / request 8,000

Includes system prompt, history, retrieved context. RAG patterns trend high (10k–50k); chat trends low (1k–5k).

Avg output tokens / request 600

Tool-calling agents accumulate output across turns. A 5-turn agent w/ 800 tok/turn ≈ 4,000.

Model mix

Haiku 4.5 60%

Sonnet 4.6 35%

Opus 4.7 5%

Optimization levers

Prompt cache hit rate 70%

% of input tokens served from cache (90% cheaper). Realistic: 50–85% once warm. New conversations start cold.

Batch-eligible % 0%

% of traffic that can wait up to 24h (50% off). Interactive copilots: 0%. Nightly enrichment: 100%.

Monthly cost

$0/mo

— per request

Governance gates — §15.1

$/task ceiling

Set at 1.5× modeled $/task. Current $/req: —. Leave blank to disable.

$/day cap

Auto-throttle at 80%, auto-disable at 100%. Est. $/day (monthly ÷ 30): —. Leave blank to disable.

0% saved vs. naive baseline (all-Opus, no caching, no batch): $0/mo

Cost breakdown

Component	Tokens/mo	$/mo

Pricing assumptions (per million tokens, USD)

Model	Input	Output	Cache read	Cache write 5m
Haiku 4.5	$1.00	$5.00	$0.10	$1.25
Sonnet 4.6	$3.00	$15.00	$0.30	$3.75
Opus 4.7	$15.00	$75.00	$1.50	$18.75

Batch API: 50% off all rates. Cache read = 10% of input price. Cache write 5m = 125% of input price (one-time per cached prefix).

Pricing approximate as of 2026-05. Verify current rates at anthropic.com/pricing before final budgeting.

Reading the result

What drives spend most

Model mix dominates. Opus is 15× Haiku on input, 15× on output. Triaging with Haiku and routing only hard cases to Sonnet/Opus is the single highest-leverage decision.
Cache hit rate is the second lever. A 70% hit rate on a 50k-token system prompt is roughly equivalent to running a much smaller model. RAG patterns benefit most.
Output tokens cost 5× input. Agentic loops that generate long chains pay disproportionately. Worth measuring tokens-per-task and capping with explicit instructions.
Batch is free money where latency allows. Overnight enrichment, eval runs, content generation, analytics — half-price by default.

Common modeling mistakes

Forgetting agent fan-out — a single user request can become 5–20 model calls with tool use. Multiply.
Treating cache hit rate as a constant — real workloads vary 30–80% depending on conversation reset frequency.
Pricing all output at the highest tier — if you're routing 80% of traffic to Haiku, output blends accordingly.
Ignoring extended thinking budget — when enabled, thinking tokens count as output. Budget separately.