claude-platform-playbook

Claude Eval Starter Pack

As of 2026-05. Pin to current model surface (Opus 4.7 / Sonnet 4.6 / Haiku 4.5). Companion to adoption-playbook.md (Weeks 5–8 guardrails) and claude-code-adoption-guide.md (Phase 3 governance). See ../docs/feature-inventory.md for canonical feature surface.

Eight evaluation templates an engineering or COE lead can drop into CI on day one. Each eval is framed by decision first (what regression it catches, what it costs, who owns it) — the eval body is last, because eval mechanics are the cheapest part to build and the most expensive part to govern.

The adoption playbook names eval debt as the #2 failure mode in scaling Claude. This pack is the scaffolding it doesn’t ship: the categories, the failure modes, the ownership pattern. The specific tasks are yours — the structure is portable.

If you ship even half of these in CI, you defuse the most common scaling failure: “prompts evolved faster than the evalset; quality regressed unnoticed.”


How to use this pack

  1. Pick by failure mode, not by name. Each eval below opens with the specific regression it catches. Pick the 2–3 that map to your most-feared failure mode in production, not the most familiar names.
  2. 30–80 examples per eval to start. Enough to be statistically meaningful for binary pass/fail; small enough to keep CI under 10 minutes. Grow to 200–500 only when the eval is proven valuable.
  3. One owner per eval. Without an owner, evalsets rot inside a quarter — same failure mode as Skills. Name in the eval’s metadata.
  4. Block deploys on critical evals; advisory on everything else. Trying to make every eval blocking creates eval-bypass culture. See “Blocking vs advisory” below.
  5. Run cheaply. Use Batch API (50% off) for nightly full runs and Code execution tool to host the runner if you don’t want to maintain infrastructure. A full 500-example eval suite on Sonnet 4.6 with caching runs under $5 in batch mode.

Each template below follows the same structure:


1. Regression suite (golden tasks)

Catches. Prompt or Skill changes that improve target behavior but degrade unrelated behavior. The classic “I made it better at X and broke Y without noticing.”

Why it earns its keep. Without this, every prompt change is an unmonitored experiment. The adoption playbook calls this out explicitly: prompts evolve faster than the evalset; quality regresses unnoticed. This eval is the gate.

Failure mode of the eval itself. Golden tasks become stale — the team adds new features but only adds positive examples (no regression seeds). Within 6 months the eval passes but production breaks. Mitigate: when a bug ships, the bug becomes a regression seed before the fix merges.

Owner archetype. Pilot lead in Phase 1; COE in Phase 2+.

name: regression-golden-tasks
owner: <person-or-team>
runs_on: [pre-merge, nightly]
blocking: true  # blocks merge if pass rate drops below threshold

inputs:
  count: 30-80 to start, grow to 200-500
  composition:
    - 60% golden path (the work this prompt is for)
    - 25% common edge cases (named in pilot reports)
    - 15% regression seeds (one example per shipped bug, in perpetuity)
  format: |
    {
      "id": "stable-uuid",
      "input": "<the actual input the system sees>",
      "expected_behavior": "<what should happen, in plain English>",
      "rubric": ["criterion 1", "criterion 2", ...],
      "is_regression_seed_for": "<bug-id or null>"
    }

scoring:
  method: LLM-as-judge (Opus 4.7) against the rubric, with sampled human review
  output: per-task pass/fail + aggregate pass rate
  human_sample: 10% of judge calls reviewed weekly to detect judge drift

threshold:
  current: <pin to baseline pass rate at lock time>
  drop_tolerance: 2 percentage points before blocking
  on_threshold_breach: surface failing tasks to PR author + owner

retire_when:
  - task no longer reflects current product
  - regression seed has had 0 failures across 90 days AND the underlying bug class is structurally prevented (test, type system, lint)

2. Format / schema compliance

Catches. Outputs that look right to a human but break downstream parsing — malformed JSON, missing required fields, wrong markdown structure, inconsistent citation format.

Why it earns its keep. The most common silent failure in production-grade Claude integrations: model output passes “looks fine” review and crashes the consumer. This eval is mechanical — it should never be skipped on a deploy.

Failure mode of the eval itself. Eval is too strict (rejects valid variations) or too lenient (accepts malformed output that breaks one specific consumer). Mitigate: the schema is owned by the consumer, not the prompt author.

Owner archetype. Whoever owns the downstream integration / API consumer.

name: format-schema-compliance
owner: <integration team>
runs_on: [pre-merge, every deploy]
blocking: true

inputs:
  count: 50-100 (covers the structural variants the prompt should produce)
  format: |
    {
      "id": "stable-uuid",
      "input": "<actual prompt input>",
      "schema": "<JSON schema or pydantic model or grammar>",
      "must_contain": ["field1", "field2", ...],
      "must_not_contain": ["leaked-thinking-token", "raw-tool-call-syntax", ...]
    }

scoring:
  method: deterministic — parse the output, validate against schema
  output: parse_succeeded × schema_valid × contains_required × no_forbidden
  no LLM judge — schema compliance is binary

threshold:
  current: 100% (any failure = blocking)
  drop_tolerance: 0 — schema breaks have no acceptable rate

retire_when:
  - schema is deprecated and consumer has fully migrated

3. Tool-call accuracy

Catches. Agent picks the wrong tool, calls the right tool with wrong arguments, or hallucinates a tool that doesn’t exist. Regresses subtly when adding new tools to a working agent.

Why it earns its keep. Adding a 6th tool to a 5-tool agent commonly degrades selection accuracy on the original 5 — and the original 5 are the ones that already work in production. This eval surfaces that regression before the new tool ships.

Failure mode of the eval itself. Eval rewards “called the tool” rather than “called it with right args.” Mitigate: score arguments separately from selection — both must pass.

Owner archetype. Owner of the agent/workflow (the team whose product the agent is part of).

name: tool-call-accuracy
owner: <agent owner team>
runs_on: [pre-merge on tool changes, nightly on agent changes, on every model bump]
blocking: true

inputs:
  count: 50-150 (cover each tool's primary patterns + cross-tool ambiguity cases)
  format: |
    {
      "id": "stable-uuid",
      "input": "<user message>",
      "expected_tool": "tool-name OR null (no tool needed)",
      "expected_args_shape": {"key": "value pattern or wildcard"},
      "forbidden_tools": ["tool-that-would-be-wrong-here"],
      "ambiguity_class": "clear | ambiguous | trick"
    }

scoring:
  method: deterministic on tool selection + LLM judge on argument quality
  components:
    - selected_correct_tool: bool
    - selected_no_forbidden_tool: bool
    - args_match_expected_shape: bool (judge or schema match)
    - ambiguous_cases: tracked separately, not in primary pass rate
  output: aggregate accuracy on clear cases + reported accuracy on ambiguous

threshold:
  clear_cases: 95%+ pass rate
  ambiguous_cases: 80%+ — drops here are advisory, not blocking
  on_breach: surface to agent owner with diff vs prior run

retire_when:
  - tool is removed from the agent

4. Hallucination / grounding

Catches. Outputs that state plausible facts not supported by the provided context (RAG corpus, file uploads, tool results). The single most-quoted production failure for AI integrations.

Why it earns its keep. Customers will not distinguish “the AI was wrong” from “your product was wrong.” This eval is the only way to put a number on grounding quality before users find the bad answers.

Failure mode of the eval itself. Judge model can’t distinguish “stated correctly but wasn’t asked for” from “hallucinated.” Mitigate: scope the eval narrowly — every claim in the output must trace to a specific span in the input context, judged against citation rather than truth.

Owner archetype. Whoever owns the knowledge corpus or RAG layer. (Not the prompt author — that’s a conflict of interest.)

name: grounding
owner: <knowledge / RAG team>
runs_on: [nightly on RAG changes, weekly otherwise]
blocking: false  # advisory until you've calibrated the threshold for your domain

inputs:
  count: 50-200 (cover topical breadth of the corpus)
  format: |
    {
      "id": "stable-uuid",
      "query": "<user question>",
      "context": "<the documents the system retrieved or was given>",
      "ground_truth_answer": "<optional — if you have it>",
      "must_cite": true | false
    }

scoring:
  method: LLM judge (Opus 4.7) extracts each factual claim from the output and tests:
    - is_supported_by_context: bool
    - has_citation: bool (if must_cite)
    - citation_resolves_to_real_span: bool
  primary metric: % of claims supported / total claims
  secondary: % of outputs with at least one unsupported claim

threshold:
  initial: track only — set baseline after first 30 days
  alert: when supported-claim rate drops 3pp from rolling 7-day average

retire_when:
  - corpus is structurally migrated to a system with built-in grounding (e.g., Files API + native citations)

5. Adversarial / jailbreak resistance

Catches. System prompt extraction, role-play breaks, prompt injection via user content (in RAG, files, web fetch results), and policy violations under pressure.

Why it earns its keep. Production systems run for months without an adversary, then meet one. The eval is cheap; the breach is not. Especially required if the system has tools that read external content (web fetch, file uploads, MCP servers reading email/tickets).

Failure mode of the eval itself. Test set is too narrow (last year’s jailbreaks) and misses current technique drift. Mitigate: pull from public jailbreak corpora (DAN variants, prompt-injection benchmarks) and add 5–10 fresh attempts each quarter.

Owner archetype. Security / red team. Do NOT let the prompt author own this eval — same conflict-of-interest as grounding.

name: adversarial-jailbreak
owner: <security or red team>
runs_on: [weekly, on system prompt changes, on model bumps]
blocking: true on critical violations

inputs:
  count: 30-100, refreshed quarterly
  composition:
    - 40% direct prompt-injection attempts (system prompt extraction, instruction override)
    - 30% indirect injection (malicious content in RAG, file uploads, tool outputs)
    - 20% policy violations (refused-content elicitation under role-play)
    - 10% emerging techniques from public jailbreak corpora
  format: |
    {
      "id": "stable-uuid",
      "attack_class": "direct-injection | indirect-injection | policy-violation | extraction",
      "input": "<the adversarial input>",
      "expected_behavior": "refuse | safe-completion | flag-and-continue",
      "severity_if_breached": "critical | high | medium"
    }

scoring:
  method: LLM judge (Opus 4.7) classifies output as breached | partially-breached | held
  human review: ALL "breached" classifications reviewed within 24h
  primary metric: critical-severity breach rate (must be 0)
  secondary: any-severity breach rate

threshold:
  critical_breaches: 0 (any breach = blocking + post-mortem)
  high_breaches: <2% across the suite
  medium_breaches: tracked, not blocking

retire_when:
  - never. This is a compounding asset.

6. Cost-per-task

Catches. Cost regressions from prompt bloat (cache misses), unintended Opus calls, model-mix drift, or context-window growth that breaks the team’s unit economics.

Why it earns its keep. The single most common cause of an exec losing faith in Claude is a surprise cost spike. This eval is the early warning. The adoption playbook: “if you can’t measure cost-per-task, you’ll lose the budget conversation in week 12.”

Failure mode of the eval itself. Tracks token cost without tracking value delivered, so it punishes legitimate quality investments. Mitigate: pair with a quality eval (regression suite) — degraded quality at lower cost is not a win.

Owner archetype. Platform team / FinOps. Not the prompt author.

name: cost-per-task
owner: <platform / FinOps>
runs_on: [nightly, on prompt changes]
blocking: false  # advisory unless paired with explicit cost SLO

inputs:
  count: 100+ representative production tasks (sample from real traffic, anonymized)
  format: |
    {
      "id": "stable-uuid",
      "input": "<actual production-shaped input>",
      "task_class": "<so you can track cost by class, not just average>"
    }

scoring:
  method: deterministic — count tokens, apply pricing table from cost-calculator.html
  components:
    - input_tokens (cached vs uncached)
    - output_tokens
    - cache_hit_rate
    - model_used (track Opus % over time)
  primary metric: median cost per task (by class) + p95 cost per task

threshold:
  baseline: lock at production cutover
  alert: median rises 15% week-over-week, OR p95 rises 30%
  on_breach: surface biggest cost-mover task class to platform owner

retire_when:
  - never. This is a compounding asset.

7. Latency budget

Catches. p50/p95 latency regressions from model swaps, prompt growth, tool-loop expansion, or streaming → non-streaming changes.

Why it earns its keep. Latency feels like quality to users. A 30% p95 regression won’t show up in regression tests but tanks NPS. This eval catches it before users do.

Failure mode of the eval itself. Measures latency in CI conditions (warm cache, single-tenant) that don’t reflect production load. Mitigate: also run a thin synthetic-load eval against the production endpoint hourly.

Owner archetype. Platform / SRE.

name: latency-budget
owner: <platform / SRE>
runs_on: [hourly synthetic load, nightly full eval]
blocking: false  # latency regressions block via separate SLO alerting

inputs:
  count: 30-50 representative tasks (mix of short / medium / long)
  format: |
    {
      "id": "stable-uuid",
      "input": "<input>",
      "expected_streaming": true | false,
      "task_class": "short | medium | long | tool-using"
    }

scoring:
  method: deterministic — wall-clock time-to-first-token + total time
  components:
    - ttft_p50, ttft_p95
    - total_p50, total_p95 (by task class)
    - tool_call_overhead (where applicable)
  primary metric: p95 by task class

threshold:
  budget: defined per task class (e.g., short: ttft < 800ms, total < 5s)
  alert: regression > 20% from rolling 7-day baseline
  on_breach: surface by task class — rarely uniform

retire_when:
  - never. This is a compounding asset.

8. Refusal calibration

Catches. Both directions of failure: false refusals (model declines benign requests, user trust erodes) and unsafe completions (model assists with prohibited request, policy violation).

Why it earns its keep. Most teams optimize one direction without measuring the other. The result: either a model that refuses too much (engineers route around it, defeating the point of the safety prompt) or completes too readily (incident waiting to happen).

Failure mode of the eval itself. Test set leans toward one direction (usually only-prohibited). Mitigate: explicitly balance — at least 40% benign-but-near-policy-edge inputs.

Owner archetype. Policy / safety owner with input from product (false refusals are a UX problem too).

name: refusal-calibration
owner: <policy / safety + product>
runs_on: [weekly, on system prompt changes]
blocking: false  # tracked as a balance, not a single threshold

inputs:
  count: 50-100, balanced
  composition:
    - 30% clearly benign (must complete)
    - 30% clearly prohibited (must refuse)
    - 40% near-edge (judgment calls — track separately)
  format: |
    {
      "id": "stable-uuid",
      "input": "<input>",
      "category": "benign | prohibited | edge",
      "expected_behavior": "complete | refuse | safe-completion"
    }

scoring:
  method: LLM judge (Opus 4.7) classifies output as: completed | refused | safe-completion
  primary metrics:
    - false_refusal_rate (% of benign inputs refused)
    - over_completion_rate (% of prohibited inputs completed)
    - edge_case_distribution (just track — no single right answer)
  pair these — improving one in isolation likely worsens the other

threshold:
  false_refusal: <5%
  over_completion: <1%
  on_breach: review with policy + product owners together

retire_when:
  - never. This is a compounding asset.

Blocking vs advisory — the discipline

Trying to make every eval blocking creates eval-bypass culture. Engineers learn to disable, downsize, or commit-around the suite. The adoption playbook calls this eval bypass — once it starts, the suite is dead.

The pattern that holds:

Eval Blocking Advisory Tracked-only
Regression suite    
Format compliance    
Tool-call (clear cases)    
Tool-call (ambiguous)    
Adversarial (critical)    
Adversarial (medium)    
Hallucination / grounding    
Cost-per-task    
Latency   ✓ (via SLO)  
Refusal calibration     ✓ (balance, not threshold)

When in doubt: start advisory, promote to blocking after 60 days of stable signal.


Where to run evals

Option Pros Cons When
In your existing CI (GitHub Actions, GitLab CI) Familiar; fits team workflow Cost discipline is on you; runners need API keys + budget caps Default for engineering teams
Anthropic-hosted Code execution tool No infra; cheap; runs in batch mode Newer surface; pin to current docs Small teams; eval-as-experiment posture
Batch API for nightly full runs 50% off all token cost Async — not for blocking gates Pair with synchronous CI for the blocking subset
Dedicated eval platform (Braintrust, Langfuse, Inspect, etc.) Rich tooling; comparison views Vendor lock-in; another tool for the team to learn When eval volume justifies it (>10k runs/day)

What this pack does NOT include (and why)


Companion artifacts


© gmanch94 · CC-BY-4.0 · As of 2026-05. Verify Batch API + Code execution surface at docs.claude.com.