As of 2026-05. Pin to current model surface (Opus 4.7 / Sonnet 4.6 / Haiku 4.5) — refresh monthly.
A 90-day arc from “we want to use Claude” to “Claude is in production with guardrails and a Center of Excellence pattern.” Built for transformation leads, not for engineers — engineers should pair this with claude-code-adoption-guide.md.
🛑 Before Week 0: run candidate use cases through
anti-use-cases.md. The “Premature” rows there are the same pre-flight gates this playbook names — but blocking, with cited frameworks. If a candidate hits any row, kill or re-route before scoring.
Skipping Week 0 is the single most common cause of stalled Claude pilots. Spend 5 working days here.
| Decision | Default for most orgs | When to deviate |
|---|---|---|
| Executive sponsor | Named C-level (CIO, CTO, or business unit P&L owner). Not “the AI committee.” | Skip only if pilot is a single team, single use case, no cross-functional asks. |
| Pilot use case | One internal-facing workflow with measurable cycle time and a willing team. Use pilot-selection-worksheet.html to score 2–6 candidates on 5 axes (value, speed, data, risk, sponsor) before committing. |
Don’t lead with customer-facing — too much governance overhead before you have evidence. |
| Success metrics | Cycle time, accuracy/quality, user satisfaction, and cost-per-task. Numeric, not adjectival. | If you can’t measure cost-per-task, you’ll lose the budget conversation in week 12. |
| Procurement path | Direct API via console.anthropic.com for speed. Bedrock or Vertex if hyperscaler commits dictate. |
Bedrock/Vertex add a regional + procurement path but cost a model-version lag. |
| BAA / DPA | Sign before any data flows. | Healthcare, finance, EU operations: required before pilot. |
| No-train confirmation | Default for API + console workloads is no-train. Document the policy version + as-of date. | Re-verify quarterly; terms drift. |
| Data residency | Choose region per workload sensitivity. | EU PII, China, regulated public sector: region constraint may dictate procurement path. |
Goal: prove the use case works with Claude in your environment, and surface your real constraints before they become production blockers.
| Week | Workstream | Owner | Output |
|---|---|---|---|
| 1 | Use case decomposition: who uses this, when, what data, what output | Pilot lead | Workflow diagram + data flow map |
| 1 | Model + feature selection — start with feature-decision-matrix.html |
Architect | Stack decision (model tier, caching, MCP, Skills, citations) |
| 2 | Eval set v0 — 30–80 representative inputs with expected outputs | Pilot lead + 1 SME | Versioned evalset in the team’s repo |
| 2 | First prompt + Skill — minimum viable, not pretty | Engineer | First passing eval run |
| 3 | Cost + latency baseline using cost-calculator.html inputs from real traffic |
Architect | Cost-per-task estimate + monthly $ projection |
| 3 | Governance shadow — security/risk reviewer attached to standups, not blocking | Risk function | Issues log (not gate) |
| 4 | User trial with 5–10 pilot users, instrumented | Pilot lead | Usage logs + qualitative feedback + accuracy delta vs baseline |
If any criterion fails: do not proceed to weeks 5–8. Either iterate within weeks 1–4 (extend by ≤ 2 weeks) or kill the pilot and write the postmortem.
Goal: codify what worked, paper over what didn’t, and make the pattern repeatable for the next team.
For the 8-category eval scaffolding (regression, format compliance, tool-call accuracy, grounding, adversarial, cost-per-task, latency, refusal calibration) — each framed by what it catches / failure-mode / owner — see eval-starter-pack.md. Pick 2–3 categories per use case in Phase 1; expand through Phase 2.
cost-calculator.html)governance-overlay.mdGoal: prove the pattern repeats, and stand up the structure so the 3rd–10th use case doesn’t need the transformation team’s hand-holding.
| Activity | Output |
|---|---|
| Onboard use case #2 with a different team | New evalset + cost projection + governance shadow following weeks 1–4 template |
| Stand up Center of Excellence (3–5 people, fed by pilot team alumni) | Charter, intake form, weekly office hours, templates |
| Internal documentation — how to start a Claude project | One-pager + a build-vs-buy-worksheet.html walkthrough |
| Training — eng + product + risk | 2-hour intro, 1-day deep-dive, ongoing slack channel |
| Reusable Skills + plugins library | Versioned, discoverable, evalled |
| Shared MCP servers for org-wide systems | One CRM connector, one ticketing connector — not 14 |
| Quarterly governance review | Risk register update, no-train policy verification, model version refresh, cost trend |
Each failure mode scored on probability (Low / Med / High) and cost-if-hit (★ = ~$1K, ★★ = ~$10K, ★★★ = ~$100K, ★★★★ = ~$1M, ★★★★★ = $10M+ including regulatory exposure). Early-signal column is the observable indicator that fires before the failure mode lands — wire alerting on these, not on the failure itself. Detection-latency column is the gap between early signal and visible damage.
| # | Pattern | Prob | Cost | Early signal | Detection latency | Mitigation |
|---|---|---|---|---|---|---|
| 1 | Pilot purgatory | High | ★★ | Week 4 retro: nobody can name the 2nd use case | Weeks | Pre-commit 2nd use case in Week 0 charter — see pilot-selection-worksheet.html. Score 2–6 candidates so a backup exists. |
| 2 | Eval debt | High | ★★★ | Prompt or Skill change merged without eval run | Months | Block CI on missing eval pass — see eval-starter-pack.md blocking-vs-advisory matrix. |
| 3 | Cost surprise | Med | ★★★★ | Daily $ trending up >20% week-over-week, or single workload >50% of total | Days | Wire 4 numeric gates ($/task, $/day, cache floor, batch floor) — see governance-overlay.md §15. Hook-enforced, not invoice-discovered. |
| 4 | Prompt sprawl | High | ★★ | Two teams shipping similar Skills independently; no shared registry | Weeks | Canonical Skills library + COE registry by Week 10 — see claude-code-starter-skills.md. |
| 5 | Governance afterthought | High | ★★★★ | Risk function not on Week 1 stand-up; no DPA/BAA log | Months | Embed risk reviewer in Week 1 (advisory not blocking) — see governance-overlay.md. Issues surface early, cheaply. |
| 6 | Vendor concentration panic | Med | ★★ | CFO/board ask “what if Anthropic disappears?” in QBR | Weeks | governance-overlay.md §12 multi-model abstraction at the right layer. Don’t pre-build a 3-model fallback you’ll never use. |
| 7 | Model deprecation thrash | Med | ★★★ | Anthropic announces deprecation date for a pinned model | Hours | COE owns model-bump runbook; pin family not point release; gate on regression eval pass — see eval-starter-pack.md. |
| 8 | The “AI committee” tax | High | ★★★ | Decision queue >7 days; no single sponsor name on use case | Weeks | Single named sponsor with veto — see pilot-selection-worksheet.html sponsor-clarity axis. Committee informs, doesn’t decide. |
Scoring posture. Probability and cost are calibrated against post-mortems from public AI rollout failures + the pattern frequencies named in this playbook’s own readers. Re-calibrate quarterly; if your org sees a different distribution, override the scores in your fork. The shape (early signal → detection latency → mitigation) is portable; the specific scores are not.
Below: each mode in prose, with the original symptom + fix framing. Use the heatmap above to pick which to monitor first; use the prose below for runbook depth.
| # | Pattern | Symptom | Fix |
|---|---|---|---|
| 1 | Pilot purgatory | Pilot succeeds, never scales. No second use case identified. | Pre-commit to the 2nd use case in Week 0 charter — even if you swap it later. |
| 2 | Eval debt | Prompts evolve faster than the evalset. Quality regresses unnoticed. | Block prompt changes in CI without eval pass. Owner: pilot lead. |
| 3 | Cost surprise | Month 4 bill is 5× the pilot’s monthly run rate. | Cost dashboard live by Week 6, weekly review. Cap-with-alert per use case. |
| 4 | Prompt sprawl | Every team writes its own copy of the same instruction set. | Skills + plugins library by Week 10. COE owns the canonical versions. |
| 5 | Governance afterthought | Risk function shows up in Month 5 with blocking issues. | Embed risk reviewer in Week 1 (advisory, not blocking). Issues surface early, cheaply. |
| 6 | Vendor concentration panic | CFO/board asks “what if Anthropic disappears?” | Address in governance-overlay.md §12. Multi-model abstraction at the right layer. Don’t pre-build a 3-model fallback you’ll never use. |
| 7 | Model deprecation thrash | Anthropic rev-bumps; quality moves; nobody owns re-eval. | COE owns the model bump runbook. Pin model family, not point release. |
| 8 | The “AI committee” tax | Decisions take weeks. Nothing ships. | Single named sponsor with veto. Committee informs, doesn’t decide. |
You can run a small adoption with 4–6 people. Larger orgs will scale each function but the three roles don’t merge.
executive-briefing.html — the 10-slide deck for sponsor + boardcost-calculator.html — model your projected $/monthpilot-selection-worksheet.html — Week 0 use-case scoring (5 axes, ranked verdicts, blocker flags)feature-decision-matrix.html — pick the right Claude features for the patternbuild-vs-buy-worksheet.html — score a use case for Claude direct vs. alternativesreference-architectures.html — 6 canonical patterns with diagramsgovernance-overlay.md — risk + compliance overlayeval-starter-pack.md — 8 evaluation templates (regression, format, tool-call, grounding, adversarial, cost, latency, refusal) — Weeks 5–8 scaffoldingclaude-code-adoption-guide.md — engineering-team rollout for Claude Codeclaude-code-starter-skills.md — 8 team-grade Skill templates for Claude Code rollouts© gmanch94 · CC-BY-4.0 · As of 2026-05. Verify model surface + pricing at anthropic.com.