The Builder's Compendium
Vol. III · No. 001 · v1.0

The Builder's Compendium

Platform.Claude.com · Managed Agents · Skills · Evals

An opinionated field guide to the Claude platform surface — 34 features, 15 competitive positions, 25 use cases for builders and advisors, and 20 working prompts that assume you have API keys and a terminal open.

Compiled by
SuRaM
Principal Consultant, Forcey

what platform.claude.com actually is

There are two Claudes. The one most people know lives at claude.ai — a chat surface with Projects, Artifacts, and Memory. It is a tool. The other Claude lives at platform.claude.com — and it is not a tool. It is infrastructure.

The platform surface has four concentric layers. At the centre is the Messages API — single requests to the model, with tool use, structured outputs, and prompt caching. Around that, the developer tooling: Workbench for drafting prompts, Prompt Improver and Generator for scaffolding, the Eval Tool for measurement. Around that, agent primitives: Agents API, Environments, Sessions, Vaults, and Skills. Around that, the operational surface: workspaces, spend caps, SSO, audit logs. Each layer is accessible in isolation; together, they are how production AI products get built.

The difference between a prompt that works and a prompt that ships is the platform surface. Workbench to find it; Evals to trust it; Managed Agents to run it; workspaces to keep it from bankrupting you.

This volume catalogues it. The Feature Catalog lists every capability as of April 2026. Vs. The Industry maps each to the incumbents it displaces, contests, or respects. Twenty-Five Use Cases are tagged by domain — agency builds, internal tools, productised apps, developer workflow, and advisory. The Twenty Builder Prompts at the back are the working recipes: opinionated, hybrid-voice, code where code sharpens the point.

A note on what this volume isn't. It isn't a reference for Claude.ai — that's Volume I. It isn't a prompt library for consultants using Claude — that's Volume II. It is for the reader who is building with Claude, or advising clients who build.

every capability on the platform surface

34 features across 7 categories. Each with availability and an opinionated competitive note.

The platform surface groups into seven categories. Prompt development, evaluation and testing, managed agents, skills, models and infrastructure, tools and integrations, and admin and billing. Features within each are listed with availability, a one-paragraph description, and an honest competitive note.

Prompt Development

Workbench

Browser-based prompt prototyping studio at platform.claude.com/workbench. Test prompts with variable substitution ({{var}} syntax), tune temperature and max-tokens, save versions, and export as code in 8 languages.

Availability Platform (all plans with API access)
Vs. OpenAI Playground: sharper on structured prompting, weaker on UI polish. Vs. LangSmith Studio: simpler, less opinionated. The best place to draft a prompt you'll later ship in production.

Prompt Generator

"Describe a task; get a working prompt" tool. Produces scaffolded first drafts with XML structure, examples, and chain-of-thought by default.

Availability Platform / Console
Vs. hand-writing from scratch: removes the blank-page problem. The output is a starting point, not a finish line — every prompt in Vol. II was originally seeded this way then heavily rewritten.

Prompt Improver

Takes an existing prompt and adds chain-of-thought steps, XML tags, clearer structure. Benchmarks show ~30% accuracy gain on classification tasks.

Availability Platform / Console
Vs. manual editing: faster for baseline improvements. Will over-engineer simple prompts — watch for bloat.

Examples Manager

Structured few-shot examples with clear input/output pairs. Auto-generates synthetic examples if you have none. Examples are inserted at the start of the first user message in the actual API call.

Availability Platform / Console (within Workbench)
Vs. pasting examples into the prompt text: less error-prone, versioned with the prompt, easier to swap out.
Evaluation & Testing

Evaluation Tool

Create test-case suites with variable substitution, grade outputs on a 5-point scale, side-by-side compare prompt versions, auto-generate test cases, import from CSV.

Availability Platform / Console
Vs. PromptLayer / LangSmith: simpler, tied to Anthropic. Vs. bespoke eval harness: faster to start, less flexible. Best for structured prompts where the right answer is checkable.

Ideal Output Column

Optional column in the eval sheet where you record the target output. Used for benchmarking and regression testing prompt changes.

Availability Platform / Console (within Eval Tool)
Vs. just grading outputs: gives you a ground truth to regress against, not just a quality score.

Prompt Versioning

Every Workbench prompt has a version history. Re-run the eval suite against any historical version to see how changes affected performance.

Availability Platform / Console
Vs. git-for-prompts workflows: less portable, more integrated. Makes A/B comparisons almost free.
Managed Agents (Beta)

Managed Agents

Server-managed stateful agents launched April 8, 2026. Anthropic hosts the sandbox, credentials, state, checkpointing. $0.08/hr runtime + token costs. Removes the "second job" of building agent infrastructure.

Availability Platform / API (requires managed-agents-2026-04-01 beta header)
Vs. OpenAI Assistants API: comparable surface, stronger on long-running tasks. Vs. LangGraph + self-hosted infra: 10x faster to production, less flexibility. Vs. Amazon Bedrock Agents: simpler onboarding, weaker AWS-ecosystem integration.

Agents API

Create an agent once (model, system prompt, tools, MCP servers, skills), reference it by ID across sessions. The "agent" is the persistent config, the "session" is each execution.

Availability Platform / API
Vs. stateless tool-use loops: better for complex workflows. Vs. LangChain agents: more opinionated, less DIY plumbing.

Environments

Container configuration: pre-installed packages (Python, Node.js, Go, etc.), network access rules, mounted files, credential injection. Define once, reuse across sessions.

Availability Platform / API
Vs. Docker + your own orchestrator: zero infra effort. Vs. Replit-for-agents tools: Anthropic-managed, single-vendor.

Sessions

Each session is a fresh container with persistent event history. Stream events via server-sent events (SSE). Interrupt or guide mid-execution.

Availability Platform / API
Vs. one-shot tool-use: supports hours-long workflows. Vs. polling-based agent APIs: the SSE streaming is cleaner for UX.

Vaults

Secure credential storage for agent sessions. Inject API keys, database connections, and secrets without exposing them to the model context.

Availability Platform / API
Vs. environment variables in your own infra: no key rotation headache. Vs. HashiCorp Vault: narrower scope, Anthropic-native.

Outcomes, Multiagent, Memory (Research Preview)

Three experimental features gated by access request: outcome-based task completion signals, multi-agent orchestration, and persistent per-agent memory across sessions.

Availability Platform / API (request access)
Vs. the current beta: these are what separates "agent framework" from "agent platform". Worth requesting access even if you don't use them yet.
Skills (Platform)

Agent Skills

Reusable instruction packs with code, config, and context. Attach up to 20 per session. Same concept as claude.ai Skills, but installable in Managed Agents and distributable as artifacts.

Availability Platform / API / claude.ai
Vs. custom GPTs: compose together, work across surfaces, don't require a Store listing. Vs. LangChain tools: skills are richer (instructions + code + context), not just function wrappers.

Custom Skill Upload

Upload your own Skills as zip files through Settings > Features. Available on Pro, Max, Team, and Enterprise with code execution enabled.

Availability Platform / claude.ai (Pro+)
Vs. prompt templates in your own repo: Claude loads them contextually, no engineering overhead. Currently per-user — no org-wide central management yet.

Skill Builder

Create and edit Skills with a structured editor. The SKILL.md frontmatter pattern (--- name: ... description: ... ---) plus markdown body. Preview, test, publish.

Availability Platform / Claude Code
Vs. writing raw markdown files: lower friction for non-developers. Vs. Anthropic's example skills repo: fully customisable for your domain.

claude-api Skill (Built-in)

Official skill that teaches Claude the Messages API, Managed Agents API, and SDKs. Covers 8 languages for Messages API, 7 for Managed Agents. Uses progressive disclosure to keep context efficient.

Availability Platform / Claude Code (bundled)
Vs. reading docs yourself: Claude loads only the doc fragments relevant to your current task. The meta-move of using Claude to build with Claude.
Models & Infrastructure

Model Picker

Opus 4.7 (flagship), Sonnet 4.6 (balanced speed/intelligence), Haiku 4.5 (fast/cheap). Deprecated: Sonnet 4, Opus 4 (retire June 15, 2026); Haiku 3 (retired April 19, 2026).

Availability Platform / API
Vs. OpenAI's GPT-4o / 4.5 / o-series: cleaner model-family framing, more predictable pricing. Migrate deprecated models before sunset dates.

Extended Thinking

Sonnet 4.6 and Opus 4.7 support extended thinking — visible reasoning tokens before the final response. Tunable thinking budget. 1M-token context on Sonnet 4.6 beta.

Availability Platform / API
Vs. OpenAI o-series: comparable reasoning depth, more transparent about thinking tokens. Vs. hiding the reasoning: extended thinking is itself debuggable output.

Advisor Tool (Beta)

Pair a faster executor model with a higher-intelligence advisor model that provides strategic guidance mid-generation. Long-horizon workloads approach advisor-solo quality at executor-model cost. Beta header: advisor-tool-2026-03-01.

Availability Platform / API
Vs. running everything on Opus: massive cost savings on long tasks. Vs. manually switching models: automated hand-off, no app logic needed.

Prompt Caching

Cache large static prompt prefixes (system prompt, long context) and reuse them across requests at ~10% of normal cost. Cache writes cost slightly more than normal tokens; reads cost a fraction.

Availability Platform / API
Vs. OpenAI's prompt caching: broadly comparable, different TTL defaults. The single biggest cost lever for any agent that re-sends context.

Batch API

Submit up to 100,000 requests in a batch, processed within 24 hours, at 50% of standard token prices.

Availability Platform / API
Vs. OpenAI Batch API: similar discount, similar latency. Vs. real-time API for bulk work: half the cost, slower turnaround.

Code Execution (Sandboxed)

Server-side Python execution as a first-party tool. Free when paired with web_search or web_fetch; standalone pricing otherwise.

Availability Platform / API
Vs. running your own Python sandbox: no security plumbing. Vs. OpenAI's Code Interpreter: tighter integration with the tool-use schema.

Structured Outputs

Force responses to conform to a JSON schema. Reduces parsing errors and LLM-vs-code contract failures.

Availability Platform / API
Vs. OpenAI structured outputs: comparable feature parity. Vs. regex-parsing free-form output: orders of magnitude more reliable.

Streaming

Server-sent event streaming for real-time response rendering. Essential for chat UIs and long-form generation UX.

Availability Platform / API
Standard across major AI APIs. Anthropic's SSE format is clean and well-documented.
Tools & Integrations

Tool Use

Claude calls your functions with typed arguments. You execute; Claude reads the result and continues. The primitive beneath every agent framework.

Availability Platform / API
Vs. OpenAI function calling: comparable. Claude's tool-use JSON schema has fewer footguns around strict mode.

MCP Servers in Agents

Managed Agents can call any MCP server — Gmail, Drive, Slack, GitHub, custom ones. Same protocol as Desktop app MCP, but runs server-side.

Availability Platform / API (within Managed Agents)
Vs. writing custom tool wrappers: MCP is becoming the de facto standard, so your investment is portable.

Web Search Tool

First-party web search as a tool call. Real-time results with citation metadata.

Availability Platform / API
Vs. wiring in Tavily/Perplexity search APIs: no extra vendor, no separate billing.

Web Fetch Tool

First-party URL fetch with HTML-to-markdown extraction, rate limiting, and domain allow/block lists.

Availability Platform / API
Vs. rolling your own fetcher: battle-tested. Vs. Firecrawl / Jina: free when paired with web search.
Admin & Billing

Workspace Management

Multi-workspace support with separate keys, rate limits, spend caps. Invite team members, assign roles, audit usage per workspace.

Availability Platform / Console
Vs. OpenAI orgs: cleaner model for agencies managing multiple clients. Each client can live in its own workspace with its own spend cap.

Usage Dashboard

Real-time spend, request volume, rate-limit headroom, model-by-model breakdown. Export to CSV.

Availability Platform / Console
Vs. cobbling together with webhooks and BI tools: good-enough out of the box for small/mid teams.

Spend Limits & Alerts

Hard spend caps per workspace; email alerts at configurable thresholds. Prevents runaway agent costs.

Availability Platform / Console
Vs. "hope the bill doesn't blow up": essential for anyone running agents unsupervised. Set these before shipping.

SSO & RBAC (Team/Enterprise)

SAML SSO, SCIM provisioning, custom roles scoped to specific Claude capabilities per group.

Availability Platform / Console (Team/Enterprise)
Vs. Enterprise OpenAI: comparable controls, cleaner UI for role definition.

API Key Rotation

Generate, revoke, and rotate API keys per workspace. Keys are scoped to workspace permissions.

Availability Platform / Console
Standard. Do it on schedule, not after an incident.

who the platform threatens, competes with, and respects

The platform surface competes with OpenAI Platform, Amazon Bedrock, Google Vertex AI, and a constellation of open-source and specialised builders' tools — LangSmith, LangGraph, PromptLayer. The table below is opinionated by design. Displaces means the incumbent is at real risk for the stated workload. Contested means both options have real merit and the choice depends on team profile. Respects means the moat still holds.

Claude Capability Incumbents What actually happens Verdict
Managed Agents OpenAI Assistants API Contested. OpenAI has broader tool ecosystem; Claude leads on long-running tasks and checkpoint reliability. Many teams run both for different workloads. Contested
Managed Agents LangGraph + self-hosted infra Displaces for teams that would otherwise spend 3-6 months building agent infrastructure. LangGraph still wins for teams needing maximum flexibility or multi-model orchestration. Displaces
Managed Agents Amazon Bedrock Agents Contested. Bedrock wins on AWS-ecosystem integration, VPC, compliance-heavy environments. Claude wins on onboarding speed and model quality. Contested
Managed Agents Vertex AI Agent Builder (Google) Contested. Vertex wins on GCP-native data access (BigQuery, Vertex Search). Claude wins on agent framework simplicity and non-GCP portability. Contested
Workbench + Eval Tool OpenAI Playground Displaces for structured prompt work. OpenAI Playground is simpler but lacks the Eval Tool's test-suite versioning. Displaces
Workbench + Eval Tool LangSmith Contested. LangSmith wins on cross-model observability and traces. Claude's tool wins on prompt iteration speed and Anthropic-first workflow. Contested
Workbench + Eval Tool PromptLayer / Helicone Respects. These are observability-first products; Claude's tools are iteration-first. Different jobs. Respects
Skills (Platform) Custom GPTs (OpenAI) Displaces for internal team use. Skills compose, trigger automatically, travel across surfaces, don't require a public Store listing. Custom GPTs have larger user reach. Displaces
Skills (Platform) LangChain tools / LlamaIndex Contested. Skills bundle instructions + code + context, which is richer than a tool wrapper. LangChain wins on plug-and-play community libraries. Contested
Advisor Tool Mixture-of-experts routing (your own) Displaces. Rolling your own executor-advisor pairing required hours of plumbing. Advisor Tool ships it as a beta header. Displaces
Prompt Caching OpenAI prompt caching Respects. Feature parity. Both cost ~10% of normal rate on cache reads. Anthropic's TTL defaults are more generous for long-context agents. Respects
Batch API OpenAI Batch API Respects. 50% discount, 24-hour SLA, same model. Different batch size limits. Commodity feature at this point. Respects
Extended Thinking OpenAI o-series Contested. Both deliver reasoning-heavy output. Claude exposes thinking tokens more transparently; OpenAI's o3 edges ahead on certain benchmarks. Contested
Structured Outputs OpenAI structured outputs / Pydantic integrations Respects. Parity feature. Both reliably enforce JSON schemas. Tool-use integrations differ slightly. Respects
Workspace Management & RBAC Enterprise OpenAI Respects. Broadly comparable admin controls. Team-tier features roughly matched. Respects

These verdicts are opinionated, and open to argument. Managed Agents against LangGraph is the most debatable row — teams with strong infra chops and specific flexibility needs will come out the other way. That is fine. The volume's job is to give readers a defensible starting position, not a foregone conclusion.

tagged by domain; selected for commercial leverage

Platform-surface use cases fall into five domains. Agency builds, internal tools, productised apps, developer workflow, and advisory. Each use case is a seed — enough to recognise an opportunity, not so much that it displaces your own thinking.

Agency Builds

Where your firm's methodology becomes a distributable, paid product.

No. 01

Productise a client methodology as a Skill

Turn your consulting firm's signature framework into a reusable Skill. Every consultant in the firm gets it for free; new hires get it on day one. The framework becomes a distributable asset, not tribal knowledge.

No. 02

White-label Managed Agent for each client

Spin up a client-specific Managed Agent with their data, their tools, their brand. Each lives in its own workspace with its own spend cap. Onboarding a new client takes an hour, not a sprint.

No. 03

Spend-capped client pilots

Use workspace spend limits to run bounded-risk pilots. Client gets a working agent within their budget; you get the confidence to quote fixed-fee implementations.

No. 04

Eval suites as acceptance criteria

Convert client requirements into an Eval Tool test suite. Every prompt iteration is measured against it. Acceptance is when the eval score crosses the threshold, not when someone 'likes it'.

No. 05

Reusable agent templates across clients

Build one canonical agent-environment-skill configuration; deploy variants to every client. The meta-framework is your IP.

Internal Tools

Where teams build their own operators.

No. 06

RAG-over-knowledge-base without building RAG

Use a Managed Agent with web_fetch tool pointed at your internal docs. Skip the vector DB, chunking, reranking. Good enough for most teams who thought they needed RAG.

No. 07

Nightly data-quality agent

A Managed Agent that wakes up, queries your data warehouse, flags anomalies, writes to a Slack channel. Your first 'analytics engineer that doesn't sleep'.

No. 08

Customer support triage agent

Zendesk MCP + a triage agent that categorises, tags, suggests replies, and escalates. Not replacing humans — giving them a clean queue.

No. 09

Internal help-desk bot with eval-gated rollout

Build in Workbench, measure with Eval Tool against 50 historical tickets, ship to Slack when accuracy crosses 85%.

No. 10

Finance-ops month-end closer

Agent reads bank feeds, categorises transactions, flags reconciliation gaps, generates the month-end report. The bits humans hated doing.

Productised Apps

Vertical SaaS with Managed Agents as backbone.

No. 11

Vertical SaaS powered by Managed Agents

The infra that would have required a seed round of AI platform engineers is now $0.08/hr. Build your vertical-specific agent; charge for the judgement, not the GPUs.

No. 12

White-label AI assistant for non-tech businesses

Managed Agents + MCP + a thin UI = an assistant for the law firm, the accountant, the recruitment agency. Branding stays theirs.

No. 13

Agent Marketplace entry

Package your agent as a Skill bundle. Distribute via your own site, partnerships, or (coming) marketplace listings.

No. 14

Multi-tenant agent with Vaults

One agent config, many customer instances, isolated credentials via Vaults. The cleanest multi-tenancy pattern for agent-based SaaS.

No. 15

Voice-first app built on Managed Agents

Deepgram or ElevenLabs for voice + Managed Agent for brain. The hardest piece (stateful agent + tool use) is outsourced to Anthropic.

Developer Workflow

How the platform improves the act of building.

No. 16

Prompt Improver as a pre-commit hook

Every prompt changed in your repo runs through Prompt Improver and Eval Tool before merge. Regressions caught at PR time, not in production.

No. 17

Advisor Tool for cost-sensitive agent workloads

Route bulk work to Sonnet; hand off hard decisions to Opus via Advisor. Your agent bill drops 60-80% with minimal quality loss.

No. 18

Batch API for research pipelines

Reprocess 50,000 historical documents overnight at half price. Entire research-scale operations priced like a single contractor-day.

No. 19

Prompt version + eval diff as the standard PR review

Every prompt change in git includes an eval-result diff. Reviewers see 'this prompt now scores 4.2 vs 3.8 on the 50-case suite', not vibes.

No. 20

claude-api Skill for onboarding new engineers

New engineer joins the team, opens Claude Code, asks 'start onboarding for managed agents in Claude API'. They're productive in an hour.

Advisory

Where a consultant's role is about the platform, not on it.

No. 21

Platform-selection teardown for a client

For an enterprise client deciding between Anthropic, OpenAI, Bedrock, Vertex: produce a side-by-side evaluation using the feature catalog in Section II as the rubric.

No. 22

Build vs buy analysis for agent infrastructure

Quantify the 'second job' — sandboxing, state, credentials, checkpointing — that Managed Agents removes. Use real hour estimates; justify the decision.

No. 23

Responsible AI governance for client agents

Help a client stand up agent governance: spend caps per workspace, eval-gated deployment, audit logging, red-team protocols.

No. 24

Migration plan from Assistants API to Managed Agents

For clients who bought into OpenAI Assistants early and now want to compare. Produce a phased migration plan with cost and quality comparisons.

No. 25

Cost model for agent-based products

A spreadsheet model: tokens-per-session, sessions-per-day, cache hit rate, advisor routing, batch vs real-time. Input to any agent-product business case.

working recipes for the platform surface

Jump directly to any prompt below, or scroll through.

Each prompt follows the same five-field structure as Volume II: Setup, The Prompt, Expected Output, Variations, and Gotchas. The voice is hybrid — strategic framing where the decision is strategic, real code where it sharpens the point. Placeholders are in [SQUARE BRACKETS]. Where a prompt references Managed Agents, the managed-agents-2026-04-01 beta header is assumed; the SDK sets it automatically. Where a prompt references advisor-tool, the advisor-tool-2026-03-01 header is required.

Some of these prompts are 700 words long. That is not padding. These are prompts that produce production-grade artefacts — architectures, cost models, memos with 8-9 structured parts. A one-paragraph prompt gets you a one-paragraph answer. Don't trim them on first use.

No. 01

The First Workbench Prompt You Write

A template for setting up any new prompt in Workbench correctly from the first save.

Setup
  • Claude Console access (platform.claude.com) with at least one API key.
  • A clear description of the task your prompt should handle.
  • 2-3 examples of 'good' input/output pairs. Even rough ones.
  • A decision on temperature: 0 for deterministic extraction/classification, 0.7-1 for creative generation.
  • Pick your model: Haiku 4.5 for fast iteration, Sonnet 4.6 for balanced production, Opus 4.7 for complex reasoning.
The Prompt
You are helping me set up a new Workbench prompt. I will describe the task; you produce a complete Workbench-ready prompt with all scaffolding.

TASK DESCRIPTION:
[Paste 2-3 paragraphs describing what the prompt should do, who it's for, what good and bad outputs look like.]

EXAMPLES I HAVE:
[Paste 2-3 input/output pairs, even rough ones. If you have none, say so — I'll generate synthetic ones.]

CONSTRAINTS:
• Target model: [Opus 4.7 / Sonnet 4.6 / Haiku 4.5]
• Target temperature: [0 / 0.3 / 0.7 / 1]
• Max tokens expected in output: [N]
• Must not: [things the prompt must refuse or avoid]
• Must always: [non-negotiable behaviours]

PRODUCE:

## PART 1: THE SYSTEM PROMPT
Structured with:
• Role framing (one sentence, concrete)
• Task statement (what, for whom, to what standard)
• Input format description (what the {{variables}} will contain)
• Output format specification (structured if possible — JSON schema, XML, or delimited)
• Failure modes and how to handle them
• Tone/voice instructions if relevant

Use XML tags for each section (<role>, <task>, <input_format>, etc.) — this is Anthropic's recommended structure.

## PART 2: THE USER MESSAGE TEMPLATE
With {{variable}} placeholders matching the input format. Wrap each variable in XML tags for clarity.

## PART 3: 3 EXAMPLES
In Workbench's structured Examples format (input → output pairs). If I provided fewer than 3, generate synthetic ones in the same style.

## PART 4: EVAL TEST CASES (seeds)
Produce 5 test cases spanning:
• Happy path (2 cases)
• Edge case (1 case — empty input, unusual format, etc.)
• Adversarial (1 case — attempted prompt injection, off-topic)
• Boundary (1 case — right at the limit of what the prompt should handle)

Format as a CSV I can paste into the Eval Tool.

## PART 5: TEMPERATURE & MODEL JUSTIFICATION
One paragraph explaining why the temperature and model I chose are right for this task. If you think a different choice would be better, say so and why.

## PART 6: "WHAT YOU'LL WANT TO ADD LATER"
Three things this prompt doesn't do now but you'd expect to need within 2-4 weeks of production use. Rate-limiting logic, fallback behaviour, observability hooks, etc.
Expected Output

A Workbench-ready prompt that follows Anthropic's structural best practices, plus 5 seed eval cases and commentary on model/temperature choice. Typically saves 30-45 minutes on first-pass setup and catches the 'I forgot to specify output format' mistake before it ships.

Variations
  • For a classification task specifically: 'Additionally produce the confusion-matrix labels and the eval metrics (accuracy, precision, recall) I should track.'
  • For a generation task: 'Include a quality rubric with 4-5 dimensions I can grade responses against.'
  • For a tool-use prompt: 'Include the tool-use JSON schema and 3 example tool-call sequences the prompt should handle.'
Gotchas
  • Do not skip the Part 6 'what you'll want to add later' — it's the single most useful part for preventing tech debt.
  • The synthetic examples Claude generates are often too clean. Manually corrupt one to stress-test the prompt.
  • Workbench's temperature slider goes 0-1; the API accepts 0-2. If you move to the API later, the upper range is available but rarely useful.
  • Examples appear inside the first user message in the actual API call. Count their tokens toward your cost calculations.
No. 02

Eval Suite for a Production Prompt

Turn an existing prompt into a measured prompt in one afternoon.

Setup
  • An existing Workbench prompt you want to make evaluable.
  • 10-30 real examples of inputs your prompt handles (scrubbed of sensitive data).
  • A success definition: what does 'the prompt worked' mean? Bent toward measurable outcomes.
  • Eval Tool access (standard with Console).
The Prompt
I want to build a proper eval suite for an existing prompt before I trust it in production.

PROMPT TO EVALUATE:
[Paste the full system prompt and user message template.]

REAL INPUT SAMPLES:
[Paste 10-30 actual inputs the prompt handles in production. Anonymised.]

SUCCESS DEFINITION:
[One paragraph. Example: "The prompt should correctly extract the 5 structured fields 95% of the time, gracefully say 'insufficient information' 100% of the time when a field is genuinely absent, and never hallucinate a value that isn't in the source."]

TASK:

## PART 1: CASE CATEGORISATION
From my 10-30 samples, cluster them into categories. Report:
• How many cases per category
• What makes each category distinct
• Which categories are currently under-represented and where I should gather more samples

## PART 2: IDEAL OUTPUTS
For each of the samples, produce the ideal output (what a perfect response looks like). Use my prompt's expected format.

If any sample is genuinely ambiguous and has multiple defensible correct answers, flag it and list them.

## PART 3: THE TEST SUITE
A CSV formatted for direct import into the Eval Tool, with columns:
• case_id
• category (from Part 1)
• input (with variable names matching my prompt's {{placeholders}})
• ideal_output
• grading_criteria (what to check for — format, specific field values, tone, completeness)
• severity_if_wrong (blocker / high / medium / low)

Rows: all my samples + 5 synthetic edge cases you design (empty inputs, malformed inputs, adversarial inputs, inputs near the capability boundary, inputs in a different language if relevant).

## PART 4: RUBRIC FOR THE 5-POINT GRADE
Define what each of the 5 grades means for THIS prompt, specifically:
• 5 = [specific description]
• 4 = [specific description]
• 3 = [specific description]
• 2 = [specific description]
• 1 = [specific description]

Not generic ("excellent, good, fair..."). Tied to the actual task.

## PART 5: REGRESSION-TEST PROTOCOL
When should I re-run this suite?
• What changes should trigger a full re-run?
• What changes can use a subset?
• What's the target score that blocks a release?

## PART 6: THE CASES YOU EXPECT TO FAIL
Your best guess at which 3-5 test cases are most likely to fail on the first run, and why. This calibrates my expectations before I click Run.
Expected Output

A CSV you can paste directly into the Eval Tool, a task-specific grading rubric, and a regression-test protocol that turns prompt changes into releases, not vibes. The 'cases you expect to fail' section often predicts the actual failures with uncanny accuracy.

Variations
  • For prompts where outputs are long-form text: 'Include an LLM-as-judge rubric with the exact judge prompt I should use.'
  • For prompts that call tools: 'Additionally grade the correctness of the tool calls (right tool, right arguments) separately from the final text output.'
  • For prompts handling user-generated content: 'Include 3 red-team cases per common attack pattern: prompt injection, PII extraction attempt, jailbreak.'
Gotchas
  • Ideal outputs are surprisingly hard. If you can't articulate what good looks like for a case, the prompt wasn't specified well enough. Go fix the prompt.
  • Don't grade on rubric dimensions that don't matter. A 7-dimension rubric sounds rigorous and is mostly noise. 3 well-chosen dimensions beat 7 vague ones.
  • The eval score is a compass, not a contract. A 4.2 vs 3.8 comparison is meaningful; chasing 4.7 vs 4.6 is usually overfitting.
  • Anonymise before pasting. Real customer data in the Eval Tool is a data-residency conversation you probably don't want to have.
No. 03

Your First Managed Agent

From zero to a running Managed Agent in 90 minutes, without the second-job infrastructure.

Setup
  • Claude API access with Managed Agents beta enabled (request access via Console).
  • The managed-agents-2026-04-01 beta header (SDK sets this automatically).
  • A clear single-purpose task the agent should handle end-to-end.
  • One to three tools or MCP servers the agent will need.
  • A spend cap set on the workspace (seriously, before you run anything).
The Prompt
I want to build my first Managed Agent. Walk me through it as a peer pair-programming session, not a tutorial. I have API access and the beta header; I've written prompts but never shipped a Managed Agent.

AGENT PURPOSE:
[One paragraph. Example: "A nightly agent that fetches yesterday's production errors from our logging API, clusters them by stack trace, and posts a prioritised summary to our #eng-oncall Slack channel."]

TOOLS/MCP NEEDED:
[List. Example: "Logging API (REST), Slack MCP, optional web_search for error-message lookups"]

TASK:

## PART 1: THE AGENT DEFINITION
Produce the exact API call to create the agent, in Python (using the official Anthropic SDK).

Include:
• Model choice (with reasoning — Sonnet 4.6 is usually right; justify if you pick different)
• System prompt (with a clear role, task statement, tool-use policy, failure handling)
• Tool definitions with JSON schemas
• MCP server references
• Any skills to attach

## PART 2: THE ENVIRONMENT CONFIG
The container setup:
• Pre-installed packages needed
• Network access rules (allow-list, not deny-list)
• Mounted files (if any)
• Vault references for credentials

Explain each choice. Tell me which defaults I should question.

## PART 3: THE SESSION LAUNCH
The code to start a session, send the initial user event, and stream SSE events back.

Include:
• How to parse the SSE stream
• Where to catch errors
• How to checkpoint and resume if the session fails mid-execution
• How to interrupt and redirect mid-run if the agent goes off-track

## PART 4: THE DEV LOOP
How I should iterate on this agent day-to-day:
• Local testing setup (what runs locally vs what's server-side)
• How to use Workbench to iterate on the system prompt without redeploying
• How to version the agent config
• How to run eval suites against the agent (not just the prompt)

## PART 5: SAFETY RAILS
Before I point this at production:
• What spend cap should I set and why that number
• What inputs should trigger human-in-the-loop escalation
• What outputs should trigger a halt
• What logs to capture for incident response

## PART 6: THE 90-MINUTE PLAN
Break the work into 6 × 15-minute increments. At each checkpoint, what should be working? What's the 'if this is broken, stop and fix' criterion?

Write this as I would actually execute it, not as a book chapter. Keep it in peer-voice.
Expected Output

A Python scaffold (≈80-150 lines), a container config explanation, a session-handling pattern, and a concrete 90-minute execution plan with checkpoints. This is the fastest path from 'I've heard of Managed Agents' to 'I have one running'.

Variations
  • For a TypeScript-first stack: 'Produce the same artefacts in TypeScript using the @anthropic-ai/sdk package.'
  • For a pure-MCP agent (no custom tools): 'Skip the JSON schemas for custom tools; show how to compose existing MCP servers only.'
  • For an agent that runs for hours: 'Additionally cover checkpoint-and-resume, progress-reporting, and how to design the system prompt for long-horizon work.'
Gotchas
  • Managed Agents is in beta. The managed-agents-2026-04-01 header is mandatory — your SDK should set it, but if you're using raw HTTP, you must set it manually.
  • The $0.08/hr runtime is separate from token costs. A 24/7 agent is ~$58/month in runtime alone, before any tokens. Budget accordingly.
  • Set spend caps before you run anything. An infinite-loop bug can cost a meaningful amount before you notice.
  • SSE stream handling is where most first-time Managed Agents bugs live. If your local client hangs or drops events, the agent is fine — your stream parser is broken.
  • The 'outcomes, multiagent, memory' research preview features are gated. Don't design around them for a first project; build with the generally-available beta surface first.
No. 04

From Prompt to Skill — The Reuse Pattern

When a Workbench prompt is getting used in three places, it's a Skill. Here's how to make the leap cleanly.

Setup
  • A Workbench prompt that's being reused across 2+ projects — copy-pasted or re-typed.
  • Skill Builder access (Platform or Claude Code).
  • Understanding of the SKILL.md frontmatter pattern (name, description, body).
The Prompt
I have a Workbench prompt that's being reused across multiple projects. Help me convert it to a proper Skill.

THE PROMPT:
[Paste the full system prompt, user message template, and any examples.]

WHERE IT'S BEING USED:
[List the 2-5 places this prompt currently lives. Example: "1. Our customer-onboarding agent. 2. A Zapier workflow. 3. An internal tool at the marketing team's request."]

TASK:

## PART 1: SHOULD THIS BE A SKILL?
Honest answer first. Not every reused prompt should be a Skill. Evaluate:
• Is the prompt stable, or still rapidly iterating? (Skills want stability.)
• Does it have a clear trigger — a description that tells Claude when to use it?
• Is it self-contained, or does it depend on context from the calling system?
• Is it genuinely reusable, or is each usage a slight variant that would diverge over time?

Recommend: convert to Skill / keep as Workbench prompt / split into multiple Skills / rewrite the calling systems instead.

If the recommendation is 'don't convert', stop here and explain why. Don't force it.

## PART 2: IF YES, THE SKILL SPEC
Produce the complete SKILL.md:

---
name: [skill-name-in-kebab-case]
description: [1-2 sentences that tell Claude when to use this Skill, not what it is]
---

# [Skill Title]

## Purpose
[One paragraph.]

## When to Use
[Bullet list of specific trigger conditions.]

## When NOT to Use
[Bullet list — critical for stopping Claude from over-applying.]

## Instructions
[The instructions, rewritten for Skill-first use. Should NOT assume the prompt structure of the original Workbench prompt. Should stand alone.]

## Examples
[3 concrete input → approach → output triads.]

## Output Format
[Spec of what a Skill-generated response should look like.]

## Failure Handling
[What to do when the input is outside scope.]

## Part 3: the migration plan
For each of the places the prompt is currently used:
• What changes
• What breaks if it changes wrong
• How to validate before rollout
• What to do if the Skill-generated output differs from the prompt-generated output

## PART 4: THE TESTING PROTOCOL
Eval suite for the Skill that covers:
• The original use cases (regression)
• New use cases the Skill now unlocks that the old prompt couldn't handle
• Edge cases where Claude might incorrectly trigger the Skill

## PART 5: DISTRIBUTION
Where should this Skill live?
• Personal (your claude.ai settings)
• Team (shared in your org's Claude workspace)
• Open-source (Anthropic skills repo or your GitHub)
• Bundled (inside a Managed Agent config)

Recommend one, with reasoning.

## PART 6: THE THING YOU'LL GET WRONG FIRST
Based on the prompt and its usage patterns, your best guess at the most likely failure mode once it's a Skill. This is where I'll spend my first week of debugging — warn me.
Expected Output

A SKILL.md file ready to install, a migration plan for the existing usages, and a realistic warning about where the conversion will create new failure modes. The 'should this be a Skill?' gate saves you from wrongly-converted prompts, which are worse than the original.

Variations
  • For a Skill that needs its own code (not just instructions): 'Include the script files the Skill should ship with and where they live in the Skill's directory structure.'
  • For a Skill intended for open-source release: 'Additionally produce the README.md, LICENSE recommendation, and contribution guidelines.'
  • For a Skill that will be called inside a Managed Agent: 'Show the agent-config change needed to attach this Skill and the tool-call examples it should produce.'
Gotchas
  • The 'description' field in SKILL.md is load-bearing. Claude uses it to decide whether to load the Skill. Vague descriptions mean the Skill never triggers.
  • Skills have a 20-per-session limit in Managed Agents. Don't create skills for things that are really tools.
  • Custom Skills on Pro/Max are per-user, not shared org-wide. If you need team distribution, you're doing it manually for now.
  • A Skill that depends on context outside itself (files, env vars, previous messages) will silently fail. Skills should be self-contained.
  • 'Skills that are really prompts' is one common failure mode. 'Prompts that should be Skills' is another. The Part 1 gate catches both.
No. 05

The Advisor Tool Routing Strategy

Pair a fast executor with a smarter advisor. Keep quality, cut costs 60-80%.

Setup
  • Advisor Tool beta access (advisor-tool-2026-03-01 header).
  • An agent or long-horizon workflow that currently runs entirely on Opus 4.7.
  • Sample inputs and the observed cost per task today.
  • A quality bar: what's acceptable degradation vs what isn't?
The Prompt
I want to adopt the Advisor Tool strategy on [AGENT/WORKFLOW]. Currently it's running fully on Opus 4.7; I suspect most of the work doesn't need Opus.

CURRENT SETUP:
• Workflow: [brief description]
• Current model: Opus 4.7 throughout
• Average cost per task: $[X]
• Average tokens per task: [N input / M output]
• Quality bar: [how you measure, and the current score]

TASK:

## PART 1: WORK-TYPE TAXONOMY
Break the workflow into distinct step-types:
• Retrieval steps (fetch, parse, extract) — executor territory
• Transformation steps (format, rewrite, summarise) — executor with occasional advisor
• Judgement steps (classify, prioritise, choose) — this is where advisor earns its keep
• Generation steps (write, design, synthesise) — advisor for the hard ones, executor for templated

For each step in my workflow, categorise it.

## PART 2: THE ROUTING STRATEGY
For each step, prescribe:
• Which model should run it (Executor: Sonnet 4.6 or Haiku 4.5; Advisor: Opus 4.7)
• What triggers advisor intervention (confidence threshold, specific conditions, always)
• What the advisor gets to see (full context, summarised context, just the question)
• What the advisor returns (guidance, full rewrite, thumbs up/down)

## PART 3: IMPLEMENTATION CODE
Produce Python (or TypeScript if preferred) showing:
• The advisor-tool beta header
• The executor-advisor message pattern
• How to pass state between them
• How to handle cases where the advisor disagrees with the executor

## PART 4: COST MODEL
A spreadsheet-ready breakdown:
• Current cost: 100% Opus. $ per task.
• Naïve Sonnet: $ per task (and expected quality drop).
• Advisor routing: $ per task (and expected quality preservation).
• Break-even: at what task volume does advisor routing pay back the engineering effort to ship it?

Don't just claim 60-80% savings — show me the math for my specific workflow.

## PART 5: THE QUALITY REGRESSION TEST
Before going live, I need to prove the advisor-routed version matches Opus quality on MY workflow.

Design:
• The test set (size, composition, sourcing)
• The grading protocol (LLM-as-judge, human eval, task-success metrics)
• The pass criterion (how much degradation is acceptable, per work-type)
• The fall-back protocol (when to revert to pure-Opus)

## PART 6: THE FIRST THING THAT WILL BREAK
Advisor routing fails in specific ways. Your best prediction of where this particular workflow will degrade, and what to monitor.
Expected Output

A routing strategy specific to your workflow, implementation code, a cost model with real math, and a regression-test design. Expect 60-80% cost reduction on long workflows with minimal quality loss — but not until you've validated on your specific task shape.

Variations
  • For workflows already running on Sonnet (not Opus): 'Flip the question — which steps would benefit from advisor help from Opus, vs keeping Sonnet throughout?'
  • For workflows with strict latency SLAs: 'Additionally address how advisor round-trips affect p95 latency, and where to skip the advisor to meet the SLA.'
  • For multi-tenant SaaS: 'Address per-tenant routing: should premium tenants get more advisor coverage than free-tier? Show the config pattern.'
Gotchas
  • Advisor Tool is in beta. The routing pattern works, but expect occasional oddities. Keep your fallback to pure-Opus warm.
  • The savings are real but not uniform. Some workflows see 70% cost reduction; some see 10%. Run the cost model before claiming numbers.
  • Advisor intervention adds latency — typically 1-3 seconds per intervention. For interactive UIs, choose intervention points carefully.
  • The executor can occasionally ignore or overfit the advisor's guidance. The Part 5 regression test catches this; don't skip it.
  • Quality-check on your own data. Benchmark gains reported in Anthropic blog posts assume task profiles that may not match yours.
No. 06

MCP Server for Your Own Data

Expose your proprietary systems to Claude as a first-class MCP server. Reusable across Claude.ai, Desktop, Claude Code, and Managed Agents.

Setup
  • A data source or system you want Claude to read/write: internal DB, REST API, SaaS system, file store, custom service.
  • Node.js or Python development environment (MCP SDKs available in both).
  • A hosting plan: local stdio for personal use, HTTP/SSE for team use, Cloudflare Worker or similar for distributed use.
  • Authentication story: API keys, OAuth, mTLS, or none (if local).
The Prompt
I want to build an MCP server that exposes [SYSTEM] to Claude. I've used MCP servers before but never built one.

SYSTEM TO EXPOSE:
[Description. Example: "Our internal customer database. Read-only for most operations; specific write endpoints for updating customer tags and notes."]

TARGET CONSUMERS:
[Where will this MCP server be used? Example: "Desktop app for our sales team, Managed Agents for automation, possibly Claude Code for internal dev tools."]

AUTH:
[API key / OAuth / mTLS / none. Example: "API keys scoped to a workspace, rotated quarterly."]

TASK:

## PART 1: THE TOOL SURFACE DESIGN
Before any code: what tools should this server expose?

For each tool:
• Name (verb_noun convention)
• One-sentence description (this is what Claude reads to decide when to call it)
• Parameters (name, type, required, description)
• Return type (and example response)
• Failure modes (what errors, when)

Rules:
• Prefer composable tools (list_customers, get_customer, update_customer_tag) over monolithic ones (manage_customers).
• Every parameter needs a description — Claude uses these to decide which to pass.
• Return types should be JSON, structured, with consistent error shapes.
• Don't expose destructive operations without an explicit confirmation step.

## PART 2: THE SERVER SKELETON
Produce the complete MCP server in [Python / TypeScript — pick Python unless I specified otherwise].

Include:
• SDK setup (mcp package)
• Tool registration with JSON schemas
• Auth middleware (API key validation or OAuth flow)
• Logging (request/response, latency, errors)
• Health-check endpoint
• Rate limiting or per-client throttling

Keep it production-grade but minimal. No premature optimisation.

## PART 3: THE HOSTING PLAN
For my target consumers [from brief], what's the right deployment model?
• stdio (local, per-user)
• HTTP/SSE (hosted, team-accessible)
• Cloudflare Worker / Vercel Edge (globally-distributed)

Recommend one and justify. Include the deploy command or config file.

## PART 4: CLIENT CONFIGURATION
Show the config changes needed in:
• Claude Desktop (.mcpb install or manual JSON)
• Claude Code (settings or config file)
• Managed Agents (agent config snippet)

Include the exact JSON/YAML each client expects.

## PART 5: SECURITY REVIEW
Before exposing this to a model, walk through:
• What data is at rest vs in transit
• What operations are read-only vs mutating
• What happens if a prompt injection tricks Claude into calling a destructive tool
• How to add a 'confirmation required' pattern for high-stakes operations
• What to log for forensics

## PART 6: THE TESTING STRATEGY
• Unit tests for each tool function
• Integration test that calls the running server via the MCP protocol
• End-to-end test in Claude Desktop or a Managed Agent session
• Red-team test: what happens with adversarial inputs

Produce the unit test file for Part 2's skeleton.

## PART 7: THE ONE TOOL YOU SHOULDN'T HAVE ADDED
Reviewing the tool surface from Part 1: is there one you'd advise against exposing on day one? What's the risk?
Expected Output

A complete, deployable MCP server with well-defined tool surface, auth, logging, and client configs. Part 5 and Part 7 are the sections that separate a working MCP server from a safe one.

Variations
  • For a read-only server: 'Skip the mutating operations entirely. Focus Part 5 on data-leak prevention rather than destructive-call prevention.'
  • For an MCP server backing a SaaS product: 'Address multi-tenancy: tenant isolation, per-tenant rate limits, per-tenant auth.'
  • For an MCP server consumed by Managed Agents only: 'Emphasise server-to-server patterns (mTLS, IP allow-lists) rather than user-facing OAuth.'
Gotchas
  • Tool descriptions are prompts. Claude reads them to decide what to call. Vague descriptions mean missed tool calls; verbose descriptions cost context. Tune.
  • MCP is still evolving. Breaking changes happen. Pin SDK versions; watch the MCP spec repo.
  • Don't expose your entire DB as a single 'query' tool. Claude will try SQL injection on itself. Expose intention-specific tools instead.
  • Local stdio servers are the easiest to ship and the hardest to distribute. Plan for HTTP hosting if more than one person will use it.
  • Logging sensitive data in MCP servers is how you accidentally create a new data-residency surface. Log identifiers, not contents.
No. 07

Prompt Caching for Cost-Sensitive Agents

Cut agent token costs by 70-90% on long-context workflows with correctly-placed cache control.

Setup
  • An agent or prompt that sends large static context on every turn (system prompt, reference docs, examples).
  • Current token usage and cost breakdown.
  • Understanding that cache writes cost ~25% more than normal tokens; reads cost ~10%.
The Prompt
I want to add prompt caching to [AGENT/PROMPT] to cut costs on repetitive context.

CURRENT SETUP:
• System prompt length: [N tokens]
• Static context (docs, examples, reference material): [N tokens]
• Dynamic context per turn (user message, tool results): [N tokens]
• Average turns per session: [N]
• Average sessions per day: [N]
• Current monthly cost: $[X]

TASK:

## PART 1: CACHE-ELIGIBILITY ANALYSIS
Walk through my message structure. For each block:
• Block type (system prompt, doc context, examples, user message, tool output, assistant reply)
• Size in tokens (estimated)
• Volatility (never changes / changes per session / changes per turn / always changes)
• Caching recommendation (always cache / conditionally cache / never cache)

The rule: cache stable blocks that are at least 1024 tokens and used in multiple turns or sessions.

## PART 2: THE CACHED MESSAGE STRUCTURE
Rewrite my message construction to add cache_control markers in the right places.

Show:
• Where the cache_control: {"type": "ephemeral"} markers go
• The order matters: cached blocks must come before dynamic blocks
• The TTL default (5 min) and when to use 1-hour TTL instead (beta)
• What to do about the system prompt specifically

Include Python code for the correctly-structured message construction.

## PART 3: THE COST MODEL
Given my usage pattern, compute:
• Cache write cost per first-session turn
• Cache read cost per subsequent turn
• Break-even: how many cache hits do I need to come out ahead?
• Expected monthly cost post-caching (best/realistic/worst case)

Be honest about the worst case. If my traffic is too sparse for caching to help, say so.

## PART 4: THE CACHE-HIT MONITORING
Production caching requires observability. Design:
• What to log per request (cache_read_tokens, cache_write_tokens, cache_creation_tokens)
• Dashboards to build (cache hit rate, tokens-saved-per-day, cost-per-session)
• Alerts (cache hit rate drops below threshold, cache writes spike suggesting cache eviction)

## PART 5: CACHE INVALIDATION STRATEGY
The things that will break your cache:
• System prompt changes (any modification invalidates)
• Context doc updates (same)
• Different cache_control marker placement

How to deploy prompt changes without catastrophic cache-miss days:
• Staged rollout
• Warm-up period
• A/B variants during transition

## PART 6: WHEN CACHING DOESN'T HELP
Honest assessment. Caching is a bad fit when:
• Each session has unique reference material (per-customer docs, dynamic knowledge bases)
• Traffic is too sparse (fewer than ~10 requests per cache TTL)
• The static prefix is under ~1024 tokens
• The agent's context is being rewritten each turn anyway

Does any of this apply to me?

## PART 7: THE 1-HOUR TTL DECISION
Anthropic offers a 1-hour TTL (beta) at higher cache-write cost. When is it worth it?
• For my traffic pattern: recommend or not, with math.
Expected Output

A before/after message structure, a cost model with real savings projections, and a monitoring plan. 70-90% savings are real for high-volume agents; 10-30% is more typical for mixed workloads. The Part 6 'honest assessment' is what separates caching that pays off from caching theatre.

Variations
  • For agents with per-customer context that changes: 'Instead of caching static docs, address whether to cache the system prompt + tool schemas separately. Partial caching still helps.'
  • For very high-volume agents: 'Address the 1-hour TTL beta specifically. For sustained high traffic, it can change the math meaningfully.'
  • For agents with user-facing latency: 'Note cache reads are still faster than non-cached tokens to TTFT (time to first token). Include latency comparison, not just cost.'
Gotchas
  • Cache writes cost more than regular tokens. If you're over-caching (blocks that never hit), you're paying a premium for no benefit.
  • The 5-minute default TTL means sparse traffic doesn't benefit. Under ~10 req/5min on a given cache, you're mostly paying for writes.
  • Cache_control markers must be in the right position in the message array. A misplaced marker silently disables caching on everything after it.
  • Streaming + caching + retries can produce duplicate cache writes if your retry logic isn't careful.
  • Anthropic's cache is per-API-key (approximately). Multi-tenant systems sharing one key can see cross-contamination if not careful.
No. 08

The Cost-Guardrailed Agent

Before you point an agent at production, put it in a budget box.

Setup
  • An agent that will run without constant human supervision.
  • Admin access to set workspace-level spend caps.
  • A monthly budget — the number you actually want to spend, not the one you're ashamed to admit.
The Prompt
I'm about to ship an autonomous agent to production. Before it runs, I need the full cost-guardrails setup.

AGENT CONTEXT:
• Purpose: [description]
• Expected runs per day: [N]
• Expected tokens per run: [best estimate, range]
• Monthly budget: $[X]
• Consequences of overrun: [what happens if it costs 10x your estimate]

TASK:

## PART 1: MULTI-LAYER BUDGET DESIGN
Design four layers of cost protection:

LAYER 1 — WORKSPACE SPEND CAP
Set the hard ceiling. At what $ should the workspace auto-suspend? This is the "I'm okay with losing this much" number. Usually 1.5-2x the monthly budget.

LAYER 2 — ALERT THRESHOLDS
Email alerts at 50%, 75%, 90% of budget. Route to whom, with what message.

LAYER 3 — AGENT-LEVEL BUDGET
Per-session token caps. Per-run max duration. When the agent approaches its limit, does it halt or escalate?

LAYER 4 — APPLICATION-LEVEL RATE LIMITING
If the agent is triggered by user action or webhook: rate-limit the triggers. Queue, don't DOS yourself.

For each layer, produce the exact config or code change.

## PART 2: RUNAWAY DETECTION
Most cost incidents aren't gradual — they're a bug or injection attack that causes a single run to loop or fan out.

Design:
• Per-run token-count hard limit (halt, not warn)
• Per-run wall-clock limit
• Tool-call count limit (e.g. max 50 tool calls per session)
• Recursion detection (same tool with same args 3+ times)

Produce the wrapper code that enforces these at the application layer.

## PART 3: THE INCIDENT PROTOCOL
When a cost spike happens — what's the response?
• Who gets paged
• First 5 minutes: what to check
• First 30 minutes: what to contain
• Post-incident: what to change

Write this as a runbook I can put in a wiki.

## PART 4: THE BUDGET AS A FEATURE, NOT A LIMIT
The advanced move: make the budget a first-class part of the agent's behaviour.

Show how to:
• Pass the remaining budget into the agent's system prompt ("You have $X left this month")
• Have the agent downgrade model choice as budget tightens (Opus early, Sonnet mid-month, Haiku late)
• Have the agent defer non-urgent work when budget is tight

This turns 'running out of budget' from an incident into a graceful degradation.

## PART 5: THE THINGS YOUR CURRENT BUDGET MISSES
Estimate the cost of:
• Retries (transient failures, rate limits, 5xx from tool servers)
• Prompt injection attacks causing extra token use
• Accidental prompt changes causing cache-miss storms
• Bugs that cause multiple sessions for one logical task

Add each to your budget or to the guardrails.

## PART 6: THE BUDGET POST-MORTEM QUESTION
Imagine you hit your budget cap unexpectedly in week 3. Write the 5 questions the post-mortem should answer — before the incident happens, so you know what to instrument.
Expected Output

A four-layer budget protection system, a runbook, and an 'agent as budget-aware citizen' pattern that turns a hard limit into a graceful curve. The Part 5 'what your current budget misses' is where most first-time-agent teams get surprised.

Variations
  • For multi-tenant SaaS: 'Add per-tenant budget controls, per-tenant fairness algorithms (one noisy tenant shouldn't DOS everyone else), tenant-level cost reporting.'
  • For agencies running client agents: 'Add cross-client budget dashboard, per-client billing extraction (for pass-through pricing), margin tracking.'
  • For high-stakes agents (financial, legal, medical): 'Add the human-approval gates for operations above a cost or risk threshold. Cost is a proxy for scope; big-scope operations need review.'
Gotchas
  • Workspace spend caps are not real-time. There's a lag between hitting the cap and the API rejecting calls. Application-level limits matter more.
  • Alerts at 90% of monthly budget are mostly useless if the spike is happening in an hour. You also need per-hour velocity alerts.
  • 'The agent should stop at $X' is not the same as 'the agent WILL stop at $X'. Test the halt path. Use chaos engineering — inject fake budget-exceeded conditions and see if the agent behaves correctly.
  • Runaway detection needs to be on the hot path. A post-hoc 'we detected a runaway' log doesn't stop the bleeding — it documents it.
  • Give the agent its own budget awareness. Agents that know their budget make smarter decisions than agents that just get cut off.
No. 09

The Multi-Tenant Agent Config

One agent architecture, N customers, zero cross-tenant leaks.

Setup
  • A productised agent SaaS serving 2+ customers/tenants.
  • Managed Agents access.
  • Vaults for per-tenant credentials.
  • A clear data isolation requirement (compliance, contractual, or both).
The Prompt
I'm building a multi-tenant SaaS on Managed Agents. Each customer has their own data, their own tool credentials, and must never see another customer's data.

TENANT MODEL:
[Describe. Example: "100 SMB customers. Each has: their own Slack workspace (OAuth), their own GitHub repos (App install), and their uploaded knowledge base. A tenant is a single company; users within a tenant share data."]

ISOLATION REQUIREMENTS:
[What must never cross tenants? Example: "No cross-tenant data visibility in any retrieval. Logs scrubbed of tenant content. Failover between tenants never leaks state."]

TASK:

## PART 1: THE ISOLATION ARCHITECTURE
Design the layered isolation:

AGENT LEVEL
• One shared agent definition, parameterised by tenant, OR one agent per tenant?
• How tenant identity gets threaded through (request parameter, session metadata, custom header)

ENVIRONMENT LEVEL
• Per-tenant container config vs shared config
• Filesystem isolation (mount different files per tenant)
• Network allow-lists (some tenants may have IP allow-lists on their own services)

VAULT LEVEL
• Credential injection: per-tenant Vaults, one Vault with namespacing, or hybrid
• Key-naming convention (tenant_id is the natural namespace)
• Credential rotation story (must not cause cross-tenant outages)

SESSION LEVEL
• Session IDs and how tenant_id is tagged on them
• Event-history retention per tenant
• How to stream logs to the right tenant's observability system

For each layer, produce the config snippet and a one-paragraph justification.

## PART 2: THE TENANT-AWARE SYSTEM PROMPT
The agent's system prompt should know whose data it's operating on. Design:
• How tenant identity enters the system prompt (at session creation, via a Skill, via an init tool call)
• What the agent must never do across tenants (explicit prohibitions)
• How the agent handles ambiguous references ("the customer data" when there are multiple customers)

Produce the system prompt template with clear tenant-aware sections.

## PART 3: TOOLS AS ISOLATION BOUNDARIES
Every tool the agent calls must respect tenant scope.

For each tool in my setup:
• How tenant_id is passed in (required parameter, context variable, out-of-band)
• How the tool validates tenant context (authorisation check at the tool's backend)
• What happens on mismatched or missing tenant context (fail closed, not open)

Show the tool-call examples with tenant threading done right.

## PART 4: THE CROSS-TENANT LEAK TESTS
The tests you write knowing attackers will try to break isolation:
• Prompt injection attempting to query another tenant's data
• Tool-call argument manipulation (wrong tenant_id, no tenant_id)
• Session-confusion attacks (reusing session IDs across tenants)
• Cache contamination (prompt cache leaking context between tenants)

Write the test suite. Include the specific adversarial inputs.

## PART 5: LOGGING WITHOUT LEAKING
Logs must be debuggable per tenant AND scrubbed of cross-tenant contamination.

Design:
• Log structure with tenant_id as a top-level field
• What gets logged (metadata) vs what doesn't (content)
• How to handle logs during incidents (do you still see content? under what circumstances?)
• How to purge a tenant's logs if they offboard or request deletion

## PART 6: THE DISASTER RECOVERY QUESTIONS
Multi-tenancy makes DR harder. Walk through:
• What happens if a Managed Agent session for tenant A crashes mid-run — can it be retried without tenant-B side effects?
• What happens if a Vault rotation corrupts credentials for one tenant — blast radius?
• What happens if Anthropic has a regional outage affecting some tenants but not others?

## PART 7: THE AUDIT YOU'LL NEED LATER
If one of your tenants is SOC 2 / HIPAA / GDPR-regulated, they'll ask for evidence of isolation. Produce:
• The evidence artefacts (config diffs, log samples, test results)
• The narrative description for the auditor
• The known gaps (there will be some — better to name them than hide them)
Expected Output

A layered isolation architecture with per-layer justification, adversarial tests, and an audit-ready artefact set. This is one of the highest-risk patterns in agent SaaS; a thorough Part 4 is what separates 'we thought it was isolated' from 'we proved it'.

Variations
  • For regulated industries (healthcare, finance): 'Additionally address: data residency (EU tenants must stay in EU), audit logs with retention requirements, compliance-specific attestations.'
  • For free-tier + paid customers: 'Add tier-based resource limits — free tenants get less context, lower-tier models, stricter rate limits. Show the config.'
  • For white-label deployments: 'Add branding isolation — each tenant's agent appears as their brand, not yours. Include how to handle this in system prompts and tool outputs.'
Gotchas
  • Prompt cache contamination is the non-obvious multi-tenant leak. If two tenants share cached context, one's history can bleed into the other's generation. Use per-tenant cache strategies.
  • Tenant_id in a tool argument is necessary but not sufficient. The tool's backend must validate it — an agent that can pass any tenant_id it wants is an agent that will.
  • Sessions ≠ tenants. A session can live in one tenant's scope. Don't multiplex tenants over a session.
  • Vaults are per-agent in current beta. Per-tenant Vaults mean per-tenant agents, OR a more complex indirection layer.
  • 'Works for 10 tenants' rarely works for 1000 tenants. Test at scale early — rate limits, concurrency, cache eviction behave differently.
No. 10

The Platform-Selection Teardown for a Client

When your client asks whether to build on Anthropic, OpenAI, Bedrock, or Vertex — the answer with receipts.

Setup
  • A client (or your firm) evaluating AI platforms for a real workload.
  • A clear description of the workload: task type, scale, constraints.
  • Access to Anthropic Platform, OpenAI Platform, and at least one of Bedrock/Vertex for hands-on comparison.
The Prompt
I'm producing a platform-selection teardown for a client. They're deciding between Anthropic, OpenAI, Amazon Bedrock, and Google Vertex AI for [WORKLOAD].

CLIENT CONTEXT:
• Company type and size: [description]
• Existing cloud commitments: [AWS / GCP / Azure / multi / none]
• Existing data stack: [where data lives, what's sensitive]
• Compliance requirements: [SOC 2, HIPAA, GDPR, etc.]
• AI team maturity: [greenfield / some prototypes / ML-engineering team / full ML platform team]
• Budget sensitivity: [startup / mid-market / enterprise with no hard cap]

WORKLOAD:
[One paragraph. Example: "A customer-service copilot that reads tickets, fetches knowledge-base articles, drafts responses, escalates to humans when confidence is low. Expected 50,000 interactions/day at steady state."]

TASK:

## PART 1: EVALUATION FRAMEWORK
Produce the evaluation rubric, customised to the client. Dimensions should include (at least):
• Model quality on the specific task
• Total cost at projected scale (include surprising costs — inference, vector DB, data egress)
• Time to production (the "second job" of infrastructure)
• Ecosystem fit with existing stack
• Data-residency and compliance
• Vendor lock-in risk
• Roadmap alignment (what's each vendor likely to ship over the next 18 months?)
• Operational concerns (observability, debugging, incident support)

Weight each dimension based on the client's context. Justify the weights.

## PART 2: SIDE-BY-SIDE COMPARISON
For each of the four platforms, score against the rubric. Per dimension:
• Anthropic Platform
• OpenAI Platform
• Amazon Bedrock
• Google Vertex AI

Scoring: 5-point scale with concrete justification. No vendor gets a score without a specific reason.

Produce as a table, with a summary paragraph after highlighting the signal in the noise.

## PART 3: BENCHMARK ON THE CLIENT'S ACTUAL TASK
A design for running the client's workload on each platform for a bounded period:
• Benchmark task (narrowed version of their real task)
• Eval criteria (what does 'good' look like?)
• Test set size and composition
• Instrumentation (quality score, cost, latency per platform)
• Execution plan (who, when, how much)

## PART 4: THE REAL-MONEY COST MODEL
For each platform, produce the monthly cost at projected scale. Include:
• Per-token inference cost
• Batch API discounts (where available)
• Caching discounts (where available)
• Infrastructure cost (cloud resources beyond the model API)
• Support contract cost (if applicable)
• Hidden costs (egress, cross-region, premium features)

Plot best / likely / worst cases. Call out which costs are capped (budget caps) vs uncapped.

## PART 5: BUILD-VS-BUY ON THE ANTHROPIC SIDE
Given the client's profile, specifically address:
• Managed Agents vs building agent infra on any platform: worth it for this client?
• Anthropic-hosted MCP connectors vs building integrations: worth it?
• Workbench / Eval Tool vs rolling your own: worth it?

For clients with mature ML teams, the answer skews toward build. For greenfield clients, it skews toward Anthropic's managed surfaces. Where does THIS client sit?

## PART 6: THE 18-MONTH OUTLOOK
Based on each vendor's announced roadmap and current trajectory:
• What's Anthropic likely to ship that affects this decision?
• What's OpenAI?
• What's Bedrock/Vertex's story?

Recommend the decision that ages well over 18 months, not just the one that looks best today.

## PART 7: THE RECOMMENDATION
One page. Clear recommendation. Three scenarios under which the recommendation flips. A one-paragraph executive summary the CEO can read and act on.

## PART 8: WHAT WE'RE WRONG ABOUT
Where are you least confident in this analysis? What data would change the recommendation? This matters — clients value intellectual honesty more than confident wrongness.
Expected Output

A consulting-grade platform teardown with rubric, scores, cost model, benchmark plan, and a recommendation that ages. The Part 8 'what we're wrong about' is what senior buyers specifically look for — it signals the analysis was done, not performed.

Variations
  • For a client already on one platform: 'Frame as a migration analysis: cost/quality/risk of moving from [current] to each alternative. Include stay-put as an explicit option.'
  • For a client building a consumer product: 'Weight latency, cost-per-user, and abuse-resistance more heavily. These matter more for consumer than B2B workloads.'
  • For a regulated industry: 'Weight compliance, audit-ability, data-residency above everything else. A vendor that scores 5/5 on quality but 2/5 on compliance is unusable.'
Gotchas
  • Quality comparisons based on public benchmarks are usually worthless for specific client tasks. Run the Part 3 benchmark on their real data or don't claim quality rankings.
  • Vendor roadmaps are sales material. Discount heavily. Recent ship history is a better predictor than roadmap slides.
  • Multi-cloud is not free. If recommending two vendors, account for the integration cost.
  • The client probably has an emotional preference (CEO read an article, VP-Eng likes OpenAI from a previous job). Surface this early; pretending it doesn't exist wastes your work.
  • Don't let the rubric become the answer. Rubrics quantify intuition; they don't replace it.
No. 11

The Eval-Gated Deployment Pipeline

Prompt changes go through the same rigour as code changes. CI runs evals; bad prompts get blocked.

Setup
  • Prompts tracked in version control (or a system you'll move to VC as part of this).
  • An eval suite from Prompt #2.
  • CI/CD platform: GitHub Actions, CircleCI, GitLab CI, whatever.
  • A deployment target: Managed Agent config, API-consuming service, etc.
The Prompt
I want to set up an eval-gated deployment pipeline for prompts. Today, prompt changes are deployed with the same ceremony as... no ceremony. I want every prompt change to run through evals before it reaches production.

STACK:
• Prompt storage: [location — git repo path, Workbench, config service]
• Eval Tool: [platform.claude.com Eval Tool]
• CI: [GitHub Actions / GitLab / CircleCI / other]
• Deployment target: [what consumes the prompt in production]
• Team size: [solo / small team / larger team with review requirements]

TASK:

## PART 1: THE DESIRED WORKFLOW
End-to-end, what should happen when someone changes a prompt:

1. Author changes the prompt in [where]
2. Opens a PR with the change
3. CI runs: [what evals? what thresholds?]
4. If pass, [what? auto-merge? still need human review?]
5. If fail, [what? the author sees what artefacts?]
6. On merge, deployment happens [how? immediately? staged rollout?]
7. In production, [what's monitored? rollback triggers?]

Produce this as a clear flowchart in text form.

## PART 2: THE GIT STRUCTURE
Prompts in a repo. Show:
• Directory layout (one file per prompt? directory per domain? how to organise)
• File format (YAML / JSON / markdown with frontmatter / raw)
• Accompanying files (eval suite, expected outputs, metadata)
• Versioning (semver? timestamped? git-sha-derived?)

Produce the example directory tree with actual file contents.

## PART 3: THE CI PIPELINE
The GitHub Actions workflow (or equivalent) that:
• Triggers on PR with changes to prompts/
• Extracts the changed prompts
• Runs each against its linked eval suite via the Eval Tool API
• Posts results as a PR comment (score, diffs vs main branch, regressions flagged)
• Fails the PR if scores drop or hard assertions fail

Produce the actual workflow YAML.

## PART 4: THE PROMPT-SCORE DIFF
The most useful PR comment isn't "score: 4.2". It's "new: 4.2, main: 4.0, delta: +0.2, regressions: 2 cases, improvements: 7 cases".

Produce the comment template and the script that generates it. Include:
• Per-case comparison (old score vs new score per test case)
• Regression highlights (cases where score dropped)
• New capabilities (cases that were failing and now pass)
• Cost impact (token count old vs new)

## PART 5: THE DEPLOYMENT STAGING
On merge:
• Stage 1: deploy to a shadow environment that mirrors production but doesn't affect users. Run the eval suite again against the live infra.
• Stage 2: 1% canary to production. Monitor for [what signals]?
• Stage 3: full rollout.

Produce the staging config and the rollback conditions for each stage.

## PART 6: THE DRIFT DETECTION
Once deployed, prompts don't stay "correct" — the world shifts under them.

Design:
• What to monitor in production (quality signals, not just cost)
• What triggers a "re-eval" (user-reported issue, model update, input distribution shift)
• How often to run the eval suite against live traffic samples

## PART 7: THE GUARDRAIL FOR THE HERO-DEVELOPER
One engineer bypasses CI in an incident ("just ship the fix"). Six months later, nobody remembers that commit. The eval never ran. The prompt is now broken in ways nobody sees.

Design the guardrail:
• How to allow emergency overrides without permanent damage
• Retroactive eval runs on bypassed merges
• Audit log of who bypassed what and why

## PART 8: WHAT THIS DOESN'T CATCH
Honest assessment: eval-gated deployment catches a lot, but not everything. List the failures this pipeline won't catch — so you know where to invest manual review.
Expected Output

A complete eval-gated pipeline: git structure, CI workflow, PR comment template, staged rollout config, drift detection, and override protocols. The Part 8 honesty is what makes this a real pipeline vs security theatre.

Variations
  • For teams without existing CI: 'Start with the simplest possible pipeline: a pre-commit hook that runs a small eval locally. Ship that this week; expand later.'
  • For teams with extensive CI maturity: 'Address integration with existing quality gates (SAST, DAST, perf tests). Eval is one more gate, not the gate.'
  • For orgs running many prompts across many teams: 'Address the eval-suite-per-prompt-per-team problem. Central eval infra vs federated. Skill vs drill.'
Gotchas
  • Eval scores are noisy. A 4.0 vs 3.95 diff is not a regression; it's measurement error. Build in a statistical threshold, not a strict comparison.
  • LLM-as-judge eval scores can drift as the judge model changes. Pin the judge model version.
  • If CI is slow, developers work around it. Target sub-5-minute eval runs for common prompt changes. Use representative test subsets for CI, full suites for pre-merge.
  • Don't gate deployment on costs alone. A new prompt that's 10% more expensive but solves a previously-failing use case is a good trade; a 1% cheaper prompt that fails on edge cases is not.
  • The override path will be used. Design it; don't pretend it won't happen.
No. 12

The Multi-Agent Orchestration Pattern

When one agent isn't enough: coordinating specialists without building a distributed-systems nightmare.

Setup
  • A workflow too complex for one agent: multiple specialisations, parallel execution, or decision trees deep enough that a single prompt degrades.
  • Managed Agents access, ideally with multi-agent research preview (request access).
  • A clear decomposition of the workflow into agent roles.
The Prompt
I'm building a multi-agent workflow. A single agent is degrading — too many responsibilities, too much context, too many decision branches. I need orchestration.

THE WORKFLOW:
[Description. Example: "A loan-application evaluator that ingests an application packet, checks it against rules, requests missing documents from the applicant, runs background checks, synthesises a recommendation, and generates a compliance-ready report."]

CURRENT STATE:
[What's working, what isn't, on the single-agent version. Example: "Prompt is 4,500 tokens. Accuracy is 78%. Gets confused between fraud detection and completeness checking. Background-check tool calls sometimes fire in parallel incorrectly."]

TASK:

## PART 1: THE DECOMPOSITION
Break the workflow into discrete agents:
• Role name
• Sole responsibility (one sentence)
• Inputs it needs
• Outputs it produces
• Model choice (Haiku for simple, Sonnet for balanced, Opus for judgement)

Rules:
• Each agent does one thing.
• If an agent's description includes "and", split it.
• If an agent needs more than 1,500 tokens of system prompt, it's doing too much.

Produce the agent roster.

## PART 2: THE ORCHESTRATION PATTERN
Pick one. Justify.

OPTION A — ORCHESTRATOR + WORKERS
One agent is the orchestrator; others are specialists it dispatches to. Orchestrator holds the overall plan; workers return structured outputs.

OPTION B — PIPELINE
Agents chained end-to-end. Output of A feeds B feeds C. No orchestrator.

OPTION C — EVENT-DRIVEN
Agents react to shared state or messages. No central coordinator. Decoupled but harder to reason about.

OPTION D — HIERARCHICAL
Sub-orchestrators coordinate their own sub-agents. Good for very complex workflows.

For my workflow: which pattern? Why?

## PART 3: THE MESSAGE SCHEMA
Multi-agent systems live or die on the message contract between agents.

Produce:
• The structured format (JSON schema) each agent receives and returns
• Error types (including 'I can't complete this' — not just unhandled exceptions)
• Metadata (which agent, which step, which session, latency, cost)
• Versioning (how does Agent A know which version of Agent B it's talking to?)

## PART 4: THE SHARED STATE
Agents need to share something — task state, intermediate results, tool outputs.

Design:
• Storage (Managed Agents memory beta? External state store? Event history?)
• Consistency model (eventual? strong? per-agent local?)
• Who can write what
• Cleanup and TTL

If the research-preview multiagent feature fits, use it; explain how. If not, design the alternative.

## PART 5: THE FAILURE MODES
Multi-agent failures are weirder than single-agent. Walk through:
• One agent times out — does the whole workflow block, proceed degraded, or retry?
• Two agents disagree — who arbitrates?
• The orchestrator goes down mid-workflow — how is state recovered?
• A tool a sub-agent relies on becomes unavailable — local retry or escalation?
• The model returns invalid structured output — the retry/fix protocol

Produce the failure-handling playbook.

## PART 6: THE COST MULTIPLIER
N agents means N × base cost, plus coordination overhead. Model this:
• Expected cost per workflow run
• Where advisor routing can cut costs
• Whether any step could be non-LLM code instead (big wins here)
• When to fall back to single-agent for simple cases

## PART 7: THE OBSERVABILITY STACK
Debugging a single-agent run is annoying. Multi-agent is a distributed-systems problem.

Design:
• Per-agent trace with parent-child relationships
• End-to-end visualisation of a workflow (timeline, decision branches, tool calls)
• How to find the agent that caused a bad outcome when there are 5 in the chain
• What to log at the orchestration layer vs the agent layer

## PART 8: THE WORST MULTI-AGENT DESIGN DECISION
Reviewing the plan: where's the riskiest architectural choice you made? What would be the early warning sign that it was wrong? What's the fallback?
Expected Output

A decomposition, an orchestration pattern choice with justification, message schemas, failure playbook, and observability design. The 'riskiest choice' self-review is critical — multi-agent systems fail in ways single-agent doesn't, and the earlier you name the risks, the earlier you catch them.

Variations
  • For workflows with strict latency needs: 'Emphasise parallelism — which sub-agents can run concurrently? Produce the parallelism DAG.'
  • For workflows with human-in-the-loop steps: 'Add the human approval nodes. Design the queue, the SLA, the escalation on human delay.'
  • For workflows that might run on a single agent with clever prompting: 'First, challenge the multi-agent assumption. Under what conditions would a single well-structured agent outperform the multi-agent design? Make the case for the counterfactual before agreeing to the split.'
Gotchas
  • Multi-agent is often the wrong answer. Single agent with better prompt structure, better tools, or better models solves most 'I need multiple agents' problems. Do the counterfactual analysis in Part 2.
  • Agent-to-agent coordination overhead is real. If your agents spend more tokens talking to each other than to the user, something is wrong.
  • The research-preview multiagent feature is not yet stable. Don't design critical-path infra around it; prototype with it.
  • Error handling across agents is 3x harder than within one. Budget the engineering time accordingly.
  • Orchestrator-as-bottleneck is the classic pattern failure. If the orchestrator is doing heavy lifting, you've made it the agent that needed splitting.
No. 13

The Structured Output Contract

Stop parsing LLM text like it's HTML in 1998. Use structured outputs as a code contract.

Setup
  • An agent or prompt currently returning text that gets parsed downstream.
  • A JSON schema or Pydantic / Zod model representing the target structure.
  • Downstream consumers you can change (or at least agree with on a contract).
The Prompt
I want to replace fragile text-parsing with proper structured outputs on [PROMPT/AGENT].

CURRENT STATE:
[Description. Example: "Our agent returns a recommendation as a paragraph. Downstream, a regex extracts the decision, another regex the confidence, a third the reasons. About 3% of outputs break parsing; 1% are silent wrong parses."]

TARGET STATE:
[What downstream expects. Example: "A typed object: { decision: 'approve' | 'reject' | 'review', confidence: 0-1, primary_reasons: string[], red_flags: string[], recommended_next_actions: string[] }"]

TASK:

## PART 1: THE SCHEMA
Produce the JSON schema. Include:
• Every field, typed (no vague 'string'; use enums where possible)
• Required vs optional
• Constraints (min/max, patterns, lengths)
• Descriptions for each field (these become the LLM's spec — write them for an LLM reader)

## PART 2: THE PROMPT REWRITE
Rewrite the prompt to use structured outputs:
• The schema attached via the Messages API structured_output parameter
• The system prompt updated to reference the schema structurally
• Examples showing the exact output shape
• Failure modes and what to return in each

## PART 3: THE DOWNSTREAM CONTRACT
Now that the prompt returns typed data:
• The consumer code change (ditch the regex; type-check the response)
• The validation layer (schema-valid but semantically wrong outputs — how to catch)
• The backward-compatibility plan if this prompt has existing callers

## PART 4: THE ERROR HANDLING
Structured outputs can still fail:
• Model returns invalid JSON (rare but happens) — retry with error in context?
• Model returns valid JSON but wrong values (hallucinated enum, out-of-range number)
• Schema-valid but useless (all fields set to generic defaults)

For each, the detection and handling code.

## PART 5: THE TEST SUITE
Tests that exercise:
• Happy path (valid, useful output)
• Schema-valid but semantic failure (e.g., 'decision: approve' but reasons contradict)
• Ambiguous inputs (model should still produce structured output, maybe with lower confidence)
• Adversarial inputs (should still produce structured output, not free-form text)

## PART 6: THE LATENCY AND COST IMPACT
Structured outputs sometimes add tokens (the schema gets tokenised). Sometimes save tokens (no prose explanation). For MY prompt, estimate:
• Token count before / after
• Cost delta per call
• Latency impact (minimal, but measure)

## PART 7: WHEN NOT TO USE STRUCTURED OUTPUTS
Honest assessment. Structured outputs aren't always right:
• Open-ended generation (blog posts, long-form content)
• Outputs meant for human consumption with no machine parsing
• Cases where forcing structure degrades quality

For MY prompt, is structured output actually the right move? (If no, say so. Don't force it.)
Expected Output

A schema, a rewritten prompt, consumer-side changes, and an honest 'should we even do this?' check. Structured outputs are a big win for most production prompts — but not all.

Variations
  • For deeply nested outputs: 'Address schema composition — build complex schemas from smaller ones. Show the pattern.'
  • For outputs that sometimes should be free-form: 'Design a hybrid: structured outputs for the predictable fields, a free_form_notes field for anything else the model wants to say.'
  • For outputs graded by LLM-as-judge: 'Include the judge prompt that evaluates structured output correctness vs semantic quality. These are different axes.'
Gotchas
  • Overly strict schemas make the model refuse to produce anything for unusual inputs. Build in an 'unknown' or 'insufficient_data' path.
  • Enum fields are great; enum fields with 50 options are a signal you haven't clustered them well.
  • The field descriptions are a prompt. 'recommended_next_actions' with no description gets generic output; with good description gets useful output.
  • Don't include schemas hundreds of fields deep. Split into multiple calls if needed.
  • Schema-valid doesn't mean semantically correct. A 'decision: approve' with contradictory reasons is a schema pass and a business failure.
No. 14

The Batch API Research Pipeline

Run 50,000 document analyses overnight at half price. The workload profile that made expensive research tractable.

Setup
  • A dataset of 1,000+ documents, rows, records to process.
  • A per-document task that doesn't need real-time response (you can wait 24 hours).
  • Batch API access.
  • Storage for batch inputs and outputs (S3 / GCS / local).
The Prompt
I have a research pipeline that processes [DATASET] with a per-record LLM call. Currently it runs synchronously and costs too much. I want to move it to Batch API.

CURRENT STATE:
• Dataset: [N records of type X]
• Per-record prompt: [attach or describe]
• Current execution: [sync, via Messages API, takes [T] hours, costs $[X]]
• Use case: [one-off research / recurring job / on-demand but ad-hoc]

TASK:

## PART 1: BATCH ELIGIBILITY
Is this actually right for Batch API? Check:
• Can I wait 24 hours? (Some jobs can't.)
• Are records independent? (If record N depends on record N-1, batch doesn't help.)
• Is the dataset large enough? (Under ~1,000 records, the ops overhead often isn't worth it.)
• Do I need intermediate results? (Batch is all-or-nothing for the output.)

If any answer is 'no' — say so, and suggest alternatives (prompt caching, parallel async calls, Advisor routing).

## PART 2: THE BATCH FILE
The Batch API accepts JSONL with specific structure.

Produce:
• The exact line format for my prompt (custom_id, method, url, body with model/messages)
• The script that generates batch files from my dataset
• How to split into multiple batches if dataset exceeds single-batch limits
• Validation (malformed lines will fail the whole batch, sometimes silently)

## PART 3: THE SUBMISSION AND POLLING
Code to:
• Upload the batch file
• Submit the batch job
• Poll for completion (or use webhook if available)
• Download the output file
• Parse and match outputs back to inputs by custom_id

## PART 4: THE PARTIAL-FAILURE HANDLING
Batch jobs can have per-record failures. Design:
• How to detect failed records in the output
• Retry policy for failed records (re-batch? sync retry? abandon?)
• How to avoid losing the successful ~99% if the batch partially fails

## PART 5: THE COST MODEL
Compared to sync:
• Batch discount: 50% of standard token price
• But: batch jobs can fail partially, requiring reruns (which cost again)
• Storage costs for input/output files
• Engineering time to build batch pipeline vs one-time sync run

For MY dataset and job, the real savings: $[X] sync vs $[Y] batch = $[Z] savings. Break-even on engineering effort at [N] runs.

## PART 6: THE RUNBOOK
Operational doc for the humans who'll actually run this:
• How to submit a new batch
• How to monitor progress
• What 'stuck' looks like and how to unstick it
• How to reprocess a subset without resubmitting the whole dataset
• Cost projections for common dataset sizes

## PART 7: THE EDGE CASE THAT WILL BREAK YOUR FIRST BATCH
Based on typical first-batch mistakes: where will YOUR batch likely fail? (Encoding issues? Token-length overruns on outlier records? API response size limits? Something else?)
Expected Output

A batch submission pipeline, cost model, and runbook. Batch is genuinely underused for research workflows — first-time users often realise a week of sync calls could have been one overnight batch at half price.

Variations
  • For recurring batches: 'Add a scheduler (cron or Airflow), input-diff detection (only batch new/changed records), and output storage versioning.'
  • For batches with tool use: 'Tool-use in batch is trickier — the tool call round-trips don't happen in batch. Design the workflow: batch generates plans, sync execution runs the tools, batch summarises. Or split differently.'
  • For batches that feed into another LLM: 'Address prompt caching for the second-stage prompt — the batch outputs can warm the cache across sync calls.'
Gotchas
  • Batch turnaround is 'within 24 hours' — usually faster, sometimes slower. Don't commit to stakeholders on a tighter SLA than Anthropic commits to.
  • Partial failures happen. Design for them from the start, not after the first 50% fail.
  • Custom_id is your only handle to match outputs to inputs. Make them unique and meaningful.
  • Batch can't use MCP servers currently. If your workflow needs external tools, batch is the wrong surface.
  • File size limits matter. 100k records in one batch hits the limit; split into multiple batches.
No. 15

The Agent Skills Library for Your Team

Ten Skills that turn your claude.ai workspace into a consistent team productivity layer.

Setup
  • A team of 3+ people using claude.ai Pro or higher.
  • A known set of recurring task types across the team.
  • Skill authoring access (Pro+ with code execution enabled).
The Prompt
I want to build a library of 10 Skills for my team to standardise our common workflows. Today, everyone prompts Claude their own way; outputs are inconsistent; new hires re-invent the wheel.

TEAM CONTEXT:
• Size and roles: [description]
• Most-used Claude workflows: [list the 5-10 tasks people do weekly]
• Current pain points: [inconsistency? re-explaining context? output quality variation?]

TASK:

## PART 1: SKILL INVENTORY
From my workflow list, identify the 10 Skills worth building. For each:
• Skill name (kebab-case)
• One-sentence description (this is the trigger for Claude to load it)
• Who on the team uses it, how often
• What inconsistency it fixes

Rank by ROI (time saved × frequency × number of users).

## PART 2: THE 10 SKILL.MD FILES
For each of the 10, produce the full SKILL.md:

---
name: [skill-name]
description: [when to use]
---

# [Title]

## Purpose
## When to Use
## When NOT to Use
## Instructions
## Examples
## Output Format
## Failure Handling

Rules for each:
• Self-contained (don't assume context from other Skills)
• Composable (should work alongside other Skills, not fight with them)
• Opinionated (a vague Skill produces generic output)
• Tested (include 2-3 example I/O pairs per Skill)

## PART 3: THE DISTRIBUTION PLAN
Custom Skills are per-user on Pro/Max. For team adoption:
• Store canonical SKILL.md files in [a shared repo / Drive folder / Notion]
• Onboarding doc for new team members: how to install each Skill
• Update protocol: when a Skill changes, how does the team know?
• Version tracking (even informal — date + short changelog per Skill)

## PART 4: THE USAGE GUIDANCE
A one-page 'how to use our Skills' doc for the team:
• Which Skill for which task (decision tree)
• What to do when two Skills could apply
• What to do when no Skill fits (when to write a new one vs prompt freeform)
• When to bypass Skills (highly creative work, novel tasks)

## PART 5: THE QUALITY RUBRIC
Per-Skill quality metric: how do you know a Skill is working?
• Output consistency score (multiple runs, similar inputs, measure drift)
• Team adoption rate (are people actually using it, or reverting to raw prompts?)
• Task-outcome quality (did the output actually help the downstream work?)

## PART 6: THE SKILLS YOU SHOULDN'T HAVE BUILT
Reviewing your 10: which one is weakest? Which is most likely to be abandoned? Which would be better as a tool or a template instead of a Skill?

Be honest. Better to cut 2 and ship 8 strong ones.

## PART 7: THE EVOLUTION PLAN
In 3 months, your Skills library has either grown to 20 or shrunk to 5. Which is it for this team? What's the mechanism that makes the right one happen?
Expected Output

A usable Skills library with adoption plan, quality rubric, and honest self-review. This turns claude.ai from 'everyone's own prompts' into a team standard. Expect 2-3 of the 10 to be abandoned within a month; that's normal — it's the 7-8 survivors that matter.

Variations
  • For an agency serving multiple clients: 'Build two tiers: firm-wide Skills (methodology, voice, output standards) and client-specific Skills (their brand, their data, their context). Show the layering.'
  • For heavily regulated teams (legal, healthcare, finance): 'Add Skills that enforce compliance — never output X without a disclaimer, always include attribution, redact specific patterns. Show the structure.'
  • For a team already on claude.ai Team plan: 'Address the Team-plan Skills sharing story if/when Anthropic ships org-level Skills management. For now, what's the best approximation?'
Gotchas
  • Custom Skills are per-user, not shared org-wide on current Pro/Max plans. Team distribution is manual. Plan accordingly.
  • Skills that fight each other are worse than no Skills. Design the library as a coherent set, not 10 independent ones.
  • The 'when NOT to use' section is the most-skipped and most-important. Without it, Claude over-applies Skills.
  • Skills degrade silently when the underlying model updates. Re-test your library after each significant model release.
  • A Skill that nobody uses after a month is technical debt. Cut ruthlessly.
No. 16

The Client-Facing Agent Demo

From idea to a working demo your client can poke at — in under a day.

Setup
  • A client meeting or pitch where a working demo would close the deal.
  • A concrete use case their team has expressed interest in.
  • Managed Agents access.
  • A willingness to build for throwaway — this is a demo, not a product.
The Prompt
I have a client pitch coming up. I want to build a working demo that shows them an agent solving one of their real problems. I have less than a day.

CLIENT CONTEXT:
• Industry: [description]
• Specific pain point they mentioned: [quote them if possible]
• Current tooling: [what they use today that the agent would augment or replace]
• Demo audience: [technical / mixed / fully non-technical]

TASK:

## PART 1: THE DEMO NARROWING
What's the MINIMUM demo that shows the value? Resist scope creep.

Narrow to:
• A single workflow step that's currently painful for them
• A happy-path input (their real data if possible, sanitised if not)
• A visible output they'll immediately understand as better-than-today

Anti-pattern: showing end-to-end automation. The demo should show one decisive moment where Claude produces something that makes them say 'oh'.

## PART 2: THE AGENT DESIGN
For the narrowed demo, produce:
• The agent definition (model, system prompt, tools)
• The minimum tool surface (2-3 tools max — more is distraction)
• The environment config (keep it simple; pre-installed packages only)
• The session launch code

Write it to be readable in a live demo — skip abstractions, skip configs that don't matter.

## PART 3: THE DATA
The demo needs data to operate on. Options:
• Client's real data (ideal but requires legal/privacy clearance before the meeting)
• Sanitised sample of their data (good — strip identifying info, keep structure)
• Public data that looks like their domain (okay fallback)
• Synthetic data you generate (last resort — looks fake, undermines trust)

Recommend one; produce the data file if synthetic or sanitised.

## PART 4: THE SCRIPT
The actual flow of the live demo. 15 minutes max. Written as a script:
• Setup (30 seconds — what's on screen, what the client is looking at)
• Show the problem (2 minutes — what's currently painful)
• Run the agent (3-5 minutes — live, not recorded)
• Show the output (2-3 minutes — what Claude produced, why it's good)
• The ask (2-3 minutes — what you want from them next)
• Q&A (remaining time)

Include the exact words you'll say at each step. Not because you'll read them, but because writing them forces precision.

## PART 5: THE FALLBACK PLAN
Live demos fail. Prepare:
• If the agent errors mid-run: what do you say? What's the recovery move?
• If the output is weird: is there a 'here's what we've seen in testing' backup?
• If network is flaky: do you have a recorded version ready?
• If a demo-specific thing breaks (auth token, data file): what's the 30-second fix?

## PART 6: THE POST-DEMO ARTIFACTS
After the demo, what do you leave behind?
• A one-page summary (not the demo script — a take-away doc)
• A suggested next-step (pilot scope, not full implementation)
• The agent config itself (optional — depends on commercial stage)

Produce the one-pager.

## PART 7: WHAT YOU CAN'T LET THEM SEE
The gap between a demo and a production agent is real. Know what you're showing vs what's glossed over:
• Edge cases you're not handling
• Scale behaviour you haven't tested
• Auth patterns that work in demo but wouldn't in their enterprise
• Cost at real volume

If they ask, have answers. Don't pretend gaps aren't there.
Expected Output

A minimum-viable demo: agent code, data, live-demo script, fallback plan, one-pager. This is the 'show, don't tell' pattern that closes deals — provided you narrow hard in Part 1.

Variations
  • For a non-technical audience: 'Strip the code entirely. Show input → magic → output. Focus Part 4 on the before/after, not the how.'
  • For a technical audience: 'Show the prompt, the tool schemas, the event stream. The 'how' is the product for this audience.'
  • For a security-conscious enterprise: 'Additionally address the security questions before they're asked — how data flows, what's logged, where keys live, what Anthropic sees.'
Gotchas
  • Live demos fail. Always have a recorded backup ready to seamlessly switch to.
  • Don't use the client's production credentials in a demo. Don't use real customer data without explicit sign-off. Don't show their data on a screen that's shared.
  • The narrowing in Part 1 is where first-time demos fail. Everyone tries to show too much. One moment of 'oh, that's good' beats ten minutes of 'here's a thing it can do'.
  • If the demo takes more than 15 minutes, you've built a pilot, not a demo. Scope down.
  • Don't claim the demo is production-ready. Don't claim the demo isn't achievable either. Name the gap honestly.
No. 17

The Prompt Improver Feedback Loop

Use Prompt Improver as a critique partner, not just a tool you click once.

Setup
  • A prompt that's working but could be better.
  • Prompt Improver access (Console).
  • Concrete feedback on what the prompt's missing.
The Prompt
I have a prompt that's working but not great. I want to run it through Prompt Improver, but I've learned that one-shot 'improve this' often over-engineers. Help me use it as a critique loop.

EXISTING PROMPT:
[Paste your current system prompt and user template.]

WHAT'S NOT GREAT:
[Be specific. Example: "Outputs are verbose when they should be terse. Occasionally misses the 'uncertain' flag. Sometimes hallucinates a field that isn't in the source."]

TASK:

## PART 1: THE DIAGNOSTIC BEFORE THE IMPROVEMENT
Before running Prompt Improver, diagnose: what's actually wrong?
• Is the task specification unclear?
• Is the output format under-specified?
• Are the examples mis-calibrated?
• Is the model choice wrong (too weak / too strong for the task)?
• Is the input itself ambiguous?

Prompt Improver fixes some of these; not others. Name the likely root cause. If Prompt Improver is the wrong tool, say so.

## PART 2: THE TARGETED PROMPT-IMPROVER RUN
If Prompt Improver IS the right move, don't feed it the whole prompt with vague feedback. Structure the input:
• The current prompt
• Specific feedback: "summaries are too basic for expert audiences" or "doesn't follow the specified JSON structure"
• Examples of failures (concrete inputs/outputs where the current prompt went wrong)

Produce the exact input you should feed Prompt Improver.

## PART 3: THE OUTPUT REVIEW
Prompt Improver's output is a first draft. Review it for:
• Length creep (did it add 30% more tokens for marginal gain?)
• Chain-of-thought bloat (useful for some tasks, useless for others)
• XML tag explosion (structure is good; over-structure is noise)
• Voice drift (did it lose your prompt's personality in favour of corporate-neutral?)

Produce a decision tree: what to keep from the improver's output, what to revert.

## PART 4: THE EVAL DIFF
Before committing the 'improved' prompt:
• Run both versions through your Eval Tool suite
• Compare scores per test case
• Flag regressions (where new is worse than old)
• Flag improvements (new wins)
• Flag non-changes (neither version handles X — unaddressed by Improver)

Which cases justify keeping the old version partially? Which justify the new one?

## PART 5: THE MERGE STRATEGY
Often the right move isn't 'keep old' or 'keep new' — it's merge. Take the Improver's best moves, discard the cruft, graft onto the existing structure.

Produce the merged prompt. Explain each change — why it's in, or why the Improver's version was rejected.

## PART 6: THE ITERATIVE LOOP
The loop:
1. Identify specific failure modes (not general 'improve this')
2. Run Prompt Improver with targeted feedback
3. Review for over-engineering
4. Eval-diff against the current version
5. Merge the keepers; discard the rest
6. Commit the incremental change
7. Next loop: what's the NEW weakest point?

When does the loop stop? When the weakest point is no longer about the prompt — it's about the model, the input, the task itself, or you're chasing a fractional gain that doesn't matter. Know when to stop.

## PART 7: WHAT PROMPT IMPROVER WON'T FIX
Be honest:
• Fundamentally wrong task framing
• Bad examples (garbage examples → garbage improvements)
• Wrong model for the task
• Ambiguity in the input that no prompt can resolve
• Mission creep (a prompt trying to do too much)

Does my prompt have any of these? If yes, Prompt Improver is treating symptoms. Point me at the root cause.
Expected Output

A targeted-use workflow for Prompt Improver that avoids the 'one-shot improve everything' trap. The Part 7 honesty is what prevents Improver from being used as a cargo cult — running it on every prompt without asking if it's the right tool.

Variations
  • For long-context prompts: 'Address whether Improver handles long prompts well or compresses them at the cost of fidelity.'
  • For classification prompts: 'Additionally address whether adding chain-of-thought helps or hurts classification specifically. Improver tends to add CoT; measure whether it improves accuracy on simple classification.'
  • For creative/generation prompts: 'Improver's structural additions can sterilise voice. Walk through how to preserve voice while accepting structural improvements.'
Gotchas
  • Prompt Improver adds tokens. If you're latency- or cost-sensitive, measure — sometimes the 'improved' prompt costs 2x for 10% quality gain.
  • It adds chain-of-thought aggressively. For simple classification or extraction, this is waste; for reasoning tasks, it's essential. Be intentional.
  • It over-adds XML tags. Two levels of nesting is usually the sweet spot; more gets parsed poorly.
  • It can't fix an ill-posed task. If your prompt is confused about what it's supposed to do, no amount of improving the words will help.
  • Run the eval diff. Always. 'Improved' is a claim, not a fact, until the eval shows it.
No. 18

The Tool-Use Retry and Fallback Pattern

When Claude calls your tool and the tool fails, what happens next defines whether your agent is production-grade.

Setup
  • An agent or prompt that makes tool calls.
  • Tools that can fail (network errors, validation errors, rate limits, auth expiry — all of them).
  • Managed Agents or Messages API with tool-use.
The Prompt
My agent calls external tools. Sometimes the tools fail. Today, the agent either gives up, lies about the outcome, or loops. I want a proper retry and fallback pattern.

AGENT CONTEXT:
• Agent purpose: [description]
• Tools it uses: [list with brief description]
• Most common failure modes: [what you've observed]
• Error budget: [how much retry-induced latency is acceptable]

TASK:

## PART 1: THE FAILURE TAXONOMY
Categorise tool failures:
• TRANSIENT — will succeed if retried (network blip, 5xx, timeout)
• RATE-LIMITED — will succeed if retried after a delay
• AUTH — will succeed only after re-authentication
• VALIDATION — arg is wrong; retry with same args will fail the same way
• PERMANENT — the resource doesn't exist, the operation isn't allowed
• DEGRADED — tool returned something, but the something is wrong/incomplete
• UPSTREAM-OUTAGE — the service is down entirely

For each tool my agent uses, map which of these can happen.

## PART 2: THE RETRY POLICY PER CATEGORY
For each failure type:

TRANSIENT: retry [N] times with exponential backoff. If still failing, escalate as UPSTREAM-OUTAGE.

RATE-LIMITED: respect the Retry-After header if present; otherwise, exponential backoff. Log for capacity planning.

AUTH: attempt one re-authentication; if that fails, halt and escalate to human.

VALIDATION: do NOT retry. Report the validation error back to Claude in the tool_result, let Claude re-plan the call.

PERMANENT: do NOT retry. Report back to Claude; Claude should accept this as information and continue without this data.

DEGRADED: depends on the tool. If validation catches it, report; if it doesn't, this is the silent killer — we'll handle separately.

UPSTREAM-OUTAGE: attempt [fallback strategy if one exists]; otherwise halt gracefully with a useful partial result.

Produce the retry-decision code for each category.

## PART 3: THE CLAUDE-FACING ERROR MESSAGE
When a tool fails, Claude reads the tool_result. How you phrase the error affects what Claude does next.

Good error messages to Claude:
• State what failed (which tool, which params)
• State why (in plain language, not stack traces)
• Suggest a next action when possible ("the customer_id format was wrong; try without the prefix")
• Don't encourage infinite retry ("the API is permanently deprecated — stop trying")

For each failure category, produce the error message template.

## PART 4: THE FALLBACK TOOL PATTERN
For each critical tool, design a fallback:
• Primary tool fails → what secondary tool does Claude try?
• Secondary fails → graceful degradation (partial result, caveat in output)
• All fallbacks fail → halt with structured error

Example: 'address validation API down → fallback to a regex sanity-check → if that fails, return the address as-provided with a flag'.

## PART 5: THE LOOP DETECTION
Agents that retry-loop are a classic cost and time sink. Detect and halt:
• Same tool + same args + same error 3 times in a row → halt and escalate
• Total tool calls per session > [N] → halt
• Total retry-induced latency per session > [T] → halt
• Claude re-planning the same approach twice → warn that planning isn't converging

Produce the wrapper code that enforces this.

## PART 6: THE DEGRADED-RESPONSE HANDLING
Tools that return 'success' but garbage are the worst:
• Empty response with 200 status
• Malformed data that passes schema validation
• Stale data (return from cache when fresh was needed)

For each tool, what's the 'sanity check' on the response before handing it to Claude?

## PART 7: THE HUMAN ESCALATION PATH
When retries exhaust, what happens? Options:
• Halt and log, let the user retry
• Email/page a human on-call
• Queue for later retry (for non-interactive workflows)
• Produce a best-effort partial result with a clear caveat

For MY agent's context, recommend one per failure type.

## PART 8: THE POST-MORTEM ARTIFACT
When a tool-use incident happens, you'll want the logs. Design what to capture:
• The tool calls attempted (name, args, timestamp)
• The responses received (success or failure, with details)
• Claude's reasoning between attempts (if extended thinking enabled)
• The total session cost and duration
• The eventual outcome (success, partial success, halt, escalation)
Expected Output

A failure-category taxonomy, per-category retry policy, error message templates for Claude, fallback patterns, loop detection, and escalation paths. This is the difference between a prototype that works in demos and an agent that survives production.

Variations
  • For agents making many parallel tool calls: 'Address concurrent-failure handling — 3 tools fail in parallel, what's the orchestration? Which errors block others?'
  • For agents with strict latency SLAs: 'Shift the bias — fewer retries, faster fallback to partial results. When speed beats completeness.'
  • For financial / medical / legal agents: 'Tighten escalation — almost any error escalates to human. Build the pattern where retry is the exception, not the default.'
Gotchas
  • 'Retry 3 times' without categorisation is the single most common anti-pattern. Retrying a validation error 3 times is 3 wasted calls.
  • Loop detection must be outside the agent. An agent can't reliably detect its own loop because the loop might include 'try something new' that looks like progress.
  • Error messages to Claude are prompts. Vague ones produce confused re-planning; precise ones produce correct next steps.
  • The Retry-After header exists. Respect it. Ignoring it is how you end up on a vendor's block-list.
  • Fallback chains can hide real outages. If your primary tool has been down for a month but the fallback's working 'fine', you have a silent reliability problem.
No. 19

The Observability Stack for AI

What to log, what to alert on, how to debug an agent that went wrong at 3am three weeks ago.

Setup
  • An agent or prompt-based system in (or heading to) production.
  • An observability platform (Datadog / Honeycomb / Grafana / even stdout logs).
  • A willingness to invest in observability BEFORE the first incident — otherwise you'll be reading logs that don't exist.
The Prompt
I need an observability stack for my agent infrastructure. Today, if something goes wrong, my debugging options are 'stare at my code' and 'hope'. I want to fix that.

STACK CONTEXT:
• Agent(s) running: [how many, doing what]
• Current observability: [what you have today; 'nothing' is a valid answer]
• Infrastructure: [Managed Agents / Messages API / hybrid]
• Platform: [Datadog / Honeycomb / cloud-native logging / custom]
• Volume: [requests/day, approximate]

TASK:

## PART 1: THE SIGNAL HIERARCHY
What to capture, at what level of importance:

CRITICAL (must have, from day one)
• Request ID / session ID (end-to-end trace handle)
• Model used
• Input token count
• Output token count
• Cache read tokens / write tokens
• Total cost per request
• Latency (total, time-to-first-token, time-between-tokens)
• Error code (if any)
• Tool calls attempted (name + success/failure)

IMPORTANT (should have by week 2)
• Structured input (selected fields, PII-scrubbed)
• Structured output (same)
• Claude's thinking tokens (if extended thinking on)
• Prompt version / hash
• Agent config version
• Per-tool-call latency and cost

NICE-TO-HAVE (add when you have the headroom)
• Full input/output (with aggressive PII scrubbing)
• Token distribution histogram (for cost spike diagnosis)
• User/tenant metadata
• A/B variant markers

Recommend what my stack should log at each level.

## PART 2: THE TRACE CORRELATION
One user action might cause:
• Your app to make an LLM call
• The LLM to make 3 tool calls
• Each tool call to hit a different backend
• The final response to generate a log line

All of this needs one trace_id. Design:
• Where the trace_id originates (HTTP header, session creation, generated per request)
• How it propagates (into the system prompt? tool-call metadata? OpenTelemetry-style spans?)
• How you query across spans to reconstruct a full session

## PART 3: THE DASHBOARDS
Three dashboards you need:

DAILY HEALTH
• Request volume, error rate, p50/p95/p99 latency, total cost
• Model breakdown (which models, what volume, what cost each)
• Tool success rate per tool
• Cache hit rate

INCIDENT RESPONSE
• For a given session_id: full timeline, every tool call, every error, cost, duration
• Ability to replay (or at least view) the prompt and response
• Correlation to upstream and downstream requests

COST INVESTIGATION
• Cost per agent, per session, per tenant
• Outlier detection (sessions > 3 standard deviations above average cost)
• Trends over time, vs budget

Produce the dashboard spec for each.

## PART 4: THE ALERTS
Alerts should be actionable. For each, define:
• The signal
• The threshold
• Who gets paged
• What the first action is

Proposed alerts:
• Error rate above [X]% over [T] minutes
• Cost burn-rate exceeding monthly-budget-divided-by-days projection
• p95 latency regression
• Cache hit rate dropping below [X]% (suggests cache eviction or prompt change)
• Tool success rate below [X]% for any specific tool
• Loop detection (same session hitting [N] tool calls)

## PART 5: THE PII & COMPLIANCE LAYER
Observability meets data protection:
• What fields MUST be redacted before logging (PII, credentials, sensitive content)
• How to log 'something happened' without logging the content
• Retention policy (7 days for full logs? 90 days for metadata? forever for audit trails?)
• Access controls (who can see full logs vs aggregated metrics)
• Subject-access-request support (if a user asks for their data, can you find and delete it?)

## PART 6: THE INCIDENT-RESPONSE PLAYBOOK
When an alert fires:
• Triage: how do you know it's real vs noise?
• Containment: what stops the bleeding?
• Investigation: how do you find root cause?
• Communication: who needs to know, in what order?
• Recovery: how do you verify the fix?
• Post-mortem: what goes in the doc?

Produce the playbook.

## PART 7: THE 'DEBUG THIS ONE REQUEST' WORKFLOW
A user reports 'this response was wrong'. They give you a session_id (or equivalent).

In under 10 minutes, you should be able to:
• Find the full request
• See the exact prompt Claude received
• See the exact response
• See every tool call and result
• See the Claude reasoning (thinking tokens if enabled)
• See the cost and latency
• See what version of the prompt/agent was live

If your current stack can't do this, Part 7 is your highest priority.

## PART 8: THE COST OF OBSERVABILITY
Observability costs money. Be honest:
• Logs at full verbosity: $[X]/month for your volume
• Traces with full token content: $[Y]/month
• Retention at [N] days: $[Z]/month

Where's the cost/value curve? Where do you cut?
Expected Output

A tiered logging plan, trace correlation design, three dashboards, actionable alerts, PII layer, incident playbook, and a 'debug one request' workflow. The Part 7 test is the acid test — if you can't debug one request in 10 minutes, your observability isn't production-ready.

Variations
  • For high-volume consumer-facing: 'Address sampling — you can't log every request at full fidelity. Design the sampling strategy (always keep errors, p99 latency outliers, cost outliers; sample the rest).'
  • For regulated industries: 'Emphasise PII and audit-log compliance. Retention mandated by regulators, not chosen by engineering. Access-control tightness is non-negotiable.'
  • For multi-tenant SaaS: 'Address per-tenant observability — tenants want their own data, not the aggregate. Design the tenant-facing dashboard separately from the ops one.'
Gotchas
  • Logging too much too early is expensive and hides signal in noise. Logging too little leaves you flying blind. The tiered approach (Part 1) balances this.
  • Trace IDs are cheap to add early and impossible to retrofit later. Add them before you need them.
  • Alerts that page you at 3am and turn out to be false need to be tuned, not ignored. Unmanaged alerts become ignored alerts.
  • Full token content in logs is a compliance surface. Understand what your regulators expect before deciding the retention.
  • Observability cost scales with volume. What's free at 1000 req/day is expensive at 1M req/day. Plan the cost curve.
No. 20

The Build-vs-Buy Memo for Agent Infrastructure

Before you build it yourself, here's the memo that proves whether you should.

Setup
  • A team considering whether to build agent infrastructure or adopt Managed Agents (or a competitor).
  • Honest numbers on engineering capacity, timelines, and alternative project costs.
  • A clear workload description — generic 'we want agents' doesn't work; specific workloads do.
The Prompt
I'm writing a build-vs-buy memo for our agent infrastructure decision. The question: do we build the agent harness ourselves (LangGraph, custom sandboxing, our own state management) or adopt Managed Agents (or similar)?

CONTEXT:
• Team: [size, ML/AI maturity, availability]
• Workload: [one paragraph — the actual agent workload at stake]
• Current state: [what you have today — prototypes, partial production, nothing]
• Timeline pressure: [when does this need to be in production, and what forces that date]
• Control requirements: [data residency, custom model integrations, vendor-agnostic, etc.]

TASK:

## PART 1: THE HONEST SCOPE OF "BUILD"
Building agent infrastructure is not a weekend. Enumerate:
• Sandboxing (secure container execution for untrusted agent code)
• State management (persistence across session interruptions)
• Tool orchestration (dispatch, retry, timeout, logging)
• Credential vaulting (never show keys to the model)
• Event streaming (SSE or equivalent for real-time UX)
• Checkpointing (resume a long-running session after a crash)
• Observability (tracing, logging, cost attribution)
• Multi-tenancy (if applicable)
• Incident response and on-call rotation
• Ongoing maintenance (updates, security patches, new model migrations)

For each: rough engineer-months to build to production quality. Be honest. Add 50% for the 'works in prod but weird edge case every Tuesday' tax.

## PART 2: THE HONEST SCOPE OF "BUY"
Managed Agents (or similar) handles most of Part 1. But not all.

What you still build on top of a managed platform:
• Your domain-specific tools and MCP servers
• Your prompts, skills, eval suite
• Your application integration (UI, webhooks, business logic)
• Your data pipeline feeding the agent
• Your monitoring on the buy-platform's monitoring
• Your fallback for when the buy-platform has an outage

Rough engineer-months for the "still build on top" portion.

## PART 3: THE COST MODEL
Build:
• One-time (engineering to production)
• Ongoing (maintenance, on-call, updates)
• Infrastructure (cloud compute, storage, observability)
• Opportunity cost (what else could the team be building?)

Buy:
• Per-runtime-hour ($0.08 for Managed Agents)
• Token costs (separate)
• Lock-in risk (what if the vendor raises prices or deprecates the product?)
• Migration cost (out, if ever needed)

Compute for MY workload at MY expected volume. Be specific. 'It depends' is not an answer.

## PART 4: THE RISK COMPARISON
Build risks:
• Timeline slip (likely)
• Security bug in sandbox (catastrophic)
• Team attrition (knowledge loss)
• Becoming an infra vendor to yourself (years of opportunity cost)

Buy risks:
• Vendor lock-in (moderate — MCP and Skills are somewhat portable)
• Pricing changes (watch the roadmap)
• Feature gaps (if something the vendor doesn't support is core to your use case)
• Compliance / data-residency gaps

Quantify where possible. Use likelihood × impact.

## PART 5: THE DECISION MATRIX
For each of the following team profiles, who should build vs buy?
• Greenfield startup, small team, fast timeline, no strong platform commitments
• Mid-market with existing ML infra, moderate timeline, multi-vendor strategy
• Enterprise with compliance requirements, long timeline, platform-lock-in tolerance varies
• Hyperscaler with ML platform team, no timeline pressure, build-everything-ourselves culture

For MY team profile, which bucket is closest, and what's the recommendation?

## PART 6: THE HYBRID OPTION
Rarely is it 100% build or 100% buy. Most real answers are:
• Start on managed. Migrate to custom if/when specific needs justify.
• Use managed for production; build a wrapper so you can switch vendors if needed.
• Build the domain-specific parts (tools, prompts); buy the generic parts (runtime, state).

Which hybrid makes sense for my context? What's the switch-out cost if I need to leave?

## PART 7: THE 2-YEAR LOOK
Technology moves. Think out 2 years:
• Will Managed Agents still be the best managed option, or will competitors (OpenAI, AWS) ship equivalents?
• Will agent infra become so commoditised that building is silly? Or so specialised that buying creates ceiling?
• How does your own team's maturity shift the answer over time?

Recommend the decision that's robust to uncertainty — not just the one that looks best today.

## PART 8: THE EXECUTIVE READ
One page. For the CEO/CTO/board. Clear recommendation. The three scenarios under which the recommendation flips. The one question I haven't answered that would change the decision.

## PART 9: THE DECISION DOCUMENT
Not the memo — the actual decision. In writing. With an owner. With a review date.
• Decision (build / buy / hybrid, specifically)
• Why (the 2-3 deciding factors, not the full memo)
• Review trigger (when does this decision get revisited?)
• Owner (whose neck is on the line)

This is what you file in the team's decision log.
Expected Output

A rigorous build-vs-buy memo: honest scope on both sides, real cost model, quantified risks, and a recommendation robust to the 2-year horizon. The Part 9 decision document is what turns a memo into an actual decision — most build-vs-buy analyses end without a decision actually being made.

Variations
  • For a team that already built custom infra: 'Frame as migration analysis: stay-put cost vs migration cost vs greenfield-rewrite cost. Add a cost of maintaining two systems during migration.'
  • For a team evaluating multiple buy options: 'Expand Part 2 to cover all the realistic options (Managed Agents, OpenAI Assistants, Bedrock Agents, Vertex, etc.) individually.'
  • For a team with platform engineers who want to build: 'Address team morale / retention risk. Build-vs-buy is sometimes lost on ego, not economics. Name that dynamic in the memo if it's present.'
Gotchas
  • Build cost estimates are always too low. Double or triple the first number. The one team that hits its estimate was lying to themselves at the start.
  • 'Vendor lock-in risk' is a real concern, but it's often overstated vs 'maintenance-of-custom-infra risk', which is understated.
  • The decision is reviewable, not permanent. Framing 'build vs buy' as irreversible generates anxiety and worse decisions.
  • If the buy option is new (like Managed Agents beta), the 2-year look is crucial. Don't buy on early-stage features; buy on what the platform will predictably be in 18 months.
  • The 'who owns the decision' question in Part 9 is where many memos die. If nobody owns it, nothing changes, and six months later someone asks the question again.

the work you do with these

The platform surface will change more in the next six months than it did in the six months before this volume went to press. Managed Agents will graduate from beta. The research-preview features — outcomes, multi-agent, memory — will move into the beta surface. New models will ship. Prices will shift.

What won't change is the shape of the work. Prompt Improver will still be over-aggressive. Multi-agent systems will still be wrong more often than they're right. Structured outputs will still beat regex parsing. The cost of observability after the first incident will still be higher than the cost before.

Volumes I and II catalogued Claude. This volume catalogues what builders do with it. A Volume IV, if there is one, will catalogue what clients pay for. The progression matters — each volume is less about the tool and more about the work.

Tools are replaceable. Methodology is durable. Write down yours.

If you found this volume useful, the highest-value thing you can do is write your own — your own prompts, your own gotchas, your own cost models, calibrated to your team and your clients. Use this as scaffolding. Burn it down as you go.

about this volume

The third volume in The Claude Compendium series. Volume I is the reference. Volume II is the consultant's working prompt library. This third volume is for builders, or for the consultants advising them.

Set in Fraunces (serif) and JetBrains Mono (monospace). Compiled in April 2026. Non-commercial reproduction permitted with attribution; commercial use requires permission.

Pair this volume with The Claude Compendium — Consulting Edition, Volume I and The Consultant's Twenty, Volume II.