10 · Note 01 — Cost Architecture
Status: Outline. Body fills in Week 4. Voice: principal-level, BFSI-threaded, Apic-calibrated.
What this file is. How to design token economics into an architecture — not as an afterthought but as a first-class constraint that shapes model selection, caching strategy, and tier design.
What this file is NOT. A pricing page summary — this is about architectural decisions that change unit costs by an order of magnitude.
The cost formula
Token cost is deterministic at design time. The formula:
Monthly cost = (input_tokens × input_rate + output_tokens × output_rate) × call_volume
− (cache_read_tokens × cache_discount × cache_hit_rate × call_volume)
April 2026 reference rates (Claude API, USD per million tokens):
| Model | Input | Output | Cache write | Cache read |
|---|---|---|---|---|
claude-opus-4-7 |
$15 | $75 | $18.75 | $1.50 |
claude-sonnet-4-6 |
$3 | $15 | $3.75 | $0.30 |
claude-haiku-4-5-20251001 |
$0.25 | $1.25 | $0.30 | $0.03 |
Key insight: output tokens cost 5× input tokens. A prompt that produces long verbose output is expensive in a different category from a prompt that reads large context. Optimize input and output separately.
Cache read costs ~10% of input cost. A 90% cache hit rate on a 2,000-token system prompt saves roughly $2.70 per million calls on Sonnet. At 200K calls/day, that's ~$16K/month saved on that one prompt alone.
The three cost levers
Lever 1: Model tier selection
The cost ratio between Opus 4.7 and Haiku 4.5 is 60:1 on input, 60:1 on output. Choosing the wrong tier is the single biggest architectural error in cost design.
Decision framework: - Haiku on the hot path — anything customer-facing, high-volume, latency-sensitive, extractive (classify, extract, route) - Sonnet for the reasoning layer — generation, synthesis, structured drafting, moderate complexity - Opus for the judgment layer — compliance review, adversarial edge cases, audit-grade reasoning, tasks where wrong output has legal/regulatory consequences
The BFSI SOP search use case: Haiku retrieves and ranks, Sonnet synthesizes the answer, Opus reviews if the query touches regulatory text. Three tiers, one workflow.
Lever 2: Prompt caching
Prompt caching converts repeated context into cheap cache reads. Anything that appears verbatim at the top of every call is a caching candidate:
- System prompts (role definition, safety guardrails, tool definitions)
- Static reference text (RBI guidelines, internal policy documents, product catalogs)
- Few-shot examples
Caching requires: content is ≥1024 tokens (Sonnet/Opus) or ≥2048 tokens (Haiku), content appears at the start of the prompt (or in cache_control: ephemeral blocks), and the cache TTL (5 minutes default, extendable) covers your call pattern.
The cache hit rate is not guaranteed. Burst traffic after a TTL expiry can cause a "cache cold start" — every request pays full input cost until the cache warms. Design for this; see the cost cliff section.
Lever 3: Batching (Message Batches API)
For offline workloads — nightly compliance review, document classification, report generation — the Message Batches API offers a 50% cost reduction with 24-hour SLA.
Not appropriate for: anything synchronous, anything customer-facing, anything with a latency requirement.
BFSI use cases suitable for batching: - Nightly compliance document review (use case 4) - Monthly executive analytics summarization (use case 6, historical data) - Bulk SOP re-indexing after policy updates
Cost-per-use-case: BFSI worked estimate
Use case 1: Customer support agent assist at 200K calls/day
Assumptions: - Avg system prompt: 3,000 tokens (role, guardrails, tool defs) — cacheable - Avg conversation context sent per call: 800 tokens (last 4 turns) - Avg user query: 120 tokens - Avg model output: 300 tokens - Model: Haiku 4.5 (hot path, customer-facing) - Cache hit rate: 85% on system prompt
Per-call token math: - Input uncached: 800 + 120 = 920 tokens at $0.25/M - Input cache read: 3,000 tokens at $0.03/M (85% of calls) - Input cache write: 3,000 tokens at $0.30/M (15% of calls — cache miss) - Output: 300 tokens at $1.25/M
Per-call cost:
Uncached input: 920 × $0.25/M = $0.000230
Cache read: 3,000 × $0.03/M × 0.85 = $0.0000765
Cache write: 3,000 × $0.30/M × 0.15 = $0.000135
Output: 300 × $1.25/M = $0.000375
──────────
Total per call: $0.000817
Monthly cost (200K/day × 30 days = 6M calls):
Without caching (same volume, paying full input for system prompt):
(920 + 3,000) × $0.25/M × 6M + 300 × $1.25/M × 6M
= 3,920 × $0.25/M × 6M + 300 × $1.25/M × 6M
= $5,880 + $2,250 = $8,130/month
Caching saves ~40% on this use case: $3,230/month.
Business unit cost attribution
At scale, "Claude costs X per month" is not useful. You need cost per team, per use case, per product SKU.
Attribution model:
- Tag every API call with metadata:
use_case,business_unit,user_role,channel - Ship these tags as request metadata; capture them in your observability layer (CloudWatch, Datadog)
- Aggregate by dimension in your cost dashboard
Per-use-case budget ownership: - Support agent assist → Contact Centre Operations budget - SOP search → HR / Internal Operations budget - RM copilot → Retail Banking / Wealth budget - Compliance review → Compliance & Legal budget - Developer productivity → Engineering / CTO office budget - Exec analytics → Corporate Affairs / CEO office budget
Showback vs. chargeback:
Start with showback (show teams what they spend, don't bill them). Once patterns are established (typically 3 months), move to chargeback with per-use-case budgets and alerting. The CISO will want to know if a single team can generate runaway costs — budget caps per business unit are a governance control, not just a finance one.
The CFO question you will get: "What's my cost per customer interaction?" Answer: $0.0008 for Haiku-tier agent assist. For context, a human agent handles ~50 interactions/day at a fully-loaded cost of ~$80/day = $1.60/interaction. Claude is 2,000× cheaper at this tier.
The cost cliff: what happens when caching fails
The cost cliff is a specific failure mode: cache TTL expires during a traffic surge, every concurrent request pays full input cost, the cost spike is 3–5× normal for 5–15 minutes, and if your rate limits are set based on normal cost, you may also hit throttling.
Scenario: Monday 9am surge on SOP search
- Normal traffic: 500 req/min, 85% cache hit rate
- Monday morning: 2,000 req/min, cache just expired at 8:58am
- All 2,000 req/min pay full input cost for 5 minutes (cache warms by 9:05am)
- Cost spike: 5× normal for 5 minutes ≈ 1.7% monthly cost blown in 5 minutes
Mitigation strategies:
- Cache warming job: A scheduled Lambda/cron job that sends one request per use case at TTL-5 minutes to keep the cache warm. Simple and effective.
- Sticky cache via API tier: Use Apic's extended cache TTL (available at higher usage tiers) for critical system prompts.
- Cost anomaly alert: CloudWatch metric alert on
TokensIn > 2× baseline for 5 min→ pages on-call, not just weekly review. - Graceful degradation budget: Reserve 15% monthly budget headroom for cache cold starts. Don't budget to the penny.
The CISO cares about this because a cost cliff at scale can be induced by a targeted traffic spike — it's a cost-based denial-of-service vector. Budget caps per use case are a defense.
Cross-references
- Note 02 — Latency Architecture
- Note 04 — Observability — cost observability dashboard
- Case Study 01 — BFSI Cost & Latency Model
- Drill 01 — Estimate Tokens from Business Volume
- Module 04 — Claude Platform Architecture
Strong-Hire bar for this file
- Can derive per-call cost from first principles (formula, not a rate card lookup)
- Names the three levers in order of impact magnitude and explains why caching > batching for synchronous workloads
- Produces the BFSI 200K/day worked estimate with explicit assumptions in under 3 minutes
- Connects cost attribution to governance (CISO, budget caps, chargeback)
- Explains the cost cliff and has at least two mitigations ready
The interview-room answer (60 sec)
"Cost architecture starts with the formula: input tokens times rate plus output tokens times rate, minus caching savings. The three levers in order of impact are model tier selection — a 60x cost ratio between Opus and Haiku means the wrong model choice dominates everything else — prompt caching, which converts your static system prompt from expensive input to cheap cache reads, and batching for offline workloads at 50% discount. For the BFSI support agent at 200K calls per day, Haiku plus prompt caching lands at roughly $5K/month. Without caching, that's $8K. The design decision that unlocks this is putting the cacheable system prompt first and keeping it stable. The one risk I'd flag proactively: cache cold starts during traffic surges are a 3–5x cost spike — I always build a cache warming job and reserve 15% budget headroom for it."