04 · Note 03 — Prompt Caching Architecture

Status: Outline. Body fills in Week 2. Voice: principal-level, BFSI-threaded, Apic-calibrated.

What this file is. Prompt caching as an architectural concern, not a "tip" — TTL, hit-rate optimization, cost impact, breakpoint placement, eval implications.

What this file is NOT. A copy of Apic's caching docs. Not a benchmark.

Why caching is architecture

Most engineers see prompt caching as "save tokens." Architects see it as the single largest lever to fit Claude into a customer's cost envelope without sacrificing capability. Treating caching as a tip leaves 50–80% of the savings on the table.

The three architectural questions caching answers

Where do you place cache breakpoints? — wrong placement burns the budget.
What hit-rate are you designing for? — sets your cost model.
How does cache invalidation interact with your eval regression gates? — wrong answer means you ship a model change with cache pollution skewing the evals.

Mechanics (the part you must know cold)

Cache breakpoints sit at chosen positions in the prompt (system prompts, document contexts, tool definitions, prior conversation turns).
Tokens before a breakpoint can be cached and reused on later calls.
Cache hits are billed at a fraction of the input-token rate; cache writes carry a small premium.
TTL is bounded — typically minutes — so cache strategy is sensitive to call cadence.
Cache scope is per-account / per-organization, not cross-customer.

→ Verify exact pricing/TTL numbers against current Apic docs before quoting in interview.

Where to place breakpoints

The single highest-leverage decision. The pattern that works in regulated enterprise:

[ system prompt ] ◄── breakpoint 1: large, stable content
[ tool definitions ]
[ retrieved context (RAG) ] ◄── breakpoint 2 if context is reused
[ conversation history ] ◄── breakpoint 3 in long-lived sessions
[ current user turn ]

Rules of thumb

Stable content first, volatile content last. Anything that changes per-request must come after the last breakpoint.
Big-and-static beats small-and-dynamic. Cache the 50K-token policy bundle once; don't try to cache the 200-token user turn.
Tool definitions go in the cached zone — they are large and rarely change.

Hit-rate optimization

The cost model depends on hit-rate:

effective_input_cost = (cache_miss_cost × miss_rate)
                     + (cache_hit_cost × hit_rate)
                     + (cache_write_premium × first_call_rate)

For a 50K-token system prompt with 90% hit-rate, effective input cost can drop 5–8× vs naive uncached. For a 5K-token system prompt with 30% hit-rate, savings are marginal.

Architectural moves to lift hit-rate

Co-locate calls in time. Burst-then-quiet patterns kill hit-rate. Steady-state traffic is cheap.
Pin one user-session to one worker when sticky-session is acceptable.
Pre-warm caches for predictable burst traffic (market open, daily-batch start).
Refuse to put random IDs in the cached zone. Even one varying byte breaks the cache.

Eval implications (the bit most teams miss)

Caching can mask regressions if you don't design for it. Two specific traps:

Trap 1 — Cache pollution across model versions

You cached against claude-sonnet-4-6 last week. You're testing claude-sonnet-4-7-preview this week. Cache misses → cost spikes → you blame the new model when it's actually fine.

Mitigation: evals run against a cleared cache state (deterministic), not the production cache state.

Trap 2 — Eval scores skew with cache hot/cold

The first call in an eval batch is cold; the next 99 are hot. Latency numbers from a cached run won't match production. Mitigation: report eval latency at both states (cold p99, hot p99).

→ See Module 06 Note 04 — Online Monitoring & Regression Gates.

BFSI customer caching pattern (preview)

For the running BFSI customer's six use cases:

Use case	Cached zone	Why
Support agent assist	Persona + policy bundle + escalation rules	~30K tokens, stable per shift
Internal SOP search	Tool defs + retrieval rerank prompt	Stable across all queries
RM copilot	Persona + customer-360 schema + tool defs	Stable per RM
Compliance review	Reg-framework reference + drafting style guide	Stable per regulator
Dev productivity	(handled by Claude Code defaults)	—
Exec analytics	Schema + safety guardrails	Stable per dashboard

The interview-room answer (60 sec)

"Caching is architecture, not a tip. Three decisions: where to place breakpoints — stable content first, volatile last, tool definitions inside the cached zone; what hit-rate you design for — that sets your cost model; and how cache state interacts with eval regression gates — evals run cleared-cache, latency reported at both cold and hot. For BFSI support agent assist with a 30K-token policy bundle and steady-state traffic, hit-rate of ~90% drops effective input cost roughly 5–8× vs naive uncached. The trap most teams miss is cache pollution across model versions — that's why evals must run with a deterministic cache state, not the prod state."

Cross-references

Predecessor: Note 01 — Claude API Surface.
Sibling: Note 02 — Model Selection Matrix — caching shifts the matrix.
CodeLab: 04 — Prompt Caching.
Drill: 02 — Whiteboard Prompt Cache ROI.
Module 10: Cost Architecture.

Strong-Hire bar for this file

Breakpoint placement rules reflex.
Cost-model formula at fingertips.
Eval implications (cache pollution + cold/hot reporting) called out without prompting.
BFSI per-use-case caching plan defended cold.