10 · Note 03 — Reliability Patterns
Status: Outline. Body fills in Week 4. Voice: principal-level, BFSI-threaded, Apic-calibrated.
What this file is. Patterns for making Claude deployments reliable — retry logic, fallbacks, circuit breakers, and graceful degradation — applied to a regulated BFSI environment where "Claude is down" has a specific, designed answer.
What this file is NOT. Generic distributed systems theory — every pattern here is calibrated to Claude API failure modes specifically.
Failure taxonomy
Not all failures are equal. Before designing retry and fallback logic, name what you're retrying from:
| Failure type | HTTP status | Retryable? | Notes |
|---|---|---|---|
| API timeout (no response) | — (timeout) | Yes, with backoff | Check idempotency first |
| Rate limit exceeded | 429 | Yes, with backoff | Respect retry-after header |
| Inference error (model error) | 500, 529 | Yes, 1–2 times | Not always transient |
| Overloaded (API capacity) | 529 | Yes, aggressive backoff | Apic-specific status |
| Streaming interruption | — (mid-stream) | Yes, from checkpoint | Requires resumption logic |
| Invalid request | 400 | No | Fix the prompt, not the retry |
| Auth failure | 401 | No | Rotate credentials, alert |
| Content policy refusal | 200 (stop_reason: "max_tokens" or refusal) | No | Handle in application logic, not retry |
The critical distinction: A 529 (overloaded) is a temporary capacity signal — retry with aggressive backoff. A 400 is your bug — retrying wastes quota and time. A content policy refusal is not a failure; it is the model working correctly. Retrying a refusal is a product logic error, not a reliability pattern.
Retry strategy
The parameters:
- Max retries: 3 for synchronous; 5 for async/batch
- Initial backoff: 1s for rate limits; 500ms for inference errors
- Backoff multiplier: 2× (exponential)
- Jitter: Full jitter (random(0, backoff)) — prevents thundering herd
- Max backoff: 30s (cap to prevent extreme delays)
- Timeout per attempt: Synchronous: 8s; Async: 60s; Batch: no per-call timeout
Pseudocode:
def call_with_retry(request, max_retries=3):
delay = 1.0
for attempt in range(max_retries + 1):
try:
return claude_client.messages.create(**request)
except RateLimitError as e:
if attempt == max_retries:
raise
wait = e.retry_after or (delay + random.uniform(0, delay))
time.sleep(wait)
delay = min(delay * 2, 30)
except OverloadedError:
if attempt == max_retries:
raise
time.sleep(delay + random.uniform(0, delay))
delay = min(delay * 2, 30)
except APITimeoutError:
if attempt == max_retries:
raise
time.sleep(delay * 0.5 + random.uniform(0, delay * 0.5))
delay = min(delay * 2, 30)
except BadRequestError:
raise # Never retry — fix the request
BFSI-specific constraint: Retry budgets must be visible in observability. A system that is silently retrying 20% of requests looks fine from the outside but is burning 20% extra quota and adding 1–3s to p99 latency. Alert on retry rate >5% as a leading indicator of API health degradation.
Fallback hierarchy
Design fallbacks before deployment. The question "what does the system do when Claude fails?" must have a documented, tested answer.
The BFSI fallback hierarchy (ordered, stop at first success):
Level 1: Retry with same model (exponential backoff, max 3 attempts)
Level 2: Downgrade model tier (Sonnet → Haiku, or Opus → Sonnet)
Level 3: Return cached static response (if use case supports it)
Level 4: Human escalation / queue for later processing
Level 5: Graceful degradation message to end user / agent
Level 2 (model downgrade) details:
Model downgrade is the most powerful and underused fallback. If Sonnet returns a 529, switch to Haiku. The response quality drops, but the interaction completes. For the support agent assist use case, Haiku's response quality is acceptable for most queries — the agent will flag low-confidence responses for human review anyway.
Implementation: maintain two client configurations (primary tier, fallback tier). The fallback is transparent to the caller — the response has the same schema, just with a model_used: haiku flag added for observability.
Level 3 (cached static response) details:
For high-frequency, low-specificity queries, a pre-generated response cache is appropriate. In SOP search, if Claude is unavailable, return the last generated answer for that query if it was generated within the past 24 hours. Cache at the query-hash level, not the request level.
Not appropriate for: anything time-sensitive, anything where the query is unique (RM copilot, compliance review).
Level 4 (human escalation):
For the support agent use case: if Claude fails after all retries, route the conversation to a human queue. The agent UI shows "Transferring to a specialist" — no error message to the end customer. The bank's existing Genesys routing handles this. The integration contract must be agreed with the Contact Centre Operations team before go-live.
Circuit breaker pattern
The circuit breaker prevents a failing API from being hammered with retries — protecting both your application and the upstream service.
Three states:
- Closed (normal): Requests flow through. Error rate tracked.
- Open (failing): Requests fail fast without calling Claude. Periodic probe after timeout.
- Half-open (recovering): One test request allowed. If it succeeds, close the circuit; if it fails, reopen.
Thresholds for BFSI:
| Use case | Open threshold | Open timeout | Notes |
|---|---|---|---|
| Support agent assist | Error rate >25% over 60s | 30s | High volume; open quickly |
| SOP search | Error rate >40% over 120s | 60s | Lower volume; more tolerance |
| Compliance review | Error rate >50% over 300s | 120s | Async; tolerant |
| RM copilot | Error rate >30% over 90s | 45s | Higher stakes; moderate tolerance |
When the circuit is open:
The system should immediately route to the fallback path (human escalation, cached response) without waiting for the retry budget. This is the difference between a 3-second degraded experience and a 30-second failed experience.
The CISO's question: "What does your system do in the 30 seconds after Claude goes down?" The circuit breaker answer: "After 25% error rate over 60 seconds, the circuit opens. All calls fail fast to fallback in <50ms, no retries. The Genesys queue gets a load spike, and we alert on-call within 2 minutes. We have capacity headroom in the human queue for a 30-minute outage."
Graceful degradation: what the support agent does when Claude is unavailable
This is the specific scenario the CISO will probe. Have a concrete answer.
The degradation design for UC1 (support agent assist):
When Claude is unavailable (circuit open or all retries exhausted):
1. The agent UI banner: "AI assistance temporarily limited —
suggested responses may be slower or unavailable."
2. Existing rule-based suggestions continue (pre-Claude playbook —
these run in the bank's existing CRM layer, no dependency on Claude).
3. If agent needs help beyond rule-based: one-click escalation to
senior agent / specialist queue.
4. All affected interactions are flagged in the audit log for
post-incident review.
5. Recovery: when circuit closes, the UI banner disappears automatically.
No agent action required.
What you do NOT do:
- Show an error message to the end customer ("AI is down")
- Let the agent see a blank response with no fallback
- Let the application spin indefinitely (timeout must be enforced)
- Re-enable Claude automatically without a smoke test ("half-open" state in circuit breaker handles this)
SLA design: 99.9% vs. 99.95%
| SLA | Allowed downtime/month | Equivalent |
|---|---|---|
| 99.9% | ~43 minutes | Acceptable for internal tools (SOP search, RM copilot) |
| 99.95% | ~22 minutes | Appropriate for customer-facing (support agent assist) |
| 99.99% | ~4 minutes | Reserved for core banking — not appropriate for AI layer |
Reality check: The Apic API / Bedrock SLA is 99.9% by contractual terms. Your application-level SLA cannot exceed the provider's SLA unless you build multi-provider redundancy (Bedrock + Vertex, with failover). For most BFSI deployments, 99.9% is achievable; 99.95% requires active-active multi-region or multi-provider.
The BFSI recommendation:
- UC1, UC2, UC3 (customer/employee-facing, synchronous): target 99.9% application SLA; document that 99.95% requires multi-provider investment
- UC4, UC6 (async, batch-tolerant): 99.5% is acceptable (retry windows absorb downtime)
- UC5 (developer productivity): 99.0% is acceptable (developer tolerance for tool outages is higher)
The CISO will ask about the SLA. The honest answer: "Claude's hosted SLA is 99.9%. Our application adds retry, circuit breaker, and fallback to get our effective application-level availability above that. We've tested for the 43-minute scenario. If you need 99.95%, we need to scope a multi-provider architecture and the additional cost."
Cross-references
- Note 02 — Latency Architecture — latency budget and timeout design
- Note 04 — Observability — error rate alerting, circuit breaker metrics
- Note 05 — Production Readiness Review — reliability section of PRR
- Drill 03 — Blast Radius Conversation
- Module 09 — Security, Privacy, Governance
Strong-Hire bar for this file
- Names the failure taxonomy and distinguishes retryable from non-retryable failures
- Specifies retry parameters with jitter and explains why jitter prevents thundering herd
- Describes the five-level fallback hierarchy with specific BFSI fallback behavior at each level
- Explains circuit breaker states and gives concrete thresholds for the BFSI use cases
- Directly answers the CISO question: "what happens in the 30 seconds after Claude goes down?"
The interview-room answer (60 sec)
"Reliability for Claude deployments has three layers. First: retry strategy — exponential backoff with full jitter, max three attempts for synchronous, separate retry budgets for 429 rate limits versus 529 overload. Never retry a 400 — that's your bug. Second: fallback hierarchy — retry same model, downgrade to cheaper tier, serve cached static response, escalate to human. The model downgrade is underused and very effective; Haiku on fallback completes the interaction at lower quality rather than failing. Third: circuit breaker — when error rate exceeds 25% over 60 seconds, fail fast to fallback without retrying. This is what answers the CISO's question: 30 seconds after Claude goes down, every support agent call is already routing to the Genesys human queue via the open circuit, not timing out after 8 seconds each. The difference is whether your customers experience a 3-second degraded experience or a 30-second failure."