10 · Note 04 — Observability

Status: Outline. Body fills in Week 4. Voice: principal-level, BFSI-threaded, Apic-calibrated.

What this file is. The metrics, traces, and logs that give operators visibility into a Claude deployment — what to instrument, what to alert on, and how it integrates with the BFSI customer's existing stack.

What this file is NOT. A generic APM tutorial — every metric here is chosen because it answers a specific operational or governance question in a regulated BFSI context.

The three observability layers

Layer 1: METRICS    — aggregated rates and distributions, per use case
                       → "Is the system healthy right now?"

Layer 2: TRACES     — per-request spans, tool call chains
                       → "Why did this specific request behave this way?"

Layer 3: LOGS       — structured audit events, immutable
                       → "Prove to the regulator that this is what happened."

The layers serve different audiences: - SRE / platform team: metrics and traces (p99 TTFT alerts, error rate dashboards) - Product / use-case owners: metrics (cost per use case, refusal rate trends) - CISO / compliance: logs (audit trail, PII flag events, policy refusals) - Regulator (RBI inspection): logs (immutable, signed, tamper-evident)

All three layers must be operational before go-live. The CISO will not sign PRR without seeing the audit log design.

Key metrics

Instrument these at the application layer (not inferred from Bedrock billing — that lags 24 hours and is too coarse):

Latency metrics:

Metric	Dimension	Alert threshold
`ttft_ms` (p50, p95, p99)	per use case, per model	p99 > 2× baseline for 5 min
`total_latency_ms` (p50, p95, p99)	per use case	p99 > SLA for 5 min
`streaming_first_token_ms`	per use case	p99 > 500ms

Volume and cost metrics:

Metric	Dimension	Alert threshold
`input_tokens_per_request` (p50, p99)	per use case	p99 > 2× baseline (prompt injection probe?)
`output_tokens_per_request` (p50, p99)	per use case	p99 > 3× baseline
`cache_hit_rate`	per use case	<60% for 15 min (cache cold start)
`estimated_cost_usd`	per use case, per business unit	>110% daily budget
`calls_per_minute`	per use case	>120% burst threshold

Quality and safety metrics:

Metric	Dimension	Alert threshold
`refusal_rate`	per use case	>5% (baseline: <1%)
`error_rate` (4xx, 5xx)	per use case	>5% over 60s
`retry_rate`	per use case	>10% (leading indicator)
`fallback_activation_rate`	per use case	>2% (leading indicator)
`pii_flag_rate`	per use case	Any spike — review immediately

Why refusal rate matters: A spike in refusal rate on the support agent use case means either (a) your prompt is triggering a guardrail inadvertently — a prompt engineering problem, or (b) users are attempting to use the AI for off-topic or adversarial queries — a security signal. Both need investigation; only one needs a security response. You can't tell which without the metric.

Why cache hit rate matters: A drop in cache hit rate below 60% is usually a cache cold start (TTL expired under traffic surge) or a prompt change that invalidated the cache key. Both are cost and latency events. Alert fast.

Trace design for agentic workflows

Metrics tell you the rate; traces tell you the path. For agentic workflows where Claude makes multiple tool calls, a trace is essential for debugging.

Span model:

Trace: user_request_id = "req-abc-123"
│
├─ Span: intent_classifier (Haiku)
│   ├─ input_tokens: 450
│   ├─ output_tokens: 30
│   ├─ ttft_ms: 180
│   └─ result: { intent: "account_balance_query" }
│
├─ Span: tool_call: fetch_account_data
│   ├─ tool: "crm_account_lookup"
│   ├─ duration_ms: 95
│   └─ result: { success: true, records: 1 }
│
├─ Span: response_synthesizer (Sonnet)
│   ├─ input_tokens: 1,200
│   ├─ output_tokens: 280
│   ├─ ttft_ms: 420
│   ├─ cache_hit: true
│   └─ result: { response_length: 280, refusal: false }
│
└─ Span: safety_filter (post-processing)
    ├─ pii_detected: false
    └─ response_approved: true

Every span must carry: trace ID, span ID, parent span ID, use case, model, input/output tokens, cache status, latency, error status.

For BFSI: The trace must be queryable by user_id (for DPDPA data subject access requests — "show me all AI processing for this customer") and by request_id (for incident investigation). Store traces for 90 days minimum; compliance review traces for 7 years (RBI audit requirement).

Implementation: AWS X-Ray (native with Bedrock) + manual span instrumentation using the AWS SDK. If the customer uses Datadog, the Datadog APM agent can ingest X-Ray traces with the Datadog-AWS integration. Both are proven; X-Ray has lower cost for high-volume use cases.

The cost observability dashboard

One dashboard, one question: "How much is this costing and where?"

Dashboard panels:

Daily cost by use case (bar chart, last 30 days) — shows trends, budget vs. actual
Cost per 1,000 calls by use case (line chart) — shows unit economics, catches prompt bloat
Cache hit rate by use case (gauge, current day) — real-time efficiency signal
Monthly cost projection (number, based on current 7-day run rate) — CFO-facing
Cost by business unit (pie chart, MTD) — chargeback visibility
Top 10 most expensive request templates (table) — identifies optimization opportunities

Who gets this dashboard: - Platform team: daily - Use-case product owners: weekly - CFO office: monthly summary (auto-generated PDF from the dashboard) - CISO: on request (ad-hoc queries tied to specific incident investigation)

Alert design

Alerts that page (immediate action required):

Alert	Condition	Page channel
Error rate high	Error rate >25% for 2 min, any use case	PagerDuty (P1)
Latency SLA breach	p99 TTFT >2× SLA for 5 min	PagerDuty (P2)
PII flag spike	PII flag rate >5% in 5 min	CISO + PagerDuty (P1)
Cost anomaly	Projected daily cost >150% budget	Slack + PagerDuty (P2)
Auth failure	Any 401	PagerDuty (P1) — potential key compromise
Refusal spike	Refusal rate >10% in 5 min	Slack (P3) — investigate next business day

Alerts that go to weekly review (no page):

Cache hit rate trending down over 7 days
Output token p99 growing over 7 days (prompt bloat)
Retry rate increasing over 7 days
Cost per use case above quarterly trend

Alert fatigue rule: If an alert fires more than 3 times per week for 3 consecutive weeks without generating a fix, either fix the root cause or raise the threshold. Alerts that are routinely acknowledged and ignored are worse than no alerts — they train the team to ignore the paging channel.

Integration with existing BFSI observability stack

The customer's existing stack: CloudWatch (AWS-native), Datadog (enterprise APM + SIEM), Splunk (security log aggregation), Grafana (dashboards).

Integration map:

Observability layer	Where it lives	How Claude data gets there
Metrics	CloudWatch + Datadog	Custom metrics via CloudWatch PutMetricData API, forwarded to Datadog via AWS integration
Traces	AWS X-Ray + Datadog APM	X-Ray instrumentation in Lambda/ECS; Datadog agent picks up via X-Ray integration
Logs	CloudWatch Logs → Splunk	Structured JSON logs to CloudWatch; Splunk HTTP Event Collector for compliance logs
Cost dashboard	Grafana	Grafana CloudWatch datasource, custom cost metrics
Alerts	PagerDuty via Datadog	Datadog monitor → PagerDuty integration (existing)

The CISO's audit log requirement:

Compliance-grade logs must be: - Structured JSON (not free-text) - Immutable (CloudWatch Logs with object lock, or S3 with WORM policy) - Tamper-evident (hash chaining or AWS CloudTrail integration) - Accessible for 7 years (RBI audit window) - Queryable within 48 hours of regulator request

Minimum fields for every Claude audit log event:

{
  "timestamp": "ISO8601",
  "request_id": "uuid",
  "use_case": "string",
  "user_id": "hashed or anonymized",
  "model": "string",
  "input_token_count": "integer",
  "output_token_count": "integer",
  "refusal": "boolean",
  "pii_flagged": "boolean",
  "fallback_activated": "boolean",
  "response_approved": "boolean"
}

Do NOT log: prompt text, response text, user PII (name, account number, phone). Log token counts and metadata only. The actual prompt/response may be stored in a separate encrypted store with a stricter access policy if the use case requires post-hoc review (e.g., compliance review outputs).

Cross-references

Note 01 — Cost Architecture — cost metrics design
Note 03 — Reliability Patterns — error rate and retry alerting
Note 05 — Production Readiness Review — "monitoring live" PRR gate
Module 06 — Evaluation Architecture
Module 09 — Security, Privacy, Governance

Strong-Hire bar for this file

Structures observability into three layers (metrics, traces, logs) and maps each to a stakeholder
Names the specific metrics that matter — not a generic list — with alert thresholds and rationale
Designs a trace schema for agentic workflows with all required fields
Specifies audit log fields and explicitly calls out what NOT to log (prompt text, PII)
Integrates with the specific BFSI stack (CloudWatch, Datadog, Splunk) with concrete plumbing

The interview-room answer (60 sec)

"Observability for Claude has three layers. Metrics tell you if the system is healthy right now — key ones are TTFT p50/p99, cache hit rate, error rate, refusal rate, and cost per use case. Traces tell you why a specific request behaved the way it did — for agentic workflows, every tool call gets its own span. Logs satisfy the CISO and regulator — structured, immutable, 7-year retention per RBI requirements. The metric I'd alert on first that most teams miss: refusal rate. A spike in refusals is either a prompt engineering bug or a security signal — you can't tell which without the metric, and both need a different response. For this customer's stack, metrics route to CloudWatch plus Datadog, traces via X-Ray, audit logs to Splunk via HTTP Event Collector. The one thing I'd be explicit about: do not log prompt text or PII in the audit log — log token counts and metadata. The actual content goes to a separate encrypted store with a tighter access policy."