04 · Note 01 — Claude API Surface

Status: Outline. Body fills in Week 2. Voice: principal-level, BFSI-threaded, Apic-calibrated.

What this file is. The architect's working knowledge of the Claude API surface — endpoints, SDK shape, request/response model, streaming, system prompts, message format. The depth needed to whiteboard a Claude integration cold.

What this file is NOT. A copy of Apic's public API reference. Not an SDK tutorial. Not exhaustive — calibrated to what you'll actually be asked.

What the architect must know cold

The shape of a request (system + messages + parameters).
The shape of a response (content blocks, stop reason, usage).
Streaming events and when streaming is the right default.
Where prompt caching, tool use, and structured outputs sit relative to the basic surface.
Rate limits and how they map to capacity planning.
The official SDKs (Python, TypeScript) and when to call the REST endpoint directly.

Anatomy of a request

Top-level fields

model — the canonical model ID. Current: claude-opus-4-7, claude-sonnet-4-6, claude-haiku-4-5-20251001.
max_tokens — hard ceiling on output. Architectural decision, not a default.
messages — list of role-tagged turns: user / assistant.
system — system prompt(s); supports cache breakpoints.
temperature — typically 0–1; load-bearing for evals (set explicitly, don't rely on defaults).
tools / tool_choice — for tool use. → Note 04.
stream — boolean; streaming changes the response shape (events, not a single body).

Message format

Each message has role and content.
content can be a string or a list of blocks (text, image, tool_use, tool_result).
Tool-use loops carry tool_use blocks from assistant + tool_result blocks back from user.

System prompt as architecture, not afterthought

The system prompt is not a tip-jar for instructions.
It carries: persona, behavior boundaries, required output shape, tool-use guidance, the eval-defining constraints.
For caching, the system prompt is typically the first cache breakpoint.

Anatomy of a response

Top-level fields

id — the unique response id.
model — echoed back; useful for logging.
content — list of content blocks. Most common shapes: a single text block; or text + tool_use blocks.
stop_reason — end_turn, max_tokens, tool_use, stop_sequence. Each maps to different downstream handling.
usage — input_tokens, output_tokens, plus cache-related counters when caching is on.

Stop-reason handling

end_turn — normal.
max_tokens — ceiling hit; output may be truncated. Architectural concern (don't silently treat as success).
tool_use — handle the tool-use loop. → Note 04.
stop_sequence — your custom stop hit.

Why this matters for evals

The eval framework needs to score responses partitioned by stop_reason. A max_tokens truncation is a different failure mode than a hallucinated answer.

Streaming

Event types

message_start — the message envelope arrives.
content_block_start / content_block_delta / content_block_stop — for each content block as it streams.
message_delta — incremental updates to top-level fields.
message_stop — the response is done.

When to stream

User-facing chat-shaped UX. Default on.
Customer-support agent assist (latency-perceived UX dominates actual latency).

When not to stream

Programmatic consumers downstream of the API call (no UX to perceive).
Tool-use loops where the next step depends on the full response — streaming buys nothing.
Eval harnesses (deterministic scoring, batch shapes).

SDK shape (Python, TypeScript)

Python — `apic` package

client = Apic(); client.messages.create(...).
Streaming: client.messages.stream(...) returns an event iterator.
Tool use: tool definitions are dicts with name + description + input_schema (JSON Schema).
Async client available (AsyncApic) — non-trivial for high-concurrency workloads.

TypeScript — `@apic-ai/sdk`

Mirrors Python shape. Important if the customer's stack is JS/TS-first.

When to call REST directly

Edge functions where the SDK weight matters.
Custom infrastructure that doesn't fit the SDK's retry/timeout assumptions.
Almost never in BFSI — SDK is the right default.

Rate limits and capacity planning

Limits are tier-based: per-minute requests + per-minute input tokens + per-minute output tokens.
For Indian BFSI customers, capacity planning starts from business volume (tickets/day, documents/day) and translates backwards through token estimates per call. → Module 10 Drill 01.
Rate-limit hits should be treated as architectural events (queueing, fallback to smaller model), not as exceptions to retry-and-hope.

What is not in this note (intentional split)

Model selection between Opus/Sonnet/Haiku → Note 02.
Prompt caching mechanics → Note 03.
Tool use + structured outputs → Note 04.
Bedrock vs Vertex vs first-party API → Note 05.

Cross-references

Predecessors: Module 03.
Successors: this is the entry note for Module 04. Notes 02–05 build on it.
CodeLab: 01 — First Claude Call.
Drill: 02 — Whiteboard Prompt Cache ROI.

Strong-Hire bar for this file

Whiteboard the request/response cold without notes.
Stop-reason handling discipline articulated, not implicit.
Streaming-vs-non-streaming decision rule reflex.
Capacity planning grounded in business volume, not token theatrics.

04 · Note 01 — Claude API Surface

What the architect must know cold

Anatomy of a request

Top-level fields

Message format

System prompt as architecture, not afterthought

Anatomy of a response

Top-level fields

Stop-reason handling

Why this matters for evals

Streaming

Event types

When to stream

When not to stream

SDK shape (Python, TypeScript)

Python — apic package

TypeScript — @apic-ai/sdk

When to call REST directly

Rate limits and capacity planning

What is not in this note (intentional split)

Cross-references

Strong-Hire bar for this file

Python — `apic` package

TypeScript — `@apic-ai/sdk`