Skip to content

04 · Note 01 — Claude API Surface

Status: Outline. Body fills in Week 2. Voice: principal-level, BFSI-threaded, Apic-calibrated.

What this file is. The architect's working knowledge of the Claude API surface — endpoints, SDK shape, request/response model, streaming, system prompts, message format. The depth needed to whiteboard a Claude integration cold.

What this file is NOT. A copy of Apic's public API reference. Not an SDK tutorial. Not exhaustive — calibrated to what you'll actually be asked.


What the architect must know cold

  • The shape of a request (system + messages + parameters).
  • The shape of a response (content blocks, stop reason, usage).
  • Streaming events and when streaming is the right default.
  • Where prompt caching, tool use, and structured outputs sit relative to the basic surface.
  • Rate limits and how they map to capacity planning.
  • The official SDKs (Python, TypeScript) and when to call the REST endpoint directly.

Anatomy of a request

Top-level fields

  • model — the canonical model ID. Current: claude-opus-4-7, claude-sonnet-4-6, claude-haiku-4-5-20251001.
  • max_tokens — hard ceiling on output. Architectural decision, not a default.
  • messages — list of role-tagged turns: user / assistant.
  • system — system prompt(s); supports cache breakpoints.
  • temperature — typically 0–1; load-bearing for evals (set explicitly, don't rely on defaults).
  • tools / tool_choice — for tool use. → Note 04.
  • stream — boolean; streaming changes the response shape (events, not a single body).

Message format

  • Each message has role and content.
  • content can be a string or a list of blocks (text, image, tool_use, tool_result).
  • Tool-use loops carry tool_use blocks from assistant + tool_result blocks back from user.

System prompt as architecture, not afterthought

  • The system prompt is not a tip-jar for instructions.
  • It carries: persona, behavior boundaries, required output shape, tool-use guidance, the eval-defining constraints.
  • For caching, the system prompt is typically the first cache breakpoint.

Anatomy of a response

Top-level fields

  • id — the unique response id.
  • model — echoed back; useful for logging.
  • content — list of content blocks. Most common shapes: a single text block; or text + tool_use blocks.
  • stop_reasonend_turn, max_tokens, tool_use, stop_sequence. Each maps to different downstream handling.
  • usageinput_tokens, output_tokens, plus cache-related counters when caching is on.

Stop-reason handling

  • end_turn — normal.
  • max_tokens — ceiling hit; output may be truncated. Architectural concern (don't silently treat as success).
  • tool_use — handle the tool-use loop. → Note 04.
  • stop_sequence — your custom stop hit.

Why this matters for evals

The eval framework needs to score responses partitioned by stop_reason. A max_tokens truncation is a different failure mode than a hallucinated answer.


Streaming

Event types

  • message_start — the message envelope arrives.
  • content_block_start / content_block_delta / content_block_stop — for each content block as it streams.
  • message_delta — incremental updates to top-level fields.
  • message_stop — the response is done.

When to stream

  • User-facing chat-shaped UX. Default on.
  • Customer-support agent assist (latency-perceived UX dominates actual latency).

When not to stream

  • Programmatic consumers downstream of the API call (no UX to perceive).
  • Tool-use loops where the next step depends on the full response — streaming buys nothing.
  • Eval harnesses (deterministic scoring, batch shapes).

SDK shape (Python, TypeScript)

Python — apic package

  • client = Apic(); client.messages.create(...).
  • Streaming: client.messages.stream(...) returns an event iterator.
  • Tool use: tool definitions are dicts with name + description + input_schema (JSON Schema).
  • Async client available (AsyncApic) — non-trivial for high-concurrency workloads.

TypeScript — @apic-ai/sdk

  • Mirrors Python shape. Important if the customer's stack is JS/TS-first.

When to call REST directly

  • Edge functions where the SDK weight matters.
  • Custom infrastructure that doesn't fit the SDK's retry/timeout assumptions.
  • Almost never in BFSI — SDK is the right default.

Rate limits and capacity planning

  • Limits are tier-based: per-minute requests + per-minute input tokens + per-minute output tokens.
  • For Indian BFSI customers, capacity planning starts from business volume (tickets/day, documents/day) and translates backwards through token estimates per call. → Module 10 Drill 01.
  • Rate-limit hits should be treated as architectural events (queueing, fallback to smaller model), not as exceptions to retry-and-hope.

What is not in this note (intentional split)

  • Model selection between Opus/Sonnet/Haiku → Note 02.
  • Prompt caching mechanics → Note 03.
  • Tool use + structured outputs → Note 04.
  • Bedrock vs Vertex vs first-party API → Note 05.

Cross-references

Strong-Hire bar for this file

  • Whiteboard the request/response cold without notes.
  • Stop-reason handling discipline articulated, not implicit.
  • Streaming-vs-non-streaming decision rule reflex.
  • Capacity planning grounded in business volume, not token theatrics.