Observability for LLM Apps | AI for Builders

Every production LLM app I've worked on that was struggling had the same root cause: the team couldn't see what the model was actually doing. Users reported wrong answers. The team stared at the prompt. The team asked the PM to reproduce. The team wrote a Slack thread. The team did not have a working trace of the failing request they could look at and think about.

If you can't answer, in under two minutes, "what exact prompt did the model see for user X's request at 3:14pm yesterday, and what did the model return?" — you have no observability. Every optimisation in this course is less effective when you're working blind.

This post is the minimum viable observability for an LLM feature. Not a vendor pitch. Not a 15-layer enterprise observability platform. The smallest set of logs, traces, and metrics that lets you debug in two minutes and improve in one cycle. You can build it in an afternoon, and it will outperform most "AI observability" products on the tiny fraction of functionality that matters.

The three-layer telemetry model

Think of LLM observability in three layers, each answering a different kind of question:

Per-call logs tell you what happened on one specific call. The full prompt the model saw, the raw output it produced, the tool calls, the settings.
Traces tell you why it happened — the whole chain of calls that made up a single user request. For an agent, that's 1 to 20 LLM calls plus tool dispatches; you want all of them linked by a trace_id.
Metrics tell you whether things are getting worse over time — aggregate latency, cost, error rate, cache hit rate, eval pass rate by day. No single request matters; the trend does.

You need all three. Per-call logs alone don't tell you about regressions. Metrics alone don't tell you why any specific request failed. Traces without logs are missing the actual content. Stack them; budget for all three.

Layer 1: per-call logs

The single most important thing in LLM observability is logging the full request and response for every LLM call. Not a sanitised summary. Not "call succeeded." The actual system prompt, user message, tool definitions, full response including any tool-use blocks, stop reason, and token usage.

A minimum schema:

# pip install pydantic
from datetime import datetime
from pydantic import BaseModel
from typing import Any

class LLMCallLog(BaseModel):
    # Identity
    trace_id: str          # links this call to the user request
    call_id: str           # unique per call
    user_id: str           # which user (hashed if you need it)
    feature: str           # "support_bot", "code_assistant", etc
    prompt_version: str    # from B2.2 — which prompt variant

    # The call
    model: str
    temperature: float
    max_tokens: int
    system: str            # full text, not truncated
    messages: list[dict]   # full conversation history
    tools: list[dict] | None

    # The response
    output: list[dict]     # full response content blocks
    stop_reason: str
    input_tokens: int
    output_tokens: int
    cache_read_tokens: int
    cost_usd: float
    latency_ms: int
    ttft_ms: int | None    # time to first token if streaming

    # Metadata
    timestamp: datetime
    success: bool
    error: str | None

Log this object for every LLM call your app makes. Where to put it depends on your infra:

Small scale (under 1M calls/day): dump it to Postgres with a JSONB column and good indexes on trace_id, user_id, and timestamp. Query with SQL. Done.
Medium scale (1M-100M/day): dump it to a log aggregator — Datadog, CloudWatch Logs, Axiom, ClickHouse. Keep the full object; let the aggregator handle retention.
Large scale: sample. Log 100% of errors, 100% of slow calls, and 1-10% of normal traffic. Full logging is expensive at volume; sampling is fine as long as you catch the edge cases.

Two rules about this log:

Log the prompt as the model saw it, not the template source. If you render {{ user_name }} into "Alice," log "Alice." The point of the log is to reproduce what the model saw, not what your code looked like. If you log the template, the next person trying to reproduce the bug has to re-render it from memory.
Redact PII, but keep the shape. "User's credit card was 4111-1111-1111-1111" becomes "User's credit card was [CARD]." The model's interpretation of "there is a credit card mentioned" is preserved; the actual number is gone. Never log un-redacted PII, even for debugging — retention logs are a liability.

Layer 2: traces

One user request might make 5 LLM calls (classifier → retrieval embedding → answer generation → guardrail check → reranking). Logging each one individually is useful; linking them into a single trace is essential. Otherwise, when a wrong answer arrives, you can't tell whether the classifier was wrong, the retrieval was wrong, the answer model was wrong, or the guardrail mangled it.

The link is a trace_id. Your HTTP request handler generates one at the top. Every downstream LLM call includes it in the log. Every tool dispatch includes it. Every RAG retrieval includes it. At the end, a query for trace_id = X returns every operation that ran for that request, in order.

import uuid
from contextvars import ContextVar

current_trace_id: ContextVar[str | None] = ContextVar("trace_id", default=None)

def handle_request(user_message: str) -> str:
    trace_id = str(uuid.uuid4())
    current_trace_id.set(trace_id)

    # Every downstream call reads current_trace_id via the context var.
    classification = classify(user_message)
    docs = retrieve(user_message)
    answer = generate_answer(user_message, classification, docs)
    return answer

def classify(user_message: str) -> str:
    trace_id = current_trace_id.get()
    resp = client.messages.create(...)
    log_llm_call(LLMCallLog(
        trace_id=trace_id,
        feature="classifier",
        ...
    ))
    return resp.content[0].text

Python's contextvars (and similar patterns in TypeScript with AsyncLocalStorage) let you propagate the trace ID through async code without threading it through every function signature. This is the single most important ergonomic improvement to your logging — once you have it, logging is a one-line call from anywhere in the request path.

For "real" observability, wire OpenTelemetry or a platform like LangSmith, Langfuse, Helicone, or Braintrust. These give you pre-built UI over the trace structure. They're nice to have, and most of them are inexpensive at startup scale. But they're not a substitute for the above; they're a UI on top of it. If your logging schema is wrong, the platform can't fix it. Get the logging schema right first, then decide whether a platform is worth the integration.

Layer 3: metrics

Metrics are the layer you look at every morning, not when there's an incident. Per-call logs are for incidents; metrics are for slow rot.

The minimum viable LLM metrics dashboard has six things:

Request volume — calls per minute, broken down by feature and model. Spikes tell you about traffic shifts. Drops tell you about outages.
Latency P50 and P99 — per feature, per model. Include TTFT if streaming. This is the user-feel number from B5.1.
Cost per day — broken down by feature and model. Gives you the economic signal. Alert on unexpected spikes.
Error rate — percentage of calls that returned an error or a bad stop_reason. This catches upstream provider issues and bugs in your own code.
Cache hit rate — if you have caches (B5.2), measure hit rate and cost-weighted hit rate. Helps you tune TTLs.
Eval pass rate over time — run your eval set (B2.5) against production prompts on a schedule and track the score. Drops mean something upstream changed.

Log metric events to whatever you already use — Datadog, Prometheus, CloudWatch, Grafana. You can derive all six from the per-call log table with a few SQL aggregations if you don't have a metrics store:

-- P99 latency by feature today
SELECT
  feature,
  percentile_cont(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99_latency,
  COUNT(*) AS calls
FROM llm_call_logs
WHERE timestamp > NOW() - INTERVAL '1 day'
GROUP BY feature
ORDER BY p99_latency DESC;

-- Daily cost by feature
SELECT
  DATE(timestamp) AS day,
  feature,
  SUM(cost_usd) AS daily_cost,
  COUNT(*) AS call_count
FROM llm_call_logs
WHERE timestamp > NOW() - INTERVAL '7 days'
GROUP BY day, feature
ORDER BY day DESC, daily_cost DESC;

You do not need a vendor to give you this dashboard. You need SQL and fifteen minutes.

The two alerts every LLM app needs

Alerts are the "wake someone up" subset of metrics. Don't page on everything; you'll get alert fatigue and ignore real incidents. Start with exactly two alerts:

Alert 1: error rate spike. If the error rate for any feature goes above 5% over a 5-minute window, alert. Error here means the provider returned a non-2xx, or the request timed out, or a tool call threw an exception, or stop_reason was unexpected. Almost every real incident I've seen in LLM apps shows up here first.

Alert 2: cost spike. If daily cost exceeds 150% of the rolling 7-day average, alert. Almost every cost runaway (bad prompt, infinite agent loop, caching regression) shows up here before the bill arrives.

Add more alerts only after you understand which ones are genuinely actionable. Latency alerts, quality alerts, cache-hit alerts — all tempting, all high-false-positive-rate in practice. Start with the two above.

Debugging in two minutes

The test of whether your observability is sufficient is can you debug a reported wrong answer in under two minutes. Walk through the steps:

User reports: "At 3:14pm today I asked Support Bot about my invoice and it told me my plan was free when it's actually Premium."
Get the trace_id for that request. Either the bot UI shows it to the user as a "request ID" for bug reports (strongly recommended), or you find it by user_id and timestamp.
SELECT * FROM llm_call_logs WHERE trace_id = ? ORDER BY timestamp — you now have every LLM call that ran for that request, in order.
Look at the classifier call. Did it classify the query correctly? If not, the bug is in the classifier prompt or routing.
Look at the retrieval call. Did it return the correct document about the user's plan? If not, the bug is in RAG (chunking, retrieval, stale data — see B3.5).
Look at the answer-generation call. Did it have the right context? Did it say the right thing? If the context was right and the answer was wrong, the bug is in the answer prompt or the model.
You now know, with high confidence, which layer broke. You can reproduce, fix, and ship.

Teams that can do this are teams that improve quickly. Teams that can't are teams that stall on "I can't reproduce it."

Admit what breaks

Storage costs scale with log volume. Full-prompt logging at millions of calls a day can hit real money in Postgres. Sample, archive, or use a log-oriented store for hot data.
PII in logs is a compliance landmine. Your support user asks a question containing their SSN; your log now has the SSN. Redact on the way in, not on the way out. Use a dedicated PII redaction pass before storage.
Contextvars don't propagate across task boundaries cleanly. If you use asyncio.create_task, you must explicitly pass context. Broken context means broken traces. Test this with a deliberate async workload.
Metrics can be wrong due to failing calls. If every failed call is logged as cost_usd = 0, your "average cost per call" understates reality. Log actual cost for failed calls where possible (some providers charge partial credit).
Vendors lock you in. "AI observability platforms" are fine, but the data inside them is often hard to export. Write to your own store as the source of truth, even if you also send to a vendor for their UI.
Eval-in-prod is hard to get right. Running your eval set against the live prompts on a schedule sounds simple, but the prompts change, the evals drift, and the "pass rate" number can swing for reasons unrelated to quality. Treat it as a weak signal, not a hard gate.
Too many dashboards create none. Resist the urge to build 40 charts. The six metrics above are enough for most teams. Add more only when an incident teaches you what was missing.

What just changed in your code

Log the full prompt and full response for every LLM call. Store them somewhere queryable. Nothing in this course matters more than this.
Propagate a trace_id through the whole request path via contextvars/AsyncLocalStorage so every downstream call is linked.
Include prompt_version, feature, and user_id on every log entry so you can filter usefully.
Build six metrics: volume, P50/P99 latency, daily cost, error rate, cache hit rate, eval pass rate. Derive them from the log table if you don't have a metrics store.
Set exactly two alerts to start: error rate spike, cost spike. Add more only after you've learned which are actionable.
Show the user a request_id in bug-report flows so you can find their trace in under two minutes.
Redact PII on the way in, not on the way out. The log store is a risk surface.

Next post, B5.4, we close the "shipping" loop: guardrails. Input validation, output filtering, content safety, abuse patterns. The things that stop your product from being a liability. Less glamorous than capabilities work, more expensive if you skip them.

⬅️ Previous	📍 You are here	Next ➡️
⬅️ Previous B5.2 · Three Kinds of Caching	B5.3 of B6.4	Next ➡️ B5.4 · Guardrails

📚 AI for Builders · Course Home — 28 posts, six modules.

Cover photo via Unsplash. This post is part of the AI for Builders series.

Observability for LLM Apps: Minimum Viable Telemetry

The three-layer telemetry model

Layer 1: per-call logs

Layer 2: traces

Layer 3: metrics

The two alerts every LLM app needs

Debugging in two minutes

Admit what breaks

What just changed in your code

Course navigation

Comments

AI for Builders

Guardrails: Input Validation, Output Filtering, Abuse Patterns

More from this blog

A Reading List and Two Habits: Staying Current in Ten Minutes a Week

What to Decide Now, What to Defer, What to Ignore: The AI Action Matrix

The Next 18 Months of AI: A Calibrated Leader's Forecast

Calibrating Your AI Exposure: Upside and Downside in One Matrix

Five AI Capabilities That Matter for Your Business, and Five That Do Not

Command Palette

The three-layer telemetry model

Layer 1: per-call logs

Layer 2: traces

Layer 3: metrics

The two alerts every LLM app needs

Debugging in two minutes

Admit what breaks

What just changed in your code

Course navigation

Comments

AI for Builders

Guardrails: Input Validation, Output Filtering, Abuse Patterns

More from this blog