Skip to main content

Command Palette

Search for a command to run...

Four Ways to Call a Model, and Which One You Actually Want

curl, SDK, streaming, async. Four ways to call an LLM, one shape that beats the rest in production, and the ones everyone reaches for at the wrong moment.

Updated
11 min read
Four Ways to Call a Model, and Which One You Actually Want

You can call an LLM four ways. All four work. Three of them are wrong for what you're probably building.

Last post we installed the mental model: an LLM API call is a function, f(prompt, settings) -> text. Today we get concrete about how you dial that function. The choice sounds like plumbing, but it decides three things that matter more than any prompt you'll write: how fast your users feel the product, how long you're holding server resources, and whether your application's first failure in production is a mystery or a debuggable one.

Let's go through the four in order of increasing sophistication, and then I'll tell you which one I reach for by default.


1 · raw curl

The simplest possible way to call a model. No SDK, no dependencies, just HTTP.

curl https://api.anthropic.com/v1/messages \
  -H "content-type: application/json" \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "claude-sonnet-4-6",
    "max_tokens": 200,
    "messages": [{"role": "user", "content": "Name a color."}]
  }'

Paste that into a terminal with your API key exported, and you will see a JSON response come back. No magic. Just a POST request, a JSON body, and an HTTP response. Every LLM provider exposes a shape like this, and every SDK in every language is ultimately doing exactly this under the hood.

So why have an SDK at all? Because the moment you want to do any of these:

  • Retry on 429 (rate limit) or 503 (transient error)
  • Parse the response into typed objects
  • Stream the output as it generates
  • Handle streaming errors mid-stream
  • Attach a custom HTTP client with connection pooling

…you are going to end up writing all of that yourself, and an SDK has already written it for you and someone else has tested it. Raw curl is perfect for two things, and only two things: one-off debugging (curl the endpoint to sanity-check your API key works) and scripts that need zero dependencies (a CI health check, a bash cron, a bug-report repro). Anything beyond that, reach for the SDK.


2 · the synchronous SDK call

The SDK version of the same call is what you saw in post B1.1:

# pip install anthropic
import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=200,
    messages=[{"role": "user", "content": "Name a color."}],
)

print(response.content[0].text)

A few things this is buying you over the curl version:

  • Retries. If the API returns a 429 or a 503, the SDK will back off and try again. The retry policy is configurable, and the default is usually sensible.
  • Typed responses. response.content[0].text is a property, not a string index into raw JSON. The SDK has an object model — messages, content blocks, usage metadata — and the object model catches typos at write time.
  • A connection pool. The Anthropic() client holds an underlying HTTP client that reuses connections across calls. If you create the client once and reuse it, you save a TCP handshake per request.
  • Error classes. Instead of eyeballing HTTP status codes, you catch anthropic.RateLimitError, anthropic.BadRequestError, anthropic.APIConnectionError, and handle each meaningfully.

When to use it: scripts, notebooks, most backend code where the call is part of a synchronous request handler and you don't need to stream the output. Simple, blocks the calling thread for the duration of the call, gives you back a complete response object, done. If your LLM call takes 600ms and you're fine with your request handler blocking for 600ms, this is your default.

What it costs you: the call blocks. If your web framework is sync-by-default (Flask without async, Django views without async_to_sync), the thread handling the request is tied up for the full duration of the call. If your average LLM call takes 2 seconds and you have 20 threads in your pool, you are topped out at 10 requests per second. We'll come back to this.


3 · streaming

Streaming is the move that makes LLM apps feel alive instead of frozen. Instead of waiting for the full response and then returning it, you subscribe to the tokens as the model produces them, and forward them to the user in real time.

# pip install anthropic
import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=800,
    messages=[{"role": "user", "content": "Explain caching in 4 sentences."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

    final = stream.get_final_message()
    print()
    print(f"\nusage: {final.usage.input_tokens} in, {final.usage.output_tokens} out")

Two things to notice in that snippet. First, the with client.messages.stream(...) block — the SDK exposes streaming as a context manager, and the stream.text_stream iterator yields chunks of text as they arrive. Second, stream.get_final_message() gives you the full, assembled message after the stream closes, complete with usage metadata. You get both the live experience and the full record.

Streaming changes the user's perception of latency more than any optimisation you can make on the backend. A 6-second response that starts rendering after 200ms feels vastly faster than a 4-second response that stays blank and then dumps all at once. The user's brain measures time to first token, not total time. If your product has a chat surface, or a long-form generation surface, or anything where the user is watching the output appear, you want streaming.

Here is what it looks like end-to-end:

There's a second, subtler reason to stream. You catch errors earlier. With a non-streaming call, a 500 from the provider arrives after the full generation, so you've already waited 4 seconds to find out the call failed. With streaming, the connection opens fast and failures surface in tens of milliseconds.

When to use it: any UI where the user sees the output (chatbots, code editors, writing assistants, copy tools). Basically all of product LLM work.

What it costs you: your HTTP response handler has to be streaming-aware. If you're using FastAPI, you return a StreamingResponse. If you're using Next.js, you use the app/ router's streaming primitives or the Vercel AI SDK. Getting this wired through to the browser, especially past proxies, CDNs, and buffering middleware, is where most teams lose an afternoon the first time. Set X-Accel-Buffering: no on your response and be careful about what's between you and the client.


4 · async, concurrent, and batch

The fourth way is what you reach for when you aren't making one call — you're making many, and you want them to go at once instead of one at a time.

# pip install anthropic
import asyncio
import os
from anthropic import AsyncAnthropic

client = AsyncAnthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

async def ask(prompt: str) -> str:
    response = await client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=200,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.content[0].text

async def main():
    prompts = [
        "Summarise: the sky is blue and the grass is green.",
        "Summarise: the ocean is deep and the moon is bright.",
        "Summarise: bread is warm and tea is hot.",
    ]
    results = await asyncio.gather(*(ask(p) for p in prompts))
    for prompt, result in zip(prompts, results):
        print(f"- {result}")

asyncio.run(main())

The AsyncAnthropic client is the async variant. Same API shape, same method names, but await where you'd otherwise block. With asyncio.gather, three calls go out concurrently. The wall-clock time is roughly the time of the slowest single call, not the sum. If each call takes 2 seconds and you run three in parallel, the whole thing finishes in about 2 seconds, not 6.

This is the shape you want for:

  • A single web request that needs to make several LLM calls to produce one answer (e.g. fan out to three sub-queries, gather, synthesise).
  • Batch processing (summarise 500 support tickets overnight).
  • Any service handling many concurrent user requests where each request triggers LLM work, because an async server can hold thousands of in-flight LLM calls on a single process without tying up threads per-call.

When to use it: any backend that handles many concurrent LLM calls, or any single request that fans out. Most production LLM services want their inner loop to be async.

What it costs you: you need an async-compatible web framework (FastAPI, Starlette, Hono, Next.js app router). You need to mind rate limits — if you gather 50 calls at once and hit the provider's per-minute cap, you'll get a wall of 429s all at the same instant. Most SDKs handle this with built-in retries, but if you're running heavy batch jobs you should add a semaphore and cap your own concurrency.

sem = asyncio.Semaphore(10)  # never more than 10 in-flight

async def ask(prompt: str) -> str:
    async with sem:
        response = await client.messages.create(...)
        return response.content[0].text

That one line saves you from being your own worst enemy at scale.


So which one wins?

A rough decision tree for most real products:

My actual default in any production service: the async streaming SDK. Every call in my backend is await client.messages.stream(...), even ones that aren't going to be rendered live, because:

  • It forces me to write async code from day one. No painful rewrites later.
  • Streaming gives me early error signals — if the provider 500s, I find out in 50ms, not 3 seconds.
  • The "collected full message" path is still available — I can await stream.get_final_message() and just use the full response if the caller doesn't care about tokens arriving live.
  • It costs nothing extra. The token cost is identical. The latency is identical. The code is a little more complex, but the complexity is a one-time price I pay up front and forget.

The sync SDK is fine for scripts and notebooks. Raw curl is fine for debugging. Non-streaming sync is the path I see junior teams start on, then painfully migrate away from three months in when they realise every "fast" call is blocking a worker thread.

If you start async+streaming, you don't have to do the migration. You don't have to have the migration meeting. You don't have to write the migration PR. Start as you mean to go on.


Admit what breaks

Every post ends here. Real failure modes from real production.

  • Streams that hang forever. A proxy, CDN, or load balancer between you and the model server buffers the response until a "full" response is ready, defeating streaming. The user sees nothing for 4 seconds, then everything at once. Fix: set X-Accel-Buffering: no, disable buffering in nginx, use streaming-native deploy targets (Vercel Edge, Cloudflare Workers, bare EC2), and test the end-to-end stream in staging, not just the server-side.
  • 429 storms under async gather. You gather 200 calls at once, hit the per-minute rate limit, get 200 simultaneous 429s, SDK retries fire all together, you hit the limit again. Fix: semaphore to cap concurrency, or use the provider's batch API if you're doing overnight processing.
  • Stream errors mid-flight that your code silently swallows. A stream can start successfully and then fail halfway through with an error event. If you only handle exceptions at stream-open time, you'll log the error but return a truncated result to the user. Fix: wrap the iteration in a try/except and check stream.get_final_message() is non-null before trusting the collected output.
  • Forgetting that async is viral. Once one function in your code is async, everything that calls it must be async too, or you need an asyncio.run bridge. Half-converted codebases where sync code calls async code are where concurrency bugs breed. Commit to async all the way down or stay fully sync — don't mix.
  • Creating one client per call. client = Anthropic() inside a hot function. You're rebuilding the HTTP pool on every request. Create the client once at module scope (or DI container) and reuse it.

What just changed in your code

  • Use the async streaming SDK as your default in any service code. Scripts and notebooks can use sync. Everything else, async+stream.
  • Never block a web worker on a sync LLM call. Your thread pool will thank you.
  • Instantiate the client once and reuse it. One per process, not one per call.
  • Cap your concurrency with a semaphore whenever you gather more than a handful of calls.
  • Always test streaming in staging, not just locally, because the buffering landmines live in the network, not in your code.

Next post, we finally solve the thing that's bitten every LLM app I've ever shipped: getting structured output back from the model instead of a string you have to parse with a prayer. JSON mode, schema enforcement, the right way to wire Zod/Pydantic in, and why "please respond in JSON" in the prompt is not a plan.


Course navigation

⬅️ Previous📍 You are hereNext ➡️
⬅️ Previous
B1.1 · The LLM Is a Function, Not a Friend
B1.2 of B6.4Next ➡️
B1.3 · Stop Parsing Strings

📚 AI for Builders · Course Home — 28 posts, six modules.


Cover photo via Unsplash. This post is part of the AI for Builders series.

More from this blog

Learn AI - Zero to Hero

111 posts