Multimodal in Practice: Images In, Structured Data Out
Multimodal demos look flashy. The real wins are quieter. The most underrated building block of 2026 is image-in-structured-data-out, and it changes what your backend can do.
Every multimodal AI demo you see on a conference stage has the same shape. A person holds up a picture of their fridge. The model says "you have eggs, milk, two bell peppers, and some leftover Thai food." The crowd applauds. The demo ends. Nobody ships the fridge-recognising app, because it wasn't a product — it was an illustration.
The actual value of multimodal in 2026 is much quieter and much bigger. It's not "describe this picture to me." It's image in, structured data out — using a vision-capable LLM as a universal parser for the messy real-world documents, photos, screenshots, diagrams, and whiteboards that sit on every team's backlog marked "we should really do something with these."
This is one of the most underrated building blocks of the year. If you've been treating multimodal as a niche feature for "photo apps," this post is a prompt to revisit. You probably already have three or four spots in your codebase where images or PDFs are a bottleneck, and multimodal is the tool that unblocks them.
The big shift: images are now parseable
Until very recently, handling images in software meant one of three things: build a custom computer-vision model (expensive, slow, needs a team), adopt a narrow OCR library (limited to text extraction, brittle on anything layout-heavy), or route to a third-party classification API (pay per call, locked into a vendor's categories). None of these scaled to "I have a random photo a user just uploaded and I want to turn it into something my code can use."
Vision-capable LLMs changed this. A multimodal model can take an image and produce any structured output you define. The same schema-enforced structured output you learned in B1.3, with an image as the input instead of a text prompt. You define the output shape; the model extracts whatever it sees in the image that matches that shape.
That is the whole idea. The model is a universal parser. Whatever you can describe in a schema, you can now extract from an image.
A concrete example: receipt extraction
The canonical boring-but-important use case. A user uploads a photo of a receipt. You want to get the store name, the date, the total, and a list of line items. Before 2023 this required an OCR library plus a custom parser plus a ton of edge-case handling. Now:
# pip install anthropic pydantic
import base64
import os
from anthropic import Anthropic
from pydantic import BaseModel
from typing import Literal
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
class LineItem(BaseModel):
description: str
quantity: float | None
price: float
class Receipt(BaseModel):
store_name: str
date_iso: str
currency: Literal["USD", "EUR", "GBP", "JPY", "INR", "OTHER"]
subtotal: float | None
tax: float | None
total: float
items: list[LineItem]
def extract_receipt(image_path: str) -> Receipt:
with open(image_path, "rb") as f:
image_b64 = base64.standard_b64encode(f.read()).decode()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1500,
tools=[
{
"name": "return_receipt",
"description": "Return the parsed receipt data.",
"input_schema": Receipt.model_json_schema(),
}
],
tool_choice={"type": "tool", "name": "return_receipt"},
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": image_b64,
},
},
{
"type": "text",
"text": (
"Extract the receipt data from this image. "
"If a field is unreadable, omit it rather than guessing."
),
},
],
}
],
)
# The model returned its answer via the return_receipt tool call.
tool_use = next(b for b in response.content if b.type == "tool_use")
return Receipt(**tool_use.input)
That's it. Forty lines of Python for an extraction pipeline that would have been a research project three years ago. Point it at a photo, get back a typed Receipt object. Feed the object into your database, your expense tracker, your budgeting app. No OCR library. No regex. No per-merchant parser. It handles receipts from any country, in any layout, in any lighting.
You can do this right now, today, for roughly whatever a single LLM call costs per receipt. Most teams do not realise this is a solved problem. It is.
The categories where image-in-structured-data-out wins
Receipt extraction is one obvious example. The pattern generalises to an enormous space of tasks. Categories I see in practice:
1 · Document and form parsing
PDFs of contracts, forms, policies, reports, tables. PDFs where OCR is a nightmare because layout matters, columns shift, handwriting interrupts, or the underlying structure is baroque. Vision LLMs read them end-to-end.
Concrete wins: parse a supplier invoice into line items, extract clauses from a legal PDF into a clause database, pull tables out of financial reports, read scanned historical documents into structured records, translate a Japanese shipping manifest.
2 · Screenshot and UI understanding
A user sends you a screenshot of a bug. A QA tool needs to classify a UI state. A testing framework needs to find a button that isn't in the DOM. A screen-reader-like assistant needs to describe an app.
Concrete wins: bug-report triage ("which feature is this screenshot of?"), automated UI testing by visual matching, accessibility analysis, in-app help that reads what the user is looking at.
3 · Whiteboard and diagram digitisation
Your team holds a design meeting, someone takes a photo of the whiteboard, and the photo dies in a chat history. Vision LLMs turn the photo into a structured artifact — a list of boxes, connections, labels, or even a reconstructed diagram in a machine-readable format.
Concrete wins: meeting notes from whiteboard photos, converting hand-drawn diagrams into Mermaid or PlantUML, reading architecture sketches into deployment configs.
4 · Real-world photo classification for domain tasks
Photos from the field — an inspector's photos of equipment, a user's photo of a product defect, a farmer's photo of a crop disease, a construction site's photo of safety issues. These used to require a custom CV model per domain. Now they don't, for most tasks.
Concrete wins: field-inspection apps, user-uploaded damage claims, quality control from factory-floor photos, warranty photo triage.
5 · Video frames as a flipbook
Video support is still uneven across providers and more expensive than single images, but you can approximate a lot of video tasks by sampling frames every few seconds and describing each with the same schema. For things like "summarise this lecture" or "find the slide where X happens," this works and costs a fraction of a dedicated video pipeline.
The schema design moves that matter
Vision-model extraction has the same schema-design traps as text extraction (see B3.1 and B1.3) plus a few that are specific to images. The moves:
Include an uncertainty field. If the model can't read a number from a smudged receipt, you want it to say so, not guess.
class Receipt(BaseModel):
total: float
total_uncertain: bool = False # true if the model was not sure
...
Better: include per-field confidence. The model is capable of producing a calibrated "I'm not sure about this one," and if you capture it you can route low-confidence extractions to human review.
Use structural markers for multi-region documents. If the document has a header, body, and footer, define each as a separate field so the model knows where to look. "Header fields" and "line items" and "footer totals" are easier for the model to fill correctly than a flat list of 30 fields.
Reject low-quality images explicitly. Add a top-level image_quality field with values like clear, readable_with_effort, unreadable. The model is surprisingly good at self-reporting quality, and this routes the obvious failures to "ask the user for a better photo" instead of "hallucinate the fields."
Don't over-ask. If the receipt has 3 fields you care about, extract 3 fields, not 15. Every extra field is another chance for the model to drift or hallucinate. Keep the schema tight.
The cost and latency picture
Multimodal calls are meaningfully more expensive and slower than text-only calls. As of 2026, roughly:
- Input cost: 2-5x the equivalent text call. An image counts as a lot of tokens — a 1024x1024 image is typically billed as 1,000-2,000 input tokens depending on the provider.
- Output cost: unchanged — output tokens are output tokens.
- Latency: slightly higher than text-only; noticeable if you're streaming, invisible for batch.
Concrete numbers for receipt extraction: a single call is usually $0.01-$0.03 per receipt with Claude Sonnet or GPT-5, similar with Gemini. That's cents to process a receipt that a human would spend 30 seconds on. The economics favour the model dramatically for anything that would otherwise need a human.
Caching (B5.2) helps here too: the provider's prompt cache can reduce the repeated system-prompt cost, and for video-as-flipbook use cases, cached frames can save meaningful money.
The boring gotchas that ship bugs
A short list of things teams trip over the first time they wire up multimodal:
1 · Rotation. A photo taken on a phone has rotation metadata. Some libraries strip it; some preserve it; some half-apply it. The model sees the rotated or unrotated image depending on your stack. Always auto-rotate before passing to the model, or the extraction will be wildly wrong on "sideways" images.
2 · HEIC and other iPhone formats. Users upload HEIC files. Your backend expects JPEG. Convert explicitly at the entry point. Do not trust the model to "figure it out."
3 · Size limits. Most providers cap images at around 5-20 MB and 2048px-ish on the longest side. Downscale before sending. For text-heavy documents, aggressive downscaling hurts readability; test on real content before committing to a size.
4 · Privacy redaction. Photos users upload often contain more than the user meant to share. A receipt photo might have a credit card number, a screenshot might have a password, an inspection photo might have a license plate. Decide up front what you will and won't extract, and don't log raw images longer than you need to.
5 · OCR is sometimes cheaper. For pure text extraction on clean documents (typed letters, bank statements), traditional OCR libraries (Tesseract, AWS Textract) can be cheaper and faster than a vision LLM. Use the right tool for the job. Vision LLMs win when you need structured understanding, not just text.
Admit what breaks
- Hallucination is sneakier with images. The model will confidently "read" a number that wasn't there, or "see" a field that doesn't exist. Text hallucinations are often caught by review; image hallucinations feel like competent reading. Always spot-check extractions against the source.
- Edge cases cluster. Rare document layouts, bad lighting, handwritten content — each is a small failure rate, but they compound. Measure on real user data from day one.
- Regional and cultural bias. Receipts from South Asia use different layouts than receipts from Europe. The model is often trained with a North American/European bias. Budget extra validation for non-training-majority regions.
- Structured output has its own format quirks for images. Some providers require "tool_choice" to force the model to use the tool (as in the snippet above), or the model might include preamble text before the tool call.
- Video is not yet cheap. Flipbook mode works but costs add up fast. Budget carefully if you're processing hours of content.
- Large batches need parallelism and backoff. Running 10,000 receipt extractions overnight means 10,000 calls, rate-limited by the provider. Semaphore + retry is the pattern (per B1.2).
- "What is this image of?" is an easy task. "What exact field is at position X in this image?" is hard. The further you get from "describe this picture," the more you'll need to validate.
What just changed in your code
- Audit your codebase for places that handle images or PDFs. Every one of them is a candidate for image-in-structured-data-out. The hardest ones — the "we've been meaning to automate this" ones — are often the easiest wins.
- Write your first vision extractor as a schema. Define the shape you want; pass an image; let the model fill it in. Thirty lines of code.
- Include confidence or uncertainty fields for any extraction that affects downstream data quality.
- Auto-rotate, convert to JPEG, and downscale before sending images to any provider. This is a three-line preprocessing step that saves a week of debugging.
- Don't abandon traditional OCR for pure text extraction on clean documents. Vision LLMs earn their cost on structured, layout-heavy, or messy tasks.
- Measure on real user data, not curated samples. The edge cases are where vision extraction breaks.
Last post of the course: B6.4 — where the frontier is heading, and how to keep learning. How to stay current on a field that moves every week, without drowning in it. And how to decide which of the new things to adopt and which to wait out. See you there.
Course navigation
| ⬅️ Previous | 📍 You are here | Next ➡️ |
| ⬅️ Previous B6.2 · Fine-Tuning in 2026 | B6.3 of B6.4 | Next ➡️ B6.4 · Where the Frontier Is Heading |
📚 AI for Builders · Course Home — 28 posts, six modules.
Cover photo via Unsplash. This post is part of the AI for Builders series.