Embeddings in 30 Minutes of Code

Welcome to Module B3 — Retrieval, Really. Five posts about the boring-sounding, product-defining skill of getting the right context into your model at the right moment. Module B1 taught you how to call the function. Module B2 taught you how to write the instructions. This module teaches you how to stuff the prompt with facts the model couldn't possibly have memorised — and how to do it in a way that doesn't fall apart at scale.

We start where all retrieval starts: with embeddings. If you've only encountered embeddings as a buzzword in "vector database" pitches, this post is the thirty-minute version of the thing. I'm going to skip the research-paper treatment. You don't need to know how contrastive learning works. You need to know what an embedding is, how you get one, what cosine similarity is, and how it fits into a real codebase. Four things. One cup of coffee. Let's go.

What an embedding actually is

An embedding is a list of numbers that represents a piece of text. That's it. The list is usually 768, 1024, 1536, or 3072 numbers long, depending on which embedding model you use. Each number is a float between roughly -1 and 1. The list is called a vector.

Two texts with similar meanings produce similar vectors. Two texts with different meanings produce different vectors. The magic — and this is the only magic in the whole field — is that "similar meaning" has a surprisingly reliable mathematical definition in vector space: two vectors are similar if they point in similar directions.

"Direction similarity" is just a number — cosine similarity — between -1 and 1. Higher is more similar. You will use this number constantly.

That's the whole theory. If you want the longer version with the geometric intuition, go read Module 3 post 6 of Course 1. If you just want to use them, keep reading.

Getting your first embedding

Every major provider now exposes an embedding endpoint alongside their text-generation endpoint. You pass in a string, you get back a list of floats. Here's how that looks on OpenAI and Anthropic:

# pip install openai
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

resp = client.embeddings.create(
    model="text-embedding-3-small",
    input="How do I reset my password?",
)

vector = resp.data[0].embedding
print(f"length: {len(vector)}")        # 1536
print(f"first 4: {vector[:4]}")        # e.g. [-0.018, 0.033, ..., -0.007]

One call in, one vector out. Dimension 1536 for text-embedding-3-small. You can also request 3072 from text-embedding-3-large if you want more fidelity at higher storage cost.

The Anthropic SDK doesn't currently ship a first-party embedding endpoint — Anthropic recommends using Voyage AI for Claude-compatible embeddings, and the Voyage SDK mirrors the OpenAI shape. For local or self-hosted embeddings, sentence-transformers is the library to reach for. The shape is identical: text in, vector out.

Cosine similarity in four lines

Given two vectors, how similar are they? Cosine similarity is the measurement. In code it's four lines:

import math

def cosine_similarity(a: list[float], b: list[float]) -> float:
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(x * x for x in b))
    return dot / (norm_a * norm_b)

The result is between -1 and 1. In practice, for modern embedding models, you'll see values between about 0.1 and 0.95 — most vectors are at least a little similar to each other because they all live in a trained space that prefers some directions over others.

For real code, use numpy:

# pip install numpy
import numpy as np

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

Or — and this is the move — normalise your vectors once at storage time so the norms are all 1, and then similarity is just a dot product:

def normalise(v: np.ndarray) -> np.ndarray:
    return v / np.linalg.norm(v)

# At index time:
normalised_vec = normalise(np.array(vector))

# At query time:
similarity = np.dot(query_vec, stored_vec)  # already normalised both

Dot product is faster than full cosine because the square roots are baked in. Every production retrieval system I've seen uses this shortcut. Always normalise once and compare with dot product.

Semantic search in 30 lines

Now we put it together. Here is the minimum viable semantic search — in-memory, on a small dataset, useful for understanding the shape.

# pip install openai numpy
import os
import numpy as np
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Your "corpus" — the facts you want to retrieve over.
DOCS = [
    "To reset your password, go to Settings > Security > Reset.",
    "Our billing cycle runs from the 1st to the 31st of each month.",
    "Our support team is available Monday to Friday, 9am to 5pm GMT.",
    "To upgrade your plan, visit the Account page and click Upgrade.",
    "To cancel your subscription, email support@example.com.",
]

def embed(texts: list[str]) -> np.ndarray:
    resp = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return np.array([normalise(np.array(d.embedding)) for d in resp.data])

def normalise(v: np.ndarray) -> np.ndarray:
    return v / np.linalg.norm(v)

# Index: embed all docs once, store.
DOC_VECS = embed(DOCS)

def search(query: str, k: int = 2) -> list[tuple[str, float]]:
    q_vec = embed([query])[0]
    scores = DOC_VECS @ q_vec  # dot product, one against many
    top_k = np.argsort(-scores)[:k]
    return [(DOCS[i], float(scores[i])) for i in top_k]

for text, score in search("I forgot my login"):
    print(f"{score:.3f}  {text}")

Run that, and "I forgot my login" matches "To reset your password, go to Settings > Security > Reset." with a similarity around 0.55 — the top match, by a clear margin. The model has never seen this query before, and neither has the corpus. The semantic match happens because both texts project to nearby directions in the embedding space.

Read that code twice. That is what retrieval is. Everything else in RAG is variations on this theme. Vector stores make it faster. Rerankers make it more accurate. Hybrid search combines it with keyword matching. Chunking makes the docs the right size. But the core move — embed a corpus, embed the query, find nearest neighbours — is this thirty-line function.

Where embeddings actually fit in your codebase

Four common places, in order of frequency:

1 · Semantic search inside RAG

By far the most common. You have a knowledge base. A user asks a question. You embed the question, find the nearest few chunks of the knowledge base, and stuff them into the model's prompt as context. The model answers grounded in the retrieved material. We'll build a production-shaped version of this in B3.3.

2 · Classification and clustering without labels

Embeddings let you group similar things without training a classifier. You embed every support ticket you've ever received, run k-means or HDBSCAN on the vectors, and get an unsupervised taxonomy of your users' problems. No labels needed. No training. You'll find clusters you hadn't named.

3 · Deduplication and near-duplicate detection

You have 100,000 forum posts and you suspect many are near-duplicates. Embed them all, find pairs with cosine similarity above 0.95, and you have your near-duplicate set. Much smarter than hash-based dedup because it catches paraphrases, typos, and reworded copies.

4 · Anomaly detection

Embed incoming events (log lines, user messages, transactions). Compute the mean vector of normal events. Flag any event whose distance from the mean exceeds a threshold. Cheap, effective, and surprisingly good on text-heavy data.

The choices that actually matter at scale

Four decisions you will make once and regret if you get wrong:

1 · Which embedding model

Different models produce vectors of different dimensions, trained on different data, with different cost and latency. Rule of thumb: use the smallest frontier embedding model until you have evidence you need a bigger one. text-embedding-3-small (OpenAI), voyage-3 or voyage-3-lite (Voyage), text-embedding-005 (Google) are all reasonable starting points in 2026. The gap to larger models is smaller than providers want you to think.

2 · Whether to batch

Embedding endpoints accept batches of inputs. Passing 100 texts in one call is much cheaper than 100 separate calls. Your indexing code should always batch (up to the provider's cap, usually 2048 items). Your query code is always a batch of one, which is fine.

3 · How to store

For fewer than 100,000 vectors, you do not need a vector database. A Python list in memory, a pickle file, a SQLite table with a BLOB column — any of these work fine and are faster than a managed service because they skip the network round-trip. We go into this properly in B3.2.

4 · How to monitor quality

Retrieval quality is not free. Bad embeddings → irrelevant context → bad answers, and the bad answer can look fine if you're not measuring. Hook retrieval into your eval loop (B2.5): pick 30 queries, mark which documents should be retrieved for each, and measure whether your system returns them. The metric you want is recall@k — of the right documents, what fraction were in the top k results. Aim for recall@5 > 0.9 on a sanity-check set.

Admit what breaks

Embedding-to-embedding comparisons across models are meaningless. A vector from text-embedding-3-small is not comparable with a vector from Voyage. If you change models, you have to re-embed your entire corpus. This is not cheap. Pick a model and stick with it.
Embedding drift between model versions. Providers occasionally update their embedding models. When they do, previously-similar vectors may be less similar. Subscribe to model deprecation notices and plan re-embedding days into your roadmap.
Short queries embed badly. A one-word query ("billing?") produces a vector that's far from any document. Pad short queries (e.g., prefix with "User's question:") or apply query expansion before embedding.
Long documents embed lossily. A 5,000-word document embeds into the same 1,536-dimensional vector as a 50-word snippet. The long version loses detail. Chunk before embedding (this is B3.3 territory).
Cosine similarity thresholds don't generalise. A score of 0.7 is "high" for one dataset and "low" for another. Always tune thresholds per-dataset using real examples, never based on blog-post numbers.
All the fancy tricks come before "is my retrieval eval set any good." Teams spend weeks tuning models and rerankers before they have a reliable way to measure improvement. Measure first. Then tune.

What just changed in your code

Install the OpenAI (or Voyage, or your provider's) SDK and get your first embedding. Store it somewhere, look at the floats, feel the concreteness.
Write the 30-line semantic search function above. Actually run it. Verify the intuition that similar texts score higher.
Normalise all your vectors once at storage and use dot product for comparison. Stop writing cosine_similarity() calls at query time.
Do not reach for a vector database yet. We'll decide whether you need one in the next post. For most real codebases, the answer is "not right now."
Hook retrieval into your eval loop. The retrieval problem and the prompting problem are both graded by the same number: did the user get a good answer.

Next post, B3.2, is the most opinionated post of this course so far: Do You Actually Need a Vector Store? Spoiler: probably not. You'll see why Postgres with pgvector — or SQLite with nothing at all — beats the fancy managed services for the first 90% of real projects.

⬅️ Previous	📍 You are here	Next ➡️
⬅️ Previous B2.5 · The Evals-First Loop	B3.1 of B6.4	Next ➡️ B3.2 · Do You Actually Need a Vector Store?

📚 AI for Builders · Course Home — 28 posts, six modules.

Cover photo via Unsplash. This post is part of the AI for Builders series.

Embeddings in 30 Minutes of Code

What an embedding actually is

Getting your first embedding

Cosine similarity in four lines

Semantic search in 30 lines

Where embeddings actually fit in your codebase

1 · Semantic search inside RAG

2 · Classification and clustering without labels

3 · Deduplication and near-duplicate detection

4 · Anomaly detection

The choices that actually matter at scale

1 · Which embedding model

2 · Whether to batch

3 · How to store

4 · How to monitor quality

Admit what breaks

What just changed in your code

Course navigation

Comments

AI for Builders

Do You Actually Need a Vector Store?

More from this blog

A Reading List and Two Habits: Staying Current in Ten Minutes a Week

What to Decide Now, What to Defer, What to Ignore: The AI Action Matrix

The Next 18 Months of AI: A Calibrated Leader's Forecast

Calibrating Your AI Exposure: Upside and Downside in One Matrix

Five AI Capabilities That Matter for Your Business, and Five That Do Not

Command Palette

What an embedding actually is

Getting your first embedding

Cosine similarity in four lines

Semantic search in 30 lines

Where embeddings actually fit in your codebase

1 · Semantic search inside RAG

2 · Classification and clustering without labels

3 · Deduplication and near-duplicate detection

4 · Anomaly detection

The choices that actually matter at scale

1 · Which embedding model

2 · Whether to batch

3 · How to store

4 · How to monitor quality

Admit what breaks

What just changed in your code

Course navigation

Comments

AI for Builders

Do You Actually Need a Vector Store?

More from this blog