Guardrails: Input Validation, Output Filtering, Abuse Patterns
The unglamorous layer that stops your LLM app from being a liability. Input validation, output filtering, abuse detection, the moves that actually hold up.
Every LLM app starts the same way. The team builds the capabilities, the prompt is good, the agent works, the RAG retrieves the right docs, the demo is clean. Then the team says "OK, ship it." And someone — usually security, sometimes a nervous PM, occasionally the CEO — asks: "What happens when a user sends something terrible? What happens when the model says something terrible? What happens when someone abuses this at scale?"
The team does not have good answers. The team scrambles to add a few regex filters and a content-moderation API call the week before launch. The guardrails are an afterthought.
This is backwards. Guardrails are not a thing you add at the end; they are a layer of the architecture, and they determine whether your shipped product is a feature or a liability. This post walks through the minimum guardrails every production LLM app needs: input validation, output filtering, abuse detection, and the specific thing called "safe mode" that you should always have a switch for.
Not a lecture on AI safety. Not a moral panic. Just the practical layer that stops "your AI told a user to do something dangerous" from being the headline about your product.
The map
Guardrails live in three places relative to the model call. Each catches a different class of problem.
Three boxes around the core call: one before, one after, one watching from the side. Let me take each in turn.
Guardrail 1: input validation
The first line. Before user content reaches the model at all, run it through checks that catch the obvious problems. This is the cheapest guardrail in both money and latency, and it catches the largest share of abuse traffic.
Four things to validate on every incoming user message:
1 · Length. Cap the input. 10,000 characters is a reasonable upper bound for a single user message in most chat products. Longer isn't "more expressive" — it's "the attacker is trying to stuff the context window with adversarial content." Reject hard over the cap; warn near it.
2 · Rate. Per-user, per-feature, per-day caps. A legitimate user asks 50 questions a day. An abuse script asks 5,000. Rate limits are the crudest but most reliable defence against runaway cost. Implement them at the HTTP layer, not inside the LLM code.
3 · Content class. Before the main model call, run a cheap classifier to bucket the input: safe, potentially_harmful, spam, prompt_injection, unsupported_language. You can do this with a small local model, a cheap mid-tier LLM, or a dedicated moderation API. The classifier is not the final decision — it's a routing signal:
# pip install openai
import os
from openai import OpenAI
client = OpenAI()
def classify_input(text: str) -> dict:
resp = client.moderations.create(
model="omni-moderation-latest",
input=text,
)
return resp.results[0].categories.model_dump()
# In the request handler:
def handle(user_message: str) -> str:
flags = classify_input(user_message)
if flags.get("violence", False) or flags.get("self_harm", False):
return "I can't help with that. If you need support, please contact [resource]."
if flags.get("sexual_minors", False):
log_abuse_event(user_id, "minors_category")
return "I can't help with that."
return run_main_agent(user_message)
Don't route every class to a refusal. For most classes, just log and flag. For the few categories where you have a policy, enforce it.
4 · Structural safety. The input-sanitisation moves from B2.4: strip hidden HTML before RAG ingest, normalise unicode, wrap untrusted content in <user_input> tags, tell the model to treat those tags as data. These are not "input guardrails" in the moderation sense, but they live at the same layer and cost the same — do them here.
Guardrail 2: output filtering
The second line. Even with good input validation and a tuned model, the output can still be something you don't want to show the user. The output filter runs on the model's response before it reaches the user, and blocks or sanitises if it violates policy.
Five things worth checking on the way out:
1 · Content-class check. Run the same moderation API on the model's output as you ran on the user's input. If the model produces something in a banned category, block it and substitute a fallback response. Rare on well-trained models, non-zero, worth the half-cent per call.
2 · Leaked-system-prompt check. The model should never reproduce its system prompt to the user. If the output contains a recognisable chunk of your system prompt (more than, say, 50 characters of overlap), block it or rewrite. This catches prompt-injection attempts that made it through the input filter.
3 · Allowlist-URL check. If your product shouldn't link users to external sites, strip or block URLs in the output that aren't in an allowlist. The attack vector here is prompt injection or tool-result injection that smuggles a URL into the response — output filtering catches that even when the input filter didn't.
4 · Refusal coherence. If the system prompt says "never discuss competitor X," check that the output doesn't mention competitor X. A simple contains-check on the response is enough most of the time. Don't trust the model alone to enforce it.
5 · Schema validation for structured outputs. From B1.3. If you asked for a {"category": "billing"} and the model returned {"category": "urgent billing"}, that's a schema failure. Retry or fall back; never pass unvalidated structured output to the rest of your code.
def filter_output(model_response: str, system_prompt: str) -> str | None:
# Leak check
if any(chunk in model_response for chunk in extract_prompt_chunks(system_prompt, min_len=50)):
log_event("system_prompt_leak")
return None # blocked
# URL allowlist
urls = extract_urls(model_response)
allowed = {"docs.acme.com", "support.acme.com"}
if any(urlparse(u).netloc not in allowed for u in urls):
log_event("off_allowlist_url")
model_response = strip_disallowed_urls(model_response, allowed)
# Moderation
flags = classify_input(model_response)
if flags.get("harassment") or flags.get("violence"):
log_event("output_flagged")
return None # blocked
return model_response
def safe_ask(user_message: str) -> str:
raw = run_main_agent(user_message)
filtered = filter_output(raw, SYSTEM_PROMPT)
if filtered is None:
return "I can't provide a response to that request."
return filtered
The output filter adds latency — typically one extra moderation call or a handful of string checks, so 50–300ms depending on what you're checking. For user-facing products, it's worth it. For internal tools with trusted users, you can skip most of it.
Guardrail 3: abuse detection
The third layer watches patterns across requests, not individual ones. A single weird request is fine; a hundred weird requests from the same user in five minutes is an attack. Abuse detection sits beside the request path, reading the logs (B5.3), and produces signals that gate or block traffic.
Things to watch:
1 · Request rate from a single identity. Covered under input validation; also worth tracking at the abuse-detection layer because repeat offenders across rate-limit windows are the ones to block permanently.
2 · Pattern of input categories. A user who sends one "potentially_harmful" message is a noisy signal. A user who sends thirty in a day is trying something. Aggregate the classifier flags from input validation over a window and alert on outliers.
3 · Prompt-injection attempts. A user repeatedly trying phrases like ignore previous instructions, system:, you are now, base64 blobs, ROT13 — especially from the same identity or IP. These rarely have legitimate reasons. Count them, alert on spikes, consider rate-limiting or blocking.
4 · Expensive-request concentration. One user generating 40% of your daily spend is either a power user or an attacker. Either way, you want to know. Alert on per-user-cost outliers.
5 · Response-filter-triggered frequency. The output filter blocked 100 responses for the same user today. Either your product is genuinely hostile to them or they're probing. Flag for review.
This layer doesn't need to be real-time. Run a job every 5 minutes over recent logs, compute aggregates, emit alerts for outliers. The feedback loop is slow but the data is rich.
The "safe mode" switch
One more thing that every product needs and almost no product has when they ship: a global switch that puts the product into a restricted, conservative mode. In safe mode:
- Every user query goes through stricter input validation.
- The main model is temporarily replaced with a more conservative model and system prompt.
- Tool use is disabled or narrowed to a read-only set.
- Output filtering is more aggressive.
- Rate limits are tighter.
The switch is a feature flag. You flip it in one place. It exists so that when you discover — at 2am, from a tweet — that someone is abusing your product in a new way, you can restrict the entire surface area in one operation while you figure out the real fix. Without this switch, your options are "leave it up and keep bleeding" or "take the whole product down." Neither is good.
The time to build the switch is before you need it. It takes an afternoon. Every production LLM app should have one. Most don't, and most teams learn the hard way.
What does NOT count as a guardrail
Things teams put in the "guardrails" bucket that don't actually work:
- "Don't discuss X" in the system prompt. Nice hint, not a guardrail. Prompt-injection bypasses it trivially. See B2.4.
- Trusting the model to refuse. The model sometimes refuses, sometimes doesn't, depending on phrasing. Not a reliable line of defence.
- Hiding the system prompt. Doesn't work. Users extract them. Treat the system prompt as public.
- Regex lists of forbidden words. Too many false positives on legitimate content, trivially bypassed by paraphrase. Useful as a speed bump, never as a wall.
- Client-side input validation. Trivially bypassed. Never the only layer.
- "The model is smart enough." It isn't, for this problem. The output is a probabilistic function; guardrails provide the deterministic layer.
Every one of these can help as part of a defence-in-depth strategy. None of them is a guardrail on its own.
A realistic guardrail stack
Here's the shape I build for a typical user-facing LLM product:
- HTTP layer: rate limiting by user+IP, request size cap.
- Input validation: length check, unicode normalisation, cheap moderation-API classification, prompt-injection keyword check (as a signal, not a block).
- Safe-mode gate: if safe mode is on, route to the conservative handler.
- Main agent call: the regular B4.2 loop with tool budgets and system-prompt safety instructions.
- Output filter: moderation check, system-prompt leak check, URL allowlist, schema validation.
- Logging: everything (per B5.3), including which guardrails fired.
- Batch abuse detection: every 5 minutes, aggregate logs, alert on outliers.
- Kill switches: the safe-mode flag, per-feature disable flags, per-user block list.
Eight layers. Each one is small — a function, a middleware, a database query. The whole stack adds maybe 200–400ms of latency and a few sub-cents per request. In exchange, the product isn't the headline when something goes wrong.
Admit what breaks
- False positives hurt users. An overly aggressive input filter rejects legitimate queries. An overly aggressive output filter blocks helpful answers. Tune against real traffic, not synthetic test cases.
- Moderation APIs are probabilistic. They miss things and they over-trigger. They're a layer, not a wall. Never rely solely on the API.
- Guardrail latency adds up. Two classifier calls plus string checks can add half a second. On latency-sensitive products, parallelise input validation with the main call when possible (fire off the validation and the main request at the same time; cancel one if the other decides the request is unsafe).
- Abuse detection is noisy. Most alerts are false positives. Tune thresholds to be actionable; otherwise you'll ignore the real ones.
- "Safe mode" is a kill switch, not a strategy. Use it when an incident is active. Don't leave it on forever.
- Defence in depth costs depth in money. Each layer is a few hundred microseconds and a few tenths of a cent. Over millions of requests, that's real money. Budget for it.
- Users find ways to bypass every specific defence. Which is exactly why you layer them — no single layer has to be perfect.
What just changed in your code
- Add input validation on every user-facing endpoint: length, rate, moderation class, structural safety.
- Add output filtering on every user-facing response: moderation class, system-prompt leak, URL allowlist, schema validation.
- Build a batch abuse-detection job that aggregates logs and alerts on patterns, not just single events.
- Ship a global safe-mode feature flag that puts the product into a restricted configuration with one switch.
- Log which guardrail fired on every event so you can tune thresholds and reduce false positives.
- Do not rely on "the model will refuse" as a guardrail. The model is the thing you're wrapping, not the wrapper.
- Assume users will try to bypass every single defence. Layer accordingly.
Next post, B5.5, closes out Module B5: rolling out model upgrades and prompt changes without breaking users. A/B tests, shadow traffic, canary releases, prompt migrations. The deploy story nobody tells you and every team eventually learns the hard way.
Course navigation
| ⬅️ Previous | 📍 You are here | Next ➡️ |
| ⬅️ Previous B5.3 · Observability for LLM Apps | B5.4 of B6.4 | Next ➡️ B5.5 · Rolling Out Without Breakage |
📚 AI for Builders · Course Home — 28 posts, six modules.
Cover photo via Unsplash. This post is part of the AI for Builders series.