ActiveFence is now Alice
x
Back
Blog

LLM guardrails: how to secure prompts, outputs, RAG, and agents in production

Alice Staff
-
Jun 9, 2025

TL;DR

LLM guardrails are runtime checks around a language model that decide what it can read, generate, and do, before a bad prompt, a poisoned document, or an unsafe tool call causes harm. A clever system prompt isn't enough; one injected instruction can override it. The fix: enforce policy at four points and measure it.

LLM guardrails are runtime controls that inspect and enforce policy around prompts, model outputs, retrieved context, tool calls, and high-risk actions. In production, effective LLM guardrails reduce prompt injection, data leakage, unsafe responses, hallucinations, and agent misuse, and they leave behind evidence teams can use to monitor and improve LLM security over time.

The first thing breaks the moment an LLM app stops being a demo. A bare system prompt does not survive a hostile user, a poisoned RAG document, an over-privileged tool, or a model upgrade that changes refusal behavior. Guardrails are the layer that decides what is allowed before harm reaches a user, a record, or a downstream system.

I have run pre-launch reviews where the model passed every eval set, the API was hardened, and the team still could not answer one question: what happens when a user pastes a prompt-injection payload into a support ticket, the RAG retriever pulls it back in, and an agent calls a refund tool a minute later? That gap, between model-level safety and application-level policy, is what LLM guardrails are for.

Key takeaways

  • Enforce policy at four points: Place LLM guardrails before the model, around retrieval, after the model, and around every tool call an agent can trigger.
  • Treat prompt engineering as insufficient: A system prompt asks the model to behave, but one injected instruction in a document, tool response, or multi-turn chat can override it.
  • Measure guardrails like production controls: Track true positives, false positives, false negatives, added latency, adversarial robustness, and drift after every model, prompt, or retrieval change.
  • Cover the lifecycle with WonderSuite: WonderBuild tests apps before launch, WonderFence enforces policy-trained detectors at sub-99ms latency across multimodal inputs and outputs, and WonderCheck catches drift as models and prompts shift.
  • Scope what the model can reach: Limit tools, data, and permissions so a successful injection cannot leak records or trigger irreversible actions on its own.

What are LLM guardrails?

LLM guardrails are runtime controls that sit around a large language model and decide what the system is allowed to receive, retrieve, generate, expose, or do. They inspect prompts, retrieved context, model outputs, tool calls, and downstream actions, then allow, block, redact, rewrite, route, escalate, or log each decision against a defined policy.

A working LLM guardrail evaluates the request or response against a policy, then picks an action inside a defined latency budget. The evidence record it writes after acting includes policy version, user, retrieval source, and decision. Security, AI safety, or compliance owners review that record later.

That last part is what separates a guardrail from a string filter. A regex that blocks "ignore previous instructions" is a heuristic. A guardrail logs which policy was hit, which user triggered it, which retrieval source carried the payload, and what the system did next.

LLM guardrails vs AI guardrails

LLM guardrails are a subset of full-stack AI guardrails. LLM guardrails focus on what enters and leaves the language model: prompts, retrieved context, tool inputs, and generated responses. AI guardrails are broader and cover data handling, infrastructure, application logic, agent permissions, and governance evidence across the full system.

LLM guardrails vs AI guardrails vs model-provider safety
Control scopePrimary surfaceWhat it does not cover on its own
Model-provider safetyPretrained refusal, RLHF, baseline filtersApplication policy, RAG sources, tool permissions, user entitlements
LLM guardrailsPrompts, context, outputs, tool inputs around the modelIdentity, network controls, business workflow logic, full audit pipelines
AI guardrailsInputs, outputs, data, tools, agents, infrastructure, governanceReplacing model-provider safety or product security review
Content moderationClassifying harmful content in user or generated textTool calls, RAG entitlement, regulated advice, downstream actions

Use LLM guardrails when the question is "what is the model allowed to read, write, and act on right now?" Use AI guardrails when the question expands to "who is allowed to use the system, with which data, through which tools, against which policies?"

Why LLM apps need controls beyond prompt engineering

Prompt engineering is not a security control. A well-crafted system prompt can be overwritten by a single injected instruction inside a retrieved document, a tool response, or a multi-turn conversation. The OWASP Top 10 for LLM Applications lists prompt injection, sensitive information disclosure, system prompt leakage, vector and embedding weaknesses, and excessive agency as separate top-tier risks because each of them can defeat a "just write a better prompt" approach. Alice's walkthrough of the OWASP LLM Top Ten shows how each category lands inside production GenAI apps.

The shift from chatbot to agent makes this worse. Akto's 2025 State of Agentic AI Security report found that 69% of enterprises were already piloting or running early production agent deployments while guardrail and inventory programs lagged behind. The control surface is expanding faster than the visibility.

Guardrails turn vague safety expectations into measurable runtime decisions. A system prompt asks the model to behave. A guardrail makes the application enforce that behavior regardless of what the model decides to do next.

Where guardrails fit in LLM application architecture

LLM guardrails sit at four enforcement points in a production LLM app: before the model receives the prompt, around retrieval, after the model produces the response, and around any tool call the model can trigger. Each point catches a different failure class and contributes to different evidence.

A minimal placement looks like this:

  • Pre-model input guardrails inspect the user prompt and any concatenated context for injection, jailbreaks, PII, secrets, and policy violations.
  • Retrieval guardrails evaluate which RAG sources the user is allowed to read and whether any retrieved chunk carries hidden instructions or untrusted markup.
  • Post-model output guardrails evaluate the generated response for unsafe content, leaked context, hallucinations, regulated claims, or policy violations.
  • Tool and action guardrails evaluate the model's tool call against allowed scopes, user entitlements, and irreversibility risk before any external system is touched.

The point is that the model is one decision-maker inside a multi-decision system. The application is the layer that owns the policy, not the model.

What LLM guardrails defend against

LLM guardrails defend against failure modes that prompt engineering and model alignment alone cannot resolve. The common thread is that the LLM does not get to be the only judge of what is safe; an external runtime control gets to override or escalate before the harm propagates.

Prompt injection and jailbreak attempts

Prompt injection happens when an attacker smuggles instructions into the model context so the model executes them instead of the intended task. Direct prompt injection comes from a user message. Indirect prompt injection comes from retrieved documents, web pages, emails, tickets, uploaded files, or tool responses that the model treats as trusted context.

Prompt injection guardrails have to inspect both halves of the pipeline: the prompt before inference, and the output after. A response that suddenly contains a system prompt, an internal URL, or a tool call the user never asked for is often the first observable sign that the input control missed something. Alice's research on prompt injection detection in generative AI walks through the direct and indirect variants in more detail.

Jailbreaks overlap with prompt injection but target model refusal. Roleplay, encoded prompts, language switches, and adversarial prefixes are all attempts to slip the model past its safety tuning. Treat both as production risks, not academic curiosities.

Sensitive data leakage and PII exposure

Sensitive data leakage rarely happens at the user prompt. It happens at retrieval, in memory, in summaries, in logs, in tool outputs, and in the generated response itself. An LLM that summarizes a private record for the wrong user, or that surfaces a customer's PII in a debugging log, has leaked data even if no one issued a "show me the database" prompt.

Data leakage guardrails should keep three decisions separate: whether the user is allowed to retrieve a record, whether the model is allowed to read it into context, and whether the output is allowed to reveal it. Conflating the three is how privacy incidents start.

Hallucinations, unsafe responses, and off-topic output

Hallucinations become a security problem the moment users rely on the LLM for decisions, advice, or actions. In finance, healthcare, insurance, legal, and child-facing products, a confident wrong answer can create regulatory exposure, safety incidents, or real harm.

Output guardrails can require source grounding for factual claims, restrict the answer types the model is allowed to return, attach citations to RAG responses, and route unsupported answers to refusal or human review. Off-topic output (the model wandering outside its intended domain) is the lower-severity cousin of the same control gap.

Toxicity, bias, illegal content, and policy violations

LLM guardrails should detect and act on toxicity, harassment, hate, sexual content, self-harm content, CSAM, extremist content, illegal-activity content, and other category-level policy violations. The policy is not generic; it has to match the platform, user population, and jurisdiction.

A child-facing assistant, a developer copilot, a marketplace chatbot, and a financial planning assistant cannot share the same harm taxonomy or escalation rules. Alice's review of chatbot legal accountability and AI safety shows how product context changes the threshold.

RAG context manipulation and untrusted retrieved content

RAG guardrails treat the retrieval pipeline as untrusted input. Retrieved documents can carry hidden prompt-injection payloads, conflicting policy statements, outdated answers, or content the requesting user is not entitled to read. Vector and embedding weaknesses also let attackers poison embeddings to surface a chosen document on demand.

RAG guardrails should enforce user-scoped retrieval, filter sources by trust tier, sanitize retrieved markup, and inspect any text concatenated into the model context. If RAG can read it, the guardrail has to read it first.

Tool misuse, role confusion, and agent actions

The dangerous moment for an LLM agent is not when it speaks. It is when it acts: calls an API, updates a record, writes a file, sends a message, executes code, triggers a workflow. The OWASP Agentic Top Ten maps these failures to specific design decisions around tool selection, planning, and memory, and Alice's read on the 7 subtle sins of agentic AI covers the design mistakes that show up most often in production.

Tool guardrails enforce least privilege, scope, user identity, approval requirements, and rate limits at the action layer. Role confusion (the model treating itself, or another agent, as a privileged user) belongs here too, not in a chat-safety policy.

Types of LLM guardrails

LLM guardrails are grouped by where they sit in the request path. Most production deployments combine all of them because a single layer never covers every failure mode.

Types of LLM guardrails and where they run
Guardrail typeWhere it runsWhat it controls
Input guardrailsBefore the modelUser prompts, concatenated context, files, URLs, multimodal inputs
Output guardrailsAfter the modelGenerated responses, summaries, code, structured outputs
Retrieval guardrailsAround the RAG pipelineSource eligibility, embedding integrity, retrieved-chunk inspection
Tool guardrailsAround tool and agent callsScopes, identity, approvals, irreversibility, rate limits
Data guardrailsAcross prompts, context, outputs, logsPII, secrets, regulated data, IP, entitlements, retention
Human-in-the-loop guardrailsAround high-risk decisionsEscalation routing, reviewer assignment, approval evidence

Input guardrails before prompts reach the model

Input guardrails inspect the full prompt the model will see: the user message, the system prompt, any conversation history, retrieved chunks, tool responses, file contents, and multimodal inputs. They look for prompt injection patterns, jailbreak templates, secrets, PII, prohibited requests, and instructions that violate the application policy.

Input guardrails should reject early. The cheapest place to stop an attack is before it consumes a model inference, a tool call, and a log entry.

Output guardrails before responses reach users or systems

Output guardrails inspect what the model produced before any user or downstream system sees it. They can block unsafe responses, redact PII or secrets, rewrite regulated claims into compliant language, attach citations, enforce schema or JSON structure, or route an ambiguous response to human review.

Output guardrails are also the right place to catch leaks. A response that suddenly contains a confidential string, an internal URL, or a system prompt fragment is a signal that an earlier layer was bypassed.

Retrieval guardrails for RAG context and source trust

Retrieval guardrails turn the RAG pipeline from a content firehose into a policy-aware step. The control set includes user-scoped retrieval, source allowlists by trust tier, markup sanitization, instruction stripping, embedding integrity checks, and chunk-level inspection for injection patterns.

The practical test is whether the team can answer one question: if an attacker uploads a malicious document into a knowledge base, will the next user query execute the attacker's instructions instead of the user's? If yes, the RAG guardrails are not in place.

Tool guardrails for APIs, plugins, functions, and agents

Tool guardrails sit between the model's tool call and the external system it wants to touch. They enforce scoped credentials, parameter validation, user identity, role checks, approval gates, idempotency, rate limits, and rollback paths.

For irreversible actions (payments, account changes, outbound communication, code execution) a tool guardrail should require an explicit approval signal, not just a permissive policy. Alice's overview of low-latency AI guardrails covers how to keep these checks fast enough not to break the user experience.

Data guardrails for PII, secrets, regulated data, and IP

Data guardrails apply across every layer because the same record can move from prompt to retrieval to memory to output to log. The controls include detection and redaction for PII, secrets, payment data, health data, and IP, plus retention limits, purpose binding, and entitlement-aware retrieval.

Data guardrails are also where compliance teams get the evidence they need for frameworks like the NIST AI Risk Management Framework and ISO 42001. Alice's overview of AI risk management frameworks across NIST, OWASP, MITRE, MAESTRO, and ISO shows how the frameworks fit together. If a system cannot show what data it processed, with which authority, the rest of the program is hard to defend.

Human-in-the-loop guardrails for high-risk decisions

Human in the loop is a guardrail action, not a fallback. It applies when the risk, regulatory weight, or irreversibility of a decision is higher than the model should resolve alone: escalations to a clinician, a fraud analyst, a trust and safety reviewer, or a compliance owner.

Human-in-the-loop guardrails need clear routing criteria, response-time expectations, reviewer guidance, and a feedback loop that improves the upstream classifier over time. Otherwise the queue becomes a dumping ground for everything the system did not want to decide.

How LLM guardrails work

LLM guardrails work by chaining deterministic rules, model-based classifiers, LLM-as-a-judge evaluations, and policy engines at defined enforcement points, then logging each decision for monitoring and tuning. The right mix is shaped by risk severity, latency budget, policy complexity, and the cost of false positives versus false negatives.

Deterministic rules, patterns, and schema validation

Deterministic guardrails use rules, allowlists, blocklists, regular expressions, schema validation, structured-output checks, and permission lookups. They are fast, predictable, easy to test, and the right tool when the policy is unambiguous.

Examples include rejecting prompts that contain known secret formats, requiring JSON output that matches a schema, blocking tool calls outside a user's role, or constraining outputs to a fixed answer set. Deterministic checks are the foundation; everything else builds on top.

Model-based classifiers and safety detectors

Model-based guardrails use trained classifiers and safety detectors to evaluate intent, meaning, and contextual risk. They handle the cases deterministic rules cannot: ambiguous language, multilingual prompts, paraphrased jailbreaks, multimodal inputs, and policy-specific harm categories.

The trade-off is operational. Classifiers need their own evaluation sets, latency targets, false-positive monitoring, and retraining pipeline. They should be tested against adversarial behavior, not only clean benchmarks, or they will pass evals and fail in production.

LLM-as-a-judge approaches and their tradeoffs

LLM-as-a-judge is exactly what it sounds like: a second model checking the first. It is flexible, useful for nuanced policy, and the most common pattern for evaluating long-form responses in offline LLM evaluation.

The trade-offs are real. Judge models inherit their own biases, can be manipulated by the same prompt-injection patterns they are supposed to detect, add inference latency, and need calibration against human-labeled ground truth. Use LLM-as-a-judge for evaluation depth and slow-path reviews; do not let it carry every runtime decision alone.

Policy engines for allowed, restricted, and prohibited behavior

A policy engine encodes the rules a guardrail enforces: which user populations can access which workflows, which categories are allowed/restricted/prohibited, which tools require approval, which data classes can appear in which outputs, and which actions need escalation.

Policy engines matter because they separate the policy from the detection code. Security, legal, compliance, and trust and safety owners can change the policy without rewriting the runtime, and the runtime can show which policy version drove each decision. Alice's note on guardrails trained for your policies explains why generic "out of the box" safety policies miss enterprise edge cases.

Before-agent and after-agent guardrail patterns

For agents, the before/after pattern moves up one layer. A before-agent guardrail evaluates the agent's plan, tool selection, and parameter values before any external call. An after-agent guardrail evaluates the result, the state change, and any output sent back to the user.

The around-tool pattern is the third leg: each tool call is wrapped with its own scope, identity, and approval check. Treat agents as multi-step systems where every step is a potential enforcement point.

Combining multiple guardrails without breaking user experience

Stacking guardrails creates two failure modes if it is not designed carefully: cumulative latency and aggregate false positives. A request that passes five sequential 200-millisecond checks now carries a one-second guardrail tax, and a 5% false-positive rate at each layer can compound into noticeable user friction.

The fix is to run independent checks in parallel where possible, use fast deterministic checks as a pre-filter, reserve heavier model-based and LLM-as-a-judge checks for higher-risk paths, and budget latency per workflow rather than per check. Alice's piece on five competitive advantages from real-time GenAI guardrails walks through how this trade-off shows up in customer-facing deployments.

How to implement LLM guardrails

Implement LLM guardrails by writing the policy first, mapping it to the application architecture, choosing enforcement points, deploying with explicit actions, and turning every decision into observable evidence. The order matters because guardrails fail when they are bolted onto a vague policy.

Define application policies and risk boundaries

Start with the policy, not the detector. Define what the LLM app is allowed to do, what it must refuse, what it must escalate, and what is out of scope entirely. Cover data classes, user groups, content categories, tool actions, regulated claims, and jurisdictional constraints.

Skip generic phrases like "be helpful and safe." Write the policy in terms the runtime can enforce: redact PII fields X/Y/Z, block prompts containing secret formats A/B/C, refuse regulated advice in domain D, require approval for tool E above amount F.

Place guardrails before the model, after the model, and around tools

Map each policy to the enforcement points that can hold it. A privacy policy usually needs input redaction, retrieval entitlement, output inspection, and log scrubbing. A regulated-claims policy usually needs output classification, citation requirements, and an escalation route. A tool policy needs scope, identity, approval, and rollback.

LLM guardrail enforcement points and typical controls
Enforcement pointWhat it catchesTypical control
Pre-model inputPrompt injection, secrets, prohibited requests, unsafe filesInput classifier + deterministic redaction + policy lookup
RetrievalUnauthorized sources, poisoned chunks, hidden instructionsEntitlement filter + source trust tier + chunk inspection
Post-model outputUnsafe content, leakage, hallucinations, policy violationsOutput classifier + schema/citation checks + LLM-as-a-judge for high-risk
Tool and actionScope violations, role confusion, irreversibility riskScoped credentials + approval gate + idempotency + rollback
Post-deploymentDrift, regressions, new attack patternsContinuous testing + production monitoring + re-tuning

Enforce least privilege and role isolation for tools and agents

Least privilege is the highest-impact tool guardrail. Each tool should hold its own scoped credentials, narrowest IAM scope, and minimum data access. Agents should hold the union of only the scopes the current task requires, not the union of every scope they might ever need.

Role isolation is the related discipline. A user role, an agent role, and a system role should not collapse into one identity at runtime. When they do, role confusion lets a model assume the privileges of whatever identity gives it the easiest path to the answer.

Protect RAG systems from untrusted or manipulated context

Treat the RAG index as a partially trusted system that contains hostile content by default. Enforce per-user retrieval, source allowlisting, content-type restrictions, markup sanitization, and instruction stripping on every retrieved chunk before it enters the model context.

For high-risk applications, add a retrieval-side guardrail that evaluates each chunk for injection patterns and a context-construction step that quarantines retrieved content from system instructions. Alice's research collection in the guide to guardrails goes deeper on RAG-specific patterns.

Log guardrail decisions, user intent, model output, and tool calls

Every guardrail decision should write an evidence record: the policy version, the user identity, the request, the retrieved context, the model output, the tool calls, the guardrail verdict, and the latency. Without that, LLM monitoring is guesswork and post-incident review is impossible.

Be deliberate about log hygiene. Logs themselves are a leakage surface: PII, secrets, and regulated data can land in monitoring tooling if the redaction step does not also apply to the log path.

Route high-risk interactions to review, escalation, or refusal

Not every uncertain decision should be answered by the model. Define which categories must escalate to a human reviewer, which must refuse outright, and which must surface a controlled fallback (a refusal message, a redirect to a human channel, or a constrained response template).

Routing rules belong in the policy engine. The runtime should be able to explain which rule sent a request to review and which reviewer queue picked it up.

How to evaluate LLM guardrail performance

Evaluate LLM guardrails like production controls, not static filters. The evaluation has to cover detection accuracy, latency, false-positive and false-negative behavior, robustness to adversarial pressure, fit to the application domain, and drift over time.

How to evaluate LLM guardrail performance
MetricWhat it measuresWhy it matters
True positive rateShare of unsafe inputs/outputs correctly caughtCore protection against the policy violation
False positive rateShare of safe inputs/outputs incorrectly blockedDrives user friction, workarounds, shadow tool use
False negative rateShare of unsafe inputs/outputs missedDirect security exposure
Latency addedMilliseconds per check, end to endDetermines whether the guardrail survives production
RobustnessPerformance under adversarial paraphrase and language shiftResistance to active attackers and jailbreaks
Domain fitPerformance on the application's policy edge casesGeneric safety models miss enterprise categories
DriftChange in any of the above over timeSignals model, prompt, retrieval, or attacker shifts

Test with benign prompts, jailbreaks, prompt injection, and policy edge cases

Build the evaluation set from four buckets: benign prompts that look risky but are safe, known jailbreak templates, direct and indirect prompt-injection payloads (including RAG-borne ones), and the application's own policy edge cases. A guardrail that scores 99% on a public safety benchmark and 60% on the application's policy is not production-ready.

Adversarial coverage is what separates real testing from theater. The evaluation set should include paraphrases, encoded prompts, language switches, multi-turn attacks, and tool-misuse scenarios. Alice's red teaming tactics webinar covers how teams build that coverage before launch.

Measure false positives and false negatives

Report both together. A guardrail that drives false negatives to zero by blocking 30% of safe traffic is not a security win; it is a churn engine. Tune each guardrail to the application's risk tolerance and re-baseline whenever the policy, model, prompt, or retrieval set changes.

Track false positives by category and by user segment. Concentrated false positives ("the guardrail blocks legitimate medical questions from healthcare staff") usually point at a missing entitlement layer, not at the classifier itself.

Track latency and user experience impact

Latency belongs to an owner, not to a sidebar metric. Define a per-workflow latency budget, attribute it across input guardrails, retrieval guardrails, model inference, output guardrails, and tool guardrails, and watch what happens at the 95th and 99th percentile, not the median.

When latency creeps, two failure modes follow: product teams bypass the guardrail to recover the UX, or users abandon the workflow and move to unmanaged tools. The latency tradeoff section in Alice's low-latency AI guardrails covers the practical choices.

Evaluate safety, accuracy, robustness, and domain fit

Run the evaluation across four axes:

  • Safety: does the guardrail catch the unsafe categories defined in the policy?
  • Accuracy: does it leave safe traffic alone within an acceptable false-positive rate?
  • Robustness: does it hold up against adversarial paraphrase, language shift, and multi-turn attacks?
  • Domain fit: does it perform on the application's specific edge cases, not just generic benchmarks?

Treat the four axes as a single report card. A guardrail can score high on three and still fail the use case if it loses domain fit.

Monitor production drift and regressions after changes

Re-run the evaluation after every meaningful change: model upgrades, prompt edits, retrieval source additions, new tools, new user groups, or policy revisions. Add continuous production monitoring on the same metrics so drift shows up as a trend, not as an incident.

Drift detection is also where post-launch testing earns its keep. Alice's blog on detecting AI degradation in production explains how ongoing red teaming and drift detection keep guardrails that worked at launch tested again as the surrounding system changes.

LLM guardrails for common architectures

LLM guardrails change shape with the architecture. A chatbot, a RAG assistant, a coding copilot, an agent, and a regulated workflow share the same primitives but weight them differently.

Chatbots and customer support assistants

Customer-facing chatbots need strong input guardrails for prompt injection, jailbreaks, and PII, plus output guardrails for unsafe advice, false policy claims, and account-data leakage. Escalation rules belong in the guardrail layer so abusive or high-risk conversations route to a human channel instead of producing a confident wrong answer.

The risk concentration is usually output. A chatbot that invents a refund policy or quotes the wrong terms creates the same business exposure as a misconfigured backend.

RAG applications connected to private knowledge bases

RAG assistants need entitlement-aware retrieval as the first guardrail, source trust tiers as the second, and chunk-level injection inspection as the third. The output guardrail should require citations for factual claims and quarantine answers that cannot be grounded in a retrieved source.

The worst-case scenario is the one to test: an attacker plants a poisoned document in the knowledge base, a legitimate user runs a normal query, and the model executes the attacker's instructions inside the user's session.

AI coding assistants and developer tools

AI coding assistants need guardrails for insecure code suggestions, secret exposure, dependency risk, license risk, and execution scope. Repository access should be scoped per developer, and any "agent" mode that can write files, run commands, or open pull requests needs the same tool guardrails as any other agent.

Output guardrails should treat secrets and credentials as a hard block, not a redaction. A suggested code block that contains a leaked API key is a security incident even if the developer never copies it.

AI agents connected to tools, browsers, and APIs

Agentic systems need the strongest tool, scope, and approval guardrails. Each step in a plan is an enforcement point: validate the planned tool call, validate the parameter values, validate the user identity, and require explicit approval for irreversible actions.

Indirect prompt injection is the primary attack path. An agent that reads a web page, an email, or a ticket as part of a task is reading attacker-controlled instructions; the runtime has to treat that input as untrusted no matter how the agent was prompted. Alice's coverage of the OWASP Agentic Top Ten maps these patterns to specific design fixes.

Regulated workflows that require audit trails and human review

Regulated workflows in finance, healthcare, insurance, and child-facing products need the full stack: input/output/data/tool/human-in-the-loop guardrails, plus governance evidence for every decision. The policy engine becomes the single source of truth that legal, compliance, and security teams can review without reading runtime code.

Guardrails are part of launch readiness here, not a post-launch addition. Alice's AI product launch checklist covers the artifacts auditors and reviewers expect to see before the system is approved. Alice's financial services guardrails blog and Black Forest Labs case study show how regulated and model teams document those controls.

LLM guardrails checklist

Use this checklist before approving and after deploying an LLM application. The goal is to make policy, ownership, testing, runtime enforcement, and evidence explicit instead of implicit.

Controls to validate before launch

  • Input guardrails for prompt injection, jailbreaks, secrets, and PII.
  • Retrieval guardrails for user-scoped sources and chunk-level inspection.
  • Output guardrails for unsafe content, leakage, hallucinations, and regulated claims.
  • Tool guardrails with scoped credentials, approval gates, and rollback paths.
  • Human-in-the-loop routing for high-risk and irreversible decisions.
  • Logging of every guardrail decision with policy version and user identity.
  • Adversarial test results across benign, jailbreak, injection, and domain edge cases.

Metrics to monitor in production

  • True-positive, false-positive, and false-negative rates per guardrail and per category.
  • Per-workflow latency at the 95th and 99th percentile.
  • Blocked categories, escalations, and approval outcomes by user segment.
  • Tool-call volume, scope violations, and rollback events.
  • Drift signals after model, prompt, retrieval, or policy changes.
  • User complaints, appeals, and bypass attempts.

Evidence to retain for governance and incident response

  • Policy versions, owners, and change history.
  • Pre-launch red team findings and remediation notes.
  • Runtime guardrail decision logs with retention aligned to regulation.
  • Escalation records and human-review outcomes.
  • Re-test results after model, prompt, tool, or policy changes.
  • Incident records with root cause and follow-up control changes.

Common LLM guardrail mistakes

Most LLM guardrail failures repeat across deployments. The same patterns show up whenever guardrails are treated as a one-time launch checklist instead of a lifecycle control. The failure surface is predictable, not exotic.

Treating guardrails as one-time prompt filters

A regex against "ignore previous instructions" is a starting point, not a program. Attackers paraphrase, encode, translate, and chain. A guardrail strategy that lives only in the input pre-filter will miss the indirect injection that lands through RAG and the unsafe output that surfaces three turns later.

Ignoring retrieved context, memory, and tools

Guardrails focused on the chat window miss most of the architecture. RAG chunks carry instructions. Memory retains sensitive data across sessions. Tools take actions on the world. Every one of those is an enforcement point, and skipping any of them creates a known gap.

Optimizing only for blocking malicious prompts

Block-only thinking creates two related problems. It ignores the unsafe-output path entirely, and it optimizes false negatives at the cost of false positives. Users notice. Product teams notice. Workarounds appear, and the guardrail loses authority faster than the policy team can react.

Skipping latency, false-positive, and false-negative measurement

A guardrail without measurement is a belief, not a control. If the team cannot quote the current false-positive rate, the false-negative rate, the median latency, and the 99th-percentile latency per workflow, the guardrail is not yet operational.

Failing to retest after model, prompt, policy, or data changes

Guardrails decay. Foundation model upgrades change refusal behavior. Prompt edits change context boundaries. New retrieval sources change the attack surface. New tools change the action surface. A guardrail that was tuned six months ago and never re-tested is no longer measuring what it thinks it is measuring.

How Alice supports LLM guardrails

By this point the article has named the gaps: prompt injection that bypasses a system prompt, RAG context that smuggles instructions, output paths that leak data the input filter never saw, tool calls that take action before anyone reviews them, and guardrails that decay the moment the model or policy changes. The hard part is no longer recognizing those gaps; it is running an operating layer that closes them and keeps them closed.

When the policy is mapped and the enforcement points are clear, the open question becomes operational: who runs the adversarial tests before launch, who enforces policy inside live interactions, who notices when drift breaks the controls, and who feeds new attack patterns back into the system. WonderSuite supplies that operating layer (see Alice's WonderSuite lifecycle security and safety overview for the full architecture). It does not replace model-provider safety, product security review, or human trust and safety; it adds application-specific, policy-aware LLM security around the systems that talk to users, read data, and call tools.

WonderBuild closes the pre-launch evidence gap

The earlier sections named the pre-launch problem: teams ship LLM apps without proof that the guardrails hold up against adversarial behavior. WonderBuild tests customer-facing LLM apps, agents, and workflows for prompt injection, jailbreaks, data leakage, PII leakage, unsafe outputs, RAG poisoning, and policy gaps before users or attackers do, and returns the category-level pass/fail evidence pre-launch reviews actually use.

WonderFence closes the runtime enforcement gap

The output guardrails section explained why every response and tool call has to pass through policy before the user sees it. WonderFence is the runtime layer for that work: it trains dedicated policy models on adversarial data, enforces them at sub-99ms latency, and covers text, image, audio, and video inputs and outputs. Alice's writeup of WonderFence for runtime AI oversight covers the deployment model.

WonderCheck closes the post-launch drift gap

The evaluation section made the point that guardrails decay when the surrounding system changes. WonderCheck runs ongoing production red teaming and drift detection, so regressions after model upgrades, prompt edits, or new retrieval sources show up as signals instead of incidents three model upgrades later.

Rabbit Hole closes the test-data gap

A guardrail that only sees clean-room examples is a guardrail with no view of the actual attack surface. Rabbit Hole is Alice's adversarial intelligence engine, built from years of global trust and safety research and harmful interaction data, and it supplies the attack patterns, language coverage, and abuse context that WonderBuild, WonderFence, and WonderCheck use to test guardrails against real adversaries.

The point is not that LLM guardrails belong to a vendor. They belong to the team operating the system. Alice fits when that team needs a shared layer for pre-launch testing, runtime enforcement, drift detection, and adversarial intelligence instead of building each one from scratch.

FAQ

What are LLM guardrails?

LLM guardrails are runtime controls that inspect prompts, retrieved context, model outputs, and tool calls, then allow, block, redact, route, or log each decision against a defined policy.

What are examples of LLM guardrails?

Common examples include prompt-injection and jailbreak detection on inputs, PII and secret redaction, RAG source filtering, schema/JSON validation, output toxicity and policy classifiers, and approval gates on tool calls.

What is the difference between input and output guardrails?

Input guardrails screen prompts and context before the model runs, catching injection, secrets, and policy violations. Output guardrails screen the response before it reaches the user, catching leakage, unsafe content, and hallucinations.

Do LLM guardrails prevent prompt injection?

They reduce prompt injection risk but do not eliminate it. Effective programs layer input classifiers, RAG content sanitization, output inspection, tool guardrails, and adversarial testing.

What is the difference between LLM guardrails and AI guardrails?

LLM guardrails cover the model interaction surface: prompts, context, outputs, and tool inputs. AI guardrails are broader and add data, infrastructure, application logic, agent permissions, and governance evidence.

Share

What’s New from Alice

Policy Once, Enforced Everywhere: Alice WonderFence Joins Databricks Unity AI Gateway

blog
Jun 16, 2026
,
 
Jun 16, 2026
 -
4
 min watch
June 16, 2026

How Alice WonderFence integrates with Databricks Unity AI Gateway, and how to enforce your own AI guardrails across every model, tool, and agent in production.

Learn More

It Takes AI to Break AI: The Case for AI Red Teaming

webinar
May 25, 2026
,
 
May 25, 2026
 -
This is some text inside of a div block.
 min watch
May 25, 2026

As AI systems gain autonomy, organizations need security approaches built specifically for AI behavior. Learn why AI-driven red teaming is becoming a critical defense layer.

Learn More
Inside Alice