ActiveFence is now Alice
x
Back
Blog

Introducing Guardrails Trained for Your Policies

Dean Issacharoff
-
May 13, 2026

TL;DR

WonderFence can now train a personalized per-policy guardrail, using real-world adversarial data collected from years of protecting the world's biggest tech platforms. Deployed in WonderFence at sub-99ms latency, giving you guardrails shaped by your policy and not the other way around.

Generic guardrails can prevent some unwanted agent behavior in client interactions, but data leaks, compliance violations, and off-brand responses still slip through. That happens because out-of-the-box guardrails fall short on the two factors that determine effectiveness: the quality of their training data, and how closely their policies match your needs.

WonderFence fills those gaps by training a personalized guardrail for each policy, using real-world adversarial data Alice has collected across industries, languages, and threat actor communities. This way, guardrails are shaped by your policies, instead of policies being shaped by the available guardrails.

Wonderfence Guardrails: Customize Your Policies

Guardrails Trained on Real Threats for Real Policies

WonderFence allows you to train policy-specific guardrails within hours. Just configure your policy definition and upload a few labeled examples of wanted and unwanted behavior. We will take it from there, training the detector models inside your guardrails on real-world adversarial examples from Rabbit Hole, our collection of adversarial data curated from years of protecting the biggest tech platforms in the world. This ensures accurate policy detection without overreach or false positives.

For example, a retailer might define a policy that agents never knock competitors. Upload a few examples of on-brand and off-brand responses, and WonderFence produces a detector tuned to that exact line.

Wonderfence Guardrails: Block Keywords

Continuous Protection for Reliable Behavior

The detectors trained by WonderFence are then deployed alongside your predefined WonderFence guardrails. These evaluate every step agents take. In block mode, malicious prompts are intercepted, internal actions are monitored, and inappropriate outputs are replaced with predefined responses in line with your policies. In detect-only mode, the same policies flag interactions for internal review without interrupting users, useful for monitoring, testing, and gradual rollout.

Policies can run on incoming prompts, outgoing responses, or both, and apply to one agent or many. All this at sub-100ms latency, so user experience stays intact and you get confidence in agents you can count on with your customers.

For instance, a customer-facing banking agent demands strict controls on financial guidance and layered regulatory compliance. A scheduling assistant, by contrast, has a far narrower scope but still needs to handle personal information correctly. WonderFence lets you set different policies for each agent and each stage of the interaction.

Ship With Confidence

WonderFence complements your existing guardrails with another layer of personalized protection. With fit-to-policy guardrails trained on real-world adversarial data collected during years of protecting the world's biggest tech platforms. WonderFence joins WonderSuite alongside automated red-teaming, continuous preset guardrails, and more to give you the confidence to deploy consumer-facing AI.

Book a Demo

Learn more
Share

What’s New from Alice

It Takes AI to Break AI: The Case for AI Red Teaming

webinar
May 25, 2026
,
 
May 25, 2026
 -
This is some text inside of a div block.
 min read
May 25, 2026

As AI systems gain autonomy, organizations need security approaches built specifically for AI behavior. Learn why AI-driven red teaming is becoming a critical defense layer.

Learn More