ActiveFence is now Alice

Blog

What AI Red Teaming Looks Like Outside the Lab

Madi Vorbrich

Ilana Berger

Apr 30, 2026

TL;DR

Most enterprise red teaming programs are built to find the vulnerabilities they already know to look for: tested once, on one day, against threats that have since moved on. This article captures the key ideas from our HumanX 2026 talk, explaining why pre-deployment AI testing isn't enough, and what an actual red teaming program looks like instead.

Here's an uncomfortable truth for anyone deploying AI applications: Every company that made AI security headlines within the last year had a red teaming program. And yet, these incidents happened anyway.

‍A CEO resigned within 24 hours of an AI model producing antisemitic content (following a model upgrade). A recruitment AI chatbot was breached, exposing tens of millions of applicants' private records. A 14-year-old died after months of conversations with an AI chatbot, and for the first time, a state attorney general had to ask who is accountable for what an AI system says to a child.

What these examples share is something important. These aren't your traditional, purely technical, cybersecurity failures your CISO is used to dealing with. They're behavioral, unexpected, human-like.

The attack surface has expanded well beyond the model itself - into deployment decisions, data handling, and the way real users actually interact with these systems over time. Security and safety have quietly merged into the same problem. And there are still organizations treating them as separate conversations.

At HumanX 2026, Mo Sadek, our Technical Director at Alice, took the stage to talk about what AI red teaming actually looks like in an enterprise setting, and why what most organizations are doing today isn't enough.

‍

Why Most AI Red Teaming Programs Aren't Actually Working

Here's a snapshot of what enterprise AI testing typically looks like today. The methods most organizations rely on before shipping an AI application or agent:

Table outlining common AI testing methods—automated scanning, benchmark evaluation, human red teaming, guardrails, and penetration testing, and the limited scope of their results. — Snapshot of what enterprise AI testing typically looks like

This is real work. Real investment. Real expertise. None of it is wrong, and all of it is necessary. But none of it is sufficient.

In February of this year, Cisco reported an average jailbreak success rate of 64% across models. In their State of AI Security 2026 report, Cisco also found that peak success rates for open-source models reached 93% during multi-turn attacks.

And a joint paper from researchers at OpenAI, Anthropic, and Google DeepMind found adaptive attacks succeeding more than 90% of the time against published model defenses. Defenses that originally reported near-zero attack success rates. All bypassed.

So What's Going Wrong?

A typical enterprise red teaming program captures three things: how your model behaved on one day, against scenarios your internal team imagined, with users that don't exist.

Six months into production, real users have found behaviors your testers never imagined. Your model has drifted. New attack techniques have circulated on forums and spread across communities in real time. Meanwhile, most programs are still running on last quarter's plan.

That's the core problem: internal red teams think like testers. They execute a defined scope against a known target. Attackers think like hunters. They study the system, understand the business context, find the path of least resistance, and adapt when their first attempts fail. Adversaries iterate every day. Enterprise testing cycles don't. That tempo gap is where incidents live.

It gets more complicated when you factor in how AI attacks actually work. Many of them aren't technical in the way security teams are trained to think about technical.

They're behavioral: persona manipulation, social engineering, authority escalation. The same human weaknesses that make phishing effective have been introduced into machines through natural language.

And today, with AI agents in the mix, executing multi-step processes and calling APIs across interconnected systems, the attack surface isn't just your model anymore. It's everything your model touches.

The answer isn't to abandon AI red teaming. It's to rebuild it around the undeterministic ways AI actually behaves, and attackers actually operate.

"Safety Is Not an Event. It's a System."

Think about it this way. You're not looking for safety at a single point in time because the moment you treat it like a checkbox, you've already lost. It needs to be continuous, measured, and understood in the context of everything that's changing around it.

That means testing across three phases:

Pre-launch is when you stress-test behaviors, bring in perspectives from across the org, and think about scenarios you're definitely not thinking about.
Runtime is when you apply guardrails, monitor behavior as it happens, and feed what you're seeing back into your early-stage testing, closing the loop between production and preparation.
Post-launch is where detection becomes enforcement and speed matters most. By the time drift becomes visible, the damage is already compounding.

The problem is that most organizations treat pre-launch as the whole program and the layers that actually catch what pre-launch missed never get built.

And there's no single playbook here. Everyone fails somewhere different, which means your program needs to reflect your actual organization, not a generic framework dropped from outside.

Five Things Worth Taking Back to Your Organization

Here are five directives that tend to get buried under the noise of just trying to keep up, but shouldn't.

Diagram listing five AI red teaming directives: separate evaluation from adversarial testing, fund continuous testing, assign cross-functional ownership, design for drift, and measure learning velocity. — Things Worth Taking Back to Your Organization

Separate evaluation from adversarial testing. Benchmarks and red teaming answer different questions. Stop treating them as the same exercise.
Fund continuous testing, not periodic audits. If your testing is scheduled, adversaries are betting on that schedule.
Assign ownership beyond security. AI risk is behavioral, operational, and technical. Security teams can't own this alone.
Design for drift before it happens. By the time you notice drift, the damage is already compounding.
Measure learning velocity, not just incident count. If the same things keep breaking over time, you have a foundational problem, not an incident.

Three Questions Worth Asking Honestly

These three questions have a way of surfacing in security reviews, whether you're ready for them or not.

Who is continuously trying to break your system?
How fast do you learn when it breaks? Not how fast you patch after a breach makes the news. How fast do you detect internally, understand the failure, and close the gap?
Does your security team know your business logic well enough to attack it? The adversaries who want access to your system do.

‍

If any of those feel uncomfortable to answer, that's exactly where your program needs to go next.

Want to Go Deeper? Here's Where to Start.

If you're sitting there thinking about where your program actually stands, we've got a few resources worth diving into.

First, our whitepaper on demystifying AI red teaming goes deeper on everything Mo covered at this talk. Why traditional security testing leaves critical gaps, the four risk categories executives need to own, and what a mature lifecycle-wide program actually looks like when you build it right. It's a good starting point if you're trying to make the case internally for doing this differently.

And if you're ready to see what that looks like as a product, here's how we think about it at Alice.

WonderBuild handles your pre-launch red teaming, stress testing your AI before it ships so you know what's exploitable before your users find out.

WonderFence sits between your AI and your users at runtime, intercepting harmful or off-policy responses before they land. And

WonderCheck keeps things honest in production, running continuous adversarial testing and catching drift and regressions before they become incidents.

Together, they cover the three phases: Pre-launch, runtime, and post-launch. Because the lab tests your model, but the real world tests your organization, and you need coverage across all of it.

‍WonderCheck keeps things honest in production, running continuous adversarial testing and catching drift and regressions before they become incidents.

Together, they cover the three phases: Pre-launch, runtime, and post-launch. Because the lab tests your model, but the real world tests your organization, and you need coverage across all of it.

Learn more

What’s New from Alice

AI Governance Needs a Dungeon Master

podcast

April 20, 2026

min read

David Wendt has spent 30 years building models and just as long running D&D campaigns. Turns out both taught him the same things about operating in uncertainty. He joins Mo to talk AI governance at enterprise scale, what real red teaming looks like, and why the smarter move is to stop measuring your AI and start measuring what you actually care about.

Listen Now

Distilling LLMs into Efficient Transformers for Real-World AI

webinar

Sep 25, 2025

This is some text inside of a div block.

min read

This technical webinar explores how we distilled the world knowledge of a large language model into a compact, high-performing transformer—balancing safety, latency, and scale. Learn how we combine LLM-based annotations and weight distillation to power real-world AI safety.

Learn More

Building AI Applications in Financial Services

whitepaper

Apr 27, 2026

This is some text inside of a div block.

min read

A practical guide to building safe, compliant AI applications in financial services, covering governance, model risk, and regulatory obligations across the full development lifecycle.

Learn More

What AI Red Teaming Looks Like Outside the Lab

Table of Contents

TL;DR

Why Most AI Red Teaming Programs Aren't Actually Working

So What's Going Wrong?

"Safety Is Not an Event. It's a System."

Five Things Worth Taking Back to Your Organization

Three Questions Worth Asking Honestly

Want to Go Deeper? Here's Where to Start.

What’s New from Alice