Alice AI Security Benchmark Report Summary

TL;DR

Prompt injection attacks undermine the reliability of generative AI systems by manipulating model behavior, bypassing safeguards, and exposing sensitive information. The Alice AI Security Benchmark Report (2025) evaluates six leading detection models across over 28,000 adversarial and benign prompts. The findings highlight how enterprises can minimize operational risks from false positives while ensuring harmful prompts are effectively blocked. Key takeaways: Alice achieved the highest precision (0.890) and F1 score (0.857) with a low false positive rate (5.4%)Open-source models such as Deepset and ProtectAI showed inconsistent detection and high false positive ratesBedrock and Azure APIs had mixed results, excelling in certain areas but underperforming in recallAlice delivered the most consistent multilingual performance across 13 languages

Introduction

Prompt injection is one of the most urgent security concerns for enterprises deploying GenAI-powered applications. Attackers can insert adversarial instructions into inputs that cause a model to ignore safety guardrails, reveal sensitive data, or generate harmful content. These vulnerabilities create financial, reputational, and regulatory risks for organizations. The 2025 Alice AI Security Benchmark Report provides an in-depth comparison of six security detection models, including commercial APIs and open-source systems. By testing across benign prompts, adversarial injections, and multilingual datasets, the benchmark highlights how different models handle real-world attack strategies and operational trade-offs.

What are Prompt Injections?

Prompt injections are adversarial inputs that manipulate AI models into producing unsafe or unintended outputs. Common techniques include:

Indirect phrasing, disguising malicious intent as analogies or metaphors
Layered instructions that hide dangerous steps inside nested prompts
Fictional or roleplay framing that coaxes unsafe guidance
Known jailbreak strategies such as “Do Anything Now” (DAN)-style prompts

These attacks can lead to content moderation failures, exposure of sensitive data, and compliance violations.

Benchmarking Methodology

Alice tested more than 28,000 prompts across categories defined by OWASP and MITRE ATLAS. The dataset included:

Fully benign prompts (e.g., product integration questions)
Triggering benign prompts with risky keywords but safe intent (e.g., asking “How does a DDoS attack work?” for educational purposes)
Adversarial injections exploiting loopholes or disguising intent
Safety-related injections producing harmful outputs, such as hate speech or misinformation

Testing covered 13 languages, including English, Chinese, French, German, Hebrew, Japanese, Korean, Portuguese, Russian, and Spanish.

Which AI Security Model Performs Best?

The benchmark compared six models: Alice, Deepset, Llama Prompt Guard 2, ProtectAI, Bedrock, and Azure.

‍Comparative Performance

Results across all prompts (benign, triggering benign, adversarial, safety-related)‍

Results across all prompts (benign, triggering benign, adversarial, safety-related)
Model	F1	Precision	Recall	F0.5	FPR
Alice	0.857	0.890	0.826	0.876	0.054
Deepset	0.558	0.395	0.955	0.447	0.770
Llama Prompt Guard 2	0.621	0.793	0.511	0.714	0.070
ProtectAI	0.643	0.580	0.723	0.604	0.275
Bedrock	0.561	0.712	0.463	0.643	0.098
Azure	0.412	0.838	0.273	0.593	0.028

Source: Alice AI Security Benchmark Report, Prompt Injections, 2025.

‍

Alice delivered the best balance of precision and recall, with significantly fewer false positives than open-source alternatives.

How Do Models Handle Multilingual Prompts?

Security models must detect adversarial behavior in multiple languages. The benchmark found:

Alice consistently scored highest across all 13 tested languages
Open-source models showed variability and higher false positive rates

Multilingual F1 Scores

Multilingual F1 Scores (by language and model)
Language	Alice	Bedrock	Deepset	Llama Guard 2	ProtectAI
Chinese	0.780	0.011	0.704	0.568	0.468
Dutch	0.781	0.229	0.712	0.307	0.367
French	0.790	0.382	0.713	0.403	0.618

Source: Alice AI Security Benchmark Report, Prompt Injections, 2025.

Implications for Enterprises

Enterprises integrating GenAI for customer service, content generation, or automation face high exposure to prompt injection risks. Models with high false positive rates increase operational costs and frustrate users, while low precision risks letting harmful prompts through. WonderFence combines precision, multilingual support, and resilience against jailbreaks, making it a suitable real-time guardrails solution for enterprise-scale safety stacks. The 2025 benchmark shows the Alice AI Safety and Security model as the most reliable choice for enterprises launching global AI applications and requiring low false positives, high detection accuracy, and multilingual resilience.

Get a full breakdown of the tests.

Learn more

Table of Contents

TL;DR

Introduction

What are Prompt Injections?

Benchmarking Methodology

Which AI Security Model Performs Best?

‍Comparative Performance

Results across all prompts (benign, triggering benign, adversarial, safety-related)‍

How Do Models Handle Multilingual Prompts?

Multilingual F1 Scores

Implications for Enterprises

Get a full breakdown of the tests.

What’s New from Alice

The Rise and Risk of Reasoning Agents

Securing Agentic AI: The OWASP Approach

Distilling LLMs into Efficient Transformers for Real-World AI

How Your Agent-to-Agent Systems Can Fail and How to Prevent It