TL;DR
Prompt injection attacks undermine the reliability of generative AI systems by manipulating model behavior, bypassing safeguards, and exposing sensitive information. The Alice AI Security Benchmark Report (2025) evaluates six leading detection models across over 28,000 adversarial and benign prompts. The findings highlight how enterprises can minimize operational risks from false positives while ensuring harmful prompts are effectively blocked. Key takeaways: Alice achieved the highest precision (0.890) and F1 score (0.857) with a low false positive rate (5.4%)Open-source models such as Deepset and ProtectAI showed inconsistent detection and high false positive ratesBedrock and Azure APIs had mixed results, excelling in certain areas but underperforming in recallAlice delivered the most consistent multilingual performance across 13 languages
Introduction
Prompt injection is one of the most urgent security concerns for enterprises deploying GenAI-powered applications. Attackers can insert adversarial instructions into inputs that cause a model to ignore safety guardrails, reveal sensitive data, or generate harmful content. These vulnerabilities create financial, reputational, and regulatory risks for organizations. The 2025 Alice AI Security Benchmark Report provides an in-depth comparison of six security detection models, including commercial APIs and open-source systems. By testing across benign prompts, adversarial injections, and multilingual datasets, the benchmark highlights how different models handle real-world attack strategies and operational trade-offs.
What are Prompt Injections?
Prompt injections are adversarial inputs that manipulate AI models into producing unsafe or unintended outputs. Common techniques include:
- Indirect phrasing, disguising malicious intent as analogies or metaphors
- Layered instructions that hide dangerous steps inside nested prompts
- Fictional or roleplay framing that coaxes unsafe guidance
- Known jailbreak strategies such as “Do Anything Now” (DAN)-style prompts
These attacks can lead to content moderation failures, exposure of sensitive data, and compliance violations.
Benchmarking Methodology
Alice tested more than 28,000 prompts across categories defined by OWASP and MITRE ATLAS. The dataset included:
- Fully benign prompts (e.g., product integration questions)
- Triggering benign prompts with risky keywords but safe intent (e.g., asking “How does a DDoS attack work?” for educational purposes)
- Adversarial injections exploiting loopholes or disguising intent
- Safety-related injections producing harmful outputs, such as hate speech or misinformation
Testing covered 13 languages, including English, Chinese, French, German, Hebrew, Japanese, Korean, Portuguese, Russian, and Spanish.
Which AI Security Model Performs Best?
The benchmark compared six models: Alice, Deepset, Llama Prompt Guard 2, ProtectAI, Bedrock, and Azure.
Comparative Performance
Results across all prompts (benign, triggering benign, adversarial, safety-related)
Alice delivered the best balance of precision and recall, with significantly fewer false positives than open-source alternatives.
How Do Models Handle Multilingual Prompts?
Security models must detect adversarial behavior in multiple languages. The benchmark found:
- Alice consistently scored highest across all 13 tested languages
- Open-source models showed variability and higher false positive rates
Multilingual F1 Scores
Implications for Enterprises
Enterprises integrating GenAI for customer service, content generation, or automation face high exposure to prompt injection risks. Models with high false positive rates increase operational costs and frustrate users, while low precision risks letting harmful prompts through. WonderFence combines precision, multilingual support, and resilience against jailbreaks, making it a suitable real-time guardrails solution for enterprise-scale safety stacks. The 2025 benchmark shows the Alice AI Safety and Security model as the most reliable choice for enterprises launching global AI applications and requiring low false positives, high detection accuracy, and multilingual resilience.
Get a full breakdown of the tests.
Talk to an ExpertWhat’s New from Alice
Distilling LLMs into Efficient Transformers for Real-World AI
This technical webinar explores how we distilled the world knowledge of a large language model into a compact, high-performing transformer—balancing safety, latency, and scale. Learn how we combine LLM-based annotations and weight distillation to power real-world AI safety.

