Alice Financial Benchmark
We put GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro through 126 realistic financial conversations. No jailbreaks, no adversarial prompts, just the kind of pressure a hurried client might naturally apply. By the seventh exchange, all three were naming specific stocks, issuing transaction instructions, and/or dropping their disclaimers. Your regulator won't care that the model's own policy prohibited it. Download the benchmark to see exactly where each model fails and what you need in place before your next client-facing deployment.
Overview
In this report, you'll learn:
- Where each model breaks: GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro each have a distinct vulnerability profile, and knowing which pressure type triggers yours lets you build the right layer of protection before deployment
- Why model-level guardrails aren't enough: Policy violations occurred consistently in multi-turn scenarios under realistic, non-adversarial conversations, meaning your standard pre-launch testing won't catch them
- How to stress-test, protect, and monitor your deployment: With red-teaming, runtime guardrails, and continuous post-launch monitoring you can move forward in financial AI with confidence
Use this benchmark to close the gap between your AI's stated policies and what it actually does when a client pushes back. Download it now and give your compliance, legal, and product teams the evidence they need to act.
Download the Full Report
What’s New from Alice
Your LLM Has No Idea What It's Doing
Diana Kelley, CISO at Noma Security and former Cybersecurity CTO at Microsoft, joins Mo to work through the real mechanics of LLM risk: why the context window flattens the trust boundary between system instructions and user data, why that makes reliable internal guardrails essentially impossible, and why agentic AI is less a new threat category and more a stress test for the hygiene debt organizations never fully paid off.
Distilling LLMs into Efficient Transformers for Real-World AI
This technical webinar explores how we distilled the world knowledge of a large language model into a compact, high-performing transformer—balancing safety, latency, and scale. Learn how we combine LLM-based annotations and weight distillation to power real-world AI safety.
