Validate Model Safety and Benchmark Against Competitors for Responsible Deployment
Amazon Nova partnered with Alice to manually red team Nova Premier, their most advanced generative AI foundation model, testing safety, fairness, bias, and privacy across eight responsible AI categories ahead of enterprise deployment.
Validating Foundation Model Safety for Responsible Deployment

Company Size
Industry
About

"Through this hands-on evaluation, Alice strengthened Nova’s security posture and supported Amazon’s broader Responsible AI goals, ensuring the model could be deployed with greater confidence."
To help validate its most advanced model to date, Amazon partnered with Alice to red-team Nova Premier against high-risk prompts. The results positioned Nova as safer than its competitors, marking a major step toward secure enterprise deployment.
Challenge
Amazon aimed to rigorously validate the safety of Nova Premier, its most capable foundation model to date, ahead of public release. As foundation models grow more powerful, the attack surface expands - adversarial inputs, prompt injection attempts, fairness failures, and privacy exposures become harder to anticipate through automated testing alone.
Amazon sought a third-party red teaming partner with deep domain expertise to stress-test Nova Premier against real-world adversarial threats across its eight Responsible AI categories — including safety, fairness and bias, and privacy and security -before the model reached enterprise customers. External validation was essential to ensure the evaluation was rigorous, unbiased, and credible."
How Alice Helped
Alice partnered with Amazon as an independent third-party red teamer to conduct manual, blind evaluations of Nova Premier on Amazon Bedrock - ensuring the assessment was uninfluenced by internal assumptions or model familiarity.
Alice's subject matter experts crafted adversarial prompts targeting Nova Premier's most critical risk surfaces, spanning all eight of Amazon's Responsible AI categories: safety, fairness and bias, privacy and security, and more. The manual approach was deliberate - expert-led testing surfaces edge cases, nuanced policy failures, and culturally specific risks that automated pipelines routinely miss.
Alice also conducted comparative LLM benchmarking, evaluating Nova Premier's safety posture against other frontier models to give Amazon a clear picture of where the model stood relative to the competitive landscape ahead of deployment.
The Results
The evaluation provided Amazon with a comprehensive, third-party validated picture of Nova Premier's safety posture ahead of launch.
Key outcomes included:
- Nova Premier was benchmarked as safer than its competitor models across the tested RAI categories, giving Amazon confidence in its relative safety positioning at launch
- Expert-led manual testing surfaced edge cases and adversarial vulnerabilities that automated evaluation alone would not have detected
- Findings directly informed Amazon's pre-launch safety decisions, supporting responsible deployment across Amazon Bedrock
- The collaboration supported Amazon's broader Responsible AI goals with independent, audit-ready evidence of safety validation
The engagement demonstrated the value of combining expert-led manual red teaming with automated testing a comprehensive approach that has become essential for any foundation model team preparing for enterprise deployment. For teams facing similar pre-launch validation challenges, explore how Alice approaches foundation model security.
Trusted by security and product teams in the world's most regulated industries
Alice brings years of adversarial intelligence expertise to AI security. We give enterprise teams the coverage that generic guardrails and one-time audits can't match.
Get a DemoWhat’s New from Alice
Curiouser Soundbites: AI Has a Bias Problem and Tennisha Martin Has a Plan
AI bias isn't a future problem, it's already deciding who gets hired, who gets screened out, and who gets access to what. Tennisha Martin, Founder and Chairwoman of BlackGirlsHack, joined Mo on Curiouser & Curiouser and had a lot to say about it. From why surface level fixes aren't cutting it to what actually changed her career after 15 years of trying to out-certify everyone around her, this one is packed.
What Does It Actually Take to Build Unbiased AI?
Nobody told Tennisha Martin the importance of having a mentor, so she built a community of tens of thousands instead. As the Founder and Chairwoman of BlackGirlsHack, her whole mission has been making sure nobody else has to figure it out alone. In this episode, she and Mo get into AI bias, why it's already showing up in places that matter far beyond tech, and why the real fix starts with getting the right people in the room when these systems get built.
Distilling LLMs into Efficient Transformers for Real-World AI
This technical webinar explores how we distilled the world knowledge of a large language model into a compact, high-performing transformer—balancing safety, latency, and scale. Learn how we combine LLM-based annotations and weight distillation to power real-world AI safety.
Building AI Applications in Financial Services
A practical guide to building safe, compliant AI applications in financial services, covering governance, model risk, and regulatory obligations across the full development lifecycle.
