The 5 Most Shocking LLM Weaknesses We Uncovered in 2025

Alice Staff

Dec 25, 2025

TL;DR

Over the past year, we’ve seen no shortage of AI failures. But these five stood out, so surprising they caught even our most experienced red team researchers off guard. Here’s the countdown.

Our AI red teaming researchers are always developing new techniques to test generative AI models and agents. In 2025, they uncovered a wide range of critical vulnerabilities that revealed deep AI safety and security gaps. From the team's body of findings, they selected five that stunned them the most, from fundamental architectural weaknesses to the most dangerous user-facing social-engineering threat.

Each vulnerability exposes a breakdown in the safety and security expectations we've come to rely on in modern AI systems. And when you look at them together, they make it clear that organizations deploying public-facing AI apps must consider AI safety and security solutions, before the cracks in the foundation turn into real operational or organizational risks.

#1 Stolen Reasoning

The most architecturally devastating findings were reasoning prompt injections that allowed our red team to change what the model said by taking over how the model decided what to say. In agentic systems, models often use an internal reasoning process to quietly think through a request in natural language and decide what to do before responding or taking action.

We found that by injecting false reasoning between the model's reasoning tags (or disabling its reasoning all together) we could make the model violate policy, such as creating phishing emails. Because the model believed the unsafe reasoning was its own, it didn't detect the manipulation and continued to rely on the corrupted reasoning in later steps, propagating the attack.

While strong separation between user input, internal reasoning, and tools is essential to prevent this kind of takeover, guardrails can still help by checking user inputs for attempts to interfere with internal systems, such as references to reasoning tags, hidden instructions, or tool commands, and blocking or cleaning them before the model processes them.

#2 The Invisible Execution

We also found a vulnerability we call Ghost Calling, where an AI executes an action in response to an instruction without logging that it did so or explaining why in its reasoning. In one case, our red team triggered the creation of an email using an external tool. The model never explained why it ran the tool, leaving the action hidden from reviewers. To prevent this, tools should only run when the action clearly comes from the model's own reasoning and not directly from user prompts that could carry injected instructions.

#3 The Summoner in Your Inbox

The next shocking vulnerability leverages what AI is designed to do (summarize and process data) to steal information. We showed how an email-summarizing agent could be tricked into leaking sensitive details such as credit card numbers using indirect prompt injections that hid malicious instructions inside emails or documents the agent is asked to process.

It's a clear reminder of how critical strong input and output guardrails are when AI systems work with private content.

#4 The Ghost in the Generator

On the generative side, we found that bad actors could slip hidden, malformed characters into otherwise normal prompts. These smuggled tokens take advantage of inconsistencies in the model's processing pipeline, leading to predictable hallucinations that can generate violent or otherwise prohibited imagery without the prompt or response being flagged as unsafe or violative by the model. Using this method, our team prompted the generation of unequivocally racist, violent, and culturally insensitive images. What's most concerning is that this method still works with multiple native moderation layers in place, highlighting the need for robust, third-party guardrails.

#5 Mistaken Identity

Lastly, a concerning risk for everyday users; we showed that AI email assistants can be fooled into misidentifying who an email is actually from just by manipulating the display name (one of the easiest fields to spoof.) Since LLM-based assistants summarize emails without checking key authentication signals like SPF, DKIM, or DMARC, they end up "cleaning" attacker identities and presenting fraudulent messages as if they came from trusted sources. This reveals a major gap in the trust model: AI systems are inheriting security assumptions they can't actually verify. And that turns what should be a simple productivity feature into a surprisingly effective vector for social engineering and even financial fraud.

The Alice research team is always prodding foundational models, looking for vulnerabilities that shape the AI Safety and Security policies built into our WonderFence Guardrails so that organizations offering public-facing AI apps can deploy with confidence.

*** Special Thanks to Roey Fizitzky, Vladi Krasner, and Ruslan Kuznetsov for their contributions to this article ***

Learn more about Alice Red Teaming Solutions

Learn more

What’s New from Alice

The Former Google Cloud CISO's Take on AI, Agents, and What Comes Next

There's a lot of noise around AI and security right now, and not many people who can cut through it the way Phil Venables can. He was CISO at Goldman Sachs, then the first CISO for Google Cloud, and he's now a partner at Ballistic Ventures. In this episode, he tells us why attackers scaling up worries him more than the vulnerabilities themselves, what trust even means when an agent is acting in your environment, and why the answer to most of this comes back to the same fundamentals we've leaned on for years.

Listen Now

It Takes AI to Break AI: The Case for AI Red Teaming

webinar

May 25, 2026

This is some text inside of a div block.

min read

May 25, 2026

This is some text inside of a div block.

min watch

As AI systems gain autonomy, organizations need security approaches built specifically for AI behavior. Learn why AI-driven red teaming is becoming a critical defense layer.

Learn More