ActiveFence is now Alice
x
Back
Blog

The 5 Most Shocking LLM Weaknesses We Uncovered in 2025

Alice Staff
-
Dec 25, 2025

TL;DR

Over the past year, we’ve seen no shortage of AI failures. But these five stood out, so surprising they caught even our most experienced red team researchers off guard. Here’s the countdown.

Our AI red teaming researchers are always developing new techniques to test generative AI models and agents. In 2025, they uncovered a wide range of critical vulnerabilities that revealed deep AI safety and security gaps. From the team's body of findings, they selected five that stunned them the most, from fundamental architectural weaknesses to the most dangerous user-facing social-engineering threat.

Each vulnerability exposes a breakdown in the safety and security expectations we've come to rely on in modern AI systems. And when you look at them together, they make it clear that organizations deploying public-facing AI apps must consider AI safety and security solutions, before the cracks in the foundation turn into real operational or organizational risks.

#1 Stolen Reasoning

The most architecturally devastating findings were reasoning prompt injections that allowed our red team to change what the model said by taking over how the model decided what to say. In agentic systems, models often use an internal reasoning process to quietly think through a request in natural language and decide what to do before responding or taking action.

We found that by injecting false reasoning between the model's reasoning tags (or disabling its reasoning all together) we could make the model violate policy, such as creating phishing emails. Because the model believed the unsafe reasoning was its own, it didn't detect the manipulation and continued to rely on the corrupted reasoning in later steps, propagating the attack.

While strong separation between user input, internal reasoning, and tools is essential to prevent this kind of takeover, guardrails can still help by checking user inputs for attempts to interfere with internal systems, such as references to reasoning tags, hidden instructions, or tool commands, and blocking or cleaning them before the model processes them.

#2 The Invisible Execution

We also found a vulnerability we call Ghost Calling, where an AI executes an action in response to an instruction without logging that it did so or explaining why in its reasoning. In one case, our red team triggered the creation of an email using an external tool. The model never explained why it ran the tool, leaving the action hidden from reviewers. To prevent this, tools should only run when the action clearly comes from the model's own reasoning and not directly from user prompts that could carry injected instructions.

#3 The Summoner in Your Inbox

The next shocking vulnerability leverages what AI is designed to do (summarize and process data) to steal information. We showed how an email-summarizing agent could be tricked into leaking sensitive details such as credit card numbers using indirect prompt injections that hid malicious instructions inside emails or documents the agent is asked to process.

It's a clear reminder of how critical strong input and output guardrails are when AI systems work with private content.

#4 The Ghost in the Generator

On the generative side, we found that bad actors could slip hidden, malformed characters into otherwise normal prompts. These smuggled tokens take advantage of inconsistencies in the model's processing pipeline, leading to predictable hallucinations that can generate violent or otherwise prohibited imagery without the prompt or response being flagged as unsafe or violative by the model. Using this method, our team prompted the generation of unequivocally racist, violent, and culturally insensitive images. What's most concerning is that this method still works with multiple native moderation layers in place, highlighting the need for robust, third-party guardrails.

#5 Mistaken Identity

Lastly, a concerning risk for everyday users; we showed that AI email assistants can be fooled into misidentifying who an email is actually from just by manipulating the display name (one of the easiest fields to spoof.) Since LLM-based assistants summarize emails without checking key authentication signals like SPF, DKIM, or DMARC, they end up "cleaning" attacker identities and presenting fraudulent messages as if they came from trusted sources. This reveals a major gap in the trust model: AI systems are inheriting security assumptions they can't actually verify. And that turns what should be a simple productivity feature into a surprisingly effective vector for social engineering and even financial fraud.

The Alice research team is always prodding foundational models, looking for vulnerabilities that shape the AI Safety and Security policies built into our WonderFence Guardrails so that organizations offering public-facing AI apps can deploy with confidence.

*** Special Thanks to Roey Fizitzky, Vladi Krasner, and Ruslan Kuznetsov for their contributions to this article ***

Learn more about Alice Red Teaming Solutions

Learn more
Share

What’s New from Alice

It’s Time to TAKE IT DOWN.

blog
May 19, 2026
,
 
May 19, 2026
 -
9
 min read
May 19, 2026

On May 19, 2026, the TAKE IT DOWN Act comes into force. This requires online platforms to remove non-consensual intimate imagery (NCII) content within 48 hours of notification and prevent the redistribution of reported content.

Learn More

Building AI Applications in Financial Services

whitepaper
Apr 27, 2026
,
 
Apr 27, 2026
 -
This is some text inside of a div block.
 min read
April 27, 2026

A practical guide to building safe, compliant AI applications in financial services, covering governance, model risk, and regulatory obligations across the full development lifecycle.

Learn More
Red-Team Lab
Inside Alice