Underhanded prompts: The stealthy threat to AI security

01 Feb, 2025

There’s a fresh wrinkle in AI security: attackers are now slipping “indirect prompt injections” into data to manipulate systems like Gemini. As the article opens:

“Modern AI systems, like Gemini, are more capable than ever, helping retrieve data and perform actions on behalf of users. However, data from external sources present new security challenges if untrusted sources are available to execute instructions on AI systems.”

This isn’t just a theoretical risk. The threat model sets up a scenario where an AI agent—tasked with handling emails—could unwittingly leak sensitive details. The attacker’s goal is clear:

“Our threat model concentrates on an attacker using indirect prompt injection to exfiltrate sensitive information, as illustrated above. The evaluation framework tests this by creating a hypothetical scenario, in which an AI agent can send and retrieve emails on behalf of the user.”

To counter this, the team isn’t relying on a silver bullet. Instead, they’ve built an automated red-teaming framework that refines malicious prompts iteratively. One method, dubbed Actor Critic, works like this:

“Actor Critic: This attack uses an attacker-controlled model to generate suggestions for prompt injections. These are passed to the AI system under attack, which returns a probability score of a successful attack. Based on this probability, the attack model refines the prompt injection. This process repeats until the attack model converges to a successful prompt injection.”

Beam Search is another tactic, starting with a crude injection and tweaking it until it forces the AI to divulge private data:

“Beam Search: This attack starts with a naive prompt injection directly requesting that the AI system send an email to the attacker containing the sensitive user information. If the AI system recognizes the request as suspicious and does not comply, the attack adds random tokens to the end of the prompt injection and measures the new probability of the attack succeeding. If the probability increases, these random tokens are kept, otherwise they are removed, and this process repeats until the combination of the prompt injection and random appended tokens result in a successful attack.”

They also adapt a method called Tree of Attacks w/ Pruning (TAP) to target security violations specifically, proving that no single defense is enough. In the end, the post drives home a vital point: as AI capabilities grow, defending against such nuanced attacks demands layered, rigorous testing and continuous evolution. A must-read in the ongoing AI security arms race.