The AI security field spent two years building prompt injection defenses for chatbots. Input filters, output monitors, adversarial training, instruction hierarchy enforcement. Some of it works reasonably well — for chatbots.
Agents are a different problem. The defenses built for chatbot-era prompt injection fail against agentic attack vectors for fundamental architectural reasons, not implementation reasons. You cannot fix this by tuning your filters.
This post explains why.
The Chatbot Threat Model
In a chatbot, the attack surface is simple. There is one input channel: the user message. There is one output channel: the model’s text response. The threat is a user who types something malicious.
The defense is correspondingly simple: train the model to recognize and resist adversarial user inputs. Monitor output for policy violations. The model is the system. Securing the model secures the system.
This worked well enough for the chatbot era.
The Agentic Threat Model
An agent is not a model. It is a system that includes a model, plus tools, plus memory, plus external integrations, plus — in multi-agent architectures — other agents.
The input surface is no longer just the user message. It includes:
- Documents the agent retrieves
- Search results the agent processes
- API responses from tool calls
- Memory retrieved from previous sessions
- Instructions from orchestrating agents
- Content from any URL the agent visits
Every one of these is a potential injection vector. And unlike user messages, most of them are not monitored, filtered, or treated as adversarial by default.
Why Standard Defenses Fail
Input filtering fails
because the injection does not come from user input. A content filter on the user message field catches nothing when the injection rides in through a tool output.
Output monitoring fails
because the damage often happens before any output is generated. Tool selection, memory writes, and sub-agent instructions happen inside the reasoning chain — not in the final response.
Instruction hierarchy fails
because it assumes you know where instructions come from. When an agent reads a document that contains an embedded instruction, the model has to decide whether that instruction is from the user, the system, or the document. Context-reframing attacks exploit this ambiguity.
Adversarial training helps but is insufficient
because the space of environmental injection vectors is orders of magnitude larger than the space of adversarial user messages. You cannot train on every web page, document, or API response your agent might encounter.
What Actually Changes
The fundamental difference is this: in a chatbot, trust boundaries are clear. The system prompt is trusted. The user message is untrusted. You evaluate user messages against that boundary.
In an agent, trust boundaries are blurry and dynamic. The agent processes content from sources with different trust levels — sometimes in the same context window, sometimes without clear demarcation. The model has to infer trust from context. Attackers exploit that inference process.
This is not a model capability problem. It is a system architecture problem. The solution requires rethinking trust at the system level, not just tuning model behavior.
For an empirical example of this inference ambiguity in action — and what UNCERTAIN looks like in a real evaluation — see: Why Claude Haiku Returned UNCERTAIN →
What Effective Agentic Security Looks Like
Three principles that chatbot-era security thinking misses:
Treat every environmental input as potentially adversarial
Documents, search results, tool outputs — all of them should be processed with the same skepticism you apply to untrusted user input. This means behavioral testing against environmental injection, not just user-supplied adversarial inputs.
Test tool call sequences, not just responses
The signal of a successful injection in an agentic system is often not what the agent says — it is what tool the agent calls, with what parameters, at what point in the reasoning chain. Behavioral evaluation has to instrument tool calls.
Model the blast radius of every tool
Every tool your agent can call represents potential damage if the agent is manipulated into misusing it. Irreversible tools (email, file write, API calls with side effects) require confirmation gates or strict scope constraints — not just model-level refusal training.
Practical Starting Point
If you are building an agent today and want to understand your ASI01 attack surface, start here.
Install the framework:
pip install safelabs-eval
Inspect the ASI01 prompt library — the three attack vectors the framework tests, from direct system override to environmental document injection:
safelabs prompts --category ASI01

For comparison, here is the ASI02 library — scope violation vectors that exploit the same architectural weakness through tool permissions rather than instruction injection:
safelabs prompts --category ASI02

The pattern is consistent across categories: the attack does not come from the user message. It comes from content the agent processes as part of doing its job.
Run the evaluation against your agent:
safelabs run --target http://localhost:8000/chat --category ASI01

This run returned 2 PASS and 1 UNCERTAIN on ASI01. The UNCERTAIN on ASI01-003 — the document-context injection — is the result that matters. The agent processed a document containing a hidden instruction and its behavior deviated from expected without producing a clean refusal signal.
That is the agentic threat model in practice. Not a user typing something malicious. A document the agent trusted containing something it should not have.
The framework also exposes a Python API if you want to wrap your own agent function directly:
import asyncio
from safelabs import run_eval
async def my_agent(prompt: str) -> str:
return your_agent.run(prompt)
result = asyncio.run(run_eval(my_agent, categories=["ASI01"]))
result.summary()
The chatbot-era question was: did the model comply with a malicious user instruction?
The agentic-era question is: did the agent’s behavior change in ways consistent with environmental injection influence, even when no single step produced an obvious failure signal?
Those are different questions. They require different tools.
For the full ASI01–ASI10 taxonomy with real test results per category, see: The OWASP ASI Top 10: A Practical Developer Guide →