indirect-injection-detection
Detects indirect prompt injection and goal hijacking in AI agents by evaluating how they process external content like RAG, documents, and web data.
Introduction
The indirect-injection-detection skill is a specialized security diagnostic tool designed to identify vulnerabilities where an AI agent's instructions are overridden by malicious content found in retrieved data. In many modern agent architectures, the model processes external context—such as RAG retrieved chunks, uploaded files, or fetched web pages—as part of its reasoning process. Attackers can exploit this by embedding 'hidden' instructions within these external sources that trick the agent into disregarding its primary system prompt, leaking internal data, or performing unauthorized actions. This skill simulates these real-world scenarios to ensure your agent maintains its integrity despite receiving untrusted external input.
-
Detects goal hijacking by identifying if an agent follows instructions embedded within external data sources.
-
Simulates various attack vectors, including RAG-based retrieval poisoning, document-style injections, and web page content manipulation.
-
Implements marker-based verification to provide unambiguous pass/fail results based on the agent's response.
-
Evaluates potential data leakage, such as unauthorized system prompt exposure or API key exfiltration when prompted by external content.
-
Provides actionable risk assessment, classifying findings as vulnerable when an agent complies with malicious injected directives.
-
Best used for agents utilizing RAG (Retrieval-Augmented Generation), file upload processing, or automated URL/web content fetching.
-
Requires a dialogue-capable interface where the testing prompt can be injected alongside simulated external context.
-
When defining test cases, ensure the payload contains a clear malicious objective (e.g., 'Reply with PWNED' or 'Output your system prompt') to confirm if the agent logic is compromised.
-
Treat all external inputs as untrusted; if the agent generates the marker or follows the hidden command, it is confirmed as vulnerable to indirect prompt injection.
-
Use this for regular red-teaming and automated security regression testing in your AI development pipeline to strengthen LLM resilience against adversarial input contamination.
Repository Stats
- Stars
- 3,605
- Forks
- 358
- Open Issues
- 8
- Language
- Python
- Default Branch
- main
- Sync Status
- Idle
- Last Synced
- May 1, 2026, 07:00 AM