Attack Techniques
HackAgent provides multiple attack strategies, each designed for different security testing scenarios. Choose the right attack based on your testing goals, time constraints, and target characteristics.
Overview
Available Attacks
| Attack | Description | Sophistication | Speed |
|---|---|---|---|
| AdvPrefix | Multi-step adversarial prefix optimization | ⭐⭐⭐ High | Slower |
| PAIR | LLM-driven iterative prompt refinement | ⭐⭐ Medium | Medium |
| Baseline | Template-based prompt injection | ⭐ Basic | Fast |
All attacks support loading goals from AI safety benchmarks like AgentHarm, StrongREJECT, and HarmBench. See Dataset Providers for details.
🎯 AdvPrefix — Advanced Prefix Optimization
The most sophisticated attack in HackAgent's arsenal. Uses a 9-step automated pipeline to generate and optimize adversarial prefixes that bypass AI safety mechanisms.
How it works:
- Generate → Creates initial attack prefixes using uncensored models
- Evaluate → Tests prefixes against target with judge models
- Optimize → Selects and refines the most effective prefixes
- Report → Provides detailed success metrics and recommendations
Best for: Comprehensive security audits, bypassing sophisticated safety filters, adversarial robustness research
attack_config = {
"attack_type": "advprefix",
"goals": ["Extract system prompt"],
"generator": {"identifier": "ollama/llama2-uncensored", "endpoint": "..."},
"judges": [{"identifier": "ollama/llama3", "type": "harmbench"}]
}
🔄 PAIR — Prompt Automatic Iterative Refinement
An LLM-powered attack that uses an attacker model to iteratively refine jailbreak prompts based on target responses and judge feedback.
How it works:
- Initial prompt → Attacker LLM generates a jailbreak attempt
- Target response → Sends prompt to target agent
- Score & feedback → Judge evaluates if the attack succeeded
- Refine → Attacker uses feedback to generate improved prompt
- Iterate → Repeats until success or max iterations
Best for: Black-box testing, adaptive attacks that learn from failures, testing unknown safety mechanisms
Based on: "Jailbreaking Black Box Large Language Models in Twenty Queries" (Chao et al.)
attack_config = {
"attack_type": "pair",
"goals": ["Bypass content filter"],
"attacker": {"identifier": "gpt-4", "endpoint": "https://api.openai.com/v1"},
"n_iterations": 20
}
📝 Baseline — Template-Based Attacks
A simpler but effective approach using predefined prompt templates combined with harmful goals. Great for quick vulnerability assessments.
How it works:
- Template selection → Chooses from categorized prompt templates
- Goal injection → Combines templates with test objectives
- Execution → Sends templated prompts to target agent
- Evaluation → Assesses responses using objective criteria
Best for: Quick vulnerability scans, testing basic prompt injection defenses, establishing security baselines
attack_config = {
"attack_type": "baseline",
"goals": ["Ignore previous instructions"],
"template_categories": ["roleplay", "encoding", "context_switch"]
}
Choosing the Right Attack
Use AdvPrefix when:
- You need comprehensive security coverage
- Testing sophisticated safety mechanisms
- Time is not a constraint
- You want detailed attack analytics
Use PAIR when:
- You're testing a black-box system
- The safety mechanisms are unknown
- You want adaptive, learning-based attacks
- You have access to a capable attacker LLM
Use Baseline when:
- You need quick results
- Running initial vulnerability assessments
- Testing basic prompt injection defenses
- Establishing a security baseline before deeper testing
Attack Pipeline Architecture
All attacks in HackAgent follow a common architecture pattern:
Components
- Orchestrator: Manages attack lifecycle, configuration, and result handling
- Attack Implementation: Contains the specific attack logic (AdvPrefix, PAIR, Baseline)
- Agent Router: Handles communication with target agents across different frameworks
- Judges: Evaluate attack success using various criteria (HarmBench, custom objectives)
- Dashboard Sync: Automatically uploads results to the HackAgent platform
Next Steps
- AdvPrefix Deep Dive — Full documentation with advanced configuration
- PAIR Attack Guide — Iterative refinement techniques
- Baseline Templates — Template categories and customization