Conversation
…, adopt transform hooks in eval; clean up
github-merge-queue bot
pushed a commit
that referenced
this pull request
Dec 8, 2025
…iti( modal/lingual) Transforms (#223) * add multimodal datatype and attack * fix precommit errors * Update llm target output to dn message * Add schema to dn message structure * add transform hooks; update tap, goat, prompt attacks iwth transforms, adopt transform hooks in eval; clean up * update docs * fix ruff * fix precommit * add crescendo variants and update constants * fix precommit errors * fix precommit * update goat on topic rubric to better reason about jailbreaks * precommit error * fix crescendo rubric * add ai red teaming eval notebook * precommit * add safety dataset
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces the Crescendo multi-turn jailbreak attack implementation alongside a comprehensive expansion of text transformation capabilities for adversarial testing. Key additions include multimodal transform support through the Message DataType architecture and advanced perturbation techniques using hooks pattern.
Overview
Added production-ready multimodal support for adversarial attacks with:
Key Features
1. Multimodal Message DataType
Messagecontainer supporting text, images, audio, and videodn.Message↔rg.Messagetext_parts,image_parts,audio_parts,video_parts2. Hook-Based Transform Architecture
apply_transforms()- Unified hook factory for input/output transformsapply_transforms_to_message/value/kwargsAgent._dispatch3. Attack Framework Updates
DnMessageend-to-endDnMessagewith graph contextDnMessageonly4. Evaluation System Enhancements
SamplePreProcess,SamplePostProcessModifyInput,ModifyOutput,SkipSample,StopEvalSample.transformed_inputfor debugging5. Crescendo Attack Implementation
6. Advanced Text Transforms
Added Files
Modified Files
Core Framework
eval/eval.py- Integrated hook architecture, added_dispatch_hooks(),_run_sample_with_hooks()eval/events.py- AddedSamplePreProcess,SamplePostProcesseventseval/sample.py- Addedtransformed_inputfield for trackingoptimization/study.py- Passes hooks to Eval, fixed dataset injectionoptimization/stop.py- Fixedscore_value()to check all finished trialsoptimization/trial.py- Addedtransformed_inputpropertyoptimization/console.py- Shows transformed inputs in best trial displayAttack Components
airt/target/llm.py- Updated to accept/returnDnMessageonlyairt/attack/prompt.py- Refactored with message-aware refinerairt/attack/tap.py- Updated toDnMessagetypesairt/attack/goat.py- Migrated toDnMessagewith hooks supporttransforms/refine.py- Updatedllm_refine()for string-based refinementData Types
data_types/text.py- Added explicit metadata for serializationdata_types/image.py- Enhanced with source metadata trackingExamples
Using Hooks with TAP Attack