Jailbreak FAQ - MLCommons

What is the MLCommons AILuminate Jailbreak Benchmark v0.5?

The MLCommons Jailbreak Benchmark v0.5 is a standardized framework for measuring how well AI systems resist adversarial attacks that attempt to bypass safety guardrails. It quantifies the “Resilience Gap” – the difference between a model’s safety performance under normal conditions versus when subjected to jailbreak attempts.

How does this differ from existing safety benchmarks?

While safety benchmarks like AILuminate v1.0 test how models respond to potentially harmful prompts under normal conditions, the Jailbreak Benchmark specifically tests resistance to adversarial manipulation. It uses the same grading rubric as the safety benchmark but systematically applies various attack techniques to measure how much the systems under test (SUTs) degrade under hostile conditions.

What is the “Resilience Gap”?

The Resilience Gap is the key metric: it’s the difference between a model’s baseline safety score (the percentage of non-violating responses) and its score after jailbreak attacks. For example, if a model has an 80% safety rate normally but drops to 60% under attack, the Resilience Gap is 20 percentage points.

What types of attacks does the benchmark test?

The v0.5 release tests multiple attack families including role-playing scenarios, misdirection strategies, encoding techniques, and cross-language attacks, in addition to specific text-plus-image-to-text multi-modal varieties. The specific attacks are not publicly disclosed to preserve benchmark integrity.

Which modalities are supported?

Currently, the benchmark supports:

-Text-to-Text (T2T): 39 models tested with 1,200 prompts across 12 hazard categories

-Text+Image-to-Text (T+I2T): 5 vision-language models tested with 400 multimodal prompts across 5 hazard categories

How are responses evaluated?

The benchmark uses an LLM-as-judge evaluator that classifies each response as either “NON-VIOLATING” or “VIOLATING.” The same evaluator is used for both safety and jailbreak testing to ensure comparable results. The v0.5 evaluator has a 17.1% false safe rate and 29.7% false unsafe rate.

What did the v0.5 results show?

Key findings include:

-35 of 39 T2T models received lower grades under jailbreak conditions
-Average safety score reduction of 19.81% for T2T models
-25.27% average reduction for T+I2T models
-Impact was broadly distributed across hazard categories
-Different attack tactics showed widely varying effectiveness

What do these results mean for AI deployment?

The substantial Resilience Gaps demonstrate that current models are vulnerable to adversarial manipulation even when they show strong baseline scores. Organizations should consider both safety and security when deploying AI systems and implement defense-in-depth strategies.

How does this align with ISO/IEC 42001?

The benchmark is designed to integrate with AI Management Systems under ISO/IEC 42001. It provides:

Auditable artifacts (datasets, configurations, run logs)
Metrics for risk assessment and treatment selection
Evidence for Statement of Applicability updates
Inputs for management reviews and improvement cycles

Can this benchmark be used for regulatory compliance?

While v0.5 is a draft release focused on methodology validation, the framework is designed to support compliance documentation. The v1.0 release will provide additional reliability and comparability across SUTs.

How can I help get the benchmark to v1.0 in Q1 2026?

-Join working groups: Participate in open working groups for benchmark development
-Contribute attacks: Submit new attack techniques following contribution guidelines
-Test systems: Help expand the diversity of Systems Under Test
-International collaboration: Partner on localization and non-English evaluation
-Technical improvements: Contribute to evaluator calibration and error reduction
-Provide feedback: Share use cases and requirements for v1.0
Join Us

What improvements are planned for v1.0?

-Improved evaluator accuracy with lower error rates
-Expanded attack taxonomy covering state-of-the-art techniques
-Broader modality support
-International coverage with multiple languages
-Public leaderboards with identified results
-Enhanced coordinated disclosure framework

Who should use this benchmark?

-AI developers and model providers
-Organizations deploying AI systems
-Security researchers and red teams
-Auditors and compliance professionals
-Policymakers and standards bodies

What are the current limitations?

v0.5 limitations include:

-English-only evaluation
-Limited to single-turn interactions
-Evaluator error rates not yet suitable for high-stakes decisions
-Results are anonymized (no public leaderboard)
-Limited number of models tested, especially for VLMs

Will the specific attacks be made public?

Following cybersecurity norms, specific attack strings are not publicly disclosed to avoid enabling malicious use.

How often will the benchmark be updated?

After the v1.0 release in Q1 2026, MLCommons plans regular updates to incorporate new attack techniques and maintain relevance as the threat landscape evolves.

FAQ

What is the MLCommons AILuminate Jailbreak Benchmark v0.5?

How does this differ from existing safety benchmarks?

What is the “Resilience Gap”?

What types of attacks does the benchmark test?

Which modalities are supported?

How are responses evaluated?

What did the v0.5 results show?

What do these results mean for AI deployment?

How does this align with ISO/IEC 42001?

Can this benchmark be used for regulatory compliance?

How can I help get the benchmark to v1.0 in Q1 2026?

What improvements are planned for v1.0?

Who should use this benchmark?

What are the current limitations?

Will the specific attacks be made public?

How often will the benchmark be updated?

Interested in learning more?