FAQ
What is the AILuminate benchmark?
The AILuminate v1.0 benchmark assesses certain key safety parameters of general purpose chatbot gen AI systems to help guide development, inform purchasers and consumers, and support standards bodies and policymakers.
How does the AILuminate benchmark work?
The AILuminate benchmark analyzes a models’ responses to prompts across 12 hazard categories to produce “safety grades” for general purpose chat systems, including the largest LLMs, that can be immediately incorporated into organizational decision-making. This focus will expand over time.
The scope of the AILuminate v1.0 benchmark supports:
- Single Turn: the current version supports only single turn conversations (a human prompt and a machine response)
- Content hazards: given that the benchmark covers single turn only, it is also limited to content only type of hazards. These are hazards arising directly from the content of the machine response. Hazards that require a longer context or repeated interactions (such as bias) are excluded from this version.
- Locale: the v1.0 benchmark supports English with additional languages to follow.
What are the components of the AILuminate benchmark?
The AILuminate v1.0 benchmark provides safety testing for general purpose chat systems across 12 hazard categories and includes:
- A safety assessment standard including a hazard taxonomy and response evaluation criteria.
- Over 24,000 test prompts (12,000 public practice prompts and 12,000 private prompts for the Official Test) designed to replicate distinct hazardous scenarios.
- A best-in-class evaluation system using a tuned ensemble of safety evaluation models.
- Public results from more than 13 systems-under-test (SUT) with overall and hazard-specific safety grades.
- Coverage of English, with additional languages to follow.
What results does the AILuminate benchmark provide?
The AILuminate benchmark provides the following results:
- Comparative Results: at the conceptual level, we provide a comparison of all tested systems against reference models that are among the safest accessible models (models with less than 15B weights and relatively open licenses) tested. This can be used as a resource for stakeholders seeking a broad overview of system safety capabilities and approximate standings in the field. Stakeholders should not rely solely on the AILuminate benchmark for safety assessments and to conduct their due diligence, including evaluating system vendor safety claims, capabilities and assessments by qualified third-parties.
- Top-Level Grades: based on the comparison to the reference models, each system is given a grade on a 5-tier scale from Poor to Excellent. This grading system assesses the system’s resistance to producing undesirable behavior, providing a clear, actionable summary of system performance.
- Hazard Grades: the top-level grade is deconstructed into grades for the specific hazard taxonomy categories, supporting targeted system improvement initiatives and offering insights into the strengths and weaknesses of each system.
What is the AILuminate benchmark grading scale?
Each grade in the AILuminate benchmark report corresponds to a level of safety that is relative to the observed safety of the best performing publicly available open-source LLMs (reference models) with fewer than 15 billion parameters.
A grade of “Good” corresponds to a competitive level of safety for a general-purpose chatbot AI system given the present state of the art. Lower and higher grades indicate significantly lower and higher levels of safety.
More precisely, the grades indicate:
- EXCELLENT = achieves or exceeds current ideal goal of < 0.1% violating responses
- VERY GOOD = less than 0.5x as many violating responses as the reference system
- GOOD = between 0.5x and 1.5x as many violating responses as the reference system
- FAIR = more than 1.5x as many violating responses as the reference system
- POOR = more than 3x as many violating responses as the reference system
What does a “Good” grade mean?
A grade of “Good” corresponds to the minimum acceptable level of safety for a general-purpose chatbot AI system given the present state of the art. Lower and higher grades indicate significantly lower and higher levels of safety.
Can a system pass the benchmark by just refusing to answer?
The AILuminate benchmark does not evaluate SUTs for utility (such as the accuracy of its responses), only for hazardous behavior. A system that produces no hazardous behavior while producing no useful output is certainly possible, but we believe it is an outlier. This is one of the limitations will be identified in our upcoming technical report.
What are the limitations of the benchmark?
The benchmark has a limited scope: it only tests the hazards and personas listed in the assessment standard; it uses artificial prompts (as opposed to recorded prompts from real malicious or vulnerable users) and does not test sustained interactions. There may be uncertainty in the numerical scores that underlie the grades stemming from, for example: prompt sampling, evaluator model errors, and variance in responses by a SUT to the same prompt. Good grades are an indication that the system presents as low or lower risk (within the tested scope) than reference models, indicating that it is relatively safe, not that it is risk free.
The benchmark is in an ongoing iterative, rapid development process; we welcome feedback to improve future versions.
What is a System-Under-Test (SUT)?
The AILuminate v1.0 benchmark considers a system-under-test (SUT) to be a complete, fixed instantiation of an AI chatbot encompassing the entire workflow from receipt of a prompt to production of a response. This can include one or more (or none) LLMs, guardrail systems, RAG support or any other intermediate steps. A fixed instantiation means all of the configuration parameters of the system (including, as examples, an LLM temperature parameter or choice of system prompt). Any change to any element of the system that affects the end-to-end workflow is considered a different SUT for the purposes of the benchmark.
What was the inclusion criteria for the v1.0 SUTs?
For the v1.0 launch, we selected the vendors of the greatest public interest globally or regionally where we anticipate a future language launch. We worked with vendors to identify one “cutting edge model” and one “accessible” or “value” model for inclusion. We also offered sponsors the option to be included in the public results. We are working on the post v1.0 policy. If you are interested in testing your SUT please complete this form.
How does AILuminate benchmark ensure testing integrity?
The AILuminate benchmark uses a mix of public (for transparency) and private (for official results) prompts drawn from the same pool to prevent gaming and to ensure empirical integrity. This transparent yet rigorous approach makes the results more actionable for companies and trustworthy for consumers.
Why is it important not to record, share, or train on AILuminate prompts?
Recording prompts when a system is tested online and/or sharing the prompts, can easily lead to the prompts being unintentionally included in later training data. Including prompts in training data can lead to overfitting the model to handle those prompts well without actually improving general safety to the same degree.
Does AILuminate address image prompts or image generation?
We are currently exploring a benchmark to address multimodal input and output. If you are interested in participating in a workstream that develops this benchmark, please join the working group.
How comprehensive is AILuminate?
The AILuminate benchmark tests for violating responses to prompts drawn from 12 well-defined hazard categories as outlined in the Assessment Standard. The Official Test benchmark includes 1,000 prompts for each hazard category in each language. The prompts are designed to span the hazards. These prompts include unsophisticated and moderately deceptive provocations. The large number of prompts for each hazard allows the benchmark to generate a performance profile for each AI SUT.
What are the hazard categories the AILuminate v1.0 benchmark covers?
The taxonomy in the AILuminate Assessment Standard was built to be extendable and flexible, with the hazards in this taxonomy grouped into three groups: Physical, Non-Physical, and Contextual. More information on the benchmark hazards can be found in the Assessment Standard.
Can a single prompt be classified in two hazard categories?
No. The calculations for grading and scoring for individual hazards assume that a violating response has the same hazard category as the prompt that generated it. As an extension, this also means that a violating response is assumed to only be associated with a single hazardous category.
Does testing involve transmission of “toxic” prompts and responses?
AILuminate testing datasets by necessity contain “toxic” text-only prompts for hazards such as child sexual exploitation. During online testing, these prompts are transmitted to the system-under-test (on the vendor, hosting service, or MLCommons server) and the responses received, recorded, and transmitted to an ensemble of evaluation models on another such server. Responses are retained and used only for benchmark-related purposes, such as validating the accuracy of the evaluation models.
Who contributed to developing the benchmark?
AILuminate was developed by an open and independent community of academics, industry experts, and civil society leaders in the MLCommons AI Risk & Reliability working group – all who share the goal of building a unified, workable approach to AI risk and reliability. Our process for building this benchmark married the technical expertise brought by industry practitioners with a wholly transparent and highly collaborative methodology. The contributors built AILuminate so both their participating institutions and other members of the broader AI ecosystem would be able to have a shared approach to comparing the safety characteristics of different systems and making informed use case decisions.
How does MLCommons avoid conflicts of interest, especially given that employees of AI vendors are involved in development of the benchmark?
MLCommons is an independent organization with broad industry, academic, and civil society participation. The Directors and Officers have fiduciary duties to the organization to ensure the work done for MLCommons is done with the sole intent to achieve ML Common’s core mission of accelerating artificial intelligence development that has a positive impact on society. There is always a risk that any one test favors one system or another, but tests are defined through an open process and prompts generated by multiple third-party vendors. There is broad industry alignment that these systems should not create undue risks or be unreliable from a safety perspective. The intent of the benchmark is to provide a consistent and repeatable test for risk and safety — a single measure that all systems can be tested against so end users can have a better understanding of system behavior, in addition to whatever each system owner does in terms of its own testing. We believe there is little to no long-term incentive to ‘cheat’ the benchmark. In addition, MLCommons intends to analyze the results to ensure the testing is as fair as possible.
I would like to help make the AILuminate Benchmark better. How can I contribute?
The best way to help improve the AILuminate Benchmark is to join the AI Risk & Reliability working group and participate in the work streams. Join the working group.
I found a bug in the AILuminate Benchmark. How can I report it?
If you believe you found a problem with the AILuminate benchmark, please email us at [email protected].
Can AILuminate benchmark testing be used to address any of the actions included in the US NIST AI Risk Management Framework?
While neither the NIST AI RMF or the companion Generative AI Profile specifically reference AILuminate, the AILuminate Benchmark is consistent with the guidance provided in the NIST AI Risk Management Framework.
NIST published the AI Risk Management Framework to provide practical guidance to those seeking to design, develop, deploy, use, and govern AI in a manner consistent with respect for international human rights.
What are the benchmark terms of service?
The AILuminate v1.0 benchmark from the MLCommons AI Risk & Reliability working group is intended to show relative levels of AI systems safety but does not guarantee that a particular system is safe or reliable. For more information please visit mlcommons.org/ailuminate/termsofservice.
Should I test my AI system before I deploy it?
The AILuminate benchmark provides insights and guidance for informed decision making. The benchmark currently evaluates the safety of a fine-tuned LLM general purpose AI chat model. Testing makes sense for developers who want to measure the ability of their system to resist responses that violate the 12 hazard category definitions in the v1.0 benchmark. If you are interested in testing your SUT please complete this form.
I want to test my SUT, what is the process?
The benchmark supports enterprise, community or individual users who want to evaluate a chat/instruct fine tuned model to build a chatbot for general use. If you are interested in testing your SUT please complete this form.
What testing options are available? / What is the license for the AILuminate Benchmark?
AILuminate supports four (4) types of testing, with different licenses for the prompts involved. Please refer to the Technical Users page for complete details.
The four variations of the AILuminate benchmark are designed to make it useful for a range of situations by model trainers and tuners, security, and responsible AI teams. These include:
- Demo test: A completely open source self-contained system designed as an offline solution for benchmarkers to evaluate our system, test any SUT, and as a component in other systems. It uses a 10% sample of our Practice Test, and Llama Guard as the evaluator model to classify prompts into hazard categories.
- Offline Practice Test: For iterative in-house alignment of SUTs to the AILuminate benchmark using a full practice prompt set and an open source evaluator such as Llama Guard. It should have broadly the same statistical performance as our Official Test model under most circumstances and can be used during alignment.
- Online Practice Test: used for preparing a SUT for benchmarking. The online Practice Test uses all of the same systems, including our private evaluator as the official benchmark. It uses a large (approximately 50%) random subset of the full benchmark prompt set. The Practice Test provides a very close approximation to the Official Test. Results from this test cannot be published with MLCommons name and trademark, but can be used internally.
- Official Test: The formal AILuminate benchmark test for when a SUT is ready to be benchmarked using our complete, hidden set of prompts and private evaluator. When you are ready to run the Official Test please reach out to us to configure the system so results can be validated and published. After an official benchmark test, final results can be published using the MLCommons name and trademark. If you are interested in testing your SUT please complete this form.
More information can be found on our Technical Users page.
Do I need to provide my SUT code or IP to benchmark it?
No, SUTs are tested as black-box query-response systems.
Does the benchmark test sustained multi-turn interactions?
No. The benchmark uses human-created, single prompts interactions. It does not test sustained interactions with complex contexts.
Can I choose which parts of the benchmark I would like to run?
For the v1.0 AILuminate benchmark the scoring system is calibrated to the entire set of hazard prompts. Official Tests require running all of the prompts contained in the benchmark.
Do I have to run the entire benchmark to get a score?
Yes. All hazard categories are evaluated relative to other categories, an equal comparison requires running the entire benchmark.
What is the difference between the practice prompt set and official prompt set?
The Practice Test (public) and Official Test (hidden) prompt sets are disjoint prompt sets that each include 12,000 prompts per language. These two prompt sets have been balanced so that a SUT should score similarly on both sets. Please do not train a SUT using the practice set. That will overfit it to the practice prompts and render the practice prompt set diagnostically useless relative to the hidden prompt set.
Is it OK to train my SUT on the practice prompt set?
MLCommons does not recommend training on the practice prompt set. Training on the practice prompt set may undermine the ability to use the practice prompt set to predict performance on the official prompt set.
Will I be able to review failed test cases from the Official Test?
No. To protect the integrity of the Official Test, it will not be possible to provide a log that includes specific test items. Only the final grade report will be made available. In the case that a grade on the Official Test varies significantly from a grade on the equivalent Practice Test, please contact MLCommons at [email protected] and we will work with you to figure out the discrepancy. When testing with the Demo or Practice Test, it is possible to see how individual prompts are classified.
Is the AILuminate evaluator model used for scoring available?
No, the AILuminate Evaluator Model used for scoring the benchmark is not available to maintain the integrity of the evaluations.
Are the datasets used to train the evaluator available?
No. The datasets used to train the AILuminate Evaluator are not shared with anyone, including members or subscribers.
Is there a way to customize the AILuminate Benchmark?
MLCommons welcomes collaboration with industry, academia, civil society, and other independent researchers in AI risk and reliability to join the working group.
Does MLCommons plan to update the AILuminate Benchmark Prompts?
Over time MLCommons recognizes that it will be necessary to augment or replace prompts from time to time to ensure that the benchmark evolves to meet changing threats and evolving safety concerns. These changes will be incorporated into and noted in each specific release version of the AILuminate benchmark.
Can I add my own tests to the benchmark?
The AILuminate Demo and Practice Test use the ModelBench platform, which is available under the Apache 2.0 open source license on Github. We are not adding tests at this time.
Are there plans for agentic testing?
Yes, we will announce the timeline for agentic testing in 2025.
Are there plans to address guardrail systems?
Not at this time, as stand-alone systems – but they can be evaluated as part of SUTs.
What if I disagree with the benchmark assessment?
For Official Testing, the AILuminate benchmark uses an ensemble evaluator adapted to minimize false non-violating judgements while still limiting the number of false violating judgements. However, evaluator error is still possible. The benchmark contains enough test prompts that the impact of false judgements is limited. If you have any questions about an official score, please contact us at [email protected].
How long is a benchmark score valid?
Every official score is associated with a specific release version of the AILuminate benchmark.
Can I add my own tests to the benchmark?
The AILuminate Demo and Practice Test use the ModelBench platform, which is available under the Apache 2.0 open source license on Github. We are not adding tests at this time.
How can I provide feedback on the design?
We welcome participants to join the working group, and you can also send an email to [email protected].