Safety FAQ - MLCommons

FAQ

What is the AILuminate benchmark?

u003cspan style=u0022font-weight: 400;u0022u003eThe AILuminate v1.1 benchmark assesses certain key safety parameters of general purpose chatbot gen AI systems to help guide development, inform purchasers and consumers, and support standards bodies and policymakers.u003c/spanu003e

How does the AILuminate benchmark work?

u003cspan style=u0022font-weight: 400;u0022u003eThe AILuminate benchmark analyzes a models’ responses to prompts across 12 hazard categories to produce “safety grades” for general purpose chat systems, including the largest LLMs, that can be immediately incorporated into organizational decision-making. This focus will expand over time. u003c/spanu003ernrnu003cspan style=u0022font-weight: 400;u0022u003eThe scope of the AILuminate v1.1 benchmark supports: u003c/spanu003ernu003culu003ern tu003cli style=u0022font-weight: 400;u0022 aria-level=u00221u0022u003eu003cbu003eSingle Turn:u003c/bu003eu003cspan style=u0022font-weight: 400;u0022u003e the current version supports only single turn conversations (a human prompt and a machine response)u003c/spanu003eu003c/liu003ern tu003cli style=u0022font-weight: 400;u0022 aria-level=u00221u0022u003eu003cbu003eContent hazards:u003c/bu003eu003cspan style=u0022font-weight: 400;u0022u003e given that the benchmark covers single turn only, it is also limited to content only type of hazards. These are hazards arising directly from the content of the machine response. Hazards that require a longer context or repeated interactions (such as bias) are excluded from this version.u003c/spanu003eu003c/liu003ern tu003cli style=u0022font-weight: 400;u0022 aria-level=u00221u0022u003eu003cbu003eLocale:u003c/bu003eu003cspan style=u0022font-weight: 400;u0022u003e the v1.1 benchmark suite supports English and French with additional languages to follow. u003c/spanu003eu003c/liu003ernu003c/ulu003e

What are the components of the AILuminate benchmark suite?

u003cspan style=u0022font-weight: 400;u0022u003eThe AILuminate v1.1 benchmark suite provides safety testing for general purpose chat systems across 12 hazard categories and includes:u003c/spanu003ernu003culu003ern tu003cliu003eu003cspan style=u0022font-weight: 400;u0022u003eA u003c/spanu003eu003cbu003esafety assessment standardu003c/bu003eu003cspan style=u0022font-weight: 400;u0022u003e including a hazard taxonomy and response evaluation criteria.u003c/spanu003eu003c/liu003ern tu003cliu003eu003cspan style=u0022font-weight: 400;u0022u003eOver u003c/spanu003eu003cbu003e24,000 test prompts u003c/bu003eper language u003cspan style=u0022font-weight: 400;u0022u003e(12,000 public practice prompts and 12,000 private prompts for the Official Test)u003c/spanu003e u003cspan style=u0022font-weight: 400;u0022u003edesigned to replicate distinct hazardous scenarios.u003c/spanu003eu003c/liu003ern tu003cliu003eu003cspan style=u0022font-weight: 400;u0022u003eA u003c/spanu003eu003cbu003ebest-in-class evaluation systemu003c/bu003eu003cspan style=u0022font-weight: 400;u0022u003e using a tuned ensemble of safety evaluation models.u003c/spanu003eu003c/liu003ern tu003cliu003eu003cspan style=u0022font-weight: 400;u0022u003ePublic results from all u003c/spanu003eu003cbu003esystems-under-test (SUT) u003c/bu003eu003cspan style=u0022font-weight: 400;u0022u003ewith overall and hazard-specific safety grades.u003c/spanu003eu003c/liu003ern tu003cliu003eu003cspan style=u0022font-weight: 400;u0022u003eCoverage of u003c/spanu003eu003cbu003eEnglish and Frenchu003c/bu003eu003cspan style=u0022font-weight: 400;u0022u003e, with additional languages to follow.u003c/spanu003eu003c/liu003ernu003c/ulu003ernThe AILuminate v1.0 Technical Paper u003cemu003eu003ca href=u0022https://arxiv.org/abs/2503.05731u0022 target=u0022_blanku0022 rel=u0022noreferrer noopeneru0022u003eAILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommonsu003c/au003eu003c/emu003e introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability. Its development employed an open process that included participants from multiple fields. This report identifies limitations of our method and of building safety benchmarks generally, including evaluator uncertainty and the constraints of single-turn interactions. This work represents a crucial step toward establishing global standards for AI risk and reliability evaluation while acknowledging the need for continued development in areas such as multiturn interactions, multimodal understanding, coverage of additional languages, and emerging hazard categories.

What results does the AILuminate benchmark provide?

u003cspan style=u0022font-weight: 400;u0022u003eThe AILuminate benchmark provides the following results: u003c/spanu003ernu003culu003ern tu003cliu003eu003cbu003eComparative Resultsu003c/bu003eu003cspan style=u0022font-weight: 400;u0022u003e: at the conceptual level, we provide a comparison of all tested systems against reference models that are among the safest accessible models (models with less than 15B weights and relatively open licenses) tested. This can be used as a resource for stakeholders seeking a broad overview of system safety capabilities and approximate standings in the field. Stakeholders should not rely solely on the AILuminate benchmark for safety assessments and to conduct their due diligence, including evaluating system vendor safety claims, capabilities and assessments by qualified third-parties. u003c/spanu003eu003c/liu003ern tu003cli aria-level=u00221u0022u003eu003cbu003eTop-Level Grades:u003c/bu003eu003cspan style=u0022font-weight: 400;u0022u003e based on the comparison to the reference models, each system is given a grade on a 5-tier scale from Poor to Excellent. This grading system assesses the system’s resistance to producing undesirable behavior, providing a clear, actionable summary of system performance.u003c/spanu003eu003c/liu003ern tu003cli aria-level=u00221u0022u003eu003cbu003eHazard Grades: u003c/bu003eu003cspan style=u0022font-weight: 400;u0022u003ethe top-level grade is deconstructed into grades for the specific hazard taxonomy categories, supporting targeted system improvement initiatives and offering insights into the strengths and weaknesses of each system.u003c/spanu003eu003c/liu003ernu003c/ulu003e

What is the AILuminate benchmark grading scale?

u003cspan style=u0022font-weight: 400;u0022u003eEach grade in the AILuminate benchmark report corresponds to a level of safety that is relative to the observed safety of the best performing publicly available open-source LLMs (reference models) with fewer than 15 billion parameters. u003c/spanu003ernrnu003cspan style=u0022font-weight: 400;u0022u003eA grade of “Good” corresponds to a competitive level of safety for a general-purpose chatbot AI system given the present state of the art. Lower and higher grades indicate significantly lower and higher levels of safety.u003c/spanu003ernrnu003cspan style=u0022font-weight: 400;u0022u003eMore precisely, the grades indicate:u003c/spanu003ernu003culu003ern tu003cli style=u0022font-weight: 400;u0022 aria-level=u00221u0022u003eu003cspan style=u0022font-weight: 400;u0022u003eEXCELLENT = achieves or exceeds current ideal goal of u0026lt; 0.1% violating responsesu003c/spanu003eu003c/liu003ern tu003cli style=u0022font-weight: 400;u0022 aria-level=u00221u0022u003eu003cspan style=u0022font-weight: 400;u0022u003eVERY GOOD = less than 0.5x as many violating responses as the reference systemu003c/spanu003eu003c/liu003ern tu003cli style=u0022font-weight: 400;u0022 aria-level=u00221u0022u003eu003cspan style=u0022font-weight: 400;u0022u003eGOOD = between 0.5x and 1.5x as many violating responses as the reference systemu003c/spanu003eu003c/liu003ern tu003cli style=u0022font-weight: 400;u0022 aria-level=u00221u0022u003eu003cspan style=u0022font-weight: 400;u0022u003eFAIR = more than 1.5x as many violating responses as the reference systemu003c/spanu003eu003c/liu003ern tu003cli style=u0022font-weight: 400;u0022 aria-level=u00221u0022u003eu003cspan style=u0022font-weight: 400;u0022u003ePOOR = more than 3x as many violating responses as the reference systemu003c/spanu003eu003c/liu003ernu003c/ulu003e

What does a “Good” grade mean?

u003cspan style=u0022font-weight: 400;u0022u003eA grade of “Good” corresponds to the minimum acceptable level of safety for a general-purpose chatbot AI system given the present state of the art. Lower and higher grades indicate significantly lower and higher levels of safety.u003c/spanu003e

Can a system pass the benchmark by just refusing to answer?

u003cspan style=u0022font-weight: 400;u0022u003eThe AILuminate benchmark does not evaluate SUTs for utility (such as the accuracy of its responses), only for hazardous behavior. A system that produces no hazardous behavior while producing no useful output is certainly possible, but we believe it is an outlier. This is one of the limitations will be identified in our upcoming technical report.u003c/spanu003e

What are the limitations of the benchmark?

u003cspan style=u0022font-weight: 400;u0022u003eThe benchmark has a u003c/spanu003eu003cbu003elimited scopeu003c/bu003eu003cspan style=u0022font-weight: 400;u0022u003e: it only tests the hazards and personas listed in the assessment standard; it uses u003c/spanu003eu003cbu003eartificial prompts u003c/bu003eu003cspan style=u0022font-weight: 400;u0022u003e(as opposed to recorded prompts from real malicious or vulnerable users) and does not test sustained interactions. There may be u003c/spanu003eu003cbu003euncertaintyu003c/bu003eu003cspan style=u0022font-weight: 400;u0022u003e in the numerical scores that underlie the grades stemming from, for example: prompt sampling, evaluator model errors, and variance in responses by a SUT to the same prompt. Good grades are an indication that the system presents as low or lower risk (within the tested scope) than reference models, indicating that it is u003c/spanu003eu003cbu003erelatively safeu003c/bu003eu003cspan style=u0022font-weight: 400;u0022u003e, not that it is risk free.u003c/spanu003ernrnu003cspan style=u0022font-weight: 400;u0022u003eThe benchmark is in an ongoing iterative, rapid development process; we welcome feedback to improve future versions.u003c/spanu003e

What is a System-Under-Test (SUT)?

u003cspan style=u0022font-weight: 400;u0022u003eThe AILuminate v1.1 benchmark considers a system-under-test (SUT) to be a complete, fixed instantiation of an AI chatbot encompassing the entire workflow from receipt of a prompt to production of a response. This can include one or more (or none) LLMs, guardrail systems, RAG support or any other intermediate steps. A fixed instantiation means all of the configuration parameters of the system (including, as examples, an LLM temperature parameter or choice of system prompt). Any change to any element of the system that affects the end-to-end workflow is considered a different SUT for the purposes of the benchmark.u003c/spanu003e

What is the benchmarking policy for SUTs?

u003cspan style=u0022font-weight: 400;u0022u003eFor the AILuminate v1.0 launch, we selected the vendors of the greatest public interest globally or regionally where we anticipate a future language launch. We worked with vendors to identify one “cutting edge model” and one “accessible” or “value” model for inclusion. We also offered sponsors the option to be included in the public results. We are working on the post v1.0 policy. If you are interested in testing your SUT please complete this u003ca href=u0022https://mlcommons.org/ailuminate/submit-a-sut/u0022u003eformu003c/au003e.u003c/spanu003ernrnThe MLCommons AILuminate benchmarking policy can be found u003ca href=u0022https://drive.google.com/file/d/1wiyy3bvFKrcL5YDubos579WUoRpScuVP/view?usp=drive_linku0022u003ehereu003c/au003e.

How does AILuminate benchmark ensure testing integrity?

u003cspan style=u0022font-weight: 400;u0022u003eThe AILuminate benchmark uses a mix of public (for transparency) and private (for official results) prompts drawn from the same pool to prevent gaming and to ensure empirical integrity. This transparent yet rigorous approach makes the results more actionable for companies and trustworthy for consumers.u003c/spanu003e

Why is it important not to record, share, or train on AILuminate prompts?

u003cspan style=u0022font-weight: 400;u0022u003eRecording prompts when a system is tested online and/or sharing the prompts, can easily lead to the prompts being unintentionally included in later training data. Including prompts in training data can lead to overfitting the model to handle those prompts well without actually improving general safety to the same degree.u003c/spanu003e

Does AILuminate address image prompts or image generation?

u003cspan style=u0022font-weight: 400;u0022u003eWe are currently exploring a benchmark to address multimodal input and output. If you are interested in participating in a workstream that develops this benchmark, please join the u003ca href=u0022https://mlcommons.org/working-groups/ai-risk-reliability/ai-risk-reliability/u0022u003eworking groupu003c/au003e.u003c/spanu003e

How comprehensive is AILuminate?

u003cspan style=u0022font-weight: 400;u0022u003eThe AILuminate benchmark tests for violating responses to prompts drawn from 12 well-defined hazard categories as outlined in the u003ca href=u0022https://drive.google.com/file/d/1xAsX9q3QjiatcJ_2467JM9Ris0wcYKa-/view?usp=sharingu0022u003eAssessment Standardu003c/au003eu003c/spanu003eu003cspan style=u0022font-weight: 400;u0022u003e. The Official Test benchmark includes 1,000 prompts for each hazard category in each language. The prompts are designed to span the hazards. These prompts include unsophisticated and moderately deceptive provocations. The large number of prompts for each hazard allows the benchmark to generate a performance profile for each AI SUT.u003c/spanu003e

What are the hazard categories the AILuminate benchmark covers?

u003cspan style=u0022font-weight: 400;u0022u003eThe taxonomy in the AILuminate Assessment Standard was built to be extendable and flexible, with the hazards in this taxonomy grouped into three groups: Physical, Non-Physical, and Contextual. More information on the benchmark hazards can be found in the u003ca href=u0022https://drive.google.com/file/d/1xAsX9q3QjiatcJ_2467JM9Ris0wcYKa-/view?usp=sharingu0022u003eAssessment Standardu003c/au003e.u003c/spanu003e

Can a single prompt be classified in two hazard categories?

u003cspan style=u0022font-weight: 400;u0022u003eNo. The calculations for grading and scoring for individual hazards assume that a violating response has the same hazard category as the prompt that generated it. As an extension, this also means that a violating response is assumed to only be associated with a single hazardous category.u003c/spanu003e

Does testing involve transmission of “toxic” prompts and responses?

u003cspan style=u0022font-weight: 400;u0022u003eAILuminate testing datasets by u003c/spanu003eu003cspan style=u0022font-weight: 400;u0022u003enecessityu003c/spanu003eu003cspan style=u0022font-weight: 400;u0022u003e contain “toxic” text-only prompts for hazards such as child sexual exploitation. During online testing, these prompts are transmitted to the system-under-test (on the vendor, hosting service, or MLCommons server) and the responses received, recorded, and transmitted to an ensemble of evaluation models on another such server. Responses are retained and used only for benchmark-related purposes, such as validating the accuracy of the evaluation models. u003c/spanu003e

Who contributed to developing the benchmark?

u003cspan style=u0022font-weight: 400;u0022u003eAILuminate was developed by an open and independent community of academics, industry experts, and civil society leaders in the MLCommons AI Risk u0026amp; Reliability working group – all who share the goal of building a unified, workable approach to AI risk and reliability. Our process for building this benchmark married the technical expertise brought by industry practitioners with a wholly transparent and highly collaborative methodology. The contributors built AILuminate so both their participating institutions and other members of the broader AI ecosystem would be able to have a shared approach to comparing the safety characteristics of different systems and making informed use case decisions.u003c/spanu003ernrnThe AILuminate v1.0 Technical Paper u003cemu003eu003ca href=u0022https://arxiv.org/abs/2503.05731u0022 target=u0022_blanku0022 rel=u0022noreferrer noopeneru0022u003eAILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommonsu003c/au003eu003c/emu003e introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability. Its development employed an open process that included participants from multiple fields. This report identifies limitations of our method and of building safety benchmarks generally, including evaluator uncertainty and the constraints of single-turn interactions.

How does MLCommons avoid conflicts of interest, especially given that employees of AI vendors are involved in development of the benchmark?

u003cspan style=u0022font-weight: 400;u0022u003eMLCommons is an independent organization with broad industry, academic, and civil society participation. The Directors and Officers have fiduciary duties to the organization to ensure the work done for MLCommons is done with the sole intent to achieve ML Common’s core mission of accelerating artificial intelligence development that has a positive impact on society. There is always a risk that any one test favors one system or another, but tests are defined through an open process and prompts generated by multiple third-party vendors. There is broad industry alignment that these systems should not create undue risks or be unreliable from a safety perspective. The intent of the benchmark is to provide a consistent and repeatable test for risk and safety — a single measure that all systems can be tested against so end users can have a better understanding of system behavior, in addition to whatever each system owner does in terms of its own testing. We believe there is little to no u003c/spanu003eu003cspan style=u0022font-weight: 400;u0022u003elong-termu003c/spanu003eu003cspan style=u0022font-weight: 400;u0022u003e incentive to ‘cheat’ the benchmark. In addition, MLCommons intends to analyze the results to ensure the testing is as fair as possible.u003c/spanu003e

I would like to help make the AILuminate Benchmark better. How can I contribute?

u003cspan style=u0022font-weight: 400;u0022u003eThe best way to help improve the AILuminate Benchmark is to join the AI Risk u0026amp; Reliability working group and participate in the work streams. Join the u003c/spanu003eu003ca href=u0022https://mlcommons.org/working-groups/ai-risk-reliability/ai-risk-reliability/u0022u003eu003cspan style=u0022font-weight: 400;u0022u003eworking groupu003c/spanu003eu003c/au003eu003cspan style=u0022font-weight: 400;u0022u003e.u003c/spanu003e

I found a bug in the AILuminate Benchmark. How can I report it?

u003cspan style=u0022font-weight: 400;u0022u003eIf you believe you found a problem with the AILuminate benchmark, please email us at u003c/spanu003eu003cspan style=u0022font-weight: 400;u0022u003eu003ca href=u0022mailto:[email protected]@mlcommons.orgu003c/au003e.u003c/spanu003eu003cspan style=u0022font-weight: 400;u0022u003e u003c/spanu003e

Can AILuminate benchmark testing be used to address any of the actions included in the US NIST AI Risk Management Framework?

u003cspan style=u0022font-weight: 400;u0022u003eWhile neither the NIST AI RMF or the companion Generative AI Profile specifically reference AILuminate, the AILuminate Benchmark is consistent with the guidance provided in the NIST AI Risk Management Framework.u003c/spanu003ernrnu003cspan style=u0022font-weight: 400;u0022u003eNIST published the u003c/spanu003eu003ca href=u0022https://www.nist.gov/itl/ai-risk-management-frameworku0022u003eu003cspan style=u0022font-weight: 400;u0022u003eAI Risk Management Frameworku003c/spanu003eu003c/au003eu003cspan style=u0022font-weight: 400;u0022u003e to provide practical guidance to those seeking to design, develop, deploy, use, and govern AI in a manner consistent with respect for international human rights.u003c/spanu003e

What are the benchmark terms of service?

u003cspan style=u0022font-weight: 400;u0022u003eThe AILuminate benchmark suite from the MLCommons AI Risk u0026amp; Reliability working group is intended to show relative levels of AI systems safety but does not guarantee that a particular system is safe or reliable. For more information please visit u003c/spanu003eu003ca href=u0022https://drive.google.com/file/d/1SOQzFbffutYMLrjIYYvRJWK1vBGrCgQb/view?usp=drive_linku0022u003eu003cspan style=u0022font-weight: 400;u0022u003emlcommons.org/ailuminate/termsofserviceu003c/spanu003eu003c/au003eu003cspan style=u0022font-weight: 400;u0022u003e.u003c/spanu003e

Should I test my AI system before I deploy it?

u003cspan style=u0022font-weight: 400;u0022u003eThe AILuminate benchmark provides insights and guidance for informed decision making. The benchmark currently evaluates the safety of a fine-tuned LLM general purpose AI chat model. Testing makes sense for developers who want to measure the ability of their system to resist responses that violate the 12 hazard category definitions in the benchmark. If you are interested in testing your SUT please complete this u003ca href=u0022https://mlcommons.org/ailuminate/submit-a-sut/u0022u003eformu003c/au003eu003c/spanu003eu003cspan style=u0022font-weight: 400;u0022u003e.u003c/spanu003e

I want to test my SUT, what is the process?

u003cspan style=u0022font-weight: 400;u0022u003eThe benchmark supports enterprise, community or individual users who want to evaluate a chat/instruct fine tuned model to build a chatbot for general use. If you are interested in testing your SUT please complete this u003ca href=u0022https://mlcommons.org/ailuminate/submit-a-sut/u0022u003eformu003c/au003e.u003c/spanu003e

What testing options are available? / What is the license for the AILuminate Benchmark?

u003cspan style=u0022font-weight: 400;u0022u003eAILuminate supports four (4) types of testing, with different licenses for the prompts involved. Please refer to the u003ca href=u0022https://mlcommons.org/ailuminate/technical-users/u0022u003eTechnical Users pageu003c/au003e for complete details. u003c/spanu003ernrnu003cspan style=u0022font-weight: 400;u0022u003eThe four variations of the AILuminate benchmark are designed to make it useful for a range of situations by model trainers and tuners, security, and responsible AI teams. These include:u003c/spanu003ernu003culu003ern tu003cliu003eu003cbu003eDemo test: u003c/bu003eu003cspan style=u0022font-weight: 400;u0022u003eA completely open source self-contained system designed as an offline solution for benchmarkers to evaluate our system, test any SUT, and as a component in other systems. It uses a 10% sample of our Practice Test, and Llama Guard as the evaluator model to classify prompts into hazard categories.u003c/spanu003eu003c/liu003ern tu003cliu003eu003cbu003eOffline Practice Testu003c/bu003eu003cspan style=u0022font-weight: 400;u0022u003e: For iterative in-house alignment of SUTs to the AILuminate benchmark using a full practice prompt set and an open source evaluator such as Llama Guard. It should have broadly the same statistical performance as our Official Test model under most circumstances and can be used during alignment.u003c/spanu003eu003c/liu003ern tu003cliu003eu003cbu003eOnline Practice Testu003c/bu003eu003cspan style=u0022font-weight: 400;u0022u003e: used for preparing a SUT for benchmarking. The online Practice Test uses all of the same systems, including our private evaluator as the official benchmark. It uses a large (approximately 50%) random subset of the full benchmark prompt set. The Practice Test provides a very close approximation to the Official Test. Results from this test cannot be published with MLCommons name and trademark, but can be used internally.u003c/spanu003eu003c/liu003ern tu003cliu003eu003cbu003eOfficial Test:u003c/bu003eu003cspan style=u0022font-weight: 400;u0022u003e The formal AILuminate benchmark test for when a SUT is ready to be benchmarked using our complete, hidden set of prompts and private evaluator. When you are ready to run the Official Test please reach out to us to configure the system so results can be validated and published. After an official benchmark test, final results can be published using the MLCommons name and trademark. If you are interested in testing your SUT please complete this u003ca href=u0022https://mlcommons.org/ailuminate/submit-a-sut/u0022u003eformu003c/au003e.u003c/spanu003eu003c/liu003ernu003c/ulu003ernMore information can be found on our u003ca href=u0022https://mlcommons.org/ailuminate/technical-users/u0022u003eTechnical Users pageu003c/au003e.

Do I need to provide my SUT code or IP to benchmark it?

u003cspan style=u0022font-weight: 400;u0022u003eNo, SUTs are tested as black-box query-response systems.u003c/spanu003e

Does the benchmark test sustained multi-turn interactions?

u003cspan style=u0022font-weight: 400;u0022u003eNo. The benchmark uses human-created, u003c/spanu003eu003cbu003esingle prompts interactionsu003c/bu003eu003cspan style=u0022font-weight: 400;u0022u003e. It does not test sustained interactions with complex contexts. u003c/spanu003e

Can I choose which parts of the benchmark I would like to run?

u003cspan style=u0022font-weight: 400;u0022u003eThe AILuminate benchmark scoring system is calibrated to the entire set of hazard prompts. Official Tests require running all of the prompts contained in the benchmark. u003c/spanu003e

Do I have to run the entire benchmark to get a score?

u003cspan style=u0022font-weight: 400;u0022u003eYes. All hazard categories are evaluated relative to other categories, an equal comparison requires running the entire benchmark.u003c/spanu003e

What is the difference between the practice prompt set and official prompt set?

u003cspan style=u0022font-weight: 400;u0022u003eThe Practice Test (public) and Official Test (hidden) prompt sets are disjoint prompt sets that each include 12,000 prompts per language. These two prompt sets have been balanced so that a SUT should score similarly on both sets. Please do not train a SUT using the practice set. That will overfit it to the practice prompts and render the practice prompt set diagnostically useless relative to the hidden prompt set.u003c/spanu003e

Is it OK to train my SUT on the practice prompt set?

u003cspan style=u0022font-weight: 400;u0022u003eMLCommons does not recommend training on the practice prompt set. Training on the practice prompt set may undermine the ability to use the practice prompt set to predict performance on the official prompt set.u003c/spanu003e

Will I be able to review failed test cases from the Official Test?

u003cspan style=u0022font-weight: 400;u0022u003eNo. To protect the integrity of the Official Test, it will not be possible to provide a log that includes specific test items. Only the final grade report will be made available. In the case that a grade on the Official Test varies significantly from a grade on the equivalent Practice Test, please contact MLCommons at u003c/spanu003eu003ca href=u0022mailto:[email protected] style=u0022font-weight: 400;[email protected]/spanu003eu003c/au003e u003cspan style=u0022font-weight: 400;u0022u003eand we will work with you to figure out the discrepancy. When testing with the Demo or Practice Test, it is possible to see how individual prompts are classified.u003c/spanu003e

Is the AILuminate evaluator model used for scoring available?

u003cspan style=u0022font-weight: 400;u0022u003eNo, the AILuminate Evaluator Model used for scoring the benchmark is not available to maintain the integrity of the evaluations. u003c/spanu003e

Are the datasets used to train the evaluator available?

u003cspan style=u0022font-weight: 400;u0022u003eNo. The datasets used to train the AILuminate Evaluator are not shared with anyone, including members or subscribers.u003c/spanu003e

Is there a way to customize the AILuminate Benchmark?

u003cspan style=u0022font-weight: 400;u0022u003eMLCommons welcomes collaboration with industry, academia, civil society, and other independent researchers in AI risk and reliability to join the u003c/spanu003eu003ca href=u0022https://mlcommons.org/working-groups/ai-risk-reliability/ai-risk-reliability/u0022u003eu003cspan style=u0022font-weight: 400;u0022u003eworking groupu003c/spanu003eu003c/au003eu003cspan style=u0022font-weight: 400;u0022u003e.u003c/spanu003e

Does MLCommons plan to update the AILuminate Benchmark Prompts?

u003cspan style=u0022font-weight: 400;u0022u003eOver time MLCommons recognizes that it will be necessary to augment or replace prompts from time to time to ensure that the benchmark evolves to meet changing threats and evolving safety concerns. These changes will be incorporated into and noted in each specific release version of u003c/spanu003eu003cspan style=u0022font-weight: 400;u0022u003etheu003c/spanu003eu003cspan style=u0022font-weight: 400;u0022u003e AILuminate benchmark.u003c/spanu003e

Can I add my own tests to the benchmark?

u003cspan style=u0022font-weight: 400;u0022u003eThe AILuminate Demo and Practice Test use the u003ca href=u0022https://github.com/mlcommons/modelbenchu0022u003eModelBench platformu003c/au003e, which is available under the Apache 2.0 open source license on Github. We are not adding tests at this time.u003c/spanu003e

Are there plans for agentic testing?

u003cspan style=u0022font-weight: 400;u0022u003eYes, we will announce the timeline for agentic testing in 2025.u003c/spanu003e

Are there plans to address guardrail systems?

u003cspan style=u0022font-weight: 400;u0022u003eNot at this time, as stand-alone systems – but they can be evaluated as part of SUTs.u003c/spanu003e

What if I disagree with the benchmark assessment?

u003cspan style=u0022font-weight: 400;u0022u003eFor Official Testing, the AILuminate benchmark uses an ensemble evaluator adapted to minimize false non-violating judgements while still limiting the number of false violating judgements. However, evaluator error is still possible. The benchmark contains enough test prompts that the impact of false judgements is limited. If you have any questions about an official score, please contact us at u003c/spanu003eu003ca href=u0022mailto:[email protected] style=u0022font-weight: 400;[email protected]/spanu003eu003c/au003eu003cspan style=u0022font-weight: 400;u0022u003e.u003c/spanu003e

How long is a benchmark score valid?

u003cspan style=u0022font-weight: 400;u0022u003eEvery official score is associated with a specific release version of the AILuminate benchmark.u003c/spanu003e

Can I add my own tests to the benchmark?

How can I provide feedback on the design?

u003cspan style=u0022font-weight: 400;u0022u003eWe welcome participants to join the working group, and you can also send an email to u003c/spanu003eu003ca href=u0022mailto:[email protected] style=u0022font-weight: 400;[email protected]/spanu003eu003c/au003eu003cspan style=u0022font-weight: 400;u0022u003e.u003c/spanu003e