Technical Users

This page provides more information about the technical components of the AILuminate benchmark suite, and how they work together to benchmark an AI system-under-test (SUT). The AILuminate benchmark suite was developed through a rapid, iterative process and we are continuing to solicit ongoing feedback on the methodology. To share feedback we encourage you to join the MLCommons AI Risk & Reliability working group.

Key elements of the benchmark

The AILuminate benchmark is composed of five main components:

The Assessment Standard which defines the user personas and hazards to be tested, and provides guidelines for how responses to prompts are evaluated.
Over 24,000 custom-written prompts per language (12,000 public practice prompts and 12,000 private prompts for the Official Test) that resemble “typical” prompts for a general purpose chatbot gen AI system. These exercise the system-under-test (SUT) across the taxonomy of hazards described in the assessment standard.
A cloud service that prompts a SUT, one prompt at a time, and receives the responses.
The AILuminate state of the art ensemble evaluator model that classifies the responses following the guidelines in the assessment standard.
A benchmark SUT report for each system-under-test that uses the AILuminate benchmark grading scale to assess the quantitative and qualitative performance of the SUT overall and in each hazard category.

SUT requirements

A SUT must implement a chat-style interface, but can do so in a variety of ways: through a single model, an ensemble of models, or a mixture of model(s) and algorithm(s). For official online testing, the model must be available from a hosting service or through an API.

Who should use the benchmark

The AILuminate benchmark is designed for:

Responsible AI technical teams who want to integrate a standardized tool into their responsible AI stack and/or evaluate models for procurement.

Machine learning engineers, data scientists, and researchers tuning or training interactive LLMs who want a standard tool for measuring alignment. You can keep testing with our practice test tool until your model is ready, then submit it to our service for final evaluation; who want to verify that a model stays aligned after RAG training or tuning; and who want independent third party validation as part of a red teaming stack.

Risk managers who want to set a baseline based on industry standard tools, want to set realistic goals, and who want an independent monitoring tool to identify alignment drift.

How to use the benchmark

We provide four different variations of the AILuminate benchmark, designed to make the benchmark useful in a wide range of situations by model trainers and tuners, security and responsible AI teams.

Demo test

The Demo test is a completely open source self-contained system designed as an offline solution for people who want to use the AILuminate benchmark to evaluate a system, to test any SUT, and as a component in other systems. The demo test uses a 10% random sample of the official test, and uses Llama Guard as the evaluator model to classify prompts into hazard categories.

Offline Practice test

The Offline Practice test is for iterative in-house alignment of SUTs to the AILuminate benchmark using a full practice prompt set and an open source evaluator such as Llama Guard. It should broadly have the same statistical performance as the official test model under most circumstances and can be used during alignment.

Online Practice test

The Online Practice test is used for preparing a SUT for benchmarking. It uses all of the same systems, including the AILuminate private evaluator as the official benchmark. The online practice test uses a large (approximately 50%) random subset of the full benchmark prompt set. It provides a very close approximation to the official test. Results from the online practice test cannot be published with MLCommons name and trademark, but can be used internally.

Official test

The Official test is the formal benchmarking test. When a SUT is ready to be officially benchmarked using the complete, secret set of prompts and private AILuminate evaluator. When you are ready to run an official test please reach out to configure the system so results can be validated and published. After an official benchmark test, final results can be published using the MLCommons name and trademark.

AILuminate benchmark test types:

AILuminate Version	Prompt License	Prompts	Evaluator	Offline/Online	Results Publishable
Demo Test	CC-BY 4.0	1,200	Open source (LLama Guard etc)	Offline	No
Offline Practice Test	MLCommons AILuminate	12,000	Open source	Offline	No
Online Practice Test	MLCommons AILuminate	12,000	AILuminate private	Online	No
Official Test	Confidential	12,000	AILuminate private	Online	Yes

AILuminate Technical Paper

The AILuminate v1.0 Technical Paper AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability. Its development employed an open process that included participants from multiple fields. This report identifies limitations of our method and of building safety benchmarks generally, including evaluator uncertainty and the constraints of single-turn interactions. This work represents a crucial step toward establishing global standards for AI risk and reliability evaluation while acknowledging the need for continued development in areas such as multiturn interactions, multimodal understanding, coverage of additional languages, and emerging hazard categories.

To inquire about testing your system complete this form.

For AI developers and researchers

Quick start guide:

AILuminate is available as ModelBench on GitHub
Requirements
- An account on TogetherAI
- Install Poetry
The GitHub page includes installation instructions and a brief tutorial on how to run and review a test

Information:

AILuminate is available as ModelBench on GitHub
- Background and Installation
- Release notes
- Tutorial: Adding a New SUT
- AILuminate is distributed under the Apache 2.0 license

How to contact us:

Discord
For support questions, email: [email protected]

For Language Model Tuners

Machine learning engineers, data scientists, and researchers tuning or training interactive LLMs who want them to stay aligned will benefit from the AILuminate benchmark. Aligned, broadly safe LLMs have been shown to un-align after they were tuned or trained. The AILuminate benchmark is an independent, third party benchmark that helps measure how much a model drifts after tuning or training.

Benchmarking is an important component of an overall observability, evaluation, or red teaming process and toolkit. Standardized metrics allow you to measure improvement, compare apples to apples when training and tuning models, and track alignment drift through time.

The AILuminate open source testing framework allows for unlimited iterated offline testing of your SUT with open source guardrails tools such as Llama Guard.
The online tool suite provides secure, independent, enterprise-grade, third party testing using our dataset of unique, regularly updated, custom prompts, and can be integrated into your in-house testing platform. The ensemble response evaluator model combines multiple models to achieve best in class classification results.

Contact us to integrate the benchmark into your in-house safety, red teaming or observability suite by completing this form.

For Risk Managers

The AILuminate benchmark suite is an open-source, industry-adopted technology to understand, measure, and progress on AI risk and reliability. It is tuned for risk managers who are seeking:

Industry-aligned definitions
Clear measurement tools
Actionable insights

The sections below detail how to use the report produced by the AILuminate benchmark to identify your organization’s AI risks.

Use the AILuminate benchmark to establish a baseline with Industry-Aligned Definitions

Obtaining a clear understanding of an organization’s AI application performance is a critical challenge in AI risk management. Without a directive from policy and regulation, this clarity can remain elusive because of varying definitions, standards, and approaches across and within organizations. Our Assessment Standard details our industry-aligned definitions on what the benchmark measures and how we brought policy-makers, technology companies, and industry leaders together to define them for general use cases.

The AILuminate benchmark not only provides aligned definitions of safety, it demonstrates how to measure them. Any organization can benchmark their SUT using our evaluator and prompt library to understand their performance. The How to use the benchmark page provides step-by-step guidance on how to measure your system to determine where your current safety measures are adequate and where there are potential gaps.

Use the AILuminate benchmark to determine realistic goals based on clear measurement tools

Understanding where your organization’s models stand is just the beginning. A broader perspective helps organizations using LLMs in a production environment set achievable goals and guide improvements as models and components change. With our benchmark, you can:

Compare your AI against industry leaders
Identify areas where you are ahead
Spot opportunities where you need to leapfrog

The benchmark can also help you recognize red flags so you can:

Clearly define and identify unsafe AI behavior
Establish non-negotiable safety thresholds
Prioritize critical areas for immediate improvement

Continued measurement against an evolving landscape is necessary but difficult to do alone. Explore the AILuminate Benchmark to see where the current state-of-the-art models are trending. View the Resources page for documentation to understand what you need in place to submit your system.

Actionable insights: Using the AILuminate benchmark to monitor and report progress

In the complex world of risk management, AI risk is a moving target against the backdrop of known and real security risks. The AILuminate benchmark offers transparent reporting that can be included in your own risk framework, with clarity on where you are exceeding and where you are falling behind. View the Resources page to learn how to incorporate the AILuminate benchmark into your risk management framework.

Tailor the AILuminate benchmark to your needs

Every industry faces unique AI challenges. That’s why we offer:

Working groups to collaborate with peers to address specific AI risks
Pilot projects to customize the benchmark to your use case

Visit our Methodology, prompts, and pilot projects page to learn more.

Our FAQ addresses the most common questions about the AILuminate benchmark.

Explore the AILuminate Benchmark