Technical Users
This page provides more information about the technical components of the AILuminate benchmark, and how they work together to benchmark an AI system-under-test (SUT). The AILuminate benchmark was developed through a rapid, iterative process and we are continuing to solicit ongoing feedback on the methodology. To share feedback we encourage you to join the MLCommons AI Risk & Reliability working group.
Key elements of the benchmark
The AILuminate benchmark is composed of five main components:
- The Assessment Standard which defines the user personas and hazards to be tested, and provides guidelines for how responses to prompts are evaluated.
- Over 24,000 custom-written prompts (12,000 public practice prompts and 12,000 private prompts for the Official Test) that resemble โtypicalโ prompts for a general purpose chatbot gen AI system. These exercise the system-under-test (SUT) across the taxonomy of hazards described in the assessment standard.
- A cloud service that prompts a SUT, one prompt at a time, and receives the responses.
- The AILuminate state of the art ensemble evaluator model that classifies the responses following the guidelines in the assessment standard.
- A SUT report that uses the AILuminate v1.0 benchmark grading scale to assess the quantitative and qualitative performance of the SUT overall and in each hazard category.
SUT requirements
A SUT must implement a chat-style interface, but can do so in a variety of ways: through a single model, an ensemble of models, or a mixture of model(s) and algorithm(s). For official online testing, the model must be available from a hosting service or through an API.
Who should use the benchmark
The AILuminate benchmark is designed for:
Responsible AI technical teams who want to integrate a standardized tool into their responsible AI stack and/or evaluate models for procurement.
Machine learning engineers, data scientists, and researchers tuning or training interactive LLMs who want a standard tool for measuring alignment. You can keep testing with our practice test tool until your model is ready, then submit it to our service for final evaluation; who want to verify that a model stays aligned after RAG training or tuning; and who want independent third party validation as part of a red teaming stack.
Risk managers who want to set a baseline based on industry standard tools, want to set realistic goals, and who want an independent monitoring tool to identify alignment drift.
How to use the benchmark
We provide four different variations of the AILuminate benchmark, designed to make the benchmark useful in a wide range of situations by model trainers and tuners, security and responsible AI teams.
Demo Test
The Demo test is a completely open source self-contained system designed as an offline solution for people who want to use the AILuminate benchmark to evaluate a system, to test any SUT, and as a component in other systems. The demo test uses a 10% random sample of the official test, and uses Llama Guard as the evaluator model to classify prompts into hazard categories.
Offline Practice test
The Offline Practice test is for iterative in-house alignment of SUTs to the AILuminate benchmark using a full practice prompt set and an open source evaluator such as Llama Guard. It should broadly have the same statistical performance as the official test model under most circumstances and can be used during alignment.
Online Practice test
The Online Practice test is used for preparing a SUT for benchmarking. It uses all of the same systems, including the AILuminate private evaluator as the official benchmark. The online practice test uses a large (approximately 50%) random subset of the full benchmark prompt set. It provides a very close approximation to the official test. Results from the online practice test cannot be published with MLCommons name and trademark, but can be used internally.
Official test
The Official test is the formal benchmarking test. When a SUT is ready to be officially benchmarked using the complete, secret set of prompts and private AILuminate evaluator. When you are ready to run an official test please reach out to configure the system so results can be validated and published. After an official benchmark test, final results can be published using the MLCommons name and trademark.
AILuminate benchmark test types:
AILuminate Version | Prompt License | Prompts | Evaluator | Offline/Online | Results Publishable |
---|---|---|---|---|---|
Demo Test | CC-BY 4.0 | 1,200 | Open source (LLama Guard etc) | Offline | No |
Offline Practice Test | MLCommons AILuminate | 12,000 | Open source | Offline | No |
Online Practice Test | MLCommons AILuminate | 12,000 | AILuminate private | Online | No |
Official Test | Confidential | 12,000 | AILuminate private | Online | Yes |
To inquire about testing your system complete this form.
For AI developers and researchers
Quick start guide:
- AILuminate is available as ModelBench on GitHub
- Requirements
- An account on TogetherAI
- Install Poetry
- The GitHub page includes installation instructions and a brief tutorial on how to run and review a test
Information:
- AILuminate is available as ModelBench on GitHub
- Background and Installation
- Release notes
- Tutorial: Adding a New SUT
- AILuminate is distributed under the Apache 2.0 license
How to contact us:
- Discord
- For support questions, email: [email protected]
For Language Model Tuners
Machine learning engineers, data scientists, and researchers tuning or training interactive LLMs who want them to stay aligned will benefit from the AILuminate benchmark. Aligned, broadly safe LLMs have been shown to un-align after they were tuned or trained. The AILuminate benchmark is an independent, third party benchmark that helps measure how much a model drifts after tuning or training.
Benchmarking is an important component of an overall observability, evaluation, or red teaming process and toolkit. Standardized metrics allow you to measure improvement, compare apples to apples when training and tuning models, and track alignment drift through time.
- The AILuminate open source testing framework allows for unlimited iterated offline testing of your SUT with open source guardrails tools such as Llama Guard.
- The online tool suite provides secure, independent, enterprise-grade, third party testing using our dataset of unique, regularly updated, custom prompts, and can be integrated into your in-house testing platform. The ensemble response evaluator model combines multiple models to achieve best in class classification results.
Contact us to integrate the benchmark into your in-house safety, red teaming or observability suite by completing this form.
For Risk Managers
The AILuminate benchmark is an open-source, industry-adopted technology to understand, measure, and progress on AI risk and reliability. It is tuned for risk managers who are seeking:
- Industry-aligned definitions
- Clear measurement tools
- Actionable insights
The sections below detail how to use the report produced by the AILuminate benchmark to identify your organizationโs AI risks.
Use the AILuminate benchmark to establish a baseline with Industry-Aligned Definitions
Obtaining a clear understanding of an organizationโs AI application performance is a critical challenge in AI risk management. Without a directive from policy and regulation, this clarity can remain elusive because of varying definitions, standards, and approaches across and within organizations. Our Assessment Standard details our industry-aligned definitions on what the benchmark measures and how we brought policy-makers, technology companies, and industry leaders together to define them for general use cases.
The AILuminate benchmark not only provides aligned definitions of safety, it demonstrates how to measure them. Any organization can benchmark their SUT using our evaluator and prompt library to understand their performance. The How to use the benchmark page provides step-by-step guidance on how to measure your system to determine where your current safety measures are adequate and where there are potential gaps.
Use the AILuminate benchmark to determine realistic goals based on clear measurement tools
Understanding where your organizationโs models stand is just the beginning. A broader perspective helps organizations using LLMs in a production environment set achievable goals and guide improvements as models and components change. With our benchmark, you can:
- Compare your AI against industry leaders
- Identify areas where you are ahead
- Spot opportunities where you need to leapfrog
The benchmark can also help you recognize red flags so you can:
- Clearly define and identify unsafe AI behavior
- Establish non-negotiable safety thresholds
- Prioritize critical areas for immediate improvement
Continued measurement against an evolving landscape is necessary but difficult to do alone. Explore the AILuminate Benchmark to see where the current state-of-the-art models are trending. View the Resources page for documentation to understand what you need in place to submit your system.
Actionable insights: Using the AILuminate benchmark to monitor and report progress
In the complex world of risk management, AI risk is a moving target against the backdrop of known and real security risks. The AILuminate benchmark offers transparent reporting that can be included in your own risk framework, with clarity on where you are exceeding and where you are falling behind. View the Resources page to learn how to incorporate the AILuminate benchmark into your risk management framework.
Tailor the AILuminate benchmark to your needs
Every industry faces unique AI challenges. Thatโs why we offer:
- Working groups to collaborate with peers to address specific AI risks
- Pilot projects to customize the benchmark to your use case
Visit our Methodology, prompts, and pilot projects page to learn more.
Our FAQ addresses the most common questions about the AILuminate benchmark.