This page provides more information about the technical components of the AILuminate benchmark, and how they work together to benchmark an AI system-under-test (SUT). The AILuminate benchmark was developed through a rapid, iterative process and we are continuing to solicit ongoing feedback on the methodology. To share feedback we encourage you to join the MLCommons AI Risk & Reliability working group.

The AILuminate benchmark is composed of five main components:

  • The Assessment Standard which defines the user personas and hazards to be tested, and provides guidelines for how responses to prompts are evaluated.  
  • Over 24,000 custom-written prompts (12,000 public practice prompts and 12,000 private prompts for the Official Test) that resemble โ€œtypicalโ€ prompts for a general purpose chatbot gen AI system. These exercise the system-under-test (SUT) across the taxonomy of hazards described in the assessment standard.
  • A cloud service that prompts a SUT, one prompt at a time, and receives the responses.
  • The AILuminate state of the art ensemble evaluator model that classifies the responses following the guidelines in the assessment standard.
  • A SUT report that uses the AILuminate v1.0 benchmark grading scale to assess the quantitative and qualitative performance of the SUT overall and in each hazard category.

A SUT must implement a chat-style interface, but can do so in a variety of ways: through a single model, an ensemble of models, or a mixture of model(s) and algorithm(s). For official online testing, the model must be available from a hosting service or through an API.

The AILuminate benchmark is designed for:

Responsible AI technical teams who want to integrate a standardized tool into their responsible AI stack and/or evaluate models for procurement.

Machine learning engineers, data scientists, and researchers tuning or training interactive LLMs who want a standard tool for measuring alignment. You can keep testing with our practice test tool until your model is ready, then submit it to our service for final evaluation; who want to verify that a model stays aligned after RAG training or tuning; and who want independent third party validation as part of a red teaming stack.

Risk managers who want to set a baseline based on industry standard tools, want to set realistic goals, and who want an independent monitoring tool to identify alignment drift.

How to use the benchmark

We provide four different variations of the AILuminate benchmark, designed to make the benchmark useful in a wide range of situations by model trainers and tuners, security and responsible AI teams.

AILuminate benchmark test types:
AILuminate VersionPrompt LicensePromptsEvaluatorOffline/OnlineResults Publishable
Demo TestCC-BY 4.01,200 Open source (LLama Guard etc)OfflineNo
Offline Practice TestMLCommons AILuminate12,000 Open sourceOfflineNo
Online Practice TestMLCommons AILuminate12,000 AILuminate privateOnlineNo
Official TestConfidential12,000 AILuminate privateOnlineYes

To inquire about testing your system complete this form.

Quick start guide:
Information:
How to contact us:

Machine learning engineers, data scientists, and researchers tuning or training interactive LLMs who want them to stay aligned will benefit from the AILuminate benchmark. Aligned, broadly safe LLMs have been shown to un-align after they were tuned or trained. The AILuminate benchmark is an independent, third party benchmark that helps measure how much a model drifts after tuning or training.

Benchmarking is an important component of an overall observability, evaluation, or red teaming process and toolkit. Standardized metrics allow you to measure improvement, compare apples to apples when training and tuning models, and track alignment drift through time.

  • The AILuminate open source testing framework allows for unlimited iterated offline testing of your SUT with open source guardrails tools such as Llama Guard.
  • The online tool suite provides secure, independent, enterprise-grade, third party testing using our dataset of unique, regularly updated, custom prompts, and can be integrated into your in-house testing platform. The ensemble response evaluator model combines multiple models to achieve best in class classification results.

Contact us to integrate the benchmark into your in-house safety, red teaming or observability suite by completing this form.

The AILuminate benchmark is an open-source, industry-adopted technology to understand, measure, and progress on AI risk and reliability. It is tuned for risk managers who are seeking:

  • Industry-aligned definitions
  • Clear measurement tools
  • Actionable insights

The sections below detail how to use the report produced by the AILuminate benchmark to identify your organizationโ€™s AI risks.

Use the AILuminate benchmark to establish a baseline with Industry-Aligned Definitions

Obtaining a clear understanding of an organizationโ€™s AI application performance is a critical challenge in AI risk management. Without a directive from policy and regulation, this clarity can remain elusive because of varying definitions, standards, and approaches across and within organizations. Our Assessment Standard details our industry-aligned definitions on what the benchmark measures and how we brought policy-makers, technology companies, and industry leaders together to define them for general use cases.

The AILuminate benchmark not only provides aligned definitions of safety, it demonstrates how to measure them. Any organization can benchmark their SUT using our evaluator and prompt library to understand their performance. The How to use the benchmark page provides step-by-step guidance on how to measure your system to determine where your current safety measures are adequate and where there are potential gaps.

Use the AILuminate benchmark to determine realistic goals based on clear measurement tools

Understanding where your organizationโ€™s models stand is just the beginning. A broader perspective helps organizations using LLMs in a production environment set achievable goals and guide improvements as models and components change. With our benchmark, you can:

  • Compare your AI against industry leaders
  • Identify areas where you are ahead
  • Spot opportunities where you need to leapfrog 

The benchmark can also help you recognize red flags so you can:

  • Clearly define and identify unsafe AI behavior
  • Establish non-negotiable safety thresholds
  • Prioritize critical areas for immediate improvement

Continued measurement against an evolving landscape is necessary but difficult to do alone. Explore the AILuminate Benchmark to see where the current state-of-the-art models are trending. View the Resources page for documentation to understand what you need in place to submit your system.

Actionable insights: Using the AILuminate benchmark to monitor and report progress

In the complex world of risk management, AI risk is a moving target against the backdrop of known and real security risks. The AILuminate benchmark offers transparent reporting that can be included in your own risk framework, with clarity on where you are exceeding and where you are falling behind. View the Resources page to learn how to incorporate the AILuminate benchmark into your risk management framework.

Tailor the AILuminate benchmark to your needs

Every industry faces unique AI challenges. Thatโ€™s why we offer:

  • Working groups to collaborate with peers to address specific AI risks
  • Pilot projects to customize the benchmark to your use case

Visit our Methodology, prompts, and pilot projects page to learn more.

Our FAQ addresses the most common questions about the AILuminate benchmark.