We are witnessing the rapid evolution of AI technologies, alongside early efforts to incorporate them into products and services. In response, a wide range of stakeholders, including advocacy groups and policymakers, continue to raise concerns about how we can tell whether AI technologies are safe to deploy and use. To address these concerns, and provide a higher level of transparency, the MLCommons® AI Safety (AIS) working group, convened last fall, and has been working diligently to construct a neutral set of benchmarks for measuring the safety of AI systems, starting with large language models (LLMs). The working group reached a significant milestone in April with the release of a v0.5 AI Safety benchmark proof of concept (POC). Since then they have continued to make rapid progress towards delivering an industry-standard benchmark v1.0 release later this fall. As part of MLCommons commitment to transparency, here is an update on our progress across several important areas.

Feedback from the v0.5 POC

The v0.5 POC was intended as a first “stake in the ground” to share with the community for feedback to inform the v1.0 release. The POC focused on a single general-purpose AI chat model domain. It ran 14 different systems-under-tests to validate the basic structure of a safety test and the components it must contain, against a set of tests across 7 hazards. The team is now hard at work incorporating feedback and extending and adapting that structure to a full taxonomy of hazards for inclusion in the v1.0 release this fall.

The POC also did its job in provoking thoughtful conversations around key elements of the benchmark, such as how transparent test prompts and evaluators should be within an adversarial ecosystem where technology providers and policymakers are attempting to keep users safe from threats – and from bad actors who could leverage openness to their advantage. In a sense, complete transparency for safety benchmarks is akin to telling students what questions will be on a test before they take it. To ensure a level of safety within the tests, the v1.0 release will include some level of “hidden testing” in which a portion of the prompts remains undisclosed.

Moving forward

To ensure a comprehensive and agile approach for the benchmark, the working group has established eight workstreams that are working in parallel to build the v1.0 release and look beyond to future iterations.

  • Commercial beta users: providing user-focused product management to the benchmarks and platform, with an initial focus on 2nd-party testers including purchasers, system integrators, and deployers;
  • Evaluator models: creating a pipeline to tune models to automate a portion of the evaluation, and generating additional data in a continual learning loop;
  • Grading, scoring, and reporting: developing useful and usable systems for quantifying and reporting out statistically valid benchmark results for AI safety;
  • Hazards: identifying, describing, and prioritizing hazards and biases that could be incorporated into the benchmark;
  • Multimodal: investigating scenarios where text prompts generate images and/or image prompts generate text, aligning with specific harms, generating relevant prompts, and tracking the state of the art with partner organizations;
  • Prompts: creating test specifications, acquiring and curating a collection of prompts to AI systems that might generate unsafe results, and organizing and reporting on human verification of prompt quality;
  • Scope: defining the scope of each release, including hazard taxonomy definitions, use cases, personas, and languages for localization;
  • Test integrity and score reliability: ensuring that benchmark results are consistent and that the benchmark cannot be “gamed” by a bad actor, and assessing release options.

“I’m proud of the team that has assembled to work on the v1.0 release and beyond,” said Rebecca Weiss, Executive Director of MLCommons and AI Safety lead. “We have some of the best and brightest from industry, research, and academia collaborating to move each workstream forward. A broad community of diverse stakeholders, both individual contributors and their employers, came together to help solve this difficult problem, and the emerging results are exciting and visionary.” 

Ensuring Diverse Representation

The working group has completed the initial version of the benchmark’s taxonomy of hazards that will initially be supported in four languages: English, French, Simplified Chinese, and Hindi. In order to provide an AI Safety benchmark that tests the full range of the taxonomy and is broadly applicable, the Prompts workstream is sourcing written prompts through two mechanisms. First, it is contracting directly with a set of primary suppliers for prompt generation, to ensure complete coverage of the taxonomy of hazards. Second, it is publicly sourcing additional prompts through an Expression of Interest (EOI) to add a diverse set of deep expertise across different languages, geographies, disciplines, industries, and hazards. This work is ongoing to ensure continued responsiveness to emerging and changing hazards. 

Global Partnerships

The team is building a network of global partnerships to ensure access to relevant expertise, data, and technology. In May MLCommons signed an MOI with the AI Verify Foundation, based in Singapore, to collaborate on developing a set of common safety testing benchmarks for generative AI models. The working group is also looking to incorporate evaluator models from multiple sources. This will enable the AI Safety benchmark to address a wide range of global hazards incorporating regional and cultural nuances as well as domain-specific threats.

Over 50 academic, industry, and civil society organizations have provided essential contributions to the MLCommons AI Safety working group, with initial funding from Google, Intel, Microsoft, NVIDIA, and Qualcomm Technologies, Inc. to support the version 1.0 release of the AI Safety benchmark. 

Long-term goals

As the version 1.0 release approaches completion later this fall, the team is already planning for future releases. “This is a long-term process, and an organic, rapidly evolving ecosystem,” said Peter Mattson, MLCommons President and co-chair of the AI Safety working group. “Bad actors will adapt and hazards will continue to shift and change, so we in turn must keep investing to ensure that we have strong, effective, and responsive benchmark testing for the safety of AI systems. AI safety benchmarking is progressing quickly thanks to the contributions of the several dozen individuals and organizations who support our effort, but the AI industry is also moving fast and our challenge is to build a safety benchmark process that can keep pace with it. At the moment, we have organized the AI Safety effort to operate at the pace of a startup, given the rate of development of AI-deploying organizations. But as more AI companies mature into multi-billion dollar industries, our benchmarking efforts will also evolve towards a long-term sustainable model with equivalent maturity.”

“The AI Safety benchmark helps us to support policymakers around the world with rigorous technical measurement methodology, many of whom are struggling to keep pace with a fast-paced, emerging industry,” said Weiss. “Benchmarks provide a common frame of reference, and they can ground discussions in facts and evidence. We want to make sure that the AI Safety benchmark not only accelerates technology and product improvements by providing careful measurements, but also helps to inspire thoughtful policies supported by the technical metrics needed to measure the behavior of complex AI systems.”

MLCommons welcomes additional participation to help shape the v1.0 AI Safety benchmark suite and beyond. To contribute, please join the MLCommons AI Safety working group.