In October 2024, MLCommons and Partnership on AI (PAI) virtually convened practitioners and policy analysts to assess the current state of general-purpose AI evaluations and the challenges to their adoption. The workshop explored collaborative approaches to responsibly evaluating and governing AI across the value chain. Against a backdrop of unprecedented regulatory activity — including over 300 federal bills, 600 state-level bills, a U.S. AI executive order, and the EU’s new draft General-Purpose AI Code of Practice — the workshop provided an opportunity to examine how each actor in the AI ecosystem contributes to safe deployment and accountability. 

From foundation model providers to downstream actors, discussions emphasized that each actor has unique roles, obligations, and needs for guidance, highlighting the limitations of a “one-size-fits-all” approach. PAI’s recent work on distributed risk mitigation strategies for open foundation models, along with MLCommons’ development of safety benchmarks for AI, highlights the need for evaluation and governance practices tailored to each actor’s unique role. Together, these efforts blend normative guidance with technical evaluation, supporting a comprehensive approach to AI safety.

First Workshop Takeaway: Layered and Role-specific evaluation approaches 

The first takeaway is the importance of layered and role-specific evaluation approaches across the AI ecosystem. Workshop participants recognized that different actors—such as foundation model providers, model adapters who fine-tune these models, model hubs and model integrators—play unique roles that require tailored evaluation strategies. This includes comprehensive evaluation immediately after a model is initially trained, using both benchmarking and adversarial testing. For model adapters, post-training evaluation and adjustments during fine-tuning are essential to uncover new risks and ensure safety.

Participants emphasized proactive, layered approaches where early benchmarking and regular red-teaming work together to support safety, reliability, and compliance as AI technology advances. Benchmarking early establishes a foundation for continuous improvement, helping identify performance gaps and ensuring models meet necessary safety standards. Red-teaming, or adversarial testing, is critical to pair with benchmarks, as it helps uncover vulnerabilities not apparent through standard testing and stress-tests the model’s resilience to misuse. Frequent red-teaming enables developers to stay ahead of emerging risks, especially in the fast-evolving field of generative AI, and to address potential misuse cases proactively before widespread deployment. For model integrators, continuous testing and re-evaluation procedures were also recommended, particularly to manage the dynamic changes that occur as generative AI models are updated.

Second Workshop Takeaway: Adaptability in Evaluation Methods 

The second takeaway is the necessity of adaptable evaluation methods to ensure that safety assessments remain realistic and resistant to manipulation as AI models evolve. A critical part of adaptability is using high-quality test data sets and avoiding overfitting risks—where models become too tailored to specific test scenarios, leading them to perform well on those tests but potentially failing in new or real-world situations. Overfitting can result in evaluations that give a false sense of security, as the model may appear safe but lacks robustness.

To address this, participants discussed the importance of keeping elements of the evaluation test-set “held back” privately, to ensure that model developers cannot over-train to a public test-set. By using private test-sets as a component of the evaluation process, evaluations can better simulate real-world uses and identify vulnerabilities that traditional static benchmarks and leaderboards might miss.

Workshop participants agreed that responsibility for implementing AI evaluations will need to be shared across the AI value chain. The recent U.S. AI Safety Institute guidance highlights the importance of involving all actors in managing misuse risks throughout the AI lifecycle. While the EU’s draft General-Purpose AI Code of Practice currently focuses on model providers, the broader regulatory landscape indicates a growing recognition of the need for shared responsibility across all stakeholders involved in AI development and deployment. Together, these perspectives underscore the importance of multi-stakeholder collaboration in ensuring that general purpose  AI systems are governed safely and responsibly.

Learn more about the MLCommons AI Risk and Reliability working group building a benchmark to evaluate the safety of AI.

Learn more about Partnership on AI’s Open Foundation Model Value Chain