Last month, MLCommons® participated in the ISO/IEC JTC 1/SC 42 plenary meeting in Sydney, Australia. SC 42 is the international standards committee for artificial intelligence, and we’ve been increasingly engaged in its technical work on AI testing and management systems.

For an organization built on open, collaborative development, contributing to international standards is a natural next step. It’s also a bridge between grassroots technical work and the formal frameworks that guide AI development worldwide.

MLCommons Benchmarks Are Now in ISO Standards

Two standards in the ISO/IEC 42119 series on AI testing recently advanced to the publication stage, and now reference MLCommons benchmarks:

  • ISO/IEC 42119-2 (Testing of AI systems) covers testing techniques throughout the AI system lifecycle, including how assessment metrics integrate into verification and validation processes.
  • ISO/IEC 42119-3 (Verification and validation analysis) establishes approaches for V&V analysis of AI systems, covering formal methods, simulation, and evaluation frameworks.

Both standards cite MLCommons benchmarks as examples of standardized testing methodologies. This reflects something our community has practiced for years: reproducible, open approaches to AI evaluation are essential infrastructure, not optional extras.

What We’re Working On Now

We’re actively contributing to two emerging standards that align with our technical work:

ISO/IEC 42119-8: Defining Good Benchmarks

This standard tackles a fundamental question: what makes a benchmark actually useful? It focuses on benchmarking and guidance for the construction and implementation of AI systems. It also addresses quality assessment of prompt-based generative AI, including red teaming and safety evaluation methodologies.

Given MLCommons’ experience with both performance benchmarks (MLPerf®) and safety benchmarks (AILuminate), we’re well-positioned to contribute here. How do you design benchmarks that are practical yet comprehensive? How do you keep them relevant as AI capabilities advance? These are questions we’ve worked through in our community, and now we’re helping answer them at the standards level.

ISO/IEC 42003: From Testing to Governance

This standard provides implementation guidance for ISO/IEC 42001, the first AI management system standard. Our focus is on showing how benchmarking integrates across the AI system lifecycle—not just as a final checkpoint, but as a continuous tool for development, deployment, and ongoing governance.

Think of it this way: traditional testing happens at the end, as a gate. But when benchmarking is woven throughout, it becomes continuous feedback. It informs decisions during development, validates readiness before deployment, and provides ongoing assurance in production. That’s what we’re working to encode in this standard.

Why This Work Matters

International standards have a concrete impact that scales:

  • They create shared language. When developers in Tokyo, AI safety researchers in London, and procurement officers in São Paulo all reference the same standards, they can communicate precisely about system capabilities and requirements.
  • They establish baseline expectations. Standards codify “minimum acceptable practice,” raising the floor for AI development quality across the industry.
  • They democratize access. When standards reference open, community-developed benchmarks rather than proprietary methodologies, they lower barriers to entry. A startup or academic lab can adopt the same testing frameworks as major tech companies, without licensing fees or proprietary dependencies.
  • They mitigate risks. By developing systems that help people identify risks, standards help avoid them. In so doing, standards help grow the market for AI — less risk means more opportunities for deployment. 

As a participant in  ISO, we are involved as technical contributors.But it also means we’re learning from the broader standardization community’s perspectives on testing, governance, and risk management. It’s a two-way exchange that strengthens both our benchmarks and the emerging standards.

Your Expertise Matters: Getting Involved

Part of the reason we are engaged with ISO is the truism that “standards are made by those in the room.”  And part of why it’s important to us to talk about our standards work: as a membership organization we seek your input to make sure we can reflect your state-of-the-art knowledge back to ISO.

If your work touches AI benchmarking, testing methodologies, or AI governance implementation, please join us, either directly at ISO, or via our working groups: 

  • Direct ISO Participation: National standards bodies coordinate participation in SC 42 working groups. Contact your national standards body to explore formal involvement in standards development.
  • MLCommons Working Groups: Our AI Risk & Reliability (AIRR) working group develops benchmarks and methodologies that directly inform standards work. Community members can contribute to benchmark development, testing frameworks, and documentation that supports standardization efforts—no need to navigate the formal ISO process.

We’re particularly interested in perspectives on:

  • Practical challenges in implementing AI testing frameworks: Where do theory and practice diverge?
  • Coverage gaps in existing benchmarking approaches: What aren’t we measuring that we should be?
  • Integration points between benchmarks and organizational AI governance: How do benchmarks fit into real-world AI management systems?

A Note from (AIRR Leadership: Rebecca, Kurt, Andrew, Sean…)

“AI is moving at an unprecedented pace,  but we think that by reducing risk through coordinated, international standards it could reach people’s lives even more quickly. . The work AIRR is doing with ISO right now is an opportunity to ensure the global standards that govern this technology are built on open science and transparent measurement. We are collaborating to build the rails for responsible AI development.”

The Bigger Picture

MLCommons was founded on a simple but powerful premise: that open collaboration and transparent measurement make AI better for everyone. We started with performance benchmarks because the industry needed common ground for evaluating AI systems. We expanded to safety benchmarks because AI governance demands more than speed metrics. And now, we’re engaging with international standardization because the frameworks that guide global AI development should be informed by the same principles of openness and community participation that drive our technical work.

The benchmark you run today, the testing methodology you propose, the gap you identify in current approaches—these contributions don’t just advance MLCommons’ work. When our benchmarks inform international standards, your contributions help shape the global framework for AI development.

That’s the opportunity before us, and we’re eager to pursue it together.

Learn more about our standards work:
Visit the AI Risk & Reliability working group at mlcommons.org/working-groups/ai-risk-reliability

Get involved in standards-related development:
Find information on MLCommons working groups at mlcommons.org

MLCommons is a non-profit engineering consortium developing open benchmarks, datasets, and best practices for AI. Learn more at mlcommons.org.