As artificial intelligence moves from a fun consumer chat experience to a general-purpose technology that powers enterprise services across the economy, it faces a significant reliability barrier. Enterprises need to trust that an AI system is reliable – that it produces correct, safe, and secure responses – before that system will be put in a position to deliver even more value. Overcoming this obstacle critically depends on the development of evaluation standards that form the bridge – codified in code –between traditional standards, such as those developed by ISO/IEC, and the inherently non-deterministic and rapidly evolving nature of AI systems.
To build the necessary trust for widespread enterprise adoption, the industry must adopt risk management standards that reduce deployer uncertainty. Until enterprises, including smaller businesses, feel comfortable giving an AI agent access to their corporate data to autonomously negotiate pricing agreements on their behalf, we’ll never be able to have the kinds of automated transactions the industry is building toward today. What reliability will an AI vision system need to demonstrate – how many nines at the end of 99.9…% – before it is trusted to review an oil pipeline for damage? What will be required for deploying an AI clinical support tool to doctors to assist with diagnosis? What about deployments on a manufacturing line, where an hour of downtime is millions in lost revenue? Deploying AI systems in higher-risk and higher-trust applications, such as in finance, healthcare, manufacturing, and elsewhere, will demand meaningfully higher levels of reliability than we have today. That also means we’ll need to be able to actually measure that reliability… reliably itself!
Ultimately, the reliability objectives and procedural requirements are laid down in standards like ISO/IEC 42001, determined by consensus, as they have been for other industries in need of risk management. Because AI is a probabilistic technology, evaluation standards that underpin these objectives are critical to continuously and empirically demonstrate reliability and thus compliance.
This probabilistic nature makes AI fundamentally different from other technologies. For example, civil engineers can sign off on the design of a bridge that meets standards and have near-complete confidence that the bridge will carry people and vehicles across in different weather conditions, because the bridge itself doesn’t change every time the hundredth car drives across it. With an LLM, every time a person interacts, it produces a different result. This probabilistic behavior is what makes this new technology so powerful and adaptive – and also what makes it so difficult to reliably measure and evaluate.
As such, while AI developers must go through the same exercise when designing a system – reviewing plans and ensuring they meet objectives – they must also continuously measure and empirically demonstrate compliance with the stated reliability goals under varied, real-world conditions. With AI, using the same inputs twice will generate two different outputs, and this is by design. Therefore, it is necessary to empirically measure both model inputs and outputs across different circumstances to determine if a risk has been appropriately mitigated.
That’s where we come in. Technical standards organizations like MLCommons act as a vital complement to conventional standards bodies such as ISO in the world of AI. The standards that organizations like ISO develop set broad direction, clear objectives, and qualitative requirements based on business needs, along with societal concerns. Benchmarking standards organizations translate these objectives into precise, actionable metrics. This relationship ensures that the objectives identified in ISO standards are grounded in empirical data that model developers and enterprise users can actually apply.
For instance, MLCommons is extremely active in ISO work, such as the 42119 series, a standard for AI testing and assurance. Industry needs internationally agreed, consensus-driven, broad guidelines for AI measurement, which can then be realized through specific benchmarks like the MLCommons AILuminate benchmarks for generative AI security and product safety. These technical specifications must evolve rapidly to keep pace with the speed of AI innovation, providing a “living” bridge between standard objectives and industry practice.
Ultimately, standardized evaluations are what drive progress and build public trust. Historical precedents like the New Car Assessment Program (NCAP) show that rigorous safety testing can transform an entire industry, increasing the market share of 5-star safety rated vehicles from a negligible baseline to 86%+ in a large market over several decades. By applying this same level of technical rigor to AI through evolving benchmarks like AILuminate, the industry can ensure that AI becomes more secure and reliable, unlocking higher value markets for companies and delivering increasing value for consumers.
Join the Effort
Building trustworthy AI takes a global effort. Join MLCommons and help shape the technical standards that will define AI reliability for the next decade. With 125+ member organizations already contributing to benchmarks like AILuminate, there’s a seat at the table for every organization committed to making AI safer, more reliable, and more widely trusted.