Regardless of the intended use of an AI system – be it healthcare, banking, energy, etc. – the complex, black-box nature of LLMs means we need to define the behavior we want and then evaluate how reliably the system delivers it. It’s only with an understanding of reliability that we can both manage risk (is 80 percent reliable enough for life-critical systems?) and cost (what amount of customer service mistakes are acceptable to keep if it reduces per-interaction cost?).

This is why reliability measurement of AI software is the focus  – and namesake – of the MLCommons AI Risk and Reliability Working Group.  Significantly higher AI reliability would both grow markets and protect society. We believe that to successfully increase AI reliability across the industry, we need a systematic and thoughtful plan. We then need to effectively and collaboratively implement, iterate on, and sustain that plan over time. 

As with any long journey, our planning should begin with a map – a map we expect to evolve and improve over time. 

We begin our mapping effort by focusing on pre-deployment testing of an AI system’s behavioral reliability. Reliability needs to be addressed over the AI application lifecycle: during development, at deployment, and during operation. Achieving reliability during each of these three stages involves a different focus: development processes, deployment testing, and operation monitoring. Our initial focus is on the testing that gates deployment because it gives us the most concrete opportunity for change. We further focus on the AI system behavior: the responses it gives and the actions it takes. There are existing approaches for managing the reliability of the hardware and conventional software substrate on which it runs.

The essence of AI reliability (AIR) is consistently adhering to behavioral rules across varied circumstances. We introduce the AI Reliability Map to relate these concerns to the essential concept of consistently following rules under varied circumstances as follows:

AI Reliability Map: Rules and Circumstances
Circumstances
Correctness:Obeying rules while following instructionsSecurity:Resisting rule violations caused by malicious actors
RulesFunctionalityTesting neededTesting needed
Data ProtectionsTesting neededTesting needed
Product SafetyTesting neededTesting needed
Frontier SafetyTesting neededTesting needed
Psychosocial Limits  Testing neededTesting needed

The rows express rules the system is intended to obey; the columns express the circumstances under which the system is expected to obey those rules. It is just as important to follow a functionality rule, such as a deployer instruction, when given normal instructions as when being manipulated by a malicious actor, either directly (through prompt hacking) or indirectly (through prompt injections or misinformation). Likewise, a normal instruction might tempt the system to violate a privacy rule just as much as an attack would, if violating that rule would lead to a “better” outcome for the user. This is why we must all follow all rules in all circumstances, from normal use to malicious action.

It is worth noting that AI Security cannot be tested in isolation: it must be tested by attempting to violate a rule, and the system’s behavior may vary substantially depending on which rule is probed; hence, it is tested across all system behaviors.

This AI reliability map helps us shape our definitions, but we need to be much more detailed to take action. Below, we provide additional details on both the categories shown for the rules and the circumstances, with subcategories. The subcategories are intended to address most of the known, salient concerns in pre-deployment testing of commercial systems. The entire matrix is extensible to accommodate missing or new concerns as they emerge or increase in importance or urgency.

It’s worth noting several things about this expansion. First, Functionality encompasses complying with regulatory and deployment requirements, with the former consistently taking precedence over the later. Second, Data Protection encompasses an individual’s expectations of data privacy as well as “discrete information management,” such as the proper use of corporate data and IP. Third, in our map, Frontier Safety encompasses CBRN and offensive cyber. 

The utility of the map in understanding the state of AI testing is shown with the colored cells. The yellow colored cells correspond to the scope addressed by most public capability testing. The green and blue colored cells correspond to the scope addressed by the MLCommons AILuminate Safety and Jailbreak benchmarks to-date. 

Any moderately advanced AI agent with a natural interface, regardless of purpose, is theoretically capable of failing across this entire scope. An AI system for personal finance could, in theory, offer bad financial advice, design a virus upon request, enable a hacker to access confidential corporate information, or deceive a user into an unintended purchase. There will be both specific risks (bad financial advice, unintended purchases) and general risks (viruses, hacking) associated with these AI tools across all verticals. Our challenge, as an industry and a field, is to develop a robustly structured yet constantly evolving approach to deployment testing that covers this map.

This is the work of the MLCommons AI Risk and Reliability Working Group, where industry, academia, and government collaborate to translate frameworks like this into actionable benchmarks, including AILuminate. As an open engineering consortium, MLCommons is uniquely positioned to lead this effort – bringing together the organizations that build AI systems with those that deploy, regulate, and are affected by them. If your organization is working to understand or improve AI reliability, we want to build this with you. Learn more and join the AIRR Working Group at mlcommons.org/working-groups/ai-risk-reliability.