
The Agentic Product Maturity Ladder is a collection of benchmarks measuring the ability of agentic products to reliably support specific tasks. Our goal in releasing this Maturity Ladder is to establish a trustworthy, community-enabling benchmark framework that supports stakeholders in confidently assessing and comparing the risk and reliability of AI agents, thereby accelerating safer deployments.
While many agentic benchmarks presently exist, currently available agentic AI benchmarks are not designed to inform real-world deployment decisions, especially in safety-critical domains where errors could have severe consequences. The absence of reliable benchmarking has made it difficult to trust in the reliability of agents. Consequently, agentic system adoption has been slower than progress of agentic capabilities. Problematically, this trust may only be established through investment in testing (i.e., benchmarking) across thousands of different tasks. A fully mature risk and reliability benchmarking ecosystem would require immense effort to achieve any reasonable degree of coverage.
The goal of the MLCommons Agentic Product Maturity Ladder is twofold:
- Inform agentic product adoption decisions, thereby motivating reliability innovation across industry and better products for society.
- Prioritize benchmarking capacity to where it will be most informative.
Below, we produce one graphic per agentic task of interest, for example “Airline Booking” or “New Car Sales”. The table below is illustrative only — real data for agents is present in our paper. For these tables, each task is benchmarked against principles bundled together to answer the questions:
- Capable: Can the agent do the task
- Bounded: Will the agent stay confined to its supported tasks?
- Confidential: Will the agent protect confidential information?
- Controlled: Does the agent act at the direction of the user?
- Robust: Does the agent handle unusual circumstances appropriately?
- Secure: Is the agent resillient to attack?
- Reliable: Does the agent behave in a consistent and helpful manner?
(Feb 2025)
(February 2025)
How should I interpret this graphic?
Each “column” of the graphic shows progress towards reliability. For example, before investing resources into benchmarking how well “bounded” an agent is (Maturity Level 1 column – i.e. “The agent will not do things it should not do”), we assert that the agent must first be found to be “capable” (Readiness Level 2 column – i.e. “The agent can do the task”) under benchmarking conditions. Details about the principles that inform measurement of “capable” and “bounded” can be found in the MLCommons Agentic Principles Paper.
What is the status today of agentic maturity?
Most benchmarks today are produced for scientific or optimization purposes (i.e., to characterize or produce capabilities) rather than real-world decision making about the reliability of a system. As a result, “Research Benchmarks” suffer from a variety of design, coverage, and longevity issues that make them unreliable for informing real-world decision making. While these research benchmarks may not be well suited to making real-world decisions, they do inform whether systems may be “Capable” of performing complex agentic tasks. Therefore, benchmarking for the real-world can follow strong performance on a research benchmark.
What will it look like for a system to be “Capable” in this maturity model?
The driving principle for “Capable” from the Agentic Product Maturity Ladder is to “Be Correct in the Presence of Usual Circumstances” – Ensure agents do not perform clearly incorrect actions for the task in question. Tests for this principle should be unambiguous and not dependent on the interests of the deployer, the user, or their unstated preferences.
How does this address the known agentic benchmarking scale problem?
We propose to meet the benchmarking scale problem by conditionally developing benchmarking capacity only when research grade benchmarks demonstrate agent capability that warrants additional investment. That investment should produce increasingly sophisticated, industry-standard benchmarks.
Who is this maturity model for?
- AI Researchers who are developing benchmarks that evaluate agentic task performance
- Deployers deciding which agentic solutions to purchase
- Public Citizens who are interested in knowing which AI-supported activities are most reliable
Learn More
Looking Forward
We will continue to add to the list of agentic tasks of interest, and include corresponding visualizations of how systems perform when confronted with those agentic tasks of interest.
We will populate a list of agentic tasks where systems are showing enough progress against task capability to warrant investment in benchmarking “capability.”