The Agentic Product Maturity Ladder is a collection of benchmarks measuring the ability of agentic products to reliably support specific tasks. Our goal in releasing this Maturity Ladder is to establish a trustworthy, community-enabling benchmark framework that supports stakeholders in confidently assessing and comparing the risk and reliability of AI agents, thereby accelerating safer deployments.

While many agentic benchmarks presently exist, currently available agentic AI benchmarks are not designed to inform real-world deployment decisions, especially in safety-critical domains where errors could have severe consequences. The absence of reliable benchmarking has made it difficult to trust in the reliability of agents. Consequently, agentic system adoption has been slower than progress of agentic capabilities. Problematically, this trust may only be established through investment in testing (i.e., benchmarking) across thousands of different tasks. A fully mature risk and reliability benchmarking ecosystem would require immense effort to achieve any reasonable degree of coverage. 

The goal of the MLCommons Agentic Product Maturity Ladder is twofold: 

  1. Inform agentic product adoption decisions, thereby motivating reliability innovation across industry and better products for society.
  2. Prioritize benchmarking capacity to where it will be most informative.

Below, we produce one graphic per agentic task of interest, for example “Airline Booking” or “New Car Sales”. The table below is illustrative only — real data for agents is present in our paper. For these tables, each task is benchmarked against principles bundled together to answer the questions:

  1. Capable: Can the agent do the task
  2. Bounded: Will the agent stay confined to its supported tasks?
  3. Confidential: Will the agent protect confidential information?
  4. Controlled: Does the agent act at the direction of the user?
  5. Robust: Does the agent handle unusual circumstances appropriately?
  6. Secure: Is the agent resillient to attack?
  7. Reliable: Does the agent behave in a consistent and helpful manner?
Click to expand
Readiness Level 0
Research
Readiness Level 1
Capable
Readiness Level 2
Bounded
Readiness Level 3a
Confidential
Readiness Level 3b
Controlled
Readiness Level 3c
Robust
Readiness Level 4
Secure
Readiness Level 5
Reliable
BodegaBot
98%
90%
95%
pending
84%
23%
not eligible
not eligible
InventoryAI
95%
54%
not eligible
not eligible
not eligible
not eligible
CartLogic
81%
51%
not eligible
not eligible
not eligible
not eligible
Clive-3.7
(Feb 2025)
83%
51%
not eligible
not eligible
not eligible
not eligible
Readiness Level 0
Research
BodegaBot
98%
InventoryAI
95%
CartLogic
81%
Clive-3.7 Acrostic (February 2025)
83%
Readiness Level 1
Capable
BodegaBot
90%
InventoryAI
54%
CartLogic
51%
Clive-3.7 Acrostic (February 2025)
51%
Readiness Level 2
Bounded
BodegaBot
95%
InventoryAI
not eligible
CartLogic
not eligible
Clive-3.7 Acrostic (February 2025)
not eligible
Readiness Level 3a
Confidential
Readiness Level 3b
Controlled
Readiness Level 3c
Robust
BodegaBot
3a Confidential: pending
3b Controlled
84%
3c Robust
23%
InventoryAI
not eligible
CartLogic
not eligible
Clive-3.7 Acrostic (February 2025)
not eligible
Readiness Level 4
Secure
BodegaBot
not eligible
InventoryAI
not eligible
CartLogic
not eligible
Clive-3.7 Acrostic (February 2025)
not eligible
Readiness Level 5
Reliable
BodegaBot
not eligible
InventoryAI
not eligible
CartLogic
not eligible
Clive-3.7 Acrostic (February 2025)
not eligible

How should I interpret this graphic?

Each “column” of the graphic shows progress towards reliability. For example, before investing resources into benchmarking how well “bounded” an agent is (Maturity Level 1 column – i.e. “The agent will not do things it should not do”), we assert that the agent must first be found to be “capable” (Readiness Level 2 column – i.e. “The agent can do the task”) under benchmarking conditions. Details about the principles that inform measurement of “capable” and “bounded” can be found in the MLCommons Agentic Principles Paper.

What is the status today of agentic maturity?

Most benchmarks today are produced for scientific or optimization purposes (i.e., to characterize or produce capabilities) rather than real-world decision making about the reliability of a system. As a result, “Research Benchmarks” suffer from a variety of design, coverage, and longevity issues that make them unreliable for informing real-world decision making. While these research benchmarks may not be well suited to making real-world decisions, they do inform whether systems may be  “Capable” of performing complex agentic tasks. Therefore, benchmarking for the real-world can follow strong performance on a research benchmark.

What will it look like for a system to be “Capable” in this maturity model?

The driving principle for “Capable” from the Agentic Product Maturity Ladder is to “Be Correct in the Presence of Usual Circumstances” – Ensure agents do not perform clearly incorrect actions for the task in question. Tests for this principle should be unambiguous and not dependent on the interests of the deployer, the user, or their unstated preferences.

How does this address the known agentic benchmarking scale problem?

We propose to meet the benchmarking scale problem by conditionally developing benchmarking capacity only when research grade benchmarks demonstrate agent capability that warrants additional investment. That investment should produce increasingly sophisticated, industry-standard benchmarks.

Who is this maturity model for?

  1. AI Researchers who are developing benchmarks that evaluate agentic task performance
  2. Deployers deciding which agentic solutions to purchase
  3. Public Citizens who are interested in knowing which AI-supported activities are most reliable

Looking Forward

We will continue to add to the list of agentic tasks of interest, and include corresponding visualizations of how systems perform when confronted with those agentic tasks of interest.
We will populate a list of agentic tasks where systems are showing enough progress against task capability to warrant investment in benchmarking “capability.”

Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.