AILuminate Agentic

The Agentic Product Maturity Ladder is a collection of benchmarks measuring the ability of agentic products to reliably support specific tasks. Our goal in releasing this Maturity Ladder is to establish a trustworthy, community-enabling benchmark framework that supports stakeholders in confidently assessing and comparing the risk and reliability of AI agents, thereby accelerating safer deployments.

AILuminate Home

While many agentic benchmarks presently exist, currently available agentic AI benchmarks are not designed to inform real-world deployment decisions, especially in safety-critical domains where errors could have severe consequences. The absence of reliable benchmarking has made it difficult to trust in the reliability of agents. Consequently, agentic system adoption has been slower than progress of agentic capabilities. Problematically, this trust may only be established through investment in testing (i.e., benchmarking) across thousands of different tasks. A fully mature risk and reliability benchmarking ecosystem would require immense effort to achieve any reasonable degree of coverage.

The goal of the MLCommons Agentic Product Maturity Ladder is twofold:

Inform agentic product adoption decisions, thereby motivating reliability innovation across industry and better products for society.
Prioritize benchmarking capacity to where it will be most informative.

Below, we produce one graphic per agentic task of interest, for example “Airline Booking” or “New Car Sales”. The table below is illustrative only — real data for agents is present in our paper. For these tables, each task is benchmarked against principles bundled together to answer the questions:

Capable: Can the agent do the task
Bounded: Will the agent stay confined to its supported tasks?
Confidential: Will the agent protect confidential information?
Controlled: Does the agent act at the direction of the user?
Robust: Does the agent handle unusual circumstances appropriately?
Secure: Is the agent resillient to attack?
Reliable: Does the agent behave in a consistent and helpful manner?

Click to expand

Readiness Level 0

Research

Readiness Level 1

Capable

Readiness Level 2

Bounded

Readiness Level 3a

Confidential

Readiness Level 3b

Controlled

Readiness Level 3c

Robust

Readiness Level 4

Secure

Readiness Level 5

Reliable

BodegaBot

98%

90%

95%

pending

84%

23%

not eligible

InventoryAI

95%

54%

not eligible

CartLogic

81%

51%

not eligible

Clive-3.7
(Feb 2025)

83%

51%

not eligible

Readiness Level 0

Research

BodegaBot

98%

InventoryAI

95%

CartLogic

81%

Clive-3.7 Acrostic (February 2025)

83%

Readiness Level 1

Capable

BodegaBot

90%

InventoryAI

54%

CartLogic

51%

Clive-3.7 Acrostic (February 2025)

51%

Readiness Level 2

Bounded

BodegaBot

95%

InventoryAI

not eligible

CartLogic

not eligible

Clive-3.7 Acrostic (February 2025)

not eligible

Readiness Level 3a

Confidential

Readiness Level 3b

Controlled

Readiness Level 3c

Robust

BodegaBot

3a Confidential: pending

3b Controlled

84%

3c Robust

23%

InventoryAI

not eligible

CartLogic

not eligible

Clive-3.7 Acrostic (February 2025)

not eligible

Readiness Level 4

Secure

BodegaBot

not eligible

InventoryAI

not eligible

CartLogic

not eligible

Clive-3.7 Acrostic (February 2025)

not eligible

Readiness Level 5

Reliable

BodegaBot

not eligible

InventoryAI

not eligible

CartLogic

not eligible

Clive-3.7 Acrostic (February 2025)

not eligible

How should I interpret this graphic?

Each “column” of the graphic shows progress towards reliability. For example, before investing resources into benchmarking how well “bounded” an agent is (Maturity Level 2 column – i.e. “The agent will not do things it should not do”), we assert that the agent must first be found to be “capable” (Readiness Level 1 column – i.e. “The agent can do the task”) under benchmarking conditions. Details about the principles that inform measurement of “capable” and “bounded” can be found in the MLCommons Agentic Principles Paper.

What is the status today of agentic maturity?

Most benchmarks today are produced for scientific or optimization purposes (i.e., to characterize or produce capabilities) rather than real-world decision making about the reliability of a system. As a result, “Research Benchmarks” suffer from a variety of design, coverage, and longevity issues that make them unreliable for informing real-world decision making. While these research benchmarks may not be well suited to making real-world decisions, they do inform whether systems may be “Capable” of performing complex agentic tasks. Therefore, benchmarking for the real-world can follow strong performance on a research benchmark.

What will it look like for a system to be “Capable” in this maturity model?

The driving principle for “Capable” from the Agentic Product Maturity Ladder is to “Be Correct in the Presence of Usual Circumstances” – Ensure agents do not perform clearly incorrect actions for the task in question. Tests for this principle should be unambiguous and not dependent on the interests of the deployer, the user, or their unstated preferences.

How does this address the known agentic benchmarking scale problem?

We propose to meet the benchmarking scale problem by conditionally developing benchmarking capacity only when research grade benchmarks demonstrate agent capability that warrants additional investment. That investment should produce increasingly sophisticated, industry-standard benchmarks.

Who is this maturity model for?

AI Researchers who are developing benchmarks that evaluate agentic task performance
Deployers deciding which agentic solutions to purchase
Public Citizens who are interested in knowing which AI-supported activities are most reliable

Learn More

MLCommons AILuminate Agentic Product Maturity Ladder V0.1 White Paper

Join Us!

Looking Forward

We will continue to add to the list of agentic tasks of interest, and include corresponding visualizations of how systems perform when confronted with those agentic tasks of interest.
We will populate a list of agentic tasks where systems are showing enough progress against task capability to warrant investment in benchmarking “capability.”