Fresh Benchmarks, Reliable Scores: Introducing Continuous Prompt Stewardship for AI Risk Evaluation

The AI industry ships new frontier models every few months. Each generation is more capable than the last, and each generation changes the risk profile that regulators, enterprises, and the public need to evaluate. But the benchmarks used to measure those risks do not update themselves. A benchmark built to evaluate last year’s models can lose its diagnostic power against this year’s models.

This is a central challenge in AI evaluation: the instruments must keep pace with the technology they measure. When they don’t, the result is not a dramatic failure — it is a quiet one. Scores still get produced. Grades still get assigned. But the numbers gradually stop reflecting real-world risk, leaving the organizations that rely on them to operate on stale signal.

After the launch of LLM chatbots, AI benchmarks proliferated rapidly. However, very few have a mechanism to address this fundamental problem of “benchmark freshness”. An additional complicating factor is that benchmarks are often published with open evaluation datasets, which allows model developers to train directly on test data. Many foundation model organizations have policies preventing that, but even these organizations have difficulty ensuring test data is not among the increasingly massive datasets required to expand the frontiers of capability. When models train on the benchmarks, the resulting scores reflect memorization, rather than genuine risk management or capability. BenchRisk, an independent framework for assessing benchmark quality across 57 failure modes, quantifies this problem: among 26 AI benchmarks assessed, the median longevity score is 5 out of 100. The benchmarks saturate, get gamed, or simply stop distinguishing between systems. AILuminate, the first benchmark developed by the AI Risk and Reliability (AIRR) Working Group at MLCommons, was specifically designed to resist this pattern. Its v1.0 prompt dataset includes 24,000 human-authored prompts across 12 hazard categories. It is privately administered, built with reserve prompt sets to allow for prompt rotation, and earned the highest composite score across all 26 benchmarks assessed, including a longevity score of 75. However, while AILuminate’s longevity is likely better than peer benchmarks, it will still degrade over time. Ensuring AILuminate continues to provide reliable real world information means AILuminate itself requires maintenance.

A critical component of the long-term value of AILuminate is operational infrastructure for sustained benchmark freshness: we call it the Continuous Prompt Stewardship System. In this system, “Continuous” signals that prompt refresh is a technical requirement driven by quantitative measurement of prompt performance, one that cannot wait for organizational bandwidth or calendar cycles. “Stewardship” signals custodial management of a shared resource on behalf of a community, with obligations of care, transparency, and accountability. This reflects MLCommons’ mission to achieve Better AI for Everyone. MLCommons’ multi-stakeholder community spans industry, academia, government, civil society, and the broader public. Our prompt stewardship infrastructure is designed to sustain benchmark integrity on their behalf.

What it takes to keep a benchmark fresh

This sounds simple in principle, but it requires solving several interconnected problems at once. You need per-prompt quality metrics to detect staleness. You need reserve prompts ready to rotate in. You need quality metrics for new prompts, as well as metrics for the prompt dataset as a whole, to ensure comprehensive coverage across the hazard taxonomy and sufficient diversity to resist overfitting. These metrics need defensible scientific grounding, not just editorial judgment. You need a contributor pipeline broad enough to produce diverse, appropriately representative, naturalistic prompts at the pace the benchmark requires. This contributor pipeline must include quality controls rigorous enough to satisfy the scrutiny that an industry-standard benchmark attracts. And you need all of this to be documented and auditable, because the credibility of every benchmark MLCommons runs ultimately rests on the integrity of the prompts that generated it. To address these requirements, the Prompt Stewardship System makes the following changes to how AILuminate manages its prompt datasets.

Refresh cadence driven by prompt metrics. Prompt rotation will be driven by empirical performance such as observed declines in discriminative power, ceiling effects, emerging correlations between prompts, etc. We are adopting a measurement approach grounded in psychometric principles, particularly Item Response Theory, the measurement framework used in standardized testing from the SAT to medical licensing exams.

Closed-loop dataset rebalancing. Whenever prompts are added or retired, the system recomputes dataset-level metrics such as coverage balance across all 12 hazard categories, difficulty distribution, and linguistic diversity. Gaps identified through rebalancing (e.g. a hazard category losing coverage, a difficulty band becoming sparse, etc.) generate specifications and requirements for the next prompt generation cycle. Rebalancing closes the loop between retirement and generation. The dataset’s overall measurement properties hold even as individual prompts rotate through.

A community-driven contributor model. The v1.0 prompts were produced by contracted suppliers writing to specification, what Eric Raymond’s foundational essay on open-source development calls the “cathedral” model. It delivered the initial dataset effectively. But it concentrates expertise in a few organizations and limits the pace and diversity of prompt production. The Prompt Stewardship System moves toward the model of open collaboration that Raymond likened to the “bazaar,” by broadening authorship to include MLCommons staff, member-organization volunteers, authenticated public contributors, and hired specialists. This shift improves scale and quality, since a diverse contributor base produces prompts with greater natural variation in style, vocabulary, and cultural framing. Open contribution models only work, though, if quality control scales with them. Wikimedia produces reference-quality knowledge at a scale no contracted workforce could match, not because anyone can edit anything, but because of tiered trust levels and shared standards. The Prompt Stewardship System applies the same principle: every contributor progresses through a documented qualification pathway, and their status is recorded at every step. The result is not vague assertions of “expert authorship,” but documented, quantitative evidence that every contributor meets the same standard.

Dual-path review for boundary cases. AILuminate uses an “LLM-as-judge” approach. Using specialized evaluator models to score responses is highly scalable, but every LLM-as-judge has limits. When a prompt is ambiguous, culturally nuanced, or tests a difficult risk boundary, the evaluator may struggle to produce a judgment with a high confidence score. Industry-wide, benchmarks lack infrastructure to handle these items, they typically are either included with noisy scores or quietly excluded, and in neither case does human review fill the gap. We think this convention gets it backwards. Evaluator-incompatible prompts often test the most important and hardest boundaries — the cases where human judgment matters most. The Prompt Stewardship System routes these cases to qualified human reviewers, building ground truth in the zones that are most challenging to measure.

Human ground truth density. Most benchmarks, including AILuminate v1.0, conduct human review on an as-needed basis that relies on individual judgment about where and when human oversight is warranted. The approach is reasonable, but it is a different paradigm than treating human review as a measured, tracked property of the benchmark. This implies the need for a ground truth density metric, a dataset-level measure of how much of the prompt set has been validated by qualified human reviewers, with coverage tracked across the hazard taxonomy. The metric converts human oversight from an ad hoc practice into a reportable, tiered coverage target. MLCommons then makes quantitative claims about the level of human review underlying benchmark results.

Whitelisted testing channels. Prompts designed to probe AI risk are, by definition, designed to elicit harmful responses. Submitting thousands of such prompts through standard LLM API access would trigger the abuse detection mechanisms that providers use to protect their platforms. The Prompt Stewardship System operates through whitelisted channels: direct agreements with AI system providers that authorize the submission of evaluation prompts. No academic benchmark maintains this infrastructure, and it is a core reason why institutional benchmarking at this scale requires an independent organization with established provider relationships.

Auditable provenance. Every prompt will carry a documented record of who wrote it, when, under what methodology, and why it was included. As a 501(c)(6) nonprofit that produces industry-standard benchmarks, MLCommons expects benchmark decisions to withstand external scrutiny. The provenance framework ensures that prompt selection is defensible — not just technically sound, but transparently so.

Why this matters beyond MLCommons

Prompt stewardship addresses problems that are not unique to AILuminate. Every benchmark faces similar lifecycle challenges: staleness, contamination risk, contributor quality variance, and the tension between scale and rigor. The prompt metrics we are developing (discriminative power, difficulty calibration, overfitting resistance, non-redundancy) and the BenchRisk failure modes they are designed to mitigate, are relevant to anyone building or relying on AI evaluations. Doing this with the defensibility that an industry-standard benchmark requires means collaborating with leading academics and industry contributors to ensure the scientific grounding and practical purchase of the measurement methods we use.

We wrote this in part because we believe the methodology should be open even when the prompts themselves must remain private. Transparent prompt management processes (documented criteria, auditable decisions, public lifecycle policies) are how a privately administered benchmark earns trust. The benchmark community can scrutinize and improve the lifecycle methods without accessing the confidential prompt sets that make up the MLCommons AIRR benchmarks.

Get involved

We are building the first operational components of the Prompt Stewardship System throughout 2026. Long term, we envision a stewardship infrastructure that scales across every AI Risk & Reliability workstream, including security, agentic evaluation, and multilingual and multimodal benchmarks. A forthcoming paper will present the full technical framework, including the prompt-level and dataset-level selection criteria, BenchRisk failure mode mappings, and the contributor qualification architecture.

We want to make a direct invitation. Researchers and government agencies building AI evaluation capacity face the same challenges the Prompt Stewardship System addresses. We encourage you to join the AI Risk & Reliability working group, contribute to building and operating the Prompt Stewardship System, and co-author the paper with us. MLCommons has built the infrastructure (provider relationships, private administration, multi-stakeholder governance, a global contributor community) and it is designed to be extended. The risks the AIRR benchmarks measure require sustained attention, innovation, and investment, and we look forward to collaborating with individuals and communities who want to join us in that effort.

Isaac Holeman, PhD is a Senior Consultant at Working Paper. He leads product development for the Continuous Prompt Stewardship System at MLCommons.

¹ Polo, F. M., Weber, L., Choshen, L., Sun, Y., Xu, G., and Yurochkin, M. (2024). tinyBenchmarks: Evaluating LLMs with fewer examples. In Proceedings of the 41st International Conference on Machine Learning (ICML). arXiv:2402.14992. IRT-based approach to efficient LLM benchmarking; complementary to G-theory decomposition.

² Messing (2026) shows that item-level evaluator disagreement is the dominant source of variance in LLM safety evaluation, accounting for 44–61% of total measurement variance depending on judge tier. Benchmarks that do not identify and route these high-disagreement items to human review are either publishing noisy scores or excluding the prompts where measurement matters most.