Standardizing Generative AI Service Evaluation: An API-Centric Benchmarking Approach

Generative AI adoption has exploded. ChatGPT alone saw roughly 8x growth in users between mid-2023 and early 2025, and every major provider — Anthropic, Google, Meta, Microsoft, Mistral, OpenAI — is shipping new models at a pace that makes six-month benchmark cycles feel like geological time. For the organisations spending millions on inference infrastructure, one question keeps getting louder: how do you actually compare these systems in a way that reflects production reality?

At GTC, MLCommons^® co-founder David Kanter unveiled the answer: MLPerf^® Endpoints, a ground-up rethinking of how the industry’s benchmark of record measures generative AI performance. With over 125 member organizations, more than 90,000 reproducible results to date, and recognition by IEEE and ISO/IEC SC42, MLPerf already underpins critical procurement decisions across government, industry, and academia. Endpoints is designed to keep that trust intact while adapting to a landscape that looks fundamentally different from what it was just two years ago. You can try it out here.

Why Traditional Approaches Need To Change

Traditional MLPerf inference benchmarks used a tightly coupled architecture: the load generator and model server ran as a single local process with shared dependencies. That worked well for classical ML, but generative AI deployments are API-first — whether on-prem, in the cloud, or via managed cloud endpoints.

Meanwhile, measuring GenAI performance is harder than it looks. Real serving combines accuracy, latency, throughput, and sequence length into a non-linear, multi-dimensional surface. Long-tail queries, variable arrival patterns, and tight SLAs interact in ways that simple scenarios miss entirely.

An API-Centric Architecture

MLPerf Endpoints replaces the monolithic design with a decoupled client that communicates with any model-serving API endpoint via standard interfaces such as HTTP or gRPC. The benchmark client is lightweight and production-ready; the system under test is simply a URL. This means zero-effort integration for submitters — point the client at your endpoint and run. The architecture also enables benchmarking of managed cloud services alongside bare-metal deployments on an equal footing, something the previous framework could not easily support.

Under the hood, a new scalable load generator uses separate worker processes, pre-warmed connection pools, and ZeroMQ-based IPC to ensure the harness itself never becomes the bottleneck, even when testing rack-scale systems.

Pareto Curves and Step Functions: New Visualisation for New Metrics and Easy Comparison

One of the most compelling innovations is how results are presented. Each benchmark run varies concurrency and captures key metrics, including TTFT (time to first token), throughput (tokens per second), interactivity (tokens per second per user), and response latency. Submitters tune parallelism and batch settings for each operating point, and the visualizer plots these metrics as a Pareto curve (e.g., throughput vs. interactivity)—giving buyers an immediate picture of the real-world trade-offs, such as between serving more users and keeping each user’s experience responsive.

Crucially, MLPerf Endpoints uses step functions rather than interpolated trend lines. GenAI performance is highly non-linear; interpolating between measured points can suggest performance levels that were never actually achieved, masking memory overflows or P99 latency spikes. Step functions show only verified operating points, eliminating what the presentation aptly called “paper performance.” Customers can easily compare these step functions against each other and match these verified points to their own use cases — high concurrency during the day, best possible interactivity at night.

Rolling Submissions: Benchmarking at the Speed of Software Updates

Perhaps the boldest change is operational. MLPerf has historically published results on a fixed bi-annual schedule, depending on the benchmark offering (Training, Inference, Storage, etc). In a market where major model releases land every few weeks, that cadence is too slow for buyers writing RFPs and vendors launching hardware. Starting in Q2 2026, MLPerf Endpoints will move to continuous rolling submissions: submitters can publish peer-reviewed, audited results at any time. Incremental submissions let vendors start with a baseline Pareto curve and iteratively add more operating points as their software stack matures.

The approach is inspired by proven methodologies from other industry-standard bodies such as SPEC and TPC, adapted to the world of AI. Peer review and audit requirements will remain fully intact to deliver the robustness that the industry demands.

What Comes Next

The first MLPerf Endpoints v0.5 demonstration feature results from AMD, Google, Intel, KRAI, and NVIDIA, backed by over 30 supporting organisations, including Argonne National Laboratory, Broadcom, Dell, HPE, Lambda, Lenovo, Oracle, Red Hat, and the University of Florida. The results include models such as DeepSeek-R1, GPT OSS 120B, Llama 3.1 8B, QWEN 3 Coder 480B and more, running on nearly a dozen different systems.

Looking ahead, MLCommons is inviting the broader ecosystem to shape what comes next. Enterprise and IT buyers can join the advisory council. OEMs, CSPs, and ODMs can contribute results to the rolling leaderboard. Model developers and API providers can integrate next-generation SOTA models and build managed roadmaps. Researchers can anchor reproducible baselines using the Endpoints framework. New models — especially popular and commercially relevant ones — are continuously evaluated for inclusion. You can try it out for yourself here.

Get involved: MLPerf Endpoints rolling submissions open in Q2 2026. To participate, contribute, or learn more, visit https://mlcommons.org/benchmarks/endpoints/ or join our working group.