Benchmark Suite Results

MLPerf Endpoints

No matter the workload, no matter the hardware, no matter the deployer, GenAI is an API endpoint. That is why we have started the MLPerf Endpoints Working Group at MLCommons.

The mission of this working group is to build a trusted, open, fair, and reproducible MLPerf benchmark for evaluating GenAI endpoint performance. MLPerf became the industry standard, but it needs to evolve to bring clarity and velocity to the Gen AI era. Combined with quality targets and backed by a consensus-based governance model, we believe this work on endpoints will expand the AI community and advance capabilities for the entire ecosystem.

If it has an API, we can measure it. If this mission speaks to you, we invite you to join us.

Join the Working Group

Try Endpoints

We released the first demonstration version of MLPerf Endpoints on 19 March 2026 at GTC with the support of over 30 organizations, including submissions from five member organizations: AMD, Intel, Google, Krai, and NVIDIA. You can try it here.

See How Systems Really Perform Across Their Full Operating Range

Start with what matters to you. Select the model most relevant to your use case, then compare systems side-by-side on a Throughput vs. Interactivity graph — instantly see the real-world tradeoffs across the full operating range, not just peak numbers.
No more single-point measurements. Concurrency X-axis graphs for Throughput, Time to First Token (TTFT), and Interactivity show you the complete performance surface. Use the concurrency selector to focus on the region that matches your production workload — for example, systems handling at least 10 simultaneous users.
Find the right system for your use case. Filter and compare systems by accelerator, software stack, and more. Whether you’re evaluating on-prem infrastructure or managed API endpoints, the data is right there.
Hover over any run to understand utilization. Each point on the curve shows tokens/sec for that run relative to the system’s peak throughput — so you can see not just what a system can do, but how hard it’s working to do it.
Click any point to get the full picture. Every data point links to a detailed run report covering the System Under Test (SUT) summary, model and dataset, node-level hardware and software descriptions (including heterogeneous and disaggregated systems), and complete run data: concurrency, TTFT, TPOT, tokens/sec, queries per second, and more.
Transparent, reproducible, and auditable. Every result is peer-reviewed and self-contained. The detail you see here is the same detail available to anyone — buyers, analysts, and the industry at large.

Goals

MLPerf Inference is the industry standard benchmark for AI system performance and efficiency. MLPerf Endpoints extends this foundation for the Gen AI era, delivering on six key goals:

Velocity — shift from a fixed bi-annual schedule to a continuous and flexible rolling submission process; easier to set up, run, and submit means more tests in less time; rapidly update the suite with 0-day support for new models and platforms, meaning vendors can include real MLPerf Endpoints results in new product launch material and buyers can request MLPerf Endpoints scores in RFPs
Clarity — measure performance mirroring real customer deployment experience; Pareto curves measure performance across a broader range of use cases; easier to understand results and compare across systems visually, making MLPerf Endpoints data more accessible to a wide range of users (including system purchasers)
API endpoint-centric architecture — simplified, production-ready, lightweight, and decoupled; if a system has an API, it can be benchmarked; measures everything from on-prem systems to managed endpoints
Standardized Pareto curves — each run captures TTFT, Throughput, Interactivity, and Query Latency; customers can match results to different production use cases (e.g., high usage during the day, low at night)
Broad participation — diverse members and solutions, encompassing developers, enterprise buyers, CSPs, OEMs, and open-source contributors

Principles

Relevant
- Focus on the important problems; mirror customer deployments to ensure relevance
- GenAI performance is a complex, non-linear, and multi-dimensional surface
- Real-world traffic involves “long-tail” queries and latency always explodes as utilization peaks; measuring averages ignores reality
- Pareto curves accurately and realistically measure performance for the full range of customer use cases
Fair and neutral
- MLCommons is a non-profit, committed to neutrality and fairness for all
- Clear rules for submissions under different categories, which apply to all submitters
- Architectural neutrality across hardware, software stacks, and deployment models
Reproducible
- Extensive rules and well-documented workloads; robust peer-review of results with auditing
- Decoupled client-server architecture — submitted results are self-contained and reproducible by third parties
- Enables customers and the whole ecosystem to trust results and develop best practices
Inclusive
- Well-structured governance with robust participation that drives industry consensus
- Open-source codebase, broadly accessible; standard OpenAI-compatible API interface

Call for Participation

In an era of black-box benchmarks, help us build the one the industry actually trusts. Rolling submissions start in Q2 2026.

Enterprise & IT Buyers — influence the standard by joining the advisory council; ensure the gold-standard benchmark tests what matters to you; benefit from trusted, up-to-date performance data that simplifies RFP requirements
Infrastructure & Software (OEMs, CSPs, ODMs) — demonstrate leadership by shaping the spec and adding results to the rolling leaderboard
Model Developers & API Providers — scale the roadmap by integrating next-gen SOTA models and build managed roadmaps for cloud endpoints
Researchers & Community — anchor your science by using MLPerf Endpoints for reproducible basepoints and contribute feedback

Join us!