New MLPerf Training v4.1 Benchmarks Highlight Industry’s Focus on New Systems and Generative AI Applications

Today, MLCommons^®announced new results for the MLPerf^® Training v4.1 benchmark suite, including several preview category submissions using the next generation of accelerator hardware. The v4.1 round also saw increased participation in the benchmarks that represent generative AI model training, highlighting the strong alignment between the benchmark suite and the current direction of the AI industry.

New generations of hardware accelerators

MLPerf Training v4.1 includes preview category submissions using new hardware accelerators that will be generally available in the next round:

Google “Trillium” TPUv6 accelerator (preview)
NVIDIA “Blackwell” B200 accelerator (preview)

“As AI-targeted hardware rapidly advances, the value of the MLPerf Training benchmark becomes more important as an open, transparent forum for apples-to-apples comparisons,” said Hiwot Kassa, MLPerf Training working group co-chair. “Everyone benefits: vendors know where they stand versus their competitors, and their customers have better information as they procure AI training systems.”

MLPerf Training v4.1

The MLPerf Training benchmark suite comprises full system tests that stress models, software, and hardware for a range of machine learning (ML) applications. The open-source and peer-reviewed benchmark suite provides a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry.

The latest Training v4.1 results show a substantial shift in submissions for the three benchmarks that represent “generative AI” training workloads: GPT3, Stable Diffusion, and Llama 2 70B LoRA fine-tuning benchmark, with a 46% increase in submissions in total across those three.

The two newest benchmarks in the MLPerf Training suite, Llama 2 70B LoRA and Graph Neural Network (GNN), both had notably higher submission rates: a 16% increase for Llama 2, and a 55% increase for GNN. They both also saw significant performance improvements in the v4.1 round compared to v4.0 when they were first introduced, with a 1.26X speedup in the best training time for Llama 2 and a 1.23X speedup for GNN.

“The MLCommons Training working group regularly updates the benchmark to keep pace with industry trends, and the trend in submissions we are seeing validates that our open process is giving stakeholders the information they are looking for. In this round we see continued steady, incremental progress in improving AI training performance,” observed Shriya Rishab, MLPerf Training working group co-chair.

Continued Robust Industry Participation

The MLPerf Training v4.1 round includes 155 results from 17 submitting organizations: ASUSTeK, Azure, Cisco, Clemson University Research Computing and Data, Dell, FlexAI, Fujitsu, GigaComputing, Google, Krai, Lambda, Lenovo, NVIDIA, Oracle, Quanta Cloud Technology, Supermicro, and Tiny Corp.

“We would especially like to welcome first-time MLPerf Training submitters FlexAI and Lambda,” said David Kanter, Head of MLPerf at MLCommons. “We are also very excited to see Dell’s first MLPerf Training results that includes power measurement; as AI adoption skyrockets, it is critical to measure both the performance and the energy efficiency of AI training.”

Participation in the benchmarking process by a broad set of industry stakeholders, as well as by academic groups, strengthens the AI/ML ecosystem as a whole and helps to ensure that the benchmark is serving the community’s needs. We invite submitters and other stakeholders to join the MLPerf Training working group and help us continue to evolve the benchmark.

View the results

To view the full results for MLPerf Training v4.1 and find additional information about the benchmarks, please visit the Training benchmark page.

About ML Commons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI risk and reliability.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or contact [email protected].