Benchmark Suite Results

MLPerf Training

The MLPerf Training benchmark suite measures how fast systems can train models to a target quality metric. Current and previous results can be reviewed through the results dashboard below.

The MLPerf Training benchmark paper provides a detailed description of the motivation and guiding principles behind the MLPerf Training benchmark suite.

Join the Working Group

Results

MLCommons results are shown in an interactive table to enable you to explore the results. You can apply filters to see just the information you want and click across the top tabs to view the results visually. To see all result details, expand the columns by clicking on the “+” icon, which appears when you hover over “System Name” and subsequent columns.

Published results are sometimes modified or invalidated for various reasons. The change log contains information about changes made to any results after their initial publication. View Change Log

Benchmarks

Each benchmark is defined by a Dataset and Quality Target. The following table summarizes the benchmarks in this version of the suite. The rules remain the official source of truth.

Area	Benchmark	Dataset	Quality Target	Reference Implementation Model	Latest Version Available
Vision	Object detection (light weight)	Open Images	34.0% mAP	RetinaNet	v5.1
Language	NLP	C4	TBD	Llama3.1 8b	v5.1
Language	LLM	C4	5.6 log perplexity	Llama3.1 405b	v5.1
Language	LLM finetuning	SCROLLS GovReport	0.925 cross entropy loss	Llama 2 70B	v5.1
Commerce	Recommendation	Criteo 4TB multi-hot	0.8032 AUC	DLRM-dcnv2	v5.1
Marketing, Art, Gaming	Image Generation	cc12m	TBD	FLUX.1	v5.1
Graph neural network	Graph neural network (GNN)*	IGBH-Full	72% classification accuracy	R-GAT	v5.1
Language	NLP	Wikipedia 2020/01/01	0.72 Mask-LM accuracy	BERT-large	v5.0
Marketing, Art, Gaming	Image Generation	LAION-400M-filtered	FID<=90 and CLIP>=0.15	Stable Diffusionv2	v5.0
Language	LLM	C4	2.69 log perplexity	GPT3	v4.1
Vision	Image classification	ImageNet	75.90% classification	resnet-50	v4.1
Vision	Image segmentation (medical)	KiTS19	0.908 Mean DICE score	3D U-Net	v4.0
Vision	Object detection (heavy weight)	COCO	0.377 Box min AP and 0.339 Mask min AP	Mask R-CNN	v3.1
Language	Speech recognition	LibriSpeech	0.058 Word Error Rate	RNN-T	v3.1
Commerce	Recommendation	1TB Click Logs	0.8025 AUC	DLRM	v2.1
Research	Reinforcement learning	Go	50% win rate vs. checkpoint	Mini Go (based on Alpha Go paper)	v2.1
Vision	Object detection (light weight)	COCO	23.0% mAP	SSD	v1.1
Language	Translation (recurrent)	WMT English-German	24.0 Sacre BLEU	NMT	v0.7
Language	Translation (non-recurrent)	WMT English-German	25.00 BLEU	Transformer	v0.7

Scenarios & Metrics

Each benchmark measures the wall clock time required to train a model on the specified dataset to achieve the specified quality target.

To account for the substantial variance in ML training times, final results are obtained by measuring the benchmark a benchmark-specific number of times, discarding the lowest and highest results, and averaging the remaining results. Even the multiple result average is not sufficient to eliminate all variance. Imaging benchmark results are very roughly +/- 2.5% and other benchmarks are very roughly +/- 5%.

For non-HPC training, results that converged in fewer epochs than the reference implementation run with the same hyperparameters were normalized to the expected number of epochs.

Divisions

MLPerf aims to encourage innovation in software as well as hardware by allowing submitters to reimplement the reference implementations. There are two Divisions that allow different levels of flexibility during reimplementation:

The Closed division is intended to compare hardware platforms or software frameworks “apples-to-apples” and requires using the same model as the reference implementation.
The Open division is intended to foster innovation and allows using a different model or retraining.

Availability

MLPerf divides benchmark results into Categories based on availability.

Available systems contain only components that are available for purchase or for rent in the cloud.
Preview systems must be submittable as Available in the next submission round.
Research, Development, or Internal (RDI) contain experimental, in development, or internal-use hardware or software.

Submission Information

Each row in the results table is a set of results produced by a single submitter using the same software stack and hardware platform. Each Closed division row contains the following information:

Submitter

The organization that submitted the results.
Software

The ML framework and primary ML hardware library used.
System

General system description.
Benchmark Results

Results for each benchmark as described above.
Processor and Count

The type and number of CPUs used, if CPUs perform the majority of ML compute.
Details

Link to metadata for submission.
Accelerator and Count

The type and number of accelerators used, if accelerators perform the majority of ML compute.
Code

Link to code for submission.

Open Divisions

You may add the following rows:

Model Used
The model used to produce the results, which may or may not match the Closed Division requirement.

Notes
Arbitrary notes from submitter.

Power Measurements

Each row will add columns for each benchmark containing the following:

System Power
for Server and Offline scenarios, or…

Energy Per Stream
for Single stream and Multiple stream scenarios

Rules

Reference Implementations

Results Usage Guidelines*

MLPerf Training

Results

Benchmarks

Scenarios & Metrics

Divisions

Availability

Submission Information

Submitter

Software

System

Benchmark Results

Processor and Count

Details

Accelerator and Count

Code

Open Divisions

Power Measurements