Benchmark Suite Results

AlgoPerf: Training Algorithms Benchmark Results

The AlgoPerf: Training Algorithms benchmark measures how much faster we can train neural network models to a given target performance by changing the underlying training algorithm, e.g. the optimizer or the hyperparameters.

For a detailed motivation, description, and explanation of the AlgoPerf: Training Algorithms benchmark see the AlgoPerf benchmark paper.

Join the Working Group

Results

Below are the AlgoPerf competition leaderboards for both the external tuning and self-tuning ruleset. We also provide the performance profiles as well as the detailed results, with all individual workload runtimes for you to explore.

Each line in the plot represents the performance profile of a single submission, summarizing its performance across all tested workloads and comparing it to the best submission for each workload. The final benchmark score is the integrated performance profile, i.e. the area under the curve, with higher benchmark scores being better. A step in the performance profile occurs at 𝜏 if, for one workload, the submission achieves the target performance within 𝜏 times the fastest submission’s time. For example, a step at 𝜏=2.0 indicates that, for one workload, this submission takes exactly twice as long to reach the target performance as the workload’s fastest submission.

Fractions of the submission times with respect to the workload budgets. Submissions that did not reach the target on a workload have a fraction of ‘inf’ in the table. Submissions for which the submission code raised an error before any evaluations are marked with ‘*’.

Benchmark Workloads

To test that submitted training algorithms can train a variety of deep learning models, the benchmark contains eight fixed workloads. Submissions are scored based on their time-to-results across all eight workloads. Each workload consists of a dataset, model, loss function, and performance target. Submissions cannot modify any part of the workloads and must train all workloads without any additional workload-specific settings (see the Ruleset section). The following table summarizes the workloads in this iteration of the benchmark.

Task	Dataset	Model	Loss	Metric	Validation Target	Runtime Budget (in secs)
Clickthrough rate prediction	Criteo 1TB	DLRMsmall	CE	CE	0.123735	7,703
MRI reconstruction	fastMRI	U-Net	L1	SSIM	0.723653	8,859
Image classification	ImageNet	ResNet-50 ViT	CE	ER	0.22569 0.22691	63,008 77,520
Speech recognition	LibriSpeech	Conformer DeepSpeech	CTC	WER	0.085884 0.119936	61.068 55,506
Molecular property prediction	OGBG	GNN	CE	mAP	0.28098	18,477
Translation	WMT	Transformer	CE	BLEU	30.8491	48,151

Summary of the fixed base workloads used in the AlgoPerf benchmark. The possible losses are the cross-entropy loss (CE), the mean absolute error (L1), and the Connectionist Temporal Classification loss (CTC). The evaluation metrics additionally include the structural similarity index measure (SSIM), the error rate (ER), the word error rate (WER), the mean average precision (mAP), and the bilingual evaluation understudy score (BLEU).

Ruleset

To ensure generally useful training algorithms, submissions must automate and account for any required workload-specific tuning. Submissions to the benchmark compete under two separate tuning rulesets:

The external tuning ruleset simulates hyperparameter tuning with a fixed amount of parallel resources. Multiple hyperparameter settings can be tried in parallel, and training stops once one such setting reaches the desired target performance.
In contrast, the self-tuning ruleset simulates tuning on a single machine, where all workload-specific tuning must be performed within the training time of a single run.

Both rulesets have individual leaderboards and winners.

Standardized Benchmarking System

The AlgoPerf: Training Algorithms benchmark uses fixed hardware to ensure that submitters innovate on the training algorithm part of the ML pipeline. For this iteration of the benchmark, all results are gathered on a standardized benchmarking hardware system with 8x V100 GPUs (16GB of VRAM each), 240GB in RAM, and 2TB in storage. Standardizing the hardware and software stack allows fair comparison of wall-clock training times across all submissions. All results were computed by the competition organizers.

Submission Information

Each row in the results table describes a submission by a submission team. Teams are allowed to submit multiple submissions, provided they are sufficiently different. All results use the same software and hardware system and only differ in the training algorithm.

Submission name

The name of the submission describing the training algorithm.
Organizations

List of affiliations of the individual team members of this submission.
Authors

Individual team members of this submission.
Fraction of runtime budget

The fraction of the wall-clock runtime budget required by the submission to reach the target performance on each workload. Median across five repetitions. If a submission was not able to reach the target within the runtime budget, time will be reported as “infinity”.
Benchmark score

The final benchmark score of the AlgoPerf: Training Algorithms benchmark. Used to rank the submissions. All scores are between zero and one, with higher score indicating a better performance.
Prize Eligibility

Whether or not the submission is eligible to win prize-money. To ensure fairness, we excluded some submissions from winning prize money, e.g. the working group chairs or their affiliated organizations.
Framework

The ML framework used. The benchmark supports submissions in either JAX or PyTorch.

Rules

For the rules of the AlgoPerf: Training Algorithms competition see our competition rules and the technical documentation and FAQs of the benchmark.

Submissions

All submissions and training logs are available open source under an Apache 2.0 license here.

Reference Implementations
The benchmark codebase can be found here. This includes baseline submissions such as the prize-qualification baseline.

Results Usage Guidelines
If you use the results and refer to AlgoPerf results, you must follow the results guidelines. MLCommons reserves the right to determine appropriate uses of its trademark.

AlgoPerf: Training Algorithms Benchmark Results

Results

Benchmark Workloads

Ruleset

Standardized Benchmarking System

Submission Information

Submission name

Organizations

Authors

Fraction of runtime budget

Benchmark score

Prize Eligibility

Framework

Rules

Submissions