Benchmark Suite Results
AlgoPerf: Training Algorithms Benchmark Results
The AlgoPerf: Training Algorithms benchmark measures how much faster we can train neural network models to a given target performance by changing the underlying training algorithm, e.g. the optimizer or the hyperparameters.
For a detailed motivation, description, and explanation of the AlgoPerf: Training Algorithms benchmark see the AlgoPerf benchmark paper.
Results
Below are the AlgoPerf competition leaderboards for both the external tuning and self-tuning ruleset. We also provide the performance profiles as well as the detailed results, with all individual workload runtimes for you to explore.
Each line in the plot represents the performance profile of a single submission, summarizing its performance across all tested workloads and comparing it to the best submission for each workload. The final benchmark score is the integrated performance profile, i.e. the area under the curve, with higher benchmark scores being better. A step in the performance profile occurs at 𝜏 if, for one workload, the submission achieves the target performance within 𝜏 times the fastest submission’s time. For example, a step at 𝜏=2.0 indicates that, for one workload, this submission takes exactly twice as long to reach the target performance as the workload’s fastest submission.
Fractions of the submission times with respect to the workload budgets. Submissions that did not reach the target on a workload have a fraction of ‘inf’ in the table. Submissions for which the submission code raised an error before any evaluations are marked with ‘*’.
Benchmark Workloads
To test that submitted training algorithms can train a variety of deep learning models, the benchmark contains eight fixed workloads. Submissions are scored based on their time-to-results across all eight workloads. Each workload consists of a dataset, model, loss function, and performance target. Submissions cannot modify any part of the workloads and must train all workloads without any additional workload-specific settings (see the Ruleset section). The following table summarizes the workloads in this iteration of the benchmark.
Task | Dataset | Model | Loss | Metric | Validation Target | Runtime Budget (in secs) |
---|---|---|---|---|---|---|
Clickthrough rate prediction | Criteo 1TB | DLRMsmall | CE | CE | 0.123735 | 7,703 |
MRI reconstruction | fastMRI | U-Net | L1 | SSIM | 0.723653 | 8,859 |
Image classification | ImageNet | ResNet-50 ViT | CE | ER | 0.22569 0.22691 | 63,008 77,520 |
Speech recognition | LibriSpeech | Conformer DeepSpeech | CTC | WER | 0.085884 0.119936 | 61.068 55,506 |
Molecular property prediction | OGBG | GNN | CE | mAP | 0.28098 | 18,477 |
Translation | WMT | Transformer | CE | BLEU | 30.8491 | 48,151 |
Summary of the fixed base workloads used in the AlgoPerf benchmark. The possible losses are the cross-entropy loss (CE), the mean absolute error (L1), and the Connectionist Temporal Classification loss (CTC). The evaluation metrics additionally include the structural similarity index measure (SSIM), the error rate (ER), the word error rate (WER), the mean average precision (mAP), and the bilingual evaluation understudy score (BLEU).
Ruleset
To ensure generally useful training algorithms, submissions must automate and account for any required workload-specific tuning. Submissions to the benchmark compete under two separate tuning rulesets:
- The external tuning ruleset simulates hyperparameter tuning with a fixed amount of parallel resources. Multiple hyperparameter settings can be tried in parallel, and training stops once one such setting reaches the desired target performance.
- In contrast, the self-tuning ruleset simulates tuning on a single machine, where all workload-specific tuning must be performed within the training time of a single run.
Both rulesets have individual leaderboards and winners.
Standardized Benchmarking System
The AlgoPerf: Training Algorithms benchmark uses fixed hardware to ensure that submitters innovate on the training algorithm part of the ML pipeline. For this iteration of the benchmark, all results are gathered on a standardized benchmarking hardware system with 8x V100 GPUs (16GB of VRAM each), 240GB in RAM, and 2TB in storage. Standardizing the hardware and software stack allows fair comparison of wall-clock training times across all submissions. All results were computed by the competition organizers.
Submission Information
Each row in the results table describes a submission by a submission team. Teams are allowed to submit multiple submissions, provided they are sufficiently different. All results use the same software and hardware system and only differ in the training algorithm.
Rules
For the rules of the AlgoPerf: Training Algorithms competition see our competition rules and the technical documentation and FAQs of the benchmark.
Submissions
All submissions and training logs are available open source under an Apache 2.0 license here.
Reference Implementations
The benchmark codebase can be found here. This includes baseline submissions such as the prize-qualification baseline.
Results Usage Guidelines
If you use the results and refer to AlgoPerf results, you must follow the results guidelines. MLCommons reserves the right to determine appropriate uses of its trademark.