Benchmark Suite Results
MLPerf Training: HPC
The MLPerf Training: HPC benchmark suite measures how fast systems can train models to a target quality metric. Current and previous results can be reviewed through the dashboard below.
The MLPerf HPC benchmark paper and MLPerf Training: HPC benchmark paper provide a detailed description of the motivation and guiding principles behind the MLPerf Training benchmark suite.
Results
MLCommons results are shown in an interactive table to enable you to explore the results. You can apply filters to see just the information you want and click across the top tabs to view the results visually. To see all result details, expand the columns by clicking on the “+” icon, which appears when you hover over “System Name” and subsequent columns.
Benchmarks
Each benchmark is defined by a Dataset and Quality Target. The following table summarizes the benchmarks in this version of the suite. The rules remain the official source of truth.
Area | Benchmark | Dataset | Quality Target | Reference Implementation Model | Latest Version Available |
---|---|---|---|---|---|
Scientific | Climate segmentation | CAM5+TECA simulation | IOU 0.82 | DeepCAM | v3.0 |
Scientific | Cosmology parameter prediction | CosmoFlow N-body simulation | Mean average error 0.124 | CosmoFlow | v3.0 |
Scientific | Quantum molecular modeling | Open Catalyst 2020 (OC20) | Forces mean absolute error 0.036 | DimeNet++ | v3.0 |
Scientific | Protein Folding | OpenProteinSet and Protein Data Bank (May 2022 snapshot) | 0.8 Local Distance Difference Test (lDDT) | AlphaFold2 (PyTorch) | v3.0 |
Scenarios & Metrics
In order to enable representative testing of a wide variety of inference platforms and use cases, MLPerf has defined different scenarios. A given scenario is evaluated by a standard load generator generating inference requests in a particular pattern and measuring a specific metric.
For ML there are two performance metrics and three benchmark applications.
The strong scaling metric measures the wall clock time required to train a model on the specified dataset to achieve the specified quality target. To account for the substantial variance in ML training times, final results are obtained by measuring the benchmark a benchmark-specific number of times, discarding the lowest and highest results, and averaging the remaining results.
The weak scaling metric benchmark measures the throughput for a supercomputing system training multiple models concurrently on the specified dataset to achieve the specified quality target. Submitters are allowed to choose the number of concurrently trained model instances to fill their system. To reduce the impact of variability on measured throughput, submitters may prune a chosen number of model instances. The time to train all models is then used to compute the aggregate throughput in models per minute.
Divisions
MLPerf aims to encourage innovation in software as well as hardware by allowing submitters to reimplement the reference implementations. There are two Divisions that allow different levels of flexibility during reimplementation:
- The Closed division is intended to compare hardware platforms or software frameworks “apples-to-apples” and requires using the same model and optimizer as the reference implementation.
- The Open division is intended to foster faster models and optimizers and allows any ML approach that can reach the target quality.
Availability
MLPerf divides benchmark results into Categories based on availability.
- Available systems contain only components that are available for purchase or for rent in the cloud.
- Preview systems must be submittable as Available in the next submission round. Any Preview systems not submitting as Available in the next submission round will be noted as Invalid.
- Research, Development, or Internal (RDI) contain experimental, in development, or internal-use hardware or software.
Submission Information
Each row in the results table is a set of results produced by a single submitter using the same software stack and hardware platform. Each Closed division row contains the following information:
Open Divisions
You may add the following rows:
Model Used
The model used to produce the results, which may or may not match the Closed Division requirement.
Notes
Arbitrary notes from submitter.
Each weak-scaling result additionally adds the following information for both closed division and open division:
Instances
The number of model instances concurrently training on the system
Instance scale
The number of processors or accelerators per instance
Time
Time-to-train all instances to the target quality.
Throughput
The total throughput of the system, as measured in models trained per minute. Computed as (# of instances / time).