Benchmark Suite Results
MLPerf Storage
The MLPerf Storage benchmark suite measures how fast storage systems can supply training data when a model is being trained. Below is a short summary of the workloads and metrics from the latest round of benchmark results submissions.
Results
MLCommons results are shown in an interactive table to enable you to explore the results. You can apply filters to see just the information you want and click across the top tabs to view the results visually. To see all result details, expand the columns by clicking on the “+” icon, which appears when you hover over “System Name” and subsequent columns.
Workloads
Each workload supported by MLPerf Storage is defined by a corresponding MLPerf Training benchmark. The following table summarizes the workloads in this version of the benchmark (the rules remain the official source of truth):
Area | Task | Model | Nominal Dataset | Latest Version Available |
---|---|---|---|---|
Vision | Medical image segmentation | 3D U-Net | KITS 2019 (602x512x512) | v1.0 |
Vision | Image classification | ResNet50 | ImageNet | v1.0 |
Scientific | Cosmology parameter prediction | CosmoFlow | CosmoFlow N-body simulation | v1.0 |
Language | Language processing | BERT-large | Wikipedia (2.5KB/sample) | v0.5 |
The dataset is referred to as a “nominal dataset” above because the MLPerf Storage benchmark simulates the above named real datasets using synthetically generated populations of files where the distribution of the size of the files matches the distribution in the real dataset. The size of the dataset used in each benchmark submission is automatically scaled to a size that prevents significant caching of the dataset in the systems actually running the benchmark code.
Divisions
MLPerf aims to encourage innovation in software as well as hardware by allowing submitters to reimplement the reference implementations. There are two Divisions that allow different levels of flexibility during reimplementation:
- The Closed division is intended to allow comparisons between storage systems in an “apples-to-apples” fashion and requires using a fixed set of benchmark tunables and options when running the benchmark.
- The Open division is intended to foster innovation, to show how performance could be increased if some changes were made. As a result, it allows using different data storage formats, access methods, tunables, and options.
- See the rules for specifics on what can be changed in each Division.
Availability
MLPerf divides benchmark results into categories based on the availability of the storage solution:
- Available on premise – shows results for systems that are available in a customer datacenter.
- Available via the ALCF Discretionary Allocation Program – shows results for systems that are available in the Argonne National Laboratory Discretionary Allocation Program.
- Research, Development, or Internal (RDI) – contains experimental, in development, or internal-use hardware or software.
Submission Information
Each row in the results table is a set of results produced by a single submitter using the same software stack and hardware platform. Each Closed division row contains the following information:
Each row in the results table contains the following information for each workload submitted:
Throughput
This is the maximum performance the storage system was able to deliver while maintaining all the accelerator(s) at 90% utilization or above (ie: no more than 10% of the time were the accelerator(s) idle and waiting for the storage system to deliver data). It is reported as both “samples/second”, a metric that should be intuitively valuable to AI/ML practitioners, and as “MB/s”, a metric that should be intuitively valuable to storage practitioners.
Number of Simulated Accelerators
The number of simulated accelerators active during this test; ie: how many accelerators of the given type can this storage system keep busy.
Dataset Size
Since the dataset used in this test was synthesized and must be of a size to prevent significant caching of data in the compute node(s) running the benchmark, the size of the dataset used in this test is reported here.