Benchmark Suite Results
MLPerf Inference: Mobile
The MLPerf Inference: Mobile benchmark suite measures how fast systems can process inputs and produce results using a trained model. Below is a short summary of the current benchmarks and metrics. Please see the MLPerf Mobile Inference benchmark paper for a detailed description of the motivation and guiding principles behind the benchmark suite.
Results
MLCommons results are shown in an interactive table to enable you to explore the results. You can apply filters to see just the information you want and click across the top tabs to view the results visually. To see all result details, expand the columns by clicking on the “+” icon, which appears when you hover over “System Name” and subsequent columns.
Scenarios & Metrics
To enable representative testing of a wide variety of inference platforms and use cases, MLPerf has defined four different scenarios as described below. A given scenario is evaluated by a standard load generator generating inference requests in a particular pattern and measuring a specific metric.
Scenario | Query Generation | Duration | Samples/query | Latency Constraint | Tail Latency | Performance Metric |
---|---|---|---|---|---|---|
Single stream | LoadGen sends next query as soon as SUT completes the previous query | 1024 queries and 60 seconds | 1 | None | 90% | 90%-ile measured latency |
Multiple stream (1.1 and earlier) | LoadGen sends a new query every latency constraint if the SUT has completed the prior query, otherwise the new query is dropped and is counted as one overtime query | 270,336 queries and 60 seconds | Variable, see metric | Benchmark specific | 99% | Maximum number of inferences per query supported |
Multiple stream (2.0 and later) | Loadgen sends next query, as soon as SUT completes the previous query | 270,336 queries and 600 seconds | 8 | None | 99% | 99%-ile measured latency |
Server | LoadGen sends new queries to the SUT according to a Poisson distribution | 270,336 queries and 60 seconds | 1 | Benchmark specific | 99% | Maximum Poisson throughput parameter supported |
Offline | LoadGen sends all queries to the SUT at start | 1 query and 60 seconds | At least 24,576 | None | N/A | Measured throughput |
Benchmarks
Each benchmark is defined by a Dataset and Quality Target. The following table summarizes the benchmarks in this version of the suite (the rules remain the official source of truth):
Area | Task | Model | Dataset | Mode | Quality | Latest Available Version |
---|---|---|---|---|---|---|
Generative AI | Text to image | Stable Diffusion 1.5 | MS-COCO 2014 captions | Single-stream | Text to image CLIP score, NIMA IQA-A | v4.1 |
Vision | Object detection | MobileDETs | MS-COCO 2017 | Single-stream | 95% of FP32 (mAp: 0.285) | v4.1 |
Vision | Segmentation | MOSAIC | ADE20K (32 classes, 512×512) | Single-stream | 96% of FP32 (32-class mIOU: 59.8) | v4.1 |
Language | Language Processing | Mobile-BERT | SQUAD 1.1 | Single-stream | 93% of FP32 (F1 score: 90.5) | v4.1 |
Image Processing | Super Resolution | EDSR F32B5 | OpenSR | Single-stream | 33 dB PSNR (peak signal to noise ratio) | v4.1 |
Vision | Image Classification | MobileNetV4 | ImageNet | Single-stream & Offline | 81% ~ 98% of FP32 (Top 1: 82.68%) | v4.1 |
Vision | Image Classification | MobileNetEdge TPU | ImageNet | Single-stream & Offline | 74.66% ~ 98% of FP32 (Top1: 76.19%) | v4.0 |
Vision | Segmentation | DeepLabV3+ (MobileNetV2) | ADE20K (32 classes, 512×512) | Single-stream | 97% of FP32 (32-class mIOU: 54.8) | v2.1 |
Vision | Object detection | SSD-MobileNetV2 | MS-COCO 2017 | Single-stream | 93% of FP32 (mAp: 0.244) | v0.7 |
Each Mobile benchmark requires the single stream scenario. The Image classification benchmark permits an optional Offline scenario.
Divisions
MLPerf aims to encourage innovation in software as well as hardware by allowing submitters to reimplement the reference implementations. There are two Divisions that allow different levels of flexibility during reimplementation:
- The Closed division is intended to compare hardware platforms or software frameworks “apples-to-apples” and requires using the same model as the reference implementation.
- The Open division is intended to foster innovation and allows using a different model or retraining.
Availability
MLPerf divides benchmark results into Categories based on availability.
- Available systems contain only components that are available for purchase or for rent in the cloud.
- Preview systems must be submittable as Available in the next submission round.
- Research, Development, or Internal (RDI) contain experimental, in development, or internal-use hardware or software.
Submission Information
Each row in the results table is a set of results produced by a single submitter using the same software stack and hardware platform. Each Closed division row contains the following information:
Open Divisions
You may add the following rows:
Model Used
The model used to produce the results, which may or may not match the Closed Division requirement.
Notes
Arbitrary notes from submitter.
Power Measurements
Each row will add columns for each benchmark containing the following:
System Power
for Server and Offline scenarios, or…
Energy Per Stream
for Single stream and Multiple stream scenarios