Benchmark MLPerf Inference: Tiny | MLCommons V1.1 Results

Overview

The MLPerf Inference: Tiny benchmark suite measures how fast systems can process inputs and produce results using a trained model. Below is a short summary of the current benchmarks and metrics. Please see the MLPerf Inference benchmark paper for a detailed description of the motivation and guiding principles behind the benchmark suite.

Join the Working Group

Results

MLCommons results are shown in an interactive table to enable you to explore the results. You can apply filters to see just the information you want and click across the top tabs to view the results visually. To see all result details, expand the columns by clicking on the “+” icon, which appears when you hover over “System Name” and subsequent columns.

Scenarios and Metrics

To enable representative testing of a wide variety of inference platforms and use cases, MLPerf has defined four different scenarios as described below. A given scenario is evaluated by a standard load generator generating inference requests in a particular pattern and measuring a specific metric.

Scenario	Query Generation	Duration	Samples/query	Latency Constraint	Tail Latency	Performance Metric
Single stream	LoadGen sends next query as soon as SUT completes the previous query	1024 queries and 60 seconds	1	None	90%	90%-ile measured latency
Multiple stream (1.1 and earlier)	LoadGen sends a new query every latency constraint if the SUT has completed the prior query, otherwise the new query is dropped and is counted as one overtime query	270,336 queries and 60 seconds	Variable, see metric	Benchmark specific	99%	Maximum number of inferences per query supported
Multiple stream (2.0 and later)	Loadgen sends next query, as soon as SUT completes the previous query	270,336 queries and 600 seconds	8	None	99%	99%-ile measured latency
Server	LoadGen sends new queries to the SUT according to a Poisson distribution	270,336 queries and 60 seconds	1	Benchmark specific	99%	Maximum Poisson throughput parameter supported
Offline	LoadGen sends all queries to the SUT at start	1 query and 60 seconds	At least 24,576	None	N/A	Measured throughput

Benchmarks

Each benchmark is defined by a Dataset and Quality Target. The following table summarizes the benchmarks in this version of the suite (the rules remain the official source of truth):

All MLPerf Tiny benchmarks are single stream, meaning they measure the latency of a single inference. The benchmarks also measure the model quality, which is either accuracy or AUC depending on the benchmark. MLPerf Tiny also enables optional energy benchmarking.

Task	Dataset	Model	Mode	Quality	Latest Version Available
Keyword Spotting	Google Speech Commands	DS-CNN	Single-stream, Offline	90% (Top 1)	v1.1
Visual Wake Words	Visual Wake Words Dataset	MobileNetV1 0.25x	Single-stream	80% (Top 1)	v1.1
Image classification	CIFAR10	ResNet-8	Single-stream	85% (Top 1)	v1.1
Anomaly Detection	ToyADMOS	Deep AutoEncoder	Single-stream	0.85 (AUC)	v1.1

Divisions

MLPerf aims to encourage innovation in software as well as hardware by allowing submitters to reimplement the reference implementations. There are two Divisions that allow different levels of flexibility during reimplementation. The Closed division is intended to compare hardware platforms or software frameworks “apples-to-apples” and requires using the same model as the reference implementation. The Open division is intended to foster innovation and allows using a different model or retraining.

Availability

MLPerf divides benchmark results into Categories based on availability.

Available systems contain only components that are available for purchase or for rent in the cloud.
Preview systems must be submittable as Available in the next submission round.
Research, Development, or Internal (RDI) contain experimental, in development, or internal-use hardware or software.

Submission Information

Each row in the results table is a set of results produced by a single submitter
using the same software stack and hardware platform. Each Closed division row contains the following information:

Submitter

The organization that submitted the results.

Software

The ML framework and primary ML hardware library used.

System

General system description.

Benchmark Results

Results for each benchmark as described above.

Processor and Count

The type and number of CPUs used, if CPUs perform the majority of ML compute.

Details

Link to metadata for submission.

Accelerator and Count

The type and number of accelerators used, if accelerators perform the majority of ML compute.

Code

Link to code for submission.

Each Open division row may add the following information:

Model Used

The model used to produce the results, which may or may not match the Closed Division requirement.

Accuracy/AUC

The accuracy/AUC measured on the device. This does not have to meet the closed division quality threshold.

Rules

MLPerf Tiny rules are available here. .

Reference Implementations

Reference implementations for the benchmarks are here.

Results Usage Guidelines

MLPerf™ is a trademark of MLCommons®. If you use it and refer to MLPerf results, you must follow the results guidelines. MLCommons reserves the right to determine appropriate uses of its trademark.

MLPerf Inference: Tiny Benchmark Suite Results