Benchmark Suite Results

MLPerf Inference: Datacenter

The MLPerf Inference: Datacenter benchmark suite measures how fast systems can process inputs and produce results using a trained model. Below is a short summary of the current benchmarks and metrics. 

The MLPerf Inference benchmark paper provides a detailed description of the motivation and guiding principles behind the MLPerf Inference: Datacenter benchmark suite. 


Results

MLCommons results are shown in an interactive table to enable you to explore the results. You can apply filters to see just the information you want and click across the top tabs to view the results visually. To see all result details, expand the columns by clicking on the “+” icon, which appears when you hover over “System Name” and subsequent columns.

Published results are sometimes modified or invalidated for various reasons. The change log contains information about changes made to any results after their initial publication. 

View Change Log

Scenarios & Metrics

To enable representative testing of a wide variety of inference platforms and use cases, MLPerf has defined four different scenarios as described below. A given scenario is evaluated by a standard load generator generating inference requests in a particular pattern and measuring a specific metric. 

ScenarioQuery GenerationDurationSamples/queryLatency ConstraintTail LatencyPerformance Metric
Single streamLoadGen sends next query as soon as SUT completes the previous query1024 queries and 60 seconds1None90%90%-ile measured latency
Multiple stream (1.1 and earlier)LoadGen sends a new query every latency constraint if the SUT has completed the prior query, otherwise the new query is dropped and is counted as one overtime query270,336 queries and 60 secondsVariable, see metricBenchmark specific99%Maximum number of inferences per query supported
Multiple stream (2.0 and later)Loadgen sends next query, as soon as SUT completes the previous query270,336 queries and 600 seconds8None99%99%-ile measured latency
ServerLoadGen sends new queries to the SUT according to a Poisson distribution270,336 queries and 60 seconds1Benchmark specific99%Maximum Poisson throughput parameter supported
OfflineLoadGen sends all queries to the SUT at start1 query and 60 secondsAt least 24,576NoneN/AMeasured throughput

Benchmarks

Each benchmark is defined by a Dataset and Quality Target. The following table summarizes the benchmarks in this version of the suite (the rules remain the official source of truth): 

AreaTaskModelDatasetQSL SizeQualityServer latency constraintLatest Version Available
VisionImage classificationResnet50-v1.5ImageNet (224×224)102499% of FP32 (76.46%)15 msv4.0
VisionObject detectionRetinanetOpenImages (800×800)6499% of FP32 (0.20 mAP)100 msv4.0
VisionMedical image segmentation3D UNETKITS 2019 (602x512x512)1699% of FP32 and 99.9% of FP32 (0.86330 mean DICE score)N/Av4.1
LanguageLLM – Q&ALlama 2 70BOpenOrca24576ROUGE-1 = 44.4312
ROUGE-2 = 22.0352
ROUGE-L = 28.6162
TTFT: 2s & TPOT: 200msv4.1
LanguageLLM – SummarizationGPT-J 6BCNN-DailyMail News Text Summarization1336899.9% or 99% of the original FP32 ROUGE 1 – 42.9865 ROUGE 2 – 20.1235 ROUGE L – 29.988120 secondsv4.1
LanguageLLM – Text generation (Question Answering, Math and Code Generation)Mixtral 8x7BOpenOrca GSM8K, MBXP1500099% or 99.9% of FP32 (ROUGE 1 – 45.4911, ROUGE 2 – 23.2829, ROUGE L 30.3615, (gsm8k)Accuracy 73.78, (mbxp)Accuracy 60.12)TTFT: 2s & TPOT: 200msv4.1
ImageImage GenerationSDXL 1.0COCO-20145000FID ∈ (23.0108, 23.9501)
CLIP ∈ (31.686, 31.813)
20 secondsv4.1
LanguageLanguage processingBERT-largeSQuAD v1.1 (max_seq_len=384)1083399% of FP32 and 99.9% of FP32 (f1_score=90.874%)130 msv4.1
CommerceRecommendationDLRM-DCNv2Criteo 4TB Multi-hot20480099.9% or 99% of the original FP32 AUC metric (80.31%)60 msv4.1
SpeechSpeech-to-textRNNTLibrispeech dev-clean (samples < 15 seconds)251399% of FP32 (1 – WER, where WER=7.452253714852645%)1000 msv4.0
CommerceRecommendationDLRM1TB Click Logs20480099% of FP32 and 99.9% of FP32 (AUC=80.25%)30 msv3.0
VisionObject detection (large)SSD-ResNet34COCO (1200×1200)6499% of FP32 (0.20 mAP)100 msv2.0
VisionMedical image segmentation3D UNETBraTS 2019 (224x224x160)1699% of FP32 and 99.9% of FP32 (0.85300 mean DICE score)N/Av1.1
VisionImage classificationMobileNet-v1ImageNet (224×224)102499% of FP32 (71.68%)10 msv0.5
VisionObject detection (small)SSD-MobileNets-v1COCO (300×300)25699% of FP32 (0.22 mAP)10 msv0.5

Each Datacenter benchmark requires the following scenarios: 

AreaTaskRequired Scenarios
VisionImage classificationServer, Offline
VisionObject detectionServer, Offline
VisionMedical image segmentationOffline
SpeechSpeech-to-textServer, Offline
LanguageLanguage processingServer, Offline
LanguageSummarizationServer, Offline
LanguageQuestion AnsweringServer, Offline
CommerceRecommendationServer, Offline
Image generationText-to-imageServer, Offline

Divisions

MLPerf aims to encourage innovation in software as well as hardware by allowing submitters to reimplement the reference implementations. There are two Divisions that allow different levels of flexibility during reimplementation:

  • The Closed division is intended to compare hardware platforms or software frameworks “apples-to-apples” and requires using the same model as the reference implementation.
  • The Open division is intended to foster innovation and allows using a different model or retraining. 

Availability

MLPerf divides benchmark results into Categories based on availability. 

  • Available systems contain only components that are available for purchase or for rent in the cloud. 
  • Preview systems must be submittable as Available in the next submission round.
  • Research, Development, or Internal (RDI) contain experimental, in development, or internal-use hardware or software.

Submission Information

Each row in the results table is a set of results produced by a single submitter  using the same software stack and hardware platform. Each Closed division row contains the following information:

  • Submitter

    The organization that submitted the results.

  • Software

    The ML framework and primary ML hardware library used.

  • System

    General system description.

  • Benchmark Results

    Results for each benchmark as described above.

  • Processor and Count

    The type and number of CPUs used, if CPUs perform the majority of ML compute.

  • Details

    Link to metadata for submission.

  • Accelerator and Count

    The type and number of accelerators used, if accelerators perform the majority of ML compute.

  • Code

    Link to code for submission.

Open Divisions

You may add the following rows:


Model Used
The model used to produce the results, which may or may not match the Closed Division requirement.

Notes
Arbitrary notes from submitter.

Power Measurements

Each row will add columns for each benchmark containing the following:


System Power
for Server and Offline scenarios, or…

Energy Per Stream
for Single stream and Multiple stream scenarios

These metrics are computed using the measured average AC power (energy) consumed by the entire system for the duration of the performance measurements of a benchmark (e.g., a single network under a single scenario); the AC power is measured at the wall. 

The measured power is only valid for the accompanying benchmark. MLPerf Power is only capable of measuring and validating the full system power. Any other references to power in any description (e.g., a TDP configuration, a power supply rating) are not measured or validated by MLCommons.