Bringing Text-to-Video to MLPerf Inference v6.0

Introduction

The MLCommonsⓇ MLPerfⓇ Inference benchmark suite is an industry standard for measuring the performance of machine learning (ML) and artificial intelligence (AI) workloads from diverse domains including vision, speech, and natural language processing. For each of these domains, the suite includes a carefully selected set of workloads that represents the state of the art in the industry across different application segments. These benchmarks not only provide key information to consumers for making application deployment and budget decisions but also enable vendors to deliver critical workload optimizations within certain practical constraints to their customers.

Over the past year, we have seen rapid advancements in the capabilities of video generative models like OpenAI Sora2. Gone are the days of hobbyists generating uncanny clips of Will Smith eating spaghetti; now professional artists create entire workflows around the capabilities of these models. As they now transition from being mere curiosities to becoming core parts of creative workflows, the need for a standardized benchmark has become clear. And thus the MLPerf Text-to-Video Task Force was convened to incorporate a dedicated video generation benchmark into the MLPerf suite.

Model selection

For this benchmark we picked the Wan2.2-T2V-A14B-Diffusers model (released July 2025) by Alibaba, as it was one of the best open weights models on the Text-to-Video leaderboard at the time. The model has been fully open sourced under an Apache 2.0 licence and can be run via Huggingface Diffusers.

The Wan2.2 model is run as a pipeline of 3 models:

The UMT5 XXL Text Encoder from Google, used to encode prompts
The Wan2.2 A14B Diffusion Transformer, used to generate a latent video representation
The Wan2.2 VAE Decoder, used to decode the latent video into a series of frames.

The crucial architectural feature of the Wan2.2-T2V-A14B-Diffusers model is that it is a type of Mixture of Experts model. However, unlike the standard MoE architecture, there is no gating network used to route tokens to experts, instead this model consists of 2 experts which are activated sequentially during the denoising process. The first expert is known as the “High Noise Expert” and is active during the early stages of denoising, after which the model switches to the “Low Noise Expert” which completes the denoising process.

Counterintuitively, most video generation models do not generate videos frame by frame, instead they generate the entire video at once, by denoising a massive video latent. This video latent usually represents both a spatial and temporal segment of the video, for example the Wan2.2 latent represents an area of 32×32 pixels across 4 frames. This means that generating a 5 second video at 720p resolution and 16fps would require a sequence length of 19,320. This leads to the model being heavily compute bound.

Performance metrics

One of the key difficulties we encountered when designing this benchmark was deciding which performance metrics to use. The text-to-video task is computationally expensive with long runtimes, with many systems taking multiple minutes per query.

One consequence of this performance profile is that we had to limit the length, resolution and number of videos generated to ensure that this benchmark remained feasible, whilst also demonstrating frontier capabilities. Therefore we made the following decisions.

Configuration: We limited the length of the generated videos to 5 seconds, whilst fixing the resolution at 720p. This meant that we would generate 81 720×1280 frames at 16fps.
Runtime Target: To ensure the benchmark remains accessible to a wide range of submitters, we reduced the dataset size to a practical subset (100 out of 248 samples for performance mode, keep 248 for accuracy mode).

Replacing Server with SingleStream

One of the key changes we’re introducing in this benchmark is the replacement of the Server scenario with a SingleStream scenario for latency measurement. We based this decision on the large amount of compute needed to generate a single video, meaning that videos often take multiple minutes to generate.

This poses a problem with the Server scenario where the system is assumed to be able to operate in near real-time. In practice this would mean that the System Under Test would quickly become overloaded with requests, meaning that most requests would spend the majority of their time waiting to be processed instead of being processed. This in turn would mean that the final latency measurements wouldn’t accurately reflect the hardware’s performance.

To solve this problem, we replaced the Server scenario with a SingleStream one, where we would only measure the time taken to process requests, ignoring all wait times.

Dataset and Accuracy Metric and task selection

We selected VBench as the official dataset and accuracy framework for this benchmark. We reached this decision after a comparative analysis of available options, prioritizing licensing feasibility, robustness, and ease of adoption.

We evaluated a wide range of datasets at the beginning, including OpenVid-1M, VidGen-1M, WebVid-10M, and ActivityNet. A primary filter for selection was commercial viability, as MLPerf submissions often come from industry partners for commercial hardware validation.

VidGen-1M and WebVid-10M were disqualified due to restrictive licensing (e.g., Non-Commercial or Research-Only terms), which leads to legal risks for benchmarking.
OpenVid-1M offered a permissible license (CC BY 4.0) but functioned solely as a dataset and lacked an integrated evaluation framework.

VBench distinguished itself by offering a holistic solution that combined a diverse prompt set with a pre-validated scoring suite. Unlike raw datasets that would require us to independently develop and validate separate accuracy metrics (such as FVD or IS), VBench provided:

Comprehensive Metrics: A suite of 16 distinct quality dimensions, including Subject Consistency, Motion Smoothness, and Aesthetic Quality.
Standardized Prompts: A curated list of ~950 prompts designed to stress-test specific generation capabilities.
Determinism: Experiments confirmed that VBench scores remained stable across different hardware backends (NVIDIA, AMD) with fixed seeds, a critical requirement for cross-vendor fairness.
Widespread Adoption: VBench is a widely used benchmark by video model builders and has been cited in several technical reports, including that of the Wan model family.

While VBench provided the most robust framework, its default configuration was computationally expensive, requiring over 80 hours for a full inference pass. To align with MLPerf’s accuracy check runtime, we adapted VBench by:

Subsetting the Dataset: We agreed to reduce to average the 6 key metrics: Subject Consistency, Background Consistency, Motion Smoothness, Dynamic Degree, Appearance Style, Scene.
This selection focused on the most discriminatory dimensions, such as Dynamic Degree, Multiple Objects, and Scene Quality, while removing metrics that were redundant, static, or computationally trivial for datacenter-class hardware.
Reducing Dataset Size: By focusing on these 6 metrics, the dataset was reduced to a statistically significant subset (248 samples), striking the necessary balance between rigorous accuracy validation and manageable submission runtimes.

VBench was chosen by the taskforce, because it offered the only commercially viable, legally cleared, and methodologically complete framework that could be adapted to meet the rigorous runtime constraints of the MLPerf Inference benchmark.

Reference implementation

To ensure a fair and reproducible benchmark, the reference implementation for the Text-to-Video task is built on a standardized open-source foundation.

Here is the reference setup:

Model Architecture: We utilize the Wan2.2-T2V-A14B-Diffusers model (hosted by Wan-AI). This is a 14-billion parameter Diffusion Transformer designed for high-quality video generation.
Precision & Compute: The reference implementation runs in BF16 (BFloat16) precision. This choice reflects modern datacenter standards, balancing numerical stability with efficient memory usage.
- Reference Accuracy Score: 70.48 (VBench).
- Minimum Accuracy Threshold: 69.77 (99% of reference).
Generation Pipeline: The reference pipeline is adapted from the Hugging Face Diffusers library, ensuring broad compatibility and ease of use.
- Input: Text prompt + Fixed Latent Tensor (to ensure deterministic outputs for debugging and verification).
- Scheduler: Uses the UniPCMultistepScheduler (aligned with the Wan2.2 default) to optimize step efficiency.
- Reduced diffusion steps (per TF discussion): 20 steps in diffuse process.
- Output: 720px1280p resolution at 16 frames per second, 81 frames in a video.
Containerization: To simplify deployment, the entire reference stack including Python dependencies, CUDA 12.1 libraries, and VBench evaluation tools is provided as a Docker container. This allows submitters to “build and run” the benchmark with a single launch.sh script.

Conclusion

With the introduction of the Text-to-Video task in MLPerf Inference v6.0, we have taken a significant step to generative video workloads. This benchmark provides the industry with a reliable, reproducible approach to measure the rapidly evolving capabilities of both hardware and software.

The architectural decisions behind this inaugural benchmark reflect the the real video generation tasks today:

Model Selection: We selected Wan2.2-A14B-Diffusers, a powerful open-weights T2V model that represents the state-of-the-art in open-source generation.
Performance: We adopted SingleStream as the primary metric, and keep the standard Offline scenarios. The high-fidelity video generation task is currently a compute-bound, high-latency task which fits best for the current use cases .
Accuracy: We integrated VBench as our evaluation framework, ensuring that performance optimizations do not come at the cost of visual fidelity, motion coherence, or prompt adherence.

This release represents a foundational baseline, but the field is moving fast. As generation latencies drop from minutes to seconds, we expect that the benchmark will evolve to include Server Mode scenarios to reflect in real-time use cases.