About MLPerf Training
Text-to-image models create images from text prompts. One widely used technique is diffusion, where models learn to iteratively transform random noise into complete images. Since Stable Diffusion (SDv2) was added to the MLCommons MLPerf Training benchmark suite in October 2023, the importance of these models has increased dramatically.
Model sizes and architectures have also evolved since then, with parameter counts growing from ~800M to 3.5B in Stable Diffusion XL. Diffusion Transformer (DiT) models successfully integrated transformer architecture into the diffusion process. Leading models, including Stable Diffusion 3.5, have adopted this innovation, marking a shift the SDv2 benchmark no longer reflects.
Recognizing this gap, the MLPerf Training Working Group formed a task force to evaluate a refresh of the text-to-image benchmark. MLPerf Training v5.1 introduces a new text-to-image benchmark based on Black Forest Labs’ Flux.1, a 11.9B-parameter transformer-based model that reflects the current state of generative AI.
Model Selection
The task force evaluated candidates on four primary criteria: performance, architecture, size, and availability.
- Image Quality: SDv2 can no longer compete with more modern models, such as Imagen3, Flux.1 and SD3.5, in image quality and prompt adherence
- Architecture: A representative benchmark must mirror the architectural approaches used in state-of-the-art models and represent real-world scenarios. For text-to-image generation, transformer-based architectures have become dominant, making this a critical requirement.
- Size: Modern text-to-image models have parameter counts in the billions, while SDv2 has fewer than 900M parameters. These larger parameter counts introduce new memory and computation challenges, such as different parallelization requirements, that smaller models don’t face.
- Availability: MLPerf Training requires open-source architectures, significantly limiting the possible choices.
Black Forest Lab’s Flux.1 met all four criteria. This transformer-based latent diffusion model has 11.9B parameters and a fully open-source license. It can compete with closed-source models across multiple quality benchmarks, making it a strong representative of current text-to-image generation approaches.
Dataset Selection
MLPerf Training requires publicly available, open-source datasets, a constraint that presents some challenges since even open-sourced models are typically trained on closed, proprietary data. The SDv2 benchmark used a subset of the LAION dataset for training, making it an early option for this new benchmark as well. The task force also evaluated CC12M, another popular open-source dataset for captioned images. For validation, both benchmarks use a subset of COCO-2014, which remains the standard for evaluating text-to-image models.
The two training datasets differ significantly in composition. The LAION subset contains 6.5M samples drawn from a 400M collection, prioritizing web-scale coverage. CC12M provides nearly 12M samples but emphasizes clean, high-precision captions over raw scale.
To make an evidence-based decision, the task force trained an initial Flux.1 implementation on both datasets for short and long training periods, then evaluated them on COCO-2014 using FID and CLIP as metrics. The number of training steps was chosen as an approximated upper bound of the desired training duration for the benchmark. As shown in the charts below, CC12M consistently outperformed the LAION subset across both training durations, making it the clear choice for the benchmark.


Further experimentation revealed that, for the chosen benchmarking region (detailed in the following section), a 90% reduction in dataset size had minimal impact on image quality. As such, the final training dataset consists of 1,099,776 samples from CC12M.
Implementation Details
Flux.1 is a latent diffusion model built on DiT blocks, with architectural details and training procedures following the report: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. The model uses multimodal transformer blocks to concurrently process text and image data. From a modelling perspective, a key architectural difference from SDv2 is the shift to rectified flows, which encourage linear denoising trajectories for more efficient decoding.
The reference code is implemented with torchtitan, an open-source framework for large-scale AI training built on native Pytorch. Its focus on clean, minimal implementations and enabled fast prototyping and development in collaboration with Meta’s torchtitan team. Because it builds directly on native Pytorch, the code is straightforward to read and understand. The implementation has been tested on NVIDIA B200 GPUs.
Since the original Flux.1 training code is not open-source, the reference builds on the torchtitan implementation of latent diffusion. The training procedure works as follows:
- Given an image-text pair, encode the text using T5-XXL and CLIP-ViT-H encoders, and encode the image using a VAE (Variational Autoencoder).
- Sample Gaussian noise (φ) matching the shape of the image encoding (i). Select a random point along the line between the image encoding and this noise to obtain a noisy latent. The difference (φ – i) becomes the ground truth.
- Feed the noisy latent and text encodings to the transformer model, which tries to regress the ground truth using mean squared error loss.
Intuitively, the network learns to identify the noise that should be removed to obtain a clean latent. During inference, this process is applied iteratively, starting from Gaussian noise, producing a latent that the VAE decodes into an image.
As with the previous text-to-image benchmark, the encoders are frozen during training. This allows all encodings for the dataset to be pre-computed once and made available to submitters, eliminating redundant computation during benchmark runs.
Evaluation Approach
A major difference from the SDv2 benchmark is the evaluation metric, a change that significantly streamlines the benchmarking process.
The previous benchmark used FID and CLIP scores, which currently remain industry-standard metrics for text-to-image models. However, these metrics require generating complete images and running them through separate models for evaluation, which makes them time-consuming and intensive to compute during a training submission.
Following the Scaling Rectified Flow Transformers for High-Resolution Image Synthesis research, the task force found that validation loss computed over equally-sampled noise levels correlates highly with FID and CLIP scores. This correlation allows submitters to leverage a much simpler metric that requires only a single forward pass per sample, rather than full image generation. This alternative metric significantly speeds up evaluation while maintaining meaningful quality assessment.
Benchmark Parameters
A full training run of Flux.1 would take prohibitively long for benchmarking purposes, so the task force selected a suitable benchmarking region. Training the model from scratch until a target validation loss of 0.586 provided a good balance between run-to-run variance and room for scaling compute. Using 64 NVIDIA B200 GPUs with torch.compile and bf16 training, the reference implementation completes a run in approximately 95 minutes, a baseline submitters are expected to improve upon significantly. Due to run-to-run variance, each submission requires 10 runs.
Conclusion
The Flux.1 benchmark brings MLPerf Training’s text-to-image evaluation in line with current generative AI practices. With 12x more parameters than SDv2 and a transformer-based architecture, it better reflects the current landscape of text-to-image models.
This update ensures MLPerf Training continues to provide relevant performance insights as text-to-image generation rapidly evolves, offering submitters a benchmark that reflects the systems and techniques used in production environments today.
Our submission rules and reference code are available on GitHub.
About MLCommons
MLCommons is the world’s leader in AI benchmarking. An open engineering consortium supported by over 125 members and affiliates, MLCommons has a proven record of bringing together academia, industry, and civil society to measure and improve AI. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. Since then, MLCommons has continued using collective engineering to build the benchmarks and metrics required for better AI – ultimately helping to evaluate and improve AI technologies’ accuracy, safety, speed, and efficiency.
For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email [email protected].