MLPerf Training Adds Llama 3.1 8B Benchmark

Pretraining large language models (LLMs) requires massive computational resources. While this reflects the reality of AI training in the industry, it creates challenges for accessible benchmarking due to system size and training time requirements. To address this, MLPerf Training v5.1 introduces a benchmark based on Meta’s Llama 3.1 8B, replacing BERT with a modern model that can still run on single-node systems.

The scale of this challenge is significant. The Llama 3.1 405B benchmark added in MLPerf Training v5.0, for example, required a minimum of 256 GPUs per submission. This mirrors leading AI development but creates barriers for organizations looking to benchmark their systems without massive GPU clusters.

Since v0.7 in 2020, MLPerf Training has used BERT as one of the pretraining benchmarks. Its versatility allows it to run on a single node and scale to hundreds of GPUs. However, the BERT model, introduced by Google in 2018, is now significantly outdated and cannot address the requirements of modern LLMs. Finding a suitable replacement required balancing accessibility with architectural relevance.

Model Selection

The MLPerf Training Working Group spun out a task force to find the best alternative. One of the primary requirements was a model small enough to run on a single node for easy execution.

The task force evaluated several candidates, and Llama 3.1 8B quickly became the front-runner. With Llama3.1 405B already added to the Training suite in v5.0, the 8B variant allowed submitters to leverage the existing implementation while working at a much smaller scale. Its architecture also makes it a good proxy for evaluating the performance of even larger models.

Training Data

Like the Llama3.1 405B benchmark, the 8B references the C4 (Colossal Cleaned Common Crawl) dataset. However, the 8B model’s smaller parameter count and shorter training time require only a subset of the full dataset. This reduces effort involved for submitters and keeps benchmark run times reasonable.

The benchmark uses the default C4 dataset split between training and validation. For training, the c4-train.<x>-of-01024.json.gz files (768 <= x <= 1023) are used, where the last 256 of 1,024 shards are randomly shuffled. For evaluation, a customized subset c4-validation-91205-samples.en.json.gz is used, which contains the first 91,205 samples from the unshuffled C4 validation dataset. This is the smallest number of samples needed to yield over 47,185,920 tokens. During each evaluation run, the first 1,024 sequences (roughly 8.4 million tokens) from this set are tested. The training data is shuffled to introduce variability, while the validation data remains unshuffled to ensure consistent assessment across runs.

To simplify setup, MLPerf hosts this preprocessed dataset subset. Submitters can download it by following these instructions.

Implementation Details

The reference code is implemented in the NVIDIA NeMo Framework, an open-source, scalable AI framework built for large language models. By offering modular components and pre-trained model recipes, NeMo enables developers to build performant models tailored to specific use cases. The reference code functionality is tested on NVIDIA B200 GPUs and AMD MI325X.

Unlike many MLPerf Training benchmarks, including Llama 3.1 405B, this one starts from randomized weights rather than a checkpoint. This simplifies running and porting across different systems.

The benchmark also uses the native Llama 3.1 8B tokenizer, whereas the 405B benchmark deliberately uses a different tokenizer (32k vocab size Mixtral 8x22B) to force the model checkpoint to adapt to a new token distribution. Without a checkpoint, the 8B benchmark can use its native tokenizer, making setup even easier. The tokenizer can be downloaded directly from the MLPerf website following these instructions.

To optimize convergence to the target accuracy criterion, submitters can adjust three hyperparameters: batch size, learning rate, and number of warmup samples. However, submitters must match the behaviors defined by RCPs (Reference Convergence Points). The full rules governing RCPs are available here.

The task force continued using the validation loss perplexity as the convergence criterion, maintaining consistency with the 405B benchmark. Extensive experiments on reference hardware established a target value of 3.3.

Conclusion

This new pretraining benchmark based on Meta’s Llama 3.1 8B model brings modern LLM pretraining evaluation within reach of more organizations. It was intentionally designed to be easy to set up and run on small to moderately sized computational resources. By requiring only a single node, using a subset of the C4 dataset, and starting from random weights, it lowers the barrier to entry while maintaining relevance to current AI development practices. Organizations with experience running the Llama 3.1 405B benchmark can leverage that existing expertise, while those new to MLPerf Training gain an accessible entry point.

With this addition to MLPerf Training v5.1, the suite now offers LLM pretraining benchmarks from single-node systems to massive multi-cluster workloads, enabling standardized evaluation across different system scales.

For technical specifications and submission guidelines, visit our MLPerf Training benchmark page. Our submission rules and reference code are available on GitHub

About MLCommons

MLCommons is the world’s leader in AI benchmarking. An open engineering consortium supported by over 125 members and affiliates, MLCommons has a proven record of bringing together academia, industry, and civil society to measure and improve AI. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. Since then, MLCommons has continued using collective engineering to build the benchmarks and metrics required for better AI – ultimately helping to evaluate and improve AI technologies’ accuracy, safety, speed, and efficiency.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email [email protected].