Introduction
The MLCommonsⓇ MLPerfⓇ Inference benchmark suite is an industry standard for measuring the performance of machine learning (ML) and artificial intelligence (AI) workloads from diverse domains including vision, speech, and natural language processing. For each of these domains, the suite includes a carefully selected set of workloads that represents the state of the art in the industry across different application segments. These benchmarks not only provide key information to consumers for making application deployment and budget decisions but also enable vendors to deliver critical workload optimizations within certain practical constraints to their customers.
The Small LLM task force was convened in early 2025 to ensure the MLCommonsⓇ MLPerfⓇ Inference 5.1 benchmark is based on up-to-date LLMs for evaluating inference performance.
Model selection
The first model used in the LLM category of MLPerf Inference was GPT-J, with 6B parameters. While bigger models were later introduced, the small-sized LLM remains a convenient entry point for vendor benchmarking programs. To keep abreast of the rapid improvements in model capabilities and accompanying software/hardware performance, the task force sought to find a suitable replacement for GPT-J that reflects industry trends while providing a low barrier to entry for accelerator vendors. The choice of Llama3.1-8B conveniently fits the bill. It is among the top models on Hugging Face by download frequency, is reasonably sized to accommodate several hardware and accelerators, and offers low cost of deployment.
Reference implementation
The reference implementation for Llama3.1-8B Instruct leverages vLLM, a fast and versatile library for LLM inference and serving. This choice was driven by vLLM’s robust support for large language models and its wide-ranging compatibility with a diverse range of hardware, including CPUs and GPUs from AMD, Google, Intel, NVIDIA, and others, making it an ideal choice for benchmark testing and a marked improvement over the transformers-based GPT-J model.
For readers interested in running the model themselves, we encourage them to follow the reference implementation, which contains code and instructions on how to run the entire benchmark end-to-end.
Dataset and task selection
Llama3.1-8B offers a large context length of 128,000 tokens compared to the 2,048 of GPT-J. This allows the model to work with extended context tasks such as text summarization which is of high importance for aiding human understanding and decision-making from large bodies of text. The dataset for Llama3.1-8B was carried over from GPT-J to satisfy a similar domain with the added benefit of not being constrained by the context length of GPT-J for longer inputs, as well as having a higher ~4.5 words/token.
The CNN-DailyMail validation dataset is among the most popular publicly available text-summarization datasets with an average input length of 778 tokens and output length of 73 tokens. This long-input/short-output combination differs from other existing LLM benchmark suites and presents a different challenge for inference benchmarking.

For this summarization task, we adopted Rouge-1, Rouge-2, and Rouge-L, widely recognized NLP metrics that measure lexical overlap between generated summaries and ground-truth references as our accuracy metric. We set the accuracy threshold for closed division submissions to 99% of the FP16 reference.
Due to end-to-end running time concerns, the dataset size for edge submissions is reduced to 5,000 randomly sampled articles from the 13,368 articles in the CNN-DailyMail validation set.
Performance metrics
Measuring the performance of a small LLM such as Llama3.1-8B requires carefully chosen metrics that reflect both throughput and latency under realistic deployment conditions.The inference computation benchmarking can be divided into two phases: the prompt phase (processing all input tokens as time to first output token, TTFT) and the generation phase (producing subsequent tokens sequentially while reusing the prior context as time per output token, TPOT). For consistency across MLPerf Inference benchmarks, throughput is expressed in tokens per second (tokens/s), accounting for the highly variable lengths of input and output sequences.
This reflects the change from the metric previously for GPT-J of only observing the time to generate the last output token.
Latency constraints for the server scenario
For the server scenario, which models online services with random query arrivals, response latency is critical. Llama3.1-8B adopts the following latency thresholds: TTFT ≤ 2 seconds and TPOT ≤ 100 milliseconds.
Compared to Llama 2 benchmarks, this reflects an adjustment that balances efficiency with the smaller model size, while ensuring responsiveness suitable for most general-purpose applications. A TPOT of 100 ms is effectively ~480 words per minute, significantly faster than typical human reading rates, enabling fluid interaction for chat or knowledge retrieval.
Latency constraints for the interactive scenario
In addition, Llama3.1-8B introduces a new interactive scenario to reflect use cases where user engagement and responsiveness are paramount, such as code assistants and real-time creative tools. In this mode, latency constraints are set more aggressively with TTFT ≤ 0.5 seconds and TPOT ≤ 30 milliseconds.
These constraints represent a different performance envelope with the constraints ensuring immediate feedback, with a TPOT equivalent to ~1,600 words per minute. They reflect workloads where the effective tokens/s per user is higher, stressing the system’s ability to maintain low-latency responses under higher concurrency.
With these differentiated latency targets, the benchmark captures the spectrum of real-world deployment for small LLMs. The server scenario is suitable for broad enterprise uses, while the interactive scenario highlights ultra-low-latency inference for applications where responsiveness drives the user experience.
Metrics for edge systems
Edge submissions for Llama3.1-8B require two scenarios, offline and single stream. For the offline scenario, we use the same performance metric as the data center category, tokens per second. For the single stream scenario, we use a 90th-percentile per-sequence latency. The choice of the sequence latency statistic as opposed to the per-token latencies (TTFT and TPOT) is because the per-token latencies don’t capture system performance for a given sequence.
Conclusion
In this post, we described the complexities involved in the choices of model, task, dataset, accuracy, performance, and verification of a contemporary LLM benchmark. We believe the design choices of this benchmark objectively address most of the critical deployment decisions faced by practitioners while delivering important performance insights to customers.
Details of the MLPerf Inference Llama3.1-8B benchmark and reference implementation can be found here.
Additional technical contributors: Pablo Gonzalez, MLCommons.