AI benchmark for personal computers
MLPerf Client
MLPerf Client is a new benchmark developed collaboratively at MLCommons to evaluate the performance of large language models (LLMs) and other AI workloads on personal computers–from laptops and desktops to workstations.
How it works
By simulating real-world AI tasks, the MLPerf Client benchmark provides clear metrics for understanding how well systems handle generative AI workloads. The MLPerf Client working group intends for this benchmark to drive innovation and foster competition, ensuring that PCs can meet the challenges of the AI-powered future.
The large language model tests
The LLM tests in MLPerf Client v0.5 make use of the Llama 2 7B large language model (LLM) from Meta. Large language models like this one are one of the most exciting forms of generative AI available today. LLMs are capable of performing a wide range of tasks based on natural-language interactions. The working group chose to focus its efforts on LLMs because there are many promising use cases for running LLMs locally on client systems, from chat interactions to AI agents, personal information management, and more.
Tasks | Model | Dataset | Mode | Quality |
---|---|---|---|---|
Content generation Creative writing Summarization, light Summarization, moderate | Llama 2 7B | OpenOrca | Single stream | MMLU score |
For more information on how the benchmark works, please see the Q&A section below.
System requirements and acceleration support
Supported hardware
- AMD Radeon RX 7900 Series GPUs with 16GB or more VRAM
- AMD Ryzen AI 9 Series with 32GB or more system memory
- Intel Arc GPUs with 8GB or more VRAM
- Intel Core Ultra processors (Series 2) with Built-In Intel Arc GPUs with 16GB or more system memory
- NVIDIA GeForce RTX 4000 Series GPUs with 16GB VRAM (recommended) or 12GB VRAM (minimum) and 32GB system memory (recommended) or 16GB system memory (minimum)
Other requirements
- Windows 11 x86-64 – latest updates recommended
- The Microsoft Visual C++ Redistributable must be installed
- 20GB of free space on the drive from which benchmark will run
- We recommend installing the latest GPU drivers for your system:
Supported acceleration paths
This first release of MLPerf Client includes two different paths for hardware-accelerated execution of its AI workload:
- ONNX Runtime GenAI with the DirectML execution provider (EP) for GPUs
- Intel OpenVINO for Intel GPUs
The ONNX Runtime GenAI path with the DirectML EP runs on both integrated and discrete GPUs from AMD and Intel and discrete GPUs from NVIDIA, provided they have sufficient memory. The ORT GenAI path is the default path for both the AMD and NVIDIA configs in this release of the benchmark.
OpenVINO is the default configuration path for Intel integrated and discrete GPUs. Intel also optionally supports the ONNX Runtime GenAI with DirectML EP acceleration.
We plan to add more acceleration paths supporting various types of processing units in future releases.
Downloading the benchmark
The latest release of MLPerf Client can be downloaded from the GitHub release page. The end-user license agreed is included in the download archive file.
The public GitHub repository also contains the source code for MLPerf Client for those who might wish to inspect or modify it.
More information about how to run the benchmark is provided in the Q&A section below.
Common questions and answers
How does one run the benchmark?
This benchmark is built for Windows 11 and requires the installation of the Microsoft Visual C++ Redistributable, which is available for download here. Please install this package before attempting to run MLPerf Client.
The MLPerf Client benchmark is distributed as a Zip file. Once you’ve downloaded the Zip archive, extract it to a folder on your local drive. Inside the folder you should see a few files, including the MLPerf Client executable and several JSON config files for specific brands of hardware.
To run MLPerf Client, open a command line instance in the directory where you’ve unzipped the files. In Windows 11 Explorer, if you are viewing the unzipped files, you can right-click in the Explorer window and choose “Open in Terminal” to start a PowerShell session in the appropriate folder.
The JSON config files specify the acceleration paths and configuration options recommended by each of the participating hardware vendors: AMD, Intel, and NVIDIA.
To run the benchmark, you need to call the benchmark executable while using the -c flag and specifying which config file to use. If you don’t use the -c flag to specify a config file, the benchmark will not run.
The commands in question are:
mlperf-windows.exe -c <config-filename>.json
So, for instance, you could run the NVIDIA ONNX Runtime DirectML path by typing:
mlperf-windows.exe -c Nvidia_ORT-GenAI-DML_GPU.json
Here’s how that interaction should look:
Once you’ve kicked off the test run, the benchmark should begin downloading the files it needs to run. Those files may take a while to download. For instance, for the ONNX Runtime GenAI path, the downloaded files will total just over 7GB.
After the download, the test should begin running, and in the end, it should output the results.
What are these “configuration tested” notifications?
The MLPerf Client benchmark was conceived as more than just a static test. It offers extensive control and configurability over how it runs, and the MLPerf Client working group encourages its use for experimentation and testing beyond the included configurations.
However, scores generated with modified executables, libraries, or configuration files are not to be considered comparable to the fully tested and approved configurations shipped with the benchmark application.
When the benchmark begins to run with one of the standard config files, you should see a notification in the output that looks like this:
This “Configuration tested by MLCommons” note means that both the executable program and the JSON config file you are using have been tested and approved by the MLPerf Client working group ahead of the release of the benchmark application. This testing process involves peer review among the organizations participating in the development of the MLPerf Client benchmark. It also includes MMLU-based accuracy testing. Only scores generated with this “configuration tested” notice should be considered valid MLPerf Client scores.
Future releases of MLPerf Client will support a wider range of acceleration paths, so there will be more tested configs capable of producing a broader range of comparable scores.
What do the performance metrics mean?
The LLM scenario produces results in terms of two key metrics. Both metrics characterize important aspects of the user experience.
Time to first token (TTFT): This the wait time in seconds before the system produces the first token in response to each prompt. In LLM interactions, the wait for the first output token is typically the longest one in the ensuing response. Following widespread industry practice, we have chosen to report this result separately. For TTFT, lower numbers–and thus lower wait times–are better.
Tokens per second (TPS): This is the average rate to produce all of the rest of the tokens in the response, excluding the first one. For TPS, higher output rates are better.
In a chat-style LLM usage, TTFT is the time you wait until the first character of the response appears, and tokens/s is the rate at which the rest of the response appears. The latency for a complete response is TTFT + TPS * (# of tokens in the response).
What are the different task categories in the benchmark?
Here’s a look at the token lengths for the four categories of work included in the benchmark:
Category | Approximate input tokens | Approximate expected output tokens |
---|---|---|
Content generation | 128 | 256 |
Creative writing | 512 | 512 |
Summarization, Light | 1024 | 128 |
Summarization, Moderate | 1566 | 256 |
You may find that hardware and software solutions vary in their performance across the categories, with larger context lengths being more computationally expensive.
All of the input prompts used in the four categories come from the OpenOrca data set.
The fundamental unit of length for inputs and outputs in an LLM is a token. A token is a component part of language that the machine-learning model uses to understand syntax. Tokens can be words, parts of words, or punctuation marks. For instance, complex words like “everyone” could be made up of two tokens, one for “every” and another for “one.”
What is a token?
An example of tokenization. Each colored region represents a token.
As a rule of thumb, 100 tokens would typically translate into about 75 English words. This tokenizer tool will give you a sense of how words and phrases are broken down into tokens.
How is the benchmark optimized for client systems, and how do you ensure quality results?
Many generative AI models are too large to run in their native form on client systems, so developers often modify them to make them fit in the memory and computational footprints required. This footprint reduction is typically achieved through model quantization–by changing the datatype used for model weights from the original 16- or 32-bit floating-point format (fp16 or fp32) to something more compact. For LLMs in client systems, the 4-bit integer format (int4) is a popular choice. Thus, in MLPerf Client v0.5, the Llama 2 7B model has been quantized from fp16 to int4. This change allows the model to fit and function well on a broad range of client devices, from laptops to desktops and workstations.
That said, reducing the precision of the weights can also come with a drawback: a reduction in the quality of the AI model’s outputs. Other model optimizations can also impact quality. To ensure that a model’s functionality hasn’t been too compromised by optimization, the MLPerf Client working group chose to impose an accuracy test requirement based on the MMLU dataset. Each accelerated execution path included in the benchmark must achieve a passing score on an accuracy test composed of the prompts in the MMLU corpus.
The accuracy test is not part of the standard MLPerf Client performance benchmark run because accuracy testing can take quite some time. Instead, the accuracy test is run offline by working group participants in order for an acceleration path to qualify for inclusion in the benchmark release. All of the supported frameworks and models in MLPerf Client v0.5 have been verified to achieve the required accuracy level.
Does the benchmark require an Internet connection to run?
Only for the first run with a given acceleration path. After that, the downloaded files will be stored locally, so they won’t have to be downloaded again on subsequent runs.
Do I need to run the test multiple times to ensure accurate performance results?
Although it’s not visible from the CLI, the benchmark actually runs each test four times internally. There’s one warm-up run and three performance runs. The application then reports the averages of the results of the three performance runs.
The benchmark reports an error message and stops running without reporting results. What should I do?
We are aware of an issue with systems using the Intel OpenVINO path that have both integrated and discrete GPUs enabled. This issue may cause the benchmark to encounter an error. If you run into this issue, we recommend that you disable the IGP in the system BIOS or device manager before testing. The benchmark should then run on the discrete GPU. Unfortunately, some systems, like laptops, may not support discrete-GPU-only operation. We intend to address this issue in a future update.
The benchmark doesn’t perform as well as expected on my system’s discrete GPU. Any ideas why?
Even if you run the benchmark with a config file intended for a specific brand of discrete GPU, the benchmark may sometimes run on the integrated GPU if one is present in the system. Fortunately, the benchmark output indicates which device it’s using, so you can inspect the output to verify what’s happening. A possible workaround is to disable the integrated GPU in the system BIOS or device manager for the duration of testing, if the system can operate without it. Some systems, like laptops, may not support this workaround.
How can I contribute to the ongoing development of MLPerf Client?
For details on becoming a MLCommons member and joining the MLPerf Client working group, please contact [email protected].