Introduction
The MLCommons Inference Working Group is excited to introduce a new Vision-Language Model (VLM) benchmark for the MLPerf Inference v6.0 round. As multimodal AI continues to reshape industries from retail to robotics, measuring the performance of these complex models in realistic production scenarios is more critical than ever.
For this round, we are debuting the Qwen3-VL-235B-A22B-Instruct model (released on Sept 23, 2025), paired with the Shopify Product Catalog dataset (released on Dec 12, 2025) to perform product classification and understanding, a common task in retail and commerce applications. This benchmark represents a significant leap forward in evaluating how AI systems handle the messy, high-stakes reality of hyperscale e-commerce where the infrastructure behind the scenes processes tens of millions of products daily.
The submission deadline is February 13, 2026. We invite all hardware vendors, cloud providers, and AI system experts to submit their results and help raise the bar for multimodal inference performance.
The Rise of Multimodal AI
The shift from unimodal (text-only) to multimodal AI is one of the most transformative trends in the industry. By integrating visual and textual data, businesses can automate complex workflows that were previously impossible, such as visual search, automated product tagging, and complex document understanding.
Market research underscores this urgency. The global multimodal AI market was valued at approximately USD 1.73 billion in 2024 and is projected to reach USD 10.89 billion by 2030, growing at a compound annual growth rate (CAGR) of 36.8% [1]. In the retail and e-commerce sectors specifically, adoption is accelerating even faster, with a projected CAGR of 34.6% through 2030 as merchants deploy personalized styling tools and augmented reality features [2]. This benchmark directly addresses this market demand by simulating a core “product understanding” workload that drives these commercial applications.
Model Selection: The Era of Qwen
This benchmark marks a milestone for the MLPerf Inference Benchmark Suite: it is the first time a model from the Qwen family is being introduced into the suite.
The open-weights landscape has evolved rapidly over the last 18 months, with the Qwen family emerging as a dominant force in the ecosystem. Recent data highlights this ascent:
- Adoption & downloads: Data from Hugging Face confirms that Qwen has cemented its status as a tier-one open-weights model family globally. By late 2025, it achieved record-breaking cumulative download numbers, driven by widespread adoption among developers and researchers [3, 4, 5].
- Performance: Independent benchmarks consistently rank Qwen models at the forefront of the industry. Qwen3-VL 235B A22B, in particular, delivers state-of-the-art results across a wide range of tasks, demonstrating high efficiency by achieving top-tier response quality with a significantly optimized parameter footprint [6, 7].
By selecting Qwen3-VL 235B A22B [8], we ensure MLPerf Inference remains at the bleeding edge, reflecting the models that developers and enterprises are actually deploying in 2026.
Qwen3-VL 235B A22B is a massive-scale mixture-of-experts (MoE) vision-language model designed to unify high-performance reasoning with efficient inference. It features 235 billion total parameters with 22 billion activated parameters per token, allowing it to deliver dense-model quality at a fraction of the computational cost. Built on the Qwen3 backbone [9], it natively supports a 256K token context window for interleaved text, image, and video inputs, and it introduces key architectural innovations like DeepStack (which fuses multi-level vision features) [10] and interleaved-MRoPE (for enhanced spatial-temporal modeling) [11]. This architecture enables state-of-the-art performance in complex tasks ranging from visual agentic workflows to long-context video understanding.
Case Study: Shopify Catalog
To ensure this benchmark reflects real-world production needs, we partnered with Shopify to curate the dataset and define the task. This workload mimics the “hierarchical taxonomy classification” task found in the “Shopify Catalog” product understanding layer, which processes 40 million products daily [12].
In production, the product understanding layer utilizes VLMs to perform a series of tasks that transform the unstructured data (e.g., non-canonical photos or descriptions provided by the merchants) of products into standardized metadata, including hierarchical taxonomy classification, attribute extraction, image understanding, title standardization, description analysis, and review summarization [12].
For the MLPerf Inference v6.0 round, we focus specifically on the hierarchical taxonomy classification task. The system must ingest a product’s title, description, and photo and select the correct category from a dynamic set of potential categories. This task perfectly captures the challenge of modern AI: maintaining high accuracy and throughput on noisy, real-world data at a massive scale.
Example Request and Response
An example request to a VLM inference endpoint might look like the following:
Python
[
{
'content': """Please analyze the product from the user prompt
and provide the following fields in a valid JSON object:
- category
- brand
- is_secondhand
You must choose only one, which is the most appropriate, correct, and specifc
category out of the list of possible product categories.
The description of the product sometimes contains various types of source code
(e.g., JavaScript, CSS, HTML, etc.), where useful product information is embedded
somewhere inside the source code. For this task, you should extract the useful
product information from the source code and leverage it, and discard the
programmatic parts of the source code.
Your response should only contain a valid JSON object and nothing more, e.g.,
you should not fence the JSON object inside a ```json code block.
The JSON object should match the followng JSON schema:
```json
{
"additionalProperties": false,
"description": "Json format for the expected responses from the VLM.",
"properties": {
"category": {
"description": "The complete category of the product, e.g.,\n\"Clothing & Accessories > Clothing > Shirts > Polo Shirts\".\nEach categorical level is separated by \" > \".",
"title": "Category",
"type": "string"
},
"brand": {
"description": "The brand of the product, e.g., \\"giorgio armani\\".",
"title": "Brand",
"type": "string"
},
"is_secondhand": {
"description": "True if the product is second-hand, False otherwise.",
"title": "Is Secondhand",
"type": "boolean"
}
},
"required": [
"category",
"brand",
"is_secondhand"
],
"title": "ProductMetadata",
"type": "object"
}
```
""",
'role': 'system'
},
{
'content': [
{
'text': """The title of the product is the following:
```text
Frendorf | Fahrradindikator-Signalweste
```
The description of the product is the following:
```text
Die RADFAHR-INDIKATOR-SIGNALWESTE IST EIN NEUES PRODUKT DERZEIT NICHT IM HANDEL ERHÄLTLICH UND NICHT VERFÜGBARE . VERMEIDEN SIE UNNÖTIGE UND GEFÄHRLICHE RISIKEN FÜR IHR LEBEN! DIE SICHERHEITSWESTE IST DER Einfachste UND SICHERSTE WEG, UM DIE SICHTBARKEIT FÜR FAHRER ZU VERBESSERN. Geeignet für alle Radfahrer Für Kinder, die Fahrrad fahren Für Kinder auf dem Roller Für Jogger, die nachts laufen Für Kinder, die neben einer Straße gehen (die Weste passt ideal über einen Schulranzen) Diese Weste lässt sich hervorragend über einem Rucksack tragen. Eigenschaften: EINFACH FÜR AUTOFARER ZU ERKENNEN FÜR ERWACHSENE UND KINDER Die beiden verstellbaren Riemen machen sie für Kinder und Erwachsene geeignet, was bedeutet, dass eine Weste von mehreren Familienmitgliedern genutzt werden kann. FÜR VERSCHIEDENE KLEIDUNGEN Über einer Jacke oder einem dicken Mantel Über einem Rucksack oder Schulranzen SICHTBARKEIT = SICHERHEIT FERNBEDIENUNG Die Fernbedienung ist äußerst benutzerfreundlich und somit auch für kleine Kinder leicht zu bedienen: rechter Knopf zum Abbiegen nach rechts, linker Knopf zum Abbiegen nach links usw. Das Licht der Fernbedienung und das Blinken der Weste sind synchronisiert, sodass der Nutzer immer informiert ist, was hinter ihm geschieht. IN WENIGEN MINUTEN BEREIT ZUM EINSATZ Die Fernbedienung lässt sich dank der kleinen Kunststoffhalterungen ganz einfach am Lenker anbringen. Lieferumfang: 1 x RADFAHR-INDIKATOR-SIGNALWESTE 1 x Fernbedienung
```
The following are the possible product categories:
```json
['Sporting Goods > Fitness & General Exercise Equipment > Sport Safety Lights & Reflectors > Sport Safety Lights', 'Animals & Pet Supplies > Pet Supplies > Pet Collars & Harnesses > LED Collars', 'Sporting Goods > Outdoor Recreation > Cycling > Bicycle Accessories > Bicycle Computer Accessories > Bicycle Computer Cadence Sensors', 'Vehicles & Parts > Vehicle Parts & Accessories > Motor Vehicle Parts > Motor Vehicle Lighting > Tail Lights', 'Vehicles & Parts > Vehicle Parts & Accessories > Motor Vehicle Parts > Motor Vehicle Lighting > Light Bars', 'Business & Industrial > Work Safety Protective Gear > Work Safety Harnesses > Vest-Style Work Safety Harnesses', 'Sporting Goods > Fitness & General Exercise Equipment > Cardio > Cardio Machine Accessories & Parts > Exercise Bike Accessories & Parts > Exercise Bike Heart Rate Monitors', 'Vehicles & Parts > Vehicle Parts & Accessories > Motor Vehicle Parts > Motor Vehicle Lighting > Turn Signals']
```
""",
'type': 'text'
},
{
'image_url': {
'url': 'data:image/JPEG;base64,<ommited base64 encoding of the following image>'
},
'type': 'image_url'
}
],
'role': 'user'
}
]

Then, the response from the VLM inference endpoint would look like the following:
JSON
{
"category": "Sporting Goods > Fitness & General Exercise Equipment > Sport Safety Lights & Reflectors > Sport Safety Lights",
"brand": "Frendorf",
"is_secondhand": false
}
Performance Metrics
For the “offline” scenario, we measure and compare the overall request throughput (i.e., the number of completed requests per second) when sending all samples from (both the train and test splits of) the dataset to the VLM inference endpoint at least once. This arrangement closely mirrors Shopify’s major production requirements for batch processing: being able to process as many product samples as possible with as few GPU/accelerator resources as possible, beyond the current capacity of tens of millions of product samples daily.
For the “server” scenario, we set the 99th-percentile request latency constraint to be 12 seconds, and then we compare the maximum number of requests per second that we could send to the VLM inference endpoint before the 99th-percentile request latency constraint is violated. This scenario reflects Shopify’s online processing use case of handling demand surges before and during events or holidays (e.g., Black Friday).
Accuracy Metric
The response will be parsed by the following schema:
Python
class ProductMetadata(pydantic.BaseModel):
"""Json format for the expected responses from the VLM."""
category: str
"""The complete category of the product, e.g.,
"Clothing & Accessories > Clothing > Shirts > Polo Shirts".
Each categorical level is separated by " > ".
"""
brand: str
"""The brand of the product, e.g., "giorgio armani"."""
is_secondhand: bool
"""True if the product is second-hand, False otherwise."""
We compute the hierarchical F1 score [13, 14] between the predicted categories and the ground-truth categories of all product samples in the dataset. Compared to exact matching, the hierarchical F1 score proportionally accounts for a partial, but not complete, match between a predicted category and the ground-truth product category, such as
Cameras & Optics > Cameras > Video Cameras > Drone Video Cameras
vs.
Cameras & Optics > Cameras > Video Cameras > Cinema Video Cameras
In the event where a response cannot be parsed by the given JSON schema, it will be treated as a complete mismatch between the predicted product category and the ground-truth product category. In this case, the number of category levels, which affects the hierarchical F1 calculation of the (mis-)prediction, will be taken as the same as that of the ground truth. For example, when encountering a JSON parsing error, if the ground truth is
Cameras & Optics > Cameras > Video Cameras > Drone Video Cameras
then the (mis-)prediction will be treated as
|NA| > |NA| > |NA| > |NA|
We set the accuracy target of the category hierarchical F1 score to be 0.7824. This is 99% of the mean category hierarchical F1 score across 10 runs on the original Qwen3-VL-235B-A22B-Instruct, which achieves 0.7903037.
Join the Benchmark
This is your opportunity to demonstrate the capabilities of your hardware and software stacks on one of the most relevant and rapidly growing AI workloads in the industry.
- Model: Qwen/Qwen3-VL-235B-A22B-Instruct
- Commit SHA: 710c13861be6c466e66de3f484069440b8f31389
- Dataset: Shopify/product-catalogue
- Commit SHA: d5c517c509f5aca99053897ef1de797d6d7e5aa5
- Submission Deadline: Feb 13, 2026
You can find the reference implementation along with the performance benchmarking code here. The reference implementation is based on vLLM, but its command-line interface and plugin registration enable you to benchmark arbitrary inference systems that expose endpoints via the Chat Completions OpenAI API. We look forward to seeing your submissions!
Acknowledge
We would like to thank the following individuals (sorted alphabetically) for their feedback and help that guided us in the journey of developing this benchmark:
- Ashwin Nanjappa (NVIDIA)
- Diego Castañeda (Wealthsimple; ex-Shopify)
- Frank Han (Dell)
- Gennady Pekhimenko (NVIDIA)
- Harika Pothina (Red Hat)
- Javier Moreno (Shopify)
- Junhao Li (Ubicloud)
- Lei Pan (Pinterest)
- Mert Unsal (Mistral)
- Michael Goin (Red Hat)
- Mikhail Parakhin (Shopify)
- Mingyuan Ma (NVIDIA)
- Miro Hodak (AMD)
- Naveen Miriyalu (Red Hat)
- Qidong Su (NVIDIA)
- Rachata Ausavarungnirun (MangoBoost)
- Roger Wang (Inferact)
- Shobhit Verma (NVIDIA)
- Thomas Atta-Fosu (Intel)
- Viraat Chandra (NVIDIA)
- Yubo Gao (NVIDIA)
- Zhanda Zhu (NVIDIA)
- Zhihan Jiang (NVIDIA)
References
- Grand View Research. (2024). Multimodal AI Market Size And Share | Industry Report, 2030. https://www.grandviewresearch.com/industry-analysis/multimodal-artificial-intelligence-ai-market-report
- Mordor Intelligence. (2025). Multimodal AI Market Size & Share Analysis – Industry Reports. https://www.mordorintelligence.com/industry-reports/multimodal-ai-market
- Xinhua / China Daily. (Jan 12, 2026). Alibaba’s Qwen leads global open-source AI community with 700 million downloads. https://www.chinadailyhk.com/hk/article/626974
- AI World. (2025). Chinese developers account for over 45% of top open-model public downloads. https://www.aiworld.eu/story/chinese-developers-account-for-over-45-of-top-open-model-public-downloads
- Malika Aubakirova, Alex Atallah, Chris Clark, Justin Summerville, Anjney Midha. (2025). State of AI: An Empirical 100 Trillion Token Study with OpenRouter. https://openrouter.ai/state-of-ai
- LLM-Stats. (2026). Qwen3 VL 235B A22B Instruct. https://llm-stats.com/models/qwen3-vl-235b-a22b-instruct
- Artificial Analysis. (2026). Qwen3 VL 235B A22B (Reasoning) Intelligence, Performance & Price Analysis. https://artificialanalysis.ai/models/qwen3-vl-235b-a22b-reasoning
- Qwen Team. (2025). Qwen3-VL Technical Report. https://arxiv.org/pdf/2511.21631
- Qwen Team. (2025). Qwen3 Technical Report. https://arxiv.org/pdf/2505.09388
- Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zuxuan Wu, Jianfeng Gao, and Yu-Gang Jiang. Deepstack: Deeply stacking visual tokens is surprisingly simple and effective for lmms. In Advances in Neural Information Processing Systems, volume 37, pp. 23464–23487, 2024.
- Jie Huang, Xuejing Liu, Sibo Song, Ruibing Hou, Hong Chang, Junyang Lin, and Shuai Bai. Revisiting multimodal positional encoding in vision-language models, 2025.
- Audrey-Anne Guindon. (2025). Leveraging multimodal LLMs for Shopify’s global catalogue: Recap of expo talk at ICLR 2025. https://shopify.engineering/leveraging-multimodal-llms
- Silla, C. N., & Freitas, A. A. (2011). A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery, 22(1), 31-72.
- Kiritchenko, S., Matwin, S., Nock, R., & Famili, A. F. (2006, June). Learning and evaluation in the presence of class hierarchies: Application to text categorization. In Conference of the Canadian Society for Computational Studies of Intelligence (pp. 395-406). Springer, Berlin, Heidelberg.