
Multilingual, multicultural, and multimodal AI evaluation requires benchmarks that go beyond English-language, text-only interactions. Our goal is to build safety evaluations that are globally relevant and culturally grounded, enabling stakeholders to confidently assess AI risk and reliability in the languages and modalities people actually use.
Expanding AILuminate: Researching Multilingual, Multicultural, and Multimodal Benchmarking
As AI adoption accelerates worldwide, the need for evaluations that reflect the diversity of languages, cultures, and modalities people actually use has never been greater. Today’s benchmarks were largely built for English-language, text-only interactions — leaving significant gaps in how we understand AI risk and reliability worldwide.
MLCommons is actively researching expanding the AILuminate Benchmark Suite to support multilingual, multicultural, and multimodal evaluation. This research explores how to adapt AI measurement for new languages, cultural contexts, and input types — including text-and-image interactions.
Importantly, this research goes beyond text-only evaluation. Where existing AILuminate benchmarks assess text-to-text interactions, the new research explores Text+Image-to-Text (T+I2T) evaluation, measuring how AI systems respond when both text and images are provided as input. This reflects how people increasingly interact with AI in practice, and captures risks that emerge only when visual and textual content combine. The multimodal evaluation methodology builds on the Multimodal Safety Test Suite (MSTS) research (Röttger et al., 2025), which established a framework for assessing risks that arise specifically from the combination of text and image inputs.
Early research is exploring languages and cultural contexts across the Asia-Pacific region, including Hindi, Tamil, Malay, Korean, and Japanese, among others. Work in these languages has already been conducted and will continue to expand. A pilot dataset of over 7,000 text-and-image prompts is in development across target languages. Critically, these prompts are not simply translated from English — they are developed for cultural relevance and validated by native speakers in each language to ensure they reflect authentic local contexts and concerns. This builds on AILuminate’s existing English, French, and limited Chinese text-to-text benchmarks.
Similarly to the prompts, AILuminate’s 12 hazard categories — which span physical, non-physical, and contextual hazards — are being adapted, not merely translated, for each region. What constitutes harmful content can vary significantly across cultural contexts, and this research aims to ensure that safety evaluations reflect those differences rather than imposing a single set of assumptions.
This work is led by Hiwot Tesfaye (Microsoft Office of Responsible AI), along with Lora Aroyo and Alicia Parrish (Google DeepMind). It is being pursued in collaboration with leading industry, academic, and government partners, including:
Industry
- Google (Trust & Safety, DeepMind)
- Microsoft (Office of Responsible AI, Microsoft Research India)
Academic & Research Institutions
- IIT Madras — Center for Responsible AI (CeRAI)
- Seoul National University
- Yonsei University
- National Institute of Informatics (NII), Japan
Government & Standards Bodies
- IMDA / AI Verify Foundation, Singapore
- Korea AI Safety Institute (Korea-AISI)
This research is part of MLCommons’ broader program of building AI benchmarks that are globally relevant, technically rigorous, and aligned with emerging international standards. We look forward to sharing findings as this work progresses.
For more information about the AILuminate Benchmark Suite, visit mlcommons.org/ailuminate.