Datasets

Evaluating AI systems depends on rigorous, standardized test datasets. MLCommons builds open, large-scale, and diverse datasets and a rich ecosystem of techniques and tools for AI data, helping the broader community deliver more accurate and safer AI systems.

Cognata

The MLCommons Cognata dataset is a set of photorealistic synthetic automotive data frames of urban and highway scenarios in several cities and different weather conditions and times of day. It consists of data licensed for use by MLCommons Members.

Dollar Street

The MLCommons Dollar Street dataset is a collection of images of everyday household items from homes around the world that visually captures socioeconomic diversity of traditionally underrepresented populations.

Multilingual Spoken Words

MLCommons Multilingual Spoken Words corpus is a large and growing audio dataset of spoken words in 50 languages for academic research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0.

People’s Speech

The MLCommons People’s Speech dataset is among the world’s largest English speech recognition corpus today that is licensed for academic and commercial usage under CC-BY-SA and CC-BY 4.0.

Unsupervised People’s Speech

The Unsupervised People’s Speech Dataset is a compilation of audio files extracted from Archive.org that is licensed for academic and commercial usage under CC-BY and CC-BY-SA licenses. It includes more than one million hours of audio with a diverse set of speakers.