Datasets
MLCommons Datasets
Evaluating AI systems depends on rigorous, standardized test datasets. MLCommons builds open, large-scale, and diverse datasets and a rich ecosystem of techniques and tools for AI data, helping the broader community deliver more accurate and safer AI systems.
MLCommons provides the following datasets:
Cognata
The MLCommons Cognata Dataset is a set of photorealistic synthetic automotive data frames of urban and highway scenarios in several cities and different weather conditions and times of day. It consists of data licensed for use by MLCommons Members.
Dollar Street
The MLCommons Dollar Street Dataset is a collection of images of everyday household items from homes around the world that visually captures socioeconomic diversity of traditionally underrepresented populations.
Multilingual Spoken Words
MLCommons Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages for academic research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0.
People’s Speech
The MLCommons People’s Speech Dataset is among the world’s largest English speech recognition corpus today that is licensed for academic and commercial usage under CC-BY-SA and CC-BY 4.0.