MLCommons Datasets
Unsupervised People’s Speech
The MLCommons Unsupervised People’s Speech dataset is among the world’s largest multilingual speech corpora, containing over 1 million hours of audio spanning dozens of languages, licensed for academic and commercial usage under CC-BY and CC-BY-SA.
About the dataset
The MLCommons Unsupervised People’s Speech dataset includes over 821,000 hours of speech across 89+ languages with a diverse set of speakers. This open dataset is large enough to train self-supervised speech systems and is available with a permissive license. Building on the success of the Supervised People’s Speech dataset’s impact on models like Whisper, the Unsupervised People’s Speech dataset aims to unleash innovation in multilingual speech research and products that are available to users across the globe, with particular benefits for low-resource languages.
- Connect with other Unsupervised People’s Speech users through the MLCommons Discord server
- Join us in our Google Group
Dataset Details
- Date: 2025-01-23
- Hours: 1+M of audio, +820k of speech
- Languages: 89 at least
- Audio format: MP3, FLAC