MLCommons Datasets

Unsupervised People’s Speech

The MLCommons Unsupervised People’s Speech dataset is among the world’s largest multilingual speech corpora, containing over 1 million hours of audio spanning dozens of languages, licensed for academic and commercial usage under CC-BY and CC-BY-SA.

About the dataset

The MLCommons Unsupervised People’s Speech dataset includes over 821,000 hours of speech across 89+ languages with a diverse set of speakers. This open dataset is large enough to train self-supervised speech systems and is available with a permissive license. Building on the success of the Supervised People’s Speech dataset’s impact on models like Whisper, the Unsupervised People’s Speech dataset aims to unleash innovation in multilingual speech research and products that are available to users across the globe, with particular benefits for low-resource languages.

Connect with other Unsupervised People’s Speech users through the MLCommons Discord server
Join us in our Google Group

Dataset Details

Date: 2025-01-23
Hours: 1+M of audio, +820k of speech
Languages: 89 at least
Audio format: MP3, FLAC

Download

Data HF