MLCommons Datasets

Unsupervised People’s Speech

The MLCommons Unsupervised People’s Speech dataset is among the world’s largest multilingual speech corpora, containing over 1 million hours of audio spanning dozens of languages, licensed for academic and commercial usage under CC-BY and CC-BY-SA.

About the dataset

The MLCommons Unsupervised People’s Speech dataset includes over 821,000 hours of speech across 89+ languages with a diverse set of speakers. This open dataset is large enough to train self-supervised speech systems and is available with a permissive license. Building on the success of the Supervised People’s Speech dataset’s impact on models like Whisper, the Unsupervised People’s Speech dataset aims to unleash innovation in multilingual speech research and products that are available to users across the globe, with particular benefits for low-resource languages.

Dataset Details

  • Date: 2025-01-23
  • Hours: 1+M of audio, +820k of speech
  • Languages: 89 at least
  • Audio format: MP3, FLAC

Download

image of fuzzy noise on a line graph