MLCommons Datasets

Unsupervised People’s Speech

The MLCommons Unsupervised People’s Speech dataset is among the world’s largest multilingual speech corpora, containing over 1 million hours of audio spanning dozens of languages, licensed for academic and commercial usage under CC-BY and CC-BY-SA.

About the dataset

The MLCommons Unsupervised People’s Speech dataset includes over 821,000 hours of speech across 89+ languages with a diverse set of speakers. This open dataset is large enough to train self-supervised speech systems and is available with a permissive license. Building on the success of the Supervised People’s Speech dataset’s impact on models like Whisper, the Unsupervised People’s Speech dataset aims to unleash innovation in multilingual speech research and products that are available to users across the globe, with particular benefits for low-resource languages.

Dataset Details

  • Date: 2025-01-23
  • Hours: 1+M of audio, +820k of speech
  • Languages: 89 at least
  • Audio format: MP3, FLAC

Download

image of fuzzy noise on a line graph
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.