MLCommons Datasets
People’s Speech
The MLCommons People’s Speech dataset is among the world’s largest English speech recognition corpus today that is licensed for academic and commercial usage under CC-BY-SA and CC-BY 4.0.
About the dataset
The MLCommons People’s Speech dataset includes 30,000+ hours of transcribed speech in English languages with a diverse set of speakers. This open dataset is large enough to train speech-to-text systems and crucially is available with a permissive license. Just as ImageNet catalyzed machine learning for vision, the People’s Speech will unleash innovation in speech research and products that are available to users across the globe.
- Read our full paper here.
- Connect with other People’s Speech users
on the MLCommons Discord server. - Download the credits here.
- Join us in our Google Group.
Dataset Details
- Date: 2022-11-17
- Hours: +30 K
- Examples: 23.7 Millions
- Audio Format: FLAC