Introducing the People’s Speech dataset - 30,000+ hours of diverse speech data to drive ML innovation

We are thrilled to announce the public release of the People’s Speech, a 30,000-hour and growing supervised conversational English speech recognition dataset. It is the largest diverse English speech recognition corpus today that is licensed for academic and commercial usage under CC-BY-SA and CC-BY 4.0. Crucially, the People’s Speech will drive innovation by democratizing machine learning – it is large enough to train fully end-to-end models and improve applications such as automated speech recognition, transcription, diarization, and noise reduction. In future releases we will expand the reach and impact by adding more complex acoustic conditions and additional languages.

Language and voice are natural and universal for nearly everyone on the planet and foundational to human interactions. Voice interfaces have so often captured our imagination through movies and science fiction. But this vision of the future is increasingly becoming real – voice assistants are common in everything from smartphones to cars, and are expected to reach the majority of the world’s population by the middle of the decade. More subtly, today’s phone calls and audio conferences are routinely enhanced with transcription and audio enhancement.

But the machine learning community is only just scratching the surface of the possibilities unlocked by speech technologies. One of the biggest challenges to progress is the availability of diverse large-scale datasets with permissive licensing; experts estimate that training an end-to-end speech recognition model requires 10,000-15,000 hours of labeled speech. Most existing public speech datasets are limited by a combination of restrictive licensing that prohibits commercial use, relatively small size, or simpler types of speech such as read audiobooks.

The People’s Speech is built from audio data on Archive.org that is either public domain or available with CC-BY or CC-BY-SA licenses. Rather than having humans transcribe everything which is expensive and time-consuming, we collected only audio that already had a transcript. We then used an innovative software pipeline to automatically match the audio with the transcript. As part of our commitment to updating and maintaing the dataset, we are also releasing our software as open-source to empower the community to build on our contributions.

Why? By increasing the availability of the dataset to reach more people, machine learning research and commercial applications will advance significantly. In short, the People’s Speech provides a solid jumping-off point for other companies and individuals to innovate and experiment.

Contributors to the dataset include researchers from Baidu, Factored, Harvard University, Intel, Landing AI, and NVIDIA. It can be downloaded at mlcommons.org/speech. For more information, please read our paper accepted to the 2021 Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks.