Introducing the Unsupervised People’s Speech dataset
The MLCommons Dataset working group is pleased to announce the release of the Unsupervised People’s Speech dataset. Built-in collaboration with HuggingFace, the Unsupervised People’s Speech dataset contains over 1 million hours of audio spanning dozens of languages. Given the impact the previously released Supervised People’s Speech dataset had on models such as Whisper, we expect this new version to drive innovations across different tasks, including self-supervised implementations as well as improvement on automatic speech recognition pipelines in numerous languages.
The Rationale Behind the Dataset
As the field of speech technology continues to advance, the need for large, diverse audio datasets is increasingly crucial. The Unsupervised People’s Speech dataset aims to address this need by providing a vast collection of multilingual audio data to support research and development in various areas of speech technology. Supporting broader Natural Language Processing (NLP) research for languages other than English helps bring communication technologies to more people globally–including those speaking low-resource languages.
Ensuring Useful Data
To ensure the dataset is as useful as possible the MLCommons Datasets working group ran different data pipelines to understand the contents across dimensions such as language distribution and speech detection.
- Speech detection: the working group created a custom data loader that resampled and converted the audio to a single channel (available in the HuggingFace dataset card), we ran Silero’s Voice Activity Detection pipeline, a model that uses a multi-head attention mechanism with STFT as features. Results from this model delivered a total of 821,412+ hours of speech.
- Language identification: language identification is often the first task in a series of NLP tasks, so it’s crucial to get it right. The working group used Nvidia’s TensorRT-LLM implementation of Whisper Large v3 to run inference on the subset of the dataset for which our speech detection pipeline detected a speech utterance. The results from this pipeline detected a total of 89 languages. While we are certain there are more, since the third most common category the model inferred was a no speech tag, it means the model wasn’t able to determine the language.
Technical Hurdles
While the focus of the Unsupervised People’s dataset is on its potential applications, it’s worth noting some of the technical challenges the working group overcame:
- Data Upload and Storage: we developed a custom script utilizing Git LFS backend to efficiently upload the 48+TB dataset to S3 and HuggingFace. This overcame typical speed limitations.
- Self-Supervised Learning Potential: we’ve included a training pipeline to train a Wav2Vec model, which could unlock new possibilities in unsupervised speech representation learning.
- Deduplication: we are in the process of creating embedding representations of the entire dataset using Meta’s Encodec. This will allow us to identify and remove duplicate content. This process ensures the dataset’s uniqueness and quality.
Future Work
The Unsupervised People’s Speech dataset is built from audio data on Archive.org that is either public domain or available with CC-BY or CC-BY-SA licenses. As part of our commitment to updating and maintaining the dataset, we are also releasing the software as open-source to empower the community to build on our contributions.
As the working group completes the deduplication and self-supervised efforts, we anticipate several avenues for the research community to continue to build and develop, especially in the areas of improving low-resource language speech models, enhanced speech recognition across different accents and dialects, and novel applications in speech synthesis.
Join Us
There are many ways to get involved in the MLCommons data effort. We are currently seeking speakers of over 130 languages to create a benchmark for a text language identification task. If this sounds interesting, head to Dynabench’s text classification task.
If you want to be part of the Datasets working group or any other working group at MLCommons, please visit the Get Involved page. We meet weekly to share ideas and implement projects. If you want to read more about the MLCommons Unsupervised People’s Speech dataset, visit https://mlcommons.org/datasets/unsupervised-peoples-speech/