The People's Speech Dataset just got better - dynamic data in action

Our big, freely available speech dataset just got its first upgrade. Find out what’s new, and why dynamic datasets are the future of machine learning.

What is The People’s Speech Dataset?

The People’s Speech Dataset was designed to improve the accuracy of Automatic Speech Recognition (ASR) Machine Learning (ML) models for recognizing English speakers. It includes over 30,000 hours of transcribed English-language speech from a diverse set of speakers with a wide range of accents.

We published the People’s Speech in November 2021 under a CC-BY-SA and CC-BY 4.0 license, meaning it’s free for both academic and commercial use.

Why have we already updated it to v1.1?

We believe that dynamic datasets are critical to the future advancement of machine learning. MLCommons® is committed to actively maintaining and updating its datasets, so that developers can always train their models on the most relevant and robust data.

With the People’s Speech Dataset, this is particularly important because of the speed at which modern languages evolve. Fuelled by globalization and the proliferation of the internet, the real-world manifestations of language are highly dynamic. It’s therefore vital that the datasets built to represent everyday speech are dynamic too.

When datasets are not proactively maintained, they become less relevant with every day that passes because they are increasingly reflecting historical rather than present-day trends. For developers, using static datasets also brings a major risk of reinforcing implicit algorithmic biases.

By continuously improving the quality of our datasets, we make development easier and are improving the real-world performance of ML models that train on them and making machine learning better for everyone.

What’s new for v1.1?

In this first update, our Datasets Working Group has improved the speed, quality, and usability of the dataset, particularly for those using smaller subsets of data.

A major focus for us was making the dataset more user-friendly. We’ve created a tutorial that shows users how to train an ASR model using NVIDIA NeMo, with Google Cloud Platform included.

As another example, we’ve done some behind-the-scenes work to improve download speeds, fix bugs, and moved the dataset hosting to Hugging Face. We also raised the bar for our “clean” data samples. This has increased the proportion of “dirty” data and improved the overall quality of the remaining “clean” data.

Summary

We hope these upgrades to our People’s Speech dataset inspire ML practitioners to pay more attention to the quality of their training data, and the importance of dynamic datasets.

Discover more and start using the dataset here

About MLCommons

MLCommons is an open engineering consortium with a mission to benefit society by accelerating innovation in machine learning. The foundation for MLCommons began with the MLPerf benchmark in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 50+ founding partners – global technology providers, academics and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire machine learning industry through benchmarks and metrics, public datasets and best practices.

For additional information on MLCommons and details on becoming a Member or Affiliate of the organization, please visit MLCommons or contact [email protected].

To get involved with The People’s Speech, please join the Datasets Working Group, and follow @MLCommons on Twitter. Help us keep growing this community effort, and don’t hesitate to get in touch if you would like to be involved.

Press Contact:
David Kanter
[email protected]