Datasets Working Group


Create new datasets to fuel innovation in machine learning.


Data sets fuel machine learning: a model is only as good as the data it was trained upon. ImageNet, created for less than half a million dollars, arguably gave rise to modern machine learning. Unfortunately, most public datasets today are either small (relative to private commercial datasets), static, licensed for research use only, or some combination of those things. Datasets must be large to train accurate models. To stay relevant, datasets should be constantly improved as gaps in their coverage are identified. Lastly, datasets require a permissive public license to enable new businesses, products, and services globally.

The Datasets Working Group creates and hosts public datasets that are large, actively maintained, and permissively licensed - especially for commercial use. We aim to develop a center of expertise and supporting technologies that dramatically improves the quality and reduces the cost of new public datasets. We believe that a modest investment in public datasets can have impressive ROI in terms of machine learning innovation and market growth. The Datasets Working Group’s first project is the People’s Speech dataset, an open speech recognition dataset that is approximately 100x larger than existing open alternatives. We are currently validating the utility of the data in preparation for public release.


  1. The People’s Speech Dataset v0.5 (100k hours of diverse speech)
  2. The People’s Speech Dataset v1.0 (100k hours of speech in 1,000 languages)

Meeting Schedule

Weekly on Friday from 8:30-9:00am Pacific.

Mailing List

Working Group Resources

Google Drive (Members only)

Working Group Chair Emails

Greg Diamos

Peter Mattson

Working Group Chair Bios

Greg leads transformation engineering at Landing AI, focusing on bringing AI to every major industry. He is a founding member of MLPerf™. He led AI research at Baidu’s Silicon Valley AI Lab (SVAIL), where he helped develop the Deep Speech and Deep Voice systems using Mixed Precision Training. At NVIDIA, Greg contributed compiler and microarchitecture technologies used in the Volta GPU, including the invention of the SIMT independent thread scheduling system. Greg holds a PhD from the Georgia Institute of Technology, where he led the development of the GPU-Ocelot dynamic compiler, which targeted CPUs and GPUs from the same program representation.


Peter Mattson leads ML Metrics at Google. He co-founded and is President of MLCommons™, and co-founded and was General Chair of the MLPerf consortium that preceded it. Previously, he founded the Programming Systems and Applications Group at NVIDIA Research, was VP of software infrastructure for Stream Processors Inc (SPI), and was a managing engineer at Reservoir Labs. His research focuses on understanding machine learning models and data through quantitative metrics and analysis. Peter holds a PhD and MS from Stanford University and a BS from the University of Washington.