Datasets Working Group
Mission
Create new datasets to fuel innovation in machine learning.
Purpose
Datasets fuel machine learning: a model is only as good as the data it was trained upon. ImageNet, created for less than half a million dollars, arguably gave rise to modern machine learning. Unfortunately, most public datasets today are either small (relative to private commercial datasets), static, licensed for research use only, or some combination of those things. Datasets must be large to train accurate models. To stay relevant, datasets should be constantly improved as gaps in their coverage are identified. Lastly, datasets require a permissive public license to enable new businesses, products, and services globally.
The Datasets working group creates and hosts public datasets that are large, actively maintained, and permissively licensed – especially for commercial use. We aim to develop a center of expertise and supporting technologies that dramatically improves the quality and reduces the cost of new public datasets. We believe that a modest investment in public datasets can have impressive ROI in terms of machine learning innovation and market growth. The Datasets Working Group’s first project was the People’s Speech dataset, an open speech recognition dataset that is approximately 100x larger than existing open alternatives.
Deliverables
- The People’s Speech Dataset v0.5 (100k hours of diverse speech)
- The People’s Speech Dataset v1.0 (100k hours of speech in 1,000 languages)
Join
Meeting Schedule
Thursday October 10, 2024
Weekly – 11:05 – 12:00 Pacific Time
Related Blog
-
Dollar Street – Bringing Diversity to Computer Vision
Harnessing the power of diverse data to reduce bias and build better machine learning for everyone
-
The People’s Speech Dataset just got better – dynamic data in action
Updating datasets improves speech for everyone
-
Harnessing Human-AI Collaboration
Dynamic Adversarial Data Collection augments large scale datasets by adding diverse and high-quality data
MLCommons Datasets
MLCommons Cognata Dataset
MLCommons Dollar Street Dataset
MLCommons Multilingual Spoken Word Dataset
MLCommons People’s Speech Dataset
How to Join and Access Datasets Working Group Resources
- To sign up for the group mailing list, receive the meeting invite, and access shared documents and meeting minutes:
- Fill out our subscription form and indicate that you’d like to join the Datasets Working Group.
- Associate a Google account with your organizational email address.
- Once you’ve joined the Datasets Working Group, you’ll be able to access the Datasets folder in the Public Google Drive.
- To engage in group discussions, join the group’s channels on the MLCommons Discord server.
- To access the GitHub repositories (public):
- If you want to contribute code, please submit your GitHub ID to our subscription form.
- Visit the GitHub repositories:
Datasets Working Group Chairs
To contact all Datasets working group chairs email [email protected].
Kurt Bollacker
Kurt is the Digital Research Director of the Long Now Foundation. He has a research history in the areas of machine learning, search engines, graph databases, digital archiving, and cardiac simulation. He built the first prototype of the Internet Archive’s Wayback Machine and created the computer science publication search engine CiteSeerX. He was one of the original creators of Freebase (now Google Knowledge Graph). He received a Ph.D. in computer engineering from The University of Texas at Austin.
Sarah Luger
Sarah has accumulated over two decades of expertise in Artificial Intelligence and
Natural Language Processing, focusing on using technology to help improve human communication. Her
recent work encompasses low-resource machine translation, online toxicity
identification, evaluation of generative AI, increasing data annotator diversity, and
AI Safety. She holds a Masters and PhD in Informatics (Computer Science)
from the University of Edinburgh, specializing in automated question answering.
Sarah’s background includes significant roles at IBM Watson, particularly in NLP
tasks for the Jeopardy! Challenge, as well as leadership positions in engineering and
research teams at startups and Orange Silicon Valley. Actively engaged in the human computation and
AI data benchmarking research communities, she emphasizes the fusion of art and
science in creating AI/NLP innovations.