Data-centric ML
Accelerate research innovation and increase scientific rigor in machine learning by defining, developing, and operating benchmarks for datasets and data-centric algorithms, facilitated by a flexible ML benchmarking platform.
Purpose
The Data-centric ML Research (DMLR) working group challenges existing ML benchmarking dogma by driving novel approaches to benchmark curation such as dynamic adversarial data collection. Benchmarks for machine learning solutions based on static datasets have well-known issues: they saturate quickly, are susceptible to overfitting, contain exploitable annotator artifacts and have unclear or imperfect evaluation metrics. This new paradigm of data-centric benchmarking is powered by the Dynabench platform. The key scientific question we investigate in this working group is: is it possible to make faster progress if data is collected dynamically, with humans and models in the loop, rather than in the old-fashioned static way? This further enables an ecosystem of other ML benchmarks in areas such as ML datasets and algorithms for working with datasets. We leverage these benchmarks through challenges and leaderboards.
Deliverables
DataPerf
- Organize three workshops at major ML conferences per year
- Recruit more participants to WG
- Measure and improve quality of Common Crawl (based on human annotation)
- Domains for impact challenges
- LLMs + Commoncrawl
- Safety
- Science
- Research challenges beyond the chosen domains
Dynabench
- Supporting existing tasks, add three new tasks a year
- A major task, such as Safety
- Support academic research using Dynabench
- Product improvements for LLM experiments
Shared objectives:
- Come up with sustainable funding model for ongoing research
- Develop approach to product management of Dynabench
- Define standard processes and decision-making criteria around accepting new challenges
- Establish standard processes for promoting new challenges and driving engagement
Meeting Schedule
2nd & 4th Thursdays every month 10:30-11:30AM Pacific.
DMLR Working Group Projects
How to Join and Access DMLR Working Group Resources
To sign up for the group mailing list, receive the meeting invite, and access shared documents and meeting minutes:
- Fill out our subscription form and indicate that you’d like to join the Medical Working Group.
- Associate a Google account with your organizational email address.
- Once your request to join the DMLR Working Group is approved, you’ll be able to access the DMLR folder in the Public Google Drive.
To engage in working group discussions, join the group’s channels on the MLCommons Discord server.
To access the GitHub repositories (public):
- If you want to contribute code, please submit your GitHub ID to our subscription form.
- Visit the GitHub repositories:
DMLR Working Group Chairs
Chairs
To contact all DataPerf working group chairs email [email protected].