The MLCommons® DataPerf working group is excited to announce the winners of the inaugural DataPerf 2023 challenges, which centered around pushing the boundaries of data-centric AI and addressing the crucial ‘data bottleneck’ faced by the machine learning (ML) community.

In the challenges, participants were asked to approach ML from a novel perspective: by evolving training and test data alongside models to overcome data bottlenecks. The results were inspiring. Without further ado, here are the winners of the DataPerf 2023 challenges:

Quality of Training Data

The first two challenges targeted the quality of training data in the Vision and Speech domains. The winners successfully designed selection strategies to determine the best training sets from large candidate pools of weakly labeled images and automatically extracted spoken word clips.

Quality of Training Data: Vision Domain 

In the Vision domain, Paolo Climaco, PhD candidate at the  Institute of Numerical Simulation, University of Bonn,  triumphed by presenting a robust and efficient method for training data selection. He proposed an approach based on the technique of farthest point sampling where he selected negative examples for each label by sampling the feature space via iteratively searching to maximize the l2 distance. The best core set is then selected under nested cross-validation. Paolo’s best performing model scored a mean-F1 of 81.00 across the ‘cupcake’, ‘hawk’ and ‘sushi’ classes.

Second place was awarded to Danilo Brajovic, Research Associate at Fraunhofer IPA, for his pseudo label generation technique. In this technique, he trained multiple neural networks and classical machine learning models on a subset of data to classify remaining points to the positive and negative image pools. The best-performing model is then selected for core-set proposal under multiple sampling experiments, achieving a best mean-F1 score of 78.98.

Third place was awarded to Steve Mussmann, Computer Science Postdoc at University of Washington, who used modified uncertainty sampling to score a mean-F1 of 78.06 on the leaderboard. He trained a binary classifier on noisy labels from OpenImages and used this classifier to assign positive and negative image pools, with the core-set randomly sampled from both pools. 

We would also like to highlight Margaret Warren from Institute for Human and Machine Cognition for her submission using innovative techniques of human-centered axiomatic data selection.

Quality of Training Data: Speech Domain

The Speech domain was won by the Cleanlab team, led by Jonas Mueller, Chief Scientist, who showcased a data-driven approach using out-of-sample predicted probabilities for training set selection on spoken words in multiple languages. The Cleanlab solution achieved a macro F1 of 46.76 across English, Portuguese, and Indonesian, versus a crossfold validation baseline score of 42.23, showcasing a data-driven approach using out-of-sample predicted probabilities for training set selection on spoken words. 

Training Data Cleaning

The third challenge centered on data cleaning within the Vision domain. Participants were tasked with devising data cleaning strategies to select samples from noisy training sets that would benefit from re-labeling. Team Akridata, led by Sudhir Kumar Suman, took first place with an ingenious, adaptive cleaning algorithm. Impressively, by addressing only 11.58% of the samples, Akridata achieved an accuracy that was 95% of what a model trained on a purely clean dataset could achieve. The team’s approach was characterized by a thorough analysis of the dataset to identify any significant imbalance, followed by a comprehensive and meticulous identification of misclassified samples. They delved deep into understanding the inherent limitations of the model, which helped in making more efficient corrections. They also embraced a boost-like iterative methodology, which progressively honed the model’s performance with each cycle. Notably, they adopted a weighted approach to combine Shapley value into the scoring system to better represent the importance of each data sample. Their effort echoed efforts from other participants (namely Shaopeng Wei from ETH Zurich & SWUFE China), but differed from the more generic calculation of importance valuation.

Training Dataset Acquisition

Data marketplaces are increasingly crucial in a world driven by data. The dataset acquisition challenge focused on studying a critical problem in making data marketplaces work. This challenge required participants to devise a data acquisition strategy to select the best training set from multiple data-sellers based on limited exchanged information. There was a close tie between the submissions of Hanrui Lyu and Yifan Sun, PhD students from Columbia University, advised by Prof. Yongchan Kwon, and Feiyang Kang, a PhD student from Virginia Tech, advised by Prof. Ruoxi Jia. Both approaches achieved high accuracy. Team Columbia’s approach, which led to an accuracy of 76.45% (7.92% better than the baseline), was based on a brute force search to find one seller to allocate all the budget. This approach leveraged multiple submissions into DynaBench, which was not forbidden by our rules, although it is not a practical approach. Feiyang’s approach using a customized distribution matching approach with dimension reduction achieved an accuracy of 76.17% . Because of the close match between the submissions, both were declared co-winners.

DataPerf and the Future

We are immensely proud of the work done by all the participants in these challenges. These challenges represent a pivotal moment in data-centric AI research, highlighting the crucial role data plays in ML’s evolution and how we can continue to improve it.

Congratulations to all the winners and thank you to all participants. Your efforts and solutions have provided a significant leap forward in our understanding and tackling of data bottlenecks.

If you’re interested in getting involved in more challenges, the MLCommons DataPerf working group is currently accepting submissions in the Adversarial Nibbler Challenge, a challenge where users interact with an image generation model to identify “safe” prompts that lead to “unsafe” image generations. Adversarial Nibbler has a dedicated challenge track for the ART of Safety Workshop (AACL), and challenge participants are encouraged to write up their findings as red-teaming papers for the workshop. In addition, Adversarial Nibbler is intended as a continuous challenge with multiple rounds, so follow and stay tuned on the Nibbler Twitter account for future rounds of the challenge. 

For those inspired by the work done here, we will shortly announce the DataPerf 2024 challenges; details coming soon. Please visit the DataPerf website to learn more about the challenges, join the upcoming challenges, or contribute to the design of new data-centric tasks. Join the MLCommons DataPerf working group to stay abreast of the exciting developments in the data-centric ML community. We presented DataPerf results at NeurIPS 2023 in New Orleans in December 2023. 

Appendix: DataPerf Resources

  • Challenge Creators
  • Challenge Submissions
    • Vision Selection
      • GitHub repository of the winning submission. It includes code to run the algorithm, slides presenting the developed method and a challenge report explaining the proposed approach.
    • Speech Selection
    • Adversarial Nibbler
      • Releases planned by the end of the year for (i) dataset, (ii) paper for dataset description / challenge results
    • Vision Cleaning
      • Akridata wins the competition, but not with open source solution.
      • Leaderboard: https://dynabench.org/tasks/vision-debugging

Acknowledgements

The DataPerf benchmarks were created over the last year by engineers and scientists from: Coactive AI, Eidgenössische Technische Hochschule (ETH) Zurich, Google, Harvard University, Meta, ML Commons, Stanford University. In addition, this would not have been possible without the support of DataPerf working group members from Carnegie Mellon University, Mod Op, Factored, Hugging Face, Institute for Human and Machine Cognition, Landing.ai, San Diego Supercomputing Center, Thomson Reuters Lab, and TU Eindhoven.