Dynabench and BabyLM Join Forces

By: Leshem Chosen, Rafael Mosquera, Aaron Mueller, Adina Williams

The MLCommons® Dynabench working group is pleased to announce a new collaboration with the BabyLM Challenge. Aligning with Dynabench’s commitment to human-centric data collection and pushing the state of the art in model benchmarking, the challenge brings together a global research community by incentivizing researchers with an interest in pretraining or cognitive modeling to focus their efforts on optimizing pretraining, given data limitations inspired by human development.

BabyLM is the 5th community now hosted on the Dynabench platform. They join DADC, DataPerf, Flores, and others who are dedicated to pushing the limits of AI through efforts in benchmarking dynamic adversarial data collection, finding data-centric solutions for machine learning (ML) and large language models, and evaluating machine translation.

Large language models have recently ushered in an ML renaissance, achieving unprecedented high performance on many different tasks. The capability of models to ingest huge amounts of data, and subsequently produce human-like text allows them to seemingly overcome classical challenges across the full spectrum of Natural Language Processing (NLP) tasks. For example, large models such as Chinchilla were trained on 1.4 trillion words—well over 10,000 words for every one word a 13 year old child has heard in their entire life. Because these models require huge amounts of resources, both in terms of compute and data, they are inaccessible to most researchers.

The BabyLM challenge is specifically focused on unlocking the capabilities of language models using smaller training sets—thereby enabling much broader participation throughout the ML community. To paraphrase Oliver Whang’s reporting in The New York Times, the BabyLM Challenge aims to turn the bigger-is-better paradigm on its head. More specifically, BabyLM encourages researchers to train smaller language models from scratch, using only the amount of linguistic data available to a child, and upload them to the challenge page to be ranked. As Whang reports, “A successful mini-model would be nearly as capable as the high-end models but much smaller, more accessible and ‌more compatible with humans.”

Models submitted to the challenge will be evaluated in three different tracks: strict, strict-small, and loose. The challenge restricts the amount of linguistic data that can be used for pretraining, but encourages the use of data from other modalities as well. You can learn more about the motivation behind the challenge here. All three tracks of the BabyLM challenge are hosted on Dynabench where you can find the rules and upload requirements. We encourage all researchers interested in pretraining and cognitive modeling to participate in the challenge. Submit your predictions to any track directly via the Dynabench platform by Saturday July 22nd, 2023.