Today, MLCommonsⓇ is announcing the release of Croissant, a metadata format to help standardize machine learning datasets. The aim of Croissant is to make datasets easily discoverable and usable across tools and platforms. Today’s release includes the format documentation, an open source library, and visual editor, with industry support from HuggingFace, Google Dataset Search, Kaggle, and OpenML amongst others.
Data is at the core of every AI and ML model. However, there is currently no standardized method of organizing and arranging the data and files that make up each dataset. As a result, finding, understanding, and using ML datasets can be tedious and time-consuming.
One of the goals of Croissant is to make data more easily accessible and discoverable. The Croissant vocabulary is an extension to schema.org, a machine-readable standard to describe structured data, used by over 40M datasets on the Web, which allows the datasets to be discoverable through dataset search engines such as Google Dataset Search.
Croissant is easy to adopt because it doesn’t require changing the data itself or how it is represented. Instead, it adds a layer of metadata that represents the contents of the dataset in a standardized way, describing key attributes and properties.
Croissant enables datasets to be loaded into different ML platforms without the need for reformatting. Popular ML frameworks like TensorFlow, JAX and PyTorch can already load Croissant datasets via the TensorFlow Datasets library. Additionally, by providing operationalized documentation, Croissant users can easily understand the best practices for contributing to and utilizing the data.
Users looking to publish a dataset in the Croissant format benefit from the Croissant editor which allows them to easily inspect, create, or modify Croissant descriptions for their dataset.
“Data is a critical element of any model’s performance, and some experts suggest it will run out, making the need to harness it even more important,” said Elena Simperl, professor of Computer Science at King’s College London and a Croissant working group co-chair. “Croissant allows more people to do more with data. As co-chair of the working group, it is a privilege to collaborate with world-class machine learning scientists and engineers around the globe making an enormous contribution to the AI data ecosystem.”
In addition to the 1.0 specification release of the Croissant open source library, we are announcing support from major ML repositories, including Kaggle, HuggingFace, and OpenML. You can also find public datasets using the Croissant format through Google Dataset Search.
“The development of Croissant was grounded in the needs of ML practitioners, and the technical requirements of ML tools, platforms, and datasets. Our goal with Croissant is to unlock real value for users by enabling the tools they use to work seamlessly together, while keeping the format as simple and intuitive as possible.” said Omar Benjelloun, software engineer at Google and Croissant working group co-chair.
We encourage dataset creators to provide Croissant descriptions, and dataset hosts to provide Croissant dataset files for download, as well as embed Croissant metadata into their corresponding dataset pages. Tool creators should include Croissant dataset support in data analysis and labeling tools to make datasets easier to find and work with.
Croissant is made possible thanks to efforts by the MLCommons Croissant working group, which includes contributors from these organizations: Bayer, cTuning Foundation, DANS-KNAW, Dotphoton, Google, Harvard, Hugging Face, Kaggle, King’s College London – Open Data Institute, Meta, NASA, NASA IMPACT – UAH, North Carolina State University, Open University of Catalonia – Luxembourg Institute of Science and Technology, Sage Bionetworks, and TU Eindhoven.
We invite others to join the Croissant Working Group, contribute to the GitHub repository, and stay informed about the latest updates. You can download the Croissant Editor to start implementing the Croissant vocabulary on your existing datasets today! Together, we can reduce the data development burden and enable a richer ecosystem of AI and ML research and development.