Croissant
Standardize how ML datasets are described to make them easily discoverable and usable across tools and platforms.
Purpose
Data is paramount in machine learning (ML). However, finding, understanding and using ML datasets is still unnecessarily tedious. One reason is the lack of a consistent way to describe ML datasets to facilitate reuse. That’s the aim of Croissant.
Croissant is an open community-built standardized metadata vocabulary for ML datasets, including key attributes and properties of datasets, as well as information required to load these datasets in ML tools. Croissant enables data interoperability between ML frameworks and beyond, which makes ML work easier to reproduce and replicate.
By building the vocabulary as an extension to schema.org, a machine-readable standard to describe structured data, Croissant also makes ML datasets discoverable beyond the scope of the repository where they have been published. Finally, Croissant operationalises dataset documentation, complementing and extending existing approaches such as data cards to respond to data-centric Responsible AI (RAI) concerns.
Deliverables
- A shared standard vocabulary to describe ML datasets.
- A representative set of real-world ML datasets described in this format.
- An open-source Python library capable of validating Croissant datasets, consuming their records, constructing Croissant datasets programmatically, and serializing them.
- An open-source visual editor that supports the creation, modification and loading of Croissant dataset descriptions.
- Extensions for RAI, geospatial datasets, life sciences, digital humanities datasets.
- Baseline implementations of these extensions in the tools from (3) and (4).
Meeting Schedule
Wednesday Weekly from 9:05am-10:00am Pacific.
Croissant is for:
Creators and maintainers of ML datasets
Data work is tedious and often under-appreciated. Croissant makes datasets more widely available, across repositories and ML frameworks. Croissant is designed to be modular and extensible – new vocabulary extensions are encouraged to address the distinct characteristics of datasets of certain modalities (e.g. audio, video) or in certain sectors (e.g. life sciences, geospatial).
ML researchers and practitioners
Users of Croissant-enabled datasets have access to dataset documentation to understand how to make the most of the data and contribute to it. They can find the data they need no matter where it was published online. They can load the data into different ML platforms without any overhead to transform the data from one format to another.
RAI researchers and practitioners
Croissant offers a machine-readable summary of important attributes captured in a variety of data cards and similar approaches, which is portable and discoverable no matter where the dataset and its data card live, hence promoting better documentation practices.
Policy makers
As AI regulation emerges across the world, Croissant provides a standardized way to collect core information about datasets, hence facilitating the development of data-centric AI audit and assurance tools such as transparency indexes.
Getting started with Croissant is easy. You can:
- Find public datasets in the Croissant format on Google Dataset Search.
- Download Croissant dataset descriptions from repositories such as Hugging Face, Kaggle, OpenML etc.
- Inspect, create or modify croissant descriptions using the Croissant editor. You can load your data into the editor and it will derive the metadata for you to fine-tune. You can find the editor on GitHub or try a hosted version here.
- Validate and consume Croissant datasets in Python using the open-source ML Croissant library, available on GitHub.
- Load a dataset into TensorFlow, JAX or PyTorch using custom-built loaders.
Croissant Specifications
How to Join and Access Croissant Resources
To sign up for the group mailing list, receive the meeting invite, and access shared documents and meeting minutes:
- Fill out our subscription form and indicate that you’d like to join the Medical Working Group.
- Associate a Google account with your organizational email address.
- Once you’ve joined the Croissant Working Group, you’ll be able to access the Croissant folder in the Public Google Drive.
If you need help with Croissant or have technical questions, feel free to email [email protected]. You can also join this forum here.
To access the GitHub repositories (public):
- If you want to contribute code, please submit your GitHub ID to our subscription form.
- Visit the GitHub repository.
Croissant Working Group Chairs
To contact all Croissant working group chairs email [email protected].