Croissant Working Group
Mission
Standardize how ML datasets are described to make them easily discoverable and usable across tools and platforms.
Purpose
Data is paramount in machine learning (ML). However, finding, understanding and using ML datasets is still unnecessarily tedious. One reason is the lack of a consistent way to describe ML datasets to facilitate reuse. That’s the aim of Croissant.
Croissant is an open community-built standardized metadata vocabulary for ML datasets, including key attributes and properties of datasets, as well as information required to load these datasets in ML tools. Croissant enables data interoperability between ML frameworks and beyond, which makes ML work easier to reproduce and replicate.
By building the vocabulary as an extension to schema.org, a machine-readable standard to describe structured data, Croissant also makes ML datasets discoverable beyond the scope of the repository where they have been published. Finally, Croissant operationalises dataset documentation, complementing and extending existing approaches such as data cards to respond to data-centric Responsible AI (RAI) concerns.
Croissant is for:
- Creators and maintainers of ML datasets – data work is tedious and often under-appreciated. Croissant makes datasets more widely available, across repositories and ML frameworks. Croissant is designed to be modular and extensible – new vocabulary extensions are encouraged to address the distinct characteristics of datasets of certain modalities (e.g. audio, video) or in certain sectors (e.g. life sciences, geospatial).
- ML researchers and practitioners – users of Croissant-enabled datasets have access to dataset documentation to understand how to make the most of the data and contribute to it. They can find the data they need no matter where it was published online. They can load the data into different ML platforms without any overhead to transform the data from one format to another.
- RAI researchers and practitioners – Croissant offers a machine-readable summary of important attributes captured in a variety of data cards and similar approaches, which is portable and discoverable no matter where the dataset and its data card live, hence promoting better documentation practices.
- Policy makers – as AI regulation emerges across the world, Croissant provides a standardized way to collect core information about datasets, hence facilitating the development of data-centric AI audit and assurance tools such as transparency indexes.
Getting started with Croissant is easy. You can:
- Find public datasets in the Croissant format on Google Dataset Search.
- Download Croissant dataset descriptions from repositories such as Hugging Face, Kaggle, OpenML etc.
- Inspect, create or modify croissant descriptions using the Croissant editor. You can load your data into the editor and it will derive the metadata for you to fine-tune. You can find the editor on GitHub or try a hosted version here.
- Validate and consume Croissant datasets in Python using the open-source ML Croissant library, available on GitHub.
- Load a dataset into TensorFlow, JAX or PyTorch using custom-built loaders.
Deliverables
- A shared standard vocabulary to describe ML datasets.
- A representative set of real-world ML datasets described in this format.
- An open-source Python library capable of validating Croissant datasets, consuming their records, constructing Croissant datasets programmatically, and serializing them.
- An open-source visual editor that supports the creation, modification and loading of Croissant dataset descriptions.
- Extensions for RAI, geospatial datasets, life sciences, digital humanities datasets.
- Baseline implementations of these extensions in the tools from (3) and (4).
Meeting Schedule
Weekly on Wednesday from 9:05am-10:00am Pacific.
Join
Related Blog
-
New Croissant Metadata Format helps Standardize ML Datasets
Support from Hugging Face, Google Dataset Search, Kaggle, Open ML, and TFDS, makes datasets easily discoverable and usable.
Croissant Specifications
Croissant Spec
Croissant RAI Spec
How to Join and Access Croissant Working Group Resources
- To sign up for the group mailing list, receive the meeting invite, and access shared documents and meeting minutes:
- Fill out our subscription form and indicate that you’d like to join the Croissant Working Group.
- Associate a Google account with your organizational email address.
- Once you’ve joined the Croissant Working Group, you’ll be able to access the Croissant folder in the Public Google Drive.
- If you need help with Croissant or have technical questions, feel free to email [email protected]. You can also join this forum here.
- To access the GitHub repositories (public):
- If you want to contribute code, please submit your GitHub ID to our subscription form.
- Visit the GitHub repository.
Croissant Working Group Chairs
To contact all Croissant working group chairs email [email protected].
Elena Simperl
Elena Simperl is a professor of computer science at King’s College London and the director of research at the Open Data Institute (ODI). She is also a Fellow of the British Computer Society and of the Royal Society of Arts, and features in the top 100 most influential scholars in knowledge engineering of the last decade. Elena’s research is at the intersection between AI and social computing, helping designers understand how to build smart sociotechnical systems that combine data and algorithms with human and social capabilities. She is the president of the Semantic Web Science Association, a non-for-profit with the purpose of promoting and exchanging scholarly work in semantic technologies and related fields throughout the world.
Omar Benjelloun
Omar Benjelloun is a software engineer at Google, where he has developed data-focused products (Google Public Data Explorer, Google Dataset Search) and Search features (media reviews, public statistics answers, related entities, …) for over a decade and a half. Prior to joining Google, Omar received a PhD in Databases from INRIA / University of Paris Orsay, and spent two years as a postdoc in the Database group at Stanford University.