Croissant

Standardize how ML datasets are described to make them easily discoverable, governable, and usable across tools and platforms.

Join the Working Group

Purpose

Data is paramount in machine learning (ML). However, finding, understanding and using ML datasets is still unnecessarily tedious. One reason is the lack of a consistent way to describe ML datasets to facilitate reuse. That’s the aim of Croissant.

Croissant is an open, community-built, standardized metadata vocabulary for ML datasets, including key attributes and properties of datasets, as well as information required to load them into ML tools. Croissant enables data interoperability across ML frameworks and beyond, making ML easier to reproduce and replicate.

By building the vocabulary as an extension to schema.org, a machine-readable standard to describe structured data, Croissant also makes ML datasets discoverable beyond the scope of the repository where they have been published. Finally, Croissant operationalizes dataset documentation, extending existing approaches and vocabularies to describe a dataset’s contents, provenance, and usage restrictions.

Deliverables

A shared standard vocabulary to describe ML datasets.
A representative set of real-world ML datasets described in this format.
An open-source Python library capable of validating Croissant datasets, consuming their records, constructing Croissant datasets programmatically, and serializing them.
An open-source visual editor that supports the creation, modification and loading of Croissant dataset descriptions.
An integration with Model Context Protocol (MCP) and other emerging data access methods for AI.
Extensions for RAI, geospatial datasets, life sciences, digital humanities datasets.
Baseline implementations of these extensions in the tools from (3) and (4).

Key Terms (Elements)

Facilitating data discovery and use

Any dataset with Croissant metadata is discoverable via Google Dataset Search, making more than 700,000 datasets published across Hugging Face, Kaggle, OpenML, and the rest of the web easily findable and accessible. Croissant can also be used to load a dataset into an ML workflow, with implementations in TensorFlow, JAX, and PyTorch, and MCP tooling with Eclair.

Describing data provenance and governance

Responsible ML requires a clear understanding of the lifecycle a dataset has been through, as well as what it is allowed to be used for. By building on the PROV-O and DUO ontologies, Croissant provides machine-readable descriptions of dataset provenance and permissions, enabling fast audit and governance protocols.

Extending across domains and ontologies

Croissant is extensible, interoperable, and translatable to any other metadata vocabulary or ontology, allowing it to be used across domains from the life sciences to space weather. Authors can mix-and-match to ensure datasets are described as well as possible, with Croissant functionality under the surface.

Meeting Schedule

Wednesday Weekly from 9:05am-10:00am Pacific.

Croissant is for:

Creators and maintainers of ML datasets

Data work is tedious and often under-appreciated. Croissant makes datasets more widely available, across repositories and ML frameworks. Croissant is designed to be modular and extensible – new vocabulary extensions are encouraged to address the distinct characteristics of datasets of certain modalities (e.g. audio, video) or in certain sectors (e.g. life sciences, geospatial).

ML researchers and practitioners

Users of Croissant-enabled datasets have access to dataset documentation to understand how to make the most of the data and contribute to it. They can find the data they need no matter where it was published online. They can load the data into different ML platforms without any overhead to transform the data from one format to another.

RAI researchers and practitioners

Croissant offers a machine-readable summary of important attributes captured in a variety of data cards and similar approaches, which is portable and discoverable no matter where the dataset and its data card live, hence promoting better documentation practices.

Policy makers

As AI regulation emerges across the world, Croissant provides a standardized way to collect core information about datasets, hence facilitating the development of data-centric AI audit and assurance tools such as transparency indexes.

Getting started with Croissant is easy. You can:

Find public datasets in the Croissant format on Google Dataset Search.
Download Croissant dataset descriptions from repositories such as Hugging Face, Kaggle, OpenML etc.
Inspect, create or modify croissant descriptions using the Croissant editor. You can load your data into the editor and it will derive the metadata for you to fine-tune. You can find the editor on GitHub or try a hosted version here.
Validate and consume Croissant datasets in Python using the open-source ML Croissant library, available on GitHub.
Load a dataset into TensorFlow, JAX or PyTorch using custom-built loaders.

Croissant Specifications

Croissant Spec

Croissant RAI Spec

How to Join and Access Croissant Resources

To sign up for the group mailing list, receive the meeting invite, and access shared documents and meeting minutes:

Fill out our subscription form and indicate that you’d like to join the Medical Working Group.
Associate a Google account with your organizational email address.
Once you’ve joined the Croissant Working Group, you’ll be able to access the Croissant folder in the Public Google Drive.

If you need help with Croissant or have technical questions, feel free to email [email protected]. You can also join this forum here.

To access the GitHub repositories (public):

If you want to contribute code, please submit your GitHub ID to our subscription form.
Visit the GitHub repository.

Croissant Working Group Workstreams and Leads

Croissant RAI, workstream leads: Albert Merono Penuela and Joan Giner-Miguelez, [email protected]
GeoCroissant, workstream leads: Rajat Shinde and Manil Maskey, [email protected]

Croissant Working Group Chairs

To contact all Croissant working group chairs email [email protected].

Elena Simperl

[email protected]

Elena Simperl is a professor of computer science at King’s College London and the director of research at the Open Data Institute (ODI). She is also a Fellow of the British Computer Society and of the Royal Society of Arts, and features in the top 100 most influential scholars in knowledge engineering of the last decade. Elena’s research is at the intersection between AI and social computing, helping designers understand how to build smart sociotechnical systems that combine data and algorithms with human and social capabilities. She is the president of the Semantic Web Science Association, a non-for-profit with the purpose of promoting and exchanging scholarly work in semantic technologies and related fields throughout the world.

Omar Benjelloun

[email protected]

Omar Benjelloun is a software engineer at Google, where he has developed data-focused products (Google Public Data Explorer, Google Dataset Search) and Search features (media reviews, public statistics answers, related entities, …) for over a decade and a half. Prior to joining Google, Omar received a PhD in Databases from INRIA / University of Paris Orsay, and spent two years as a postdoc in the Database group at Stanford University.

Croissant

Purpose

Deliverables

Key Terms (Elements)

Facilitating data discovery and use

Describing data provenance and governance

Extending across domains and ontologies

Meeting Schedule

Croissant is for:

Creators and maintainers of ML datasets

ML researchers and practitioners

RAI researchers and practitioners

Policy makers

Getting started with Croissant is easy. You can:

Croissant Specifications

How to Join and Access Croissant Resources

Related Insights

What’s New in Croissant 1.1: Extensible, Agent-Ready ML Dataset Standard

Metadata, Meet Datasets: Croissant and MCP in Action

Emerging themes from AAAI 2025: standardisation, evaluation & collaboration in AI safety

CKAN Announces Support for Croissant

Croissant Working Group Workstreams and Leads

Croissant Working Group Chairs

Elena Simperl

Omar Benjelloun