Croissant

Standardize how ML datasets are described to make them easily discoverable, governable, and usable across tools and platforms.

Connect with us:

Purpose


Data is paramount in machine learning (ML). However, finding, understanding and using ML datasets is still unnecessarily tedious. One reason is the lack of a consistent way to describe ML datasets to facilitate reuse. That’s the aim of Croissant.

Croissant is an open, community-built, standardized metadata vocabulary for ML datasets, including key attributes and properties of datasets, as well as information required to load them into ML tools. Croissant enables data interoperability across ML frameworks and beyond, making ML easier to reproduce and replicate.

By building the vocabulary as an extension to schema.org, a machine-readable standard to describe structured data, Croissant also makes ML datasets discoverable beyond the scope of the repository where they have been published. Finally, Croissant operationalizes dataset documentation, extending existing approaches and vocabularies to describe a dataset’s contents, provenance, and usage restrictions.

Deliverables


  • A shared standard vocabulary to describe ML datasets.
  • A representative set of real-world ML datasets described in this format.
  • An open-source Python library capable of validating Croissant datasets, consuming their records, constructing Croissant datasets programmatically, and serializing them.
  • An open-source visual editor that supports the creation, modification and loading of Croissant dataset descriptions.
  • An integration with Model Context Protocol (MCP) and other emerging data access methods for AI.
  • Extensions for RAI, geospatial datasets, life sciences, digital humanities datasets.
  • Baseline implementations of these extensions in the tools from (3) and (4).

Key Terms (Elements)


Facilitating data discovery and use

Any dataset with Croissant metadata is discoverable via Google Dataset Search, making more than 700,000 datasets published across Hugging Face, Kaggle, OpenML, and the rest of the web easily findable and accessible. Croissant can also be used to load a dataset into an ML workflow, with implementations in TensorFlow, JAX, and PyTorch, and MCP tooling with Eclair


Describing data provenance and governance

Responsible ML requires a clear understanding of the lifecycle a dataset has been through, as well as what it is allowed to be used for. By building on the PROV-O and DUO ontologies, Croissant provides machine-readable descriptions of dataset provenance and permissions, enabling fast audit and governance protocols. 


Extending across domains and ontologies

Croissant is extensible, interoperable, and translatable to any other metadata vocabulary or ontology, allowing it to be used across domains from the life sciences to space weather. Authors can mix-and-match to ensure datasets are described as well as possible, with Croissant functionality under the surface.  

Meeting Schedule

Wednesday Weekly from 9:05am-10:00am Pacific.

Croissant is for:


Creators and maintainers of ML datasets

Data work is tedious and often under-appreciated. Croissant makes datasets more widely available, across repositories and ML frameworks.  Croissant is designed to be modular and extensible – new vocabulary extensions are encouraged to address the distinct characteristics of datasets of certain modalities (e.g. audio, video) or in certain sectors (e.g. life sciences, geospatial).


ML researchers and practitioners

Users of Croissant-enabled datasets have access to dataset documentation to understand how to make the most of the data and contribute to it. They can find the data they need no matter where it was published online. They can load the data into different ML platforms without any overhead to transform the data from one format to another.


RAI researchers and practitioners

Croissant offers a machine-readable summary of important attributes captured in a variety of data cards and similar approaches, which is portable and discoverable no matter where the dataset and its data card live, hence promoting better documentation practices.


Policy makers

As AI regulation emerges across the world, Croissant provides a standardized way to collect core information about datasets, hence facilitating the development of data-centric AI audit and assurance tools such as transparency indexes.

Getting started with Croissant is easy. You can:
  • Find public datasets in the Croissant format on Google Dataset Search.
  • Download Croissant dataset descriptions from repositories such as Hugging Face, Kaggle, OpenML etc.
  • Inspect, create or modify croissant descriptions using the Croissant editor. You can load your data into the editor and it will derive the metadata for you to fine-tune. You can find the editor on GitHub or try a hosted version here
  • Validate and consume Croissant datasets in Python using the open-source ML Croissant library, available on GitHub.
  • Load a dataset into TensorFlow, JAX or PyTorch using custom-built loaders.

Croissant Specifications


How to Join and Access Croissant Resources


Croissant Working Group Workstreams and Leads

  • Croissant RAI, workstream leads: Albert Merono Penuela and Joan Giner-Miguelez, [email protected]
  • GeoCroissant, workstream leads: Rajat Shinde and Manil Maskey, [email protected]


Croissant Working Group Chairs

To contact all Croissant working group chairs email [email protected]

Elena Simperl

Elena Simperl is a professor of computer science at King’s College London and the director of research at the Open Data Institute (ODI). She is also a Fellow of the British Computer Society and of the Royal Society of Arts, and features in the top 100 most influential scholars in knowledge engineering of the last decade. Elena’s research is at the intersection between AI and social computing, helping designers understand how to build smart sociotechnical systems that combine data and algorithms with human and social capabilities. She is the president of the Semantic Web Science Association, a non-for-profit with the purpose of promoting and exchanging scholarly work in semantic technologies and related fields throughout the world.

Omar Benjelloun

Omar Benjelloun is a software engineer at Google, where he has developed data-focused products (Google Public Data Explorer, Google Dataset Search) and Search features (media reviews, public statistics answers, related entities, …) for over a decade and a half. Prior to joining Google, Omar received a PhD in Databases from INRIA / University of Paris Orsay, and spent two years as a postdoc in the Database group at Stanford University.
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.