Croissant

Standardize how ML datasets are described to make them easily discoverable and usable across tools and platforms.

Connect with us:

Purpose


Data is paramount in machine learning (ML). However, finding, understanding and using ML datasets is still unnecessarily tedious. One reason is the lack of a consistent way to describe ML datasets to facilitate reuse. That’s the aim of Croissant.

Croissant is an open community-built standardized metadata vocabulary for ML datasets, including key attributes and properties of datasets, as well as information required to load these datasets in ML tools. Croissant enables data interoperability between ML frameworks and beyond, which makes ML work easier to reproduce and replicate.

By building the vocabulary as an extension to schema.org, a machine-readable standard to describe structured data, Croissant also makes ML datasets discoverable beyond the scope of the repository where they have been published. Finally, Croissant operationalises dataset documentation, complementing and extending existing approaches such as data cards to respond to data-centric Responsible AI (RAI) concerns.

Deliverables


  • A shared standard vocabulary to describe ML datasets.
  • A representative set of real-world ML datasets described in this format.
  • An open-source Python library capable of validating Croissant datasets, consuming their records, constructing Croissant datasets programmatically, and serializing them.
  • An open-source visual editor that supports the creation, modification and loading of Croissant dataset descriptions.
  • Extensions for RAI, geospatial datasets, life sciences, digital humanities datasets.
  • Baseline implementations of these extensions in the tools from (3) and (4).
Meeting Schedule

Wednesday Weekly from 9:05am-10:00am Pacific.

Croissant is for:


Creators and maintainers of ML datasets

Data work is tedious and often under-appreciated. Croissant makes datasets more widely available, across repositories and ML frameworks.  Croissant is designed to be modular and extensible – new vocabulary extensions are encouraged to address the distinct characteristics of datasets of certain modalities (e.g. audio, video) or in certain sectors (e.g. life sciences, geospatial).


ML researchers and practitioners

Users of Croissant-enabled datasets have access to dataset documentation to understand how to make the most of the data and contribute to it. They can find the data they need no matter where it was published online. They can load the data into different ML platforms without any overhead to transform the data from one format to another.


RAI researchers and practitioners

Croissant offers a machine-readable summary of important attributes captured in a variety of data cards and similar approaches, which is portable and discoverable no matter where the dataset and its data card live, hence promoting better documentation practices.


Policy makers

As AI regulation emerges across the world, Croissant provides a standardized way to collect core information about datasets, hence facilitating the development of data-centric AI audit and assurance tools such as transparency indexes.

Getting started with Croissant is easy. You can:
  • Find public datasets in the Croissant format on Google Dataset Search.
  • Download Croissant dataset descriptions from repositories such as Hugging Face, Kaggle, OpenML etc.
  • Inspect, create or modify croissant descriptions using the Croissant editor. You can load your data into the editor and it will derive the metadata for you to fine-tune. You can find the editor on GitHub or try a hosted version here
  • Validate and consume Croissant datasets in Python using the open-source ML Croissant library, available on GitHub.
  • Load a dataset into TensorFlow, JAX or PyTorch using custom-built loaders.

Croissant Specifications


How to Join and Access Croissant Resources


Croissant Working Group Chairs

To contact all Croissant working group chairs email [email protected]

Elena Simperl

Elena Simperl is a professor of computer science at King’s College London and the director of research at the Open Data Institute (ODI). She is also a Fellow of the British Computer Society and of the Royal Society of Arts, and features in the top 100 most influential scholars in knowledge engineering of the last decade. Elena’s research is at the intersection between AI and social computing, helping designers understand how to build smart sociotechnical systems that combine data and algorithms with human and social capabilities. She is the president of the Semantic Web Science Association, a non-for-profit with the purpose of promoting and exchanging scholarly work in semantic technologies and related fields throughout the world.

Omar Benjelloun

Omar Benjelloun is a software engineer at Google, where he has developed data-focused products (Google Public Data Explorer, Google Dataset Search) and Search features (media reviews, public statistics answers, related entities, …) for over a decade and a half. Prior to joining Google, Omar received a PhD in Databases from INRIA / University of Paris Orsay, and spent two years as a postdoc in the Database group at Stanford University.