Bringing Open Source Principles to AI Data through Croissant

The Croissant metadata format to help standardize machine learning (ML) datasets has gained significant popularity within the open-source AI community by offering a powerful tool for standardizing and enhancing the accessibility of ML datasets. Developed by the MLCommons Croissant working group, it serves as a community-driven metadata standard that simplifies how data sets are used, making them more understandable and shareable. This surge in adoption underscores a growing awareness of the critical role transparency and collaboration play in advancing AI development.

By adhering to the Open Source AI Definition, Croissant guarantees that datasets are FAIR (findable, accessible, interoperable, and reusable) and advocates for responsible data management practices. As an increasing number of organizations and individuals embrace Croissant, cultivating innovation and collaboration across platforms becomes essential.

Thomas Carey Wilson‘s recent Medium post “Bringing Open Source Principles to AI Data through Croissant“, shares why embracing Croissant is a decisive step toward an open-source AI future where transparency and collective advancement are the norms. We are republishing the Medium post here for the community. To get involved in the Croissant working group join it here.

Republished from the ODI research team Medium account (“Canvas“), November 27, 2024, authored by Thomas Carey Wilson.

Bringing Open Source Principles to AI Data through Croissant

Introduction

Artificial Intelligence (AI) is progressively influencing various sectors, including industries and governmental operations. Despite some risks, its potential can be unlocked by greater, more responsible, openness – enabling freedom, transparency, and collaboration. The Open Source Initiative’s (OSI) Open Source AI Definition seeks to bring these core principles to AI systems. To our delight, central to this mission is the availability and standardisation of AI data, in addition to models and weights. This is where MLCommons’ Croissant comes into play. Croissant is a community-driven metadata standard that simplifies how datasets are used in machine learning (ML), making them more accessible, understandable, and shareable.

Understanding the Open Source AI Definition

The Open Source AI Definition extends the foundational freedoms of open source software to AI systems. It outlines four essential freedoms:

Use: the right to use the AI system for any purpose without seeking permission.
Study: examining the system’s workings by accessing its components.
Modify: the freedom to alter the system to change its outputs or behaviour.
Share: the right to distribute the system to others, with or without modifications, for any purpose.

These freedoms apply not only to complete AI systems but also to their components, such as models, data, and parameters. The OSI explains that a critical enabler of these freedoms is ensuring access to the “preferred form to make modifications,”- the essential resources required for others to understand and alter certain aspects of the AI system.

This includes providing detailed information about the data used to train the system (“data information”), the complete source code for training and running the system (“code”), and the model parameters like weights or configuration settings (“parameters”). Among these, data information is particularly crucial because it empowers users to study and modify AI systems by understanding the data foundation upon which they are built- a core area where Croissant makes a significant contribution.

We value the focus on data transparency but suggest, as highlighted in our data for AI taxonomy, specifying what measures enabling transparency could look like for different data types in AI systems. This can support more equitable access, informed debate, and the implementation of responsible practices across the AI ecosystem.

Croissant’s alignment with the Open Source AI Definition

Croissant, as an open-source tool, ensures that ML datasets are FAIR (findable, accessible, interoperable, and reusable), aligning with open science and open data practices. It embodies the principles of the Open Source AI Definition by providing a standardized, machine-readable format for ML dataset metadata. Here’s how Croissant supports each of the core freedoms and data information requirements:

Use

To be useful, core Croissant metadata should be openly available for all AI datasets, even if the datasets themselves cannot be made public. For an open-source AI model based on open data, Croissant can enable free use of AI datasets by providing detailed metadata that makes datasets easier to find and utilize. Once the required dataset(s) are located, it allows practitioners to quickly understand data collection methods, structures, and processing steps by standardizing how datasets are described. This ease of access and clarity empowers users to employ datasets for any purpose without needing permission or facing unnecessary barriers.

Moreover, Croissant’s design ensures interoperability by leveraging its attributes and ensuring tools understand what those attributes mean. The use of widely recognized vocabularies like schema.org allows integration with existing data and metadata management tools, such as data crawlers, search engines, data catalogues, and metadata assurance systems. This aligns with the Open Source AI Definition’s emphasis on freedom of use by enabling seamless integration across various ML frameworks like PyTorch, TensorFlow, and JAX.

Study

Studying an AI system requires access to detailed information about its components. Croissant facilitates this by offering rich metadata that supports a range of responsible AI use cases, including the structured discovery of datasets, representation of human contributions, and analysis of data diversity and bias. Specifically, it provides:

Dataset-level information: Includes descriptions, licenses, creators, and versioning.
Resources: Details about files and file sets within the dataset.
RecordSets: Structures that describe how data is organized, including fields and data types.

For example, Croissant enables researchers to analyze the demographic characteristics of annotators and contributors to assess dataset diversity or potential biases. For this purpose and others, the inclusion of data types, hierarchical record sets, and fields supports machine \interoperability, allowing tools to interpret and integrate datasets effectively.

Modify

Croissant’s open and extensible format encourages modification and adaptation. Users can:

Extend metadata: Croissant is designed to be modular, allowing for extensions like the Geo-Croissant extension, which captures information specific to geospatial data.
Adapt datasets: the standardization of metadata means that datasets can be modified more easily, whether by adding new data, altering existing records, or adjusting data structures.
Integrate with tools: Croissant metadata can be edited through a visual editor or a Python library, making it accessible for developers to modify datasets programmatically.
Enable interoperability across repositories: Croissant acts as a glue between platforms like Kaggle and Hugging Face, providing adapters that enable seamless dataset interchange while retaining provenance and versioning information for compatibility across frameworks.

For example, if a practitioner wants to add new fields to a dataset or alter the way data is processed, Croissant’s standardized format makes these modifications straightforward. The ability to represent complex data types and transformations within the metadata ensures that changes are accurately captured and can be shared with others.

Sharing is at the heart of open source, and Croissant facilitates this by making datasets and their metadata easily distributable. The use of JSON-LD and alignment with schema.org means that metadata can be embedded in web pages, enabling search engines to index and discover datasets. Croissant also supports:

Portability: Croissant datasets can be seamlessly loaded into different ML frameworks, promoting sharing across platforms.
Collaboration: Croissant fosters collaboration among practitioners, researchers, and organizations by providing a common language for dataset metadata.

An illustrative example is how Croissant enables dataset repositories to infer metadata from existing documentation like data cards, making it easier to publish and share datasets with metadata attached.

Data information compliance

As for the practical requirements which enable these freedoms, the Open Source AI Definition places significant emphasis on data information, requiring:

A complete description of all data used for training: Croissant provides a detailed schema for dataset-level information, including descriptions, licenses, creators, and data versions. It also specifies how to represent resources like files and file sets, ensuring that all data used in training is thoroughly documented.
Listing of publicly available training data: through the distribution property and detailed resource descriptions, Croissant makes it clear where data is stored and how it can be accessed. The use of URLs and content descriptions aligns with the requirement to list publicly available data and where to obtain it.
Listing of training data obtainable from third parties: Croissant’s metadata can include references to external data sources, including third-party datasets. Specifying licenses and access URLs ensures that users know how to obtain additional data, even if it requires a fee.

Moreover, Croissant’s support for versioning and checksums addresses the need for transparency in data changes over time. Croissant ensures that users can verify the integrity of data and understand its provenance by documenting dataset versions and providing file checksums.

Conclusion

Croissant epitomizes the core principles of the Open Source AI Definition by supporting AI datasets to be fully open, accessible, and adaptable. Through its standardized format, Croissant empowers everyone to use, study, modify, and share AI data freely- driving innovation, fostering collaboration, and building trust in AI systems. While details on how the Open Source AI Definition will work in practice are still emerging, embracing Croissant is a decisive step toward an open-source AI future where transparency and collective advancement are the norms.

Bringing Open Source Principles to AI Data through Croissant

Introduction

Understanding the Open Source AI Definition

Croissant’s alignment with the Open Source AI Definition

Use

Study

Modify

Share

Data information compliance

Conclusion