Since its introduction in March 2024, the Croissant metadata format for ML-ready datasets has quickly gained momentum within the Data community. It was recently featured at the NeurIPS conference in December 2024. Croissant is supported by major ML dataset repositories, such as Kaggle and HuggingFace. Over 700k datasets are accessible and searchable via Google Dataset Search. Croissant datasets can be loaded automatically to train ML models using popular toolkits such as JAX and PyTorch. 

MLCommons’ Croissant working group co-chairs, Omar Benjelloun and Elena Simperl, sat down to share their thoughts on Croissant’s initial popularity, the approach the working group is taking to develop it, and next steps.

[MLC] Why do you think Croissant has become so popular?

[ES] We’re delighted to see it being discussed in many contexts, beyond AI and machine learning research. I think we’re solving a problem that is easy to recognize, a problem that many practitioners who work with machine learning have, but one whose solution requires a range of skills that traditionally haven’t been that important in that community. Those techniques have to do with how you structure and organize data – data used by machine learning systems or any other type of data – in a way that makes it easy for that data to be shared across applications, to be found and reused. That’s a topic that Omar and I and other people in the working group know a lot about. And with Croissant, we are applying these techniques, which have been around for around 10-15 years,  to a new use case: machine learning data work. And I suppose the magic happened when we brought together all the relevant expertise in one working group to deliver something that ticks lots of boxes and heals many pains.

[MLC] What was the approach you took to developing Croissant?

[OB] We took an approach that was very concrete. A lot of this metadata standards work happens in a way where some experts get in a working group and they create what they think is the perfect standard. And then they try to convince others to adopt it and use it in their products.

Instead, from the start we said, “Okay, let’s define the standards at the same time and in close collaboration with the tools and systems that are going to use it.” So we worked with the repositories of machine learning datasets, like Kaggle and Hugging Face and OpenML initially, to implement support for this format as we were defining it, as well as to participate in the definition of the format. That’s on the “supply side” of datasets. On the “demand side,” we worked with the teams that develop tools for loading datasets for machine learning to implement support for loading Croissant datasets. And they were also part of the definition of the format.

So getting these people to work together helped us to create something which was not, “Here’s a specification. We think it’s going to make things better. Now, please go and implement this so that we have some data on one side and some tools on the other side.” Instead we could say, “Here’s the specification and here are some datasets that are already available in this format because these repositories support it. And here are some tools that you can use to work with these datasets.” And of course, it’s just the beginning. There are more repositories and datasets that should be available in this format. And there are a lot more tools that should add support for Croissant so that ML developers have what they need to work with this data. But the initial release of Croissant was a strong start with a useful, concrete proposition. And I think this is part of what made it so successful.

[MLC] Do you consider Croissant a “standard” now?

[ES] Personally, I have mixed feelings about the word “standard.” Sometimes when I hear someone talking about standards, I think about those lengthy, tedious, and to some degree inaccessible processes that are run in standardization committees, where there is a lot of work that is put into coming to a proposal, with lots of different stakeholders, with many competing interests to agree upon. And there’s hundreds and hundreds and hundreds of pages explaining what someone needs to do to use the standard in the intended way. And then there’s some implementations. And then if you’re an academic researcher in a small lab or a startup who doesn’t have the resources to be part of this exercise, then you feel left out. 

Having said that, standards, once they are agreed, and if there is critical mass in adoption, can be very, very powerful, because they do reduce those frictions of doing business. So if I am a startup that is creating a new type of machine learning functionality, for example to manage workflows or to do some sort of fair machine learning or any specialist functionality that the community needs, then having a standard to be able to process data that has been created and processed using any other machine learning tool is hugely valuable. But I love what Omar said: we’re following a slightly different approach. We really want to release everything to the community. We want to build everything with that adoption pathway in mind so that we put the results in the hands of practitioners as quickly as possible and in the most accessible way possible.

The other thing that I would say is that even if a competing proposal emerges, the way in which the Croissant vocabulary – and technology it is built with – means that it would be quite easy to use both. There are many vocabularies that describe datasets. And something like Google Dataset Search can work with all of them because of the technologies that are used to do that. But of course, if you just look at the number of datasets that use this or any other machine-readable vocabulary that may share similar aims, at the moment we seem to have an advantage. And the interest to adopt Croissant by more and more platforms is still there, almost a year after the launch of the initial version.

[OB] We released the first version of the Croissant format, but I don’t consider it to be standard yet.  I prefer to say  “community-driven specification”. And we’re not afraid to say Croissant will evolve and continue to change. Once we have a critical mass of data platforms and tools that have adopted it and use it and make their users happy thanks to Croissant, then it will feel sort of stable and we can say, “we don’t see this changing much from here.” At that point declaring it a standard makes sense. And of course, all the companies, startups and university labs that build tools and participate in the community will have contributed to that standard through their work.

[MLC] How important is the support from industry?

[ES] Speaking as someone who is in academia, I think industry support is very important. I am personally very happy with the diversity of industry participants in the group. We have OpenML, which is an open source initiative that is driven among others by academics, but it is a project with lots of contributors. We have obviously Google, we have participants from Meta as well, so household names in the AI space. Then we have Hugging Face and Kaggle, which have huge communities behind them. And we have NASA as well, which brings in avery interesting use case around geospatial data sets and their use in machine learning. Going forward I’d love for the working group to attract more industries that work beyond tech, that perhaps have their own machine learning use cases and their own enterprise data that they want to prepare for machine learning usage. And in that sort of scenario if you think about a big company and the types of data management problems they have, having something like Croissant would be very valuable as well. 

We have a good initial proposal for what Croissant can be, how it should be. But equally, we also have a very long list of things that we need to work on this year to get to that stable version that you could perhaps call a standard. And once you reach that, then you can actually make a case to any enterprise to explore this internally.

[MLC] What are your goals for expanding participation in the working group and in finding new partners?

[OB] First of all, this is an open group, and everyone is welcome! We would love to see more companies involved that develop tools that are used by machine learning practitioners. This speaks to the promise of Croissant: that once all datasets are described and edited in the same format, then if you have a tool that understands that format, that tool can work with any of those datasets. So if tools that provide functionality around loading of data for training ML models or fine tuning or evaluation of models, or data labeling or data analysis for machine learning, if these tools look at Croissant and consider supporting it, which does not require a huge amount of effort because it’s not a very complex format, then that will really make Croissant a lot more useful to machine learning practitioners. A goal for us this year is to get more tool vendors to come and join the Croissant effort and add support for it. Any vendor, no matter what size, that does data work, AI data work, whether that’s data augmentation, data collection, maybe responsible data collection, or data labeling or data debiasing, any type of data related activity that would make it more likely for a dataset to be ready to be used for machine learning, would add a lot of value to what we do right now. And it will also bring in new perspectives and new pathways to adoption. 

[ES] And would using Croissant be valuable for them? Absolutely. I think it was a researcher from Google who said, “No one wants to do the data work in AI,” a few years ago, and it still rings true. The vendors of data-related products solve some data challenges, but at the same time they also struggle with the fact that many people still think, “I’m just going to find the data somehow, and I’m just going to try to get to training my model as quickly as possible.” Croissant and these vendors are saying the same thing: data is important, we need to pay attention to it. We have very similar aims, and we’re providing these developers with a data portability solution. We’re helping them to interoperate with a range of other tools in the ecosystem. And that’s an advantage, especially when you’re starting out. 

[MLC] Have there been any surprises for you in how the community has been using Croissant?

[ES] We see a lot of excitement around use cases that we hadn’t really thought about so much. For instance, anything related to responsible AI, including AI safety, that’s a huge area that is looking for a whole range of solutions, some of them technical, some of them socio-technical. We’ve seen a lot of interest to use Croissant for many scenarios where there is a need to operationalize responsible AI practices. And to some degree, the challenge that we’re facing is to decide which of those use cases are really part of the core roadmap, and which ones are best served by communities that have that sectorial expertise.

We’ve also seen a lot of interest from the open science community. They’ve been working on practices and solutions for scientists to share not just their data, but their workflows, in other words, the protocols, experiment designs and other ways to do science. Scientific datasets are now increasingly used with machine learning,o we’ve had lots of interest and there’s conversations on how to take advantage of those decades of experience that they bring in.

[MLC] How was Croissant received at the NeurIPS 2024 conference?

[OB] We presented a poster at NeurIPS that opened a lot of conversations. We also conducted what turned out to be perhaps the most interesting community engagement activity all year for us. We asked authors in the conference’s Datasets and Benchmarks track to submit their datasets with a Croissant metadata description. It was not a strict requirement; it was more of a recommended thing for them to do. I think this was great for Croissant as a budding metadata format to get that exposure and the real life testing. We learned a lot from that experience about the details in our spec that were not clear or things that were not explained as well as they should have. And we also got real-world feedback on the tools that we built, including an initial metadata editor with which users can create the Croissant metadata for their datasets. It works in some cases, but we also knew that it was not feature-complete. So, users ran into issues using it. For instance, researchers often create bespoke datasets for their own research problems. And they asked us, “I have my custom binary format that I created for this; how do I write Croissant for it? It’s not CSV or images or text.” And so it pushed us into facing some of Croissant’s limits. . This experience also taught us the importance of good creation tools for users to prepare their datasets, because nobody wants to write by hand some JSON file that describes their dataset. I think for us it’s a strong incentive, as a community, to invest in building good editing tools that can also connect to all the platforms like Hugging Face or Kaggle so that users can publish their datasets on those platforms. So this is something that we learned from this interaction – and at the next conference, the experience will be much smoother for the authors who contribute to this.  

[MLC] What does the roadmap for Croissant look like moving forward?

[ES] In general, we are working on extending the metadata format so that we can capture more information. There are a few features that we’ve heard over and over again would be very important. For instance, anything related to the provenance of the data. We’re also working on tools, as Omar mentioned, and that includes technical tools like editors, but also tutorials and accessible material. 

In terms of what I’d like to see personally, as I was saying earlier, we’d love to have more contributors who are working in companies or on some sort of data work. For instance, data labeling. And a topic that we haven’t really touched on is the quality of that metadata. Croissant can solve many problems, but that assumes that the metadata is going to be of good quality, detailed, complete. That will almost certainly require some manual effort: from people who publish the data and want to describe it, maybe even from people who use the data and have made experiences with it. So, we’d love to have more datasets featuring excellent Croissant metadata. That will require a campaign to involve many users of, say, the 100 most popular machine learning datasets to come together and join efforts and create those excellent Croissant records.

[OB] We also have a new effort to describe not just datasets, but also machine learning tasks that use those datasets, so that we go to the next level of interoperability and describe things like machine learning evaluation benchmarks, or even competitions and things like that, and how they relate to datasets. That’s also a very exciting effort that is happening in the working group.

[ES] It’s quite an ambitious agenda – we’ll be busy this year and we invite others to join the working group to continue to shape the future of Croissant!

We invite others to join the Croissant Working Group, contribute to the GitHub repository, and stay informed about the latest updates. You can download the Croissant Editor to implement the Croissant vocabulary on your existing datasets today! Together, we can reduce the data development burden and enable a richer ecosystem of AI and ML research and development.