a metadata format for ML-ready datasets – Google Research Blog

Posted by Omar Benjelloun, Software Engineer, Google Research, and Peter Mattson, Software Engineer, Google Core ML and President, MLCommons Association

Machine learning (ML) practitioners often spend a significant amount of time understanding and organizing datasets when training an ML model. This time-consuming task hampers progress in the field due to the wide variety of data representations.

ML datasets encompass various content types like text, structured data, images, audio, and video. Each dataset has its unique arrangement of files and data formats, making data discovery and model training challenging. To address this issue, a new metadata format called Croissant has been introduced for ML-ready datasets.

Croissant, developed collaboratively by a community from industry and academia as part of the MLCommons effort, provides a standard way to describe and organize data without changing how the data is represented. It builds upon schema.org, a standard for publishing structured data on the Web, and includes ML-relevant metadata, data organization, and default ML semantics.

Major tools and repositories like Kaggle, Hugging Face, and OpenML will start supporting the Croissant format for datasets they host. The Dataset Search tool allows users to search for Croissant datasets, and popular ML frameworks such as TensorFlow, PyTorch, and JAX can easily load Croissant datasets using the TensorFlow Datasets (TFDS) package.

Croissant

The 1.0 release of Croissant includes a complete specification of the format, example datasets, a Python library for validating and generating Croissant metadata, and a visual editor for creating and inspecting Croissant dataset descriptions.

Supporting Responsible AI (RAI) was a key goal of the Croissant effort, leading to the release of the Croissant RAI vocabulary extension. This extension includes properties for describing important RAI use cases like data life cycle management, data labeling, ML safety, fairness evaluation, and compliance.

Why a shared format for ML data?

The training data is crucial in determining the behavior of an ML model, and the lack of a common format adds complexity to the data-centric ML development process. Croissant aims to simplify this process by improving data discoverability, data cleaning tools, and ML framework integration.

Dataset authors can enhance the value of their datasets by adopting Croissant, which requires minimal effort thanks to available creation tools and platform support.

What can Croissant do today?

The Croissant ecosystem: Users can search for Croissant datasets, download them from major repositories, and easily load them into their favorite ML frameworks. They can create, inspect, and modify Croissant metadata using the Croissant editor.

Today, users can find Croissant datasets and publish datasets with Croissant metadata to enhance discoverability and reusability.

Future direction

The community is encouraged to support Croissant by providing metadata for datasets, embedding Croissant metadata in dataset web pages, and developing tools that support Croissant datasets. Together, we can reduce the data development burden and foster a richer ML research and development ecosystem.

Contributors from various organizations have been instrumental in developing Croissant, and the community is invited to join in contributing to the effort.

Acknowledgements

Croissant was developed by teams from Google, Dataset Search, Kaggle, and TensorFlow Datasets, as part of the MLCommons community working group with contributions from other organizations like Bayer, Harvard, Hugging Face, and NASA.

Source link