MLPerf Storage
Define and develop the MLPerf Storage benchmarks to characterize performance of storage systems that support machine learning workloads.
Purpose
Storing and processing of training data is a crucial part of the machine learning (ML) pipeline. The way we ingest, store, and serve data into ML frameworks can significantly impact the performance of training and inference, as well as resource costs. However, even though data management can pose a significant bottleneck, it has received far less attention and specialization for ML.
The main goal of the MLPerf Storage working group is to create a benchmark that evaluates performance for the most important storage aspects in ML workloads, including data ingestion, training, and inference. Our end goal is to create a storage benchmark for the full ML pipeline which is compatible with diverse software frameworks and hardware accelerators. The benchmark will not require any specific hardware for performing computation.
Creating this benchmark will establish best practices in measuring storage performance in ML, contribute to the design of next generation systems for ML, and help system engineers find the right sizing of storage relative to compute in ML clusters.
Deliverables
- Storage access traces for representative ML applications, from the applications’ perspective. Our initial targets are Vision, NLP, and Recommenders. (Short-term goal)
- Storage benchmark rules for:
- Data ingestion phase (Medium-term goal)
- Training phase (Short-term goal)
- Inference phase (Long-term goal)
- Full ML pipeline (Long-term goal)
- Flexible generator of datasets:
- Synthetic workload generator based on analysis of I/O in real ML traces, which is aware of compute think-time. (Short-term goal)
- Trace replayer that scales the workload size. (Long-term goal)
- User-friendly testing harness that is easy to deploy with different storage systems. (Medium-term goal)Nunc auctor tempor ornare class sollicitudin
Meeting Schedule
Friday January 17, 2025 Weekly – 08:05 – 09:00 Pacific Time
How to Join and Access MLPerf Storage Resources
To sign up for the group mailing list, receive the meeting invite, and access shared documents and meeting minutes:
- Fill out our subscription form and indicate that you’d like to join the MLPerf Storage Working Group.
- Associate a Google account with your organizational email address.
- Once your request to join the Storage Working Group is approved, you’ll be able to access the Storage folder in the Public Google Drive.
To engage in group discussions, join the group’s channels on the MLCommons Discord server.
To access the GitHub repository (public):
- If you want to contribute code, please submit your GitHub ID to our subscription form.
- Visit the GitHub repository.
Storage Working Group Chairs
To contact all Storage working group chairs email [email protected].