MLCommons

Research Working Group

Science Working Group

Mission

Encourage and support the curation of large-scale experimental and scientific datasets and the engineering of ML benchmarks operating on those datasets.

Purpose

The WG will engage with scientists, academics, national laboratories, such as synchrotrons, in securing, engineering, curating, and publishing datasets and machine learning benchmarks that operate on experimental scientific datasets. This will entail working across different domains of sciences, including material, life, environmental, and earth sciences, particle physics, and astronomy, to mention a few. We will include traditional observational and computer-generated data.

Although scientific data is widespread, curating, maintaining, and distributing large-scale, useful datasets for public consumption is a challenging process, covering various aspects of data (from FAIR principles to distribution to versioning). With large data products, various ML techniques have to be evaluated against different architectures and different datasets. Without these benchmarking efforts, the community has no clear pathway for utilizing these advanced models. We expect that the collection will have significant tutorial value as examples from one field, and one observational or computational experiment can be modified to advance other fields and experiments.

The benchmarks, as ML-specific ones, can measure conventional ML-specific aspects, such as training time or inference time. In addition to these, we envisage that benchmarks will also measure some domain-specific measurements, which will be useful in assessing various ML techniques for scientific purposes. Note we will pay strong attention to both similarities (e.g. many scientific datasets are images in some wavelength) and differences (e.g. simulation surrogates have no direct commercial analog) to other MLCommons™ benchmarks, such as MLPerf™ Training.

Deliverables

  1. Scientific, experimental datasets (real or simulated)
  2. Benchmarks that operate on these datasets
  3. Mechanisms for using these datasets or benchmarks
  4. Tutorials

Meeting Schedule

Bi-weekly on Wednesay from 8:00-9:00AM Pacific.

Mailing List

science@googlegroups.com

Working Group Resources

Google Drive

Working Group Chair Emails

Geoffrey Fox (gcfexchange@gmail.com)

Tony Hey (tony.hey@stfc.ac.uk)

Jeyan Thiyagalingam (t.jeyan@stfc.ac.uk)

Working Group Chair Bios

Fox received a Ph.D. in Theoretical Physics from Cambridge University, where he was Senior Wrangler. He is now a distinguished professor of Engineering, Computing, and Physics at Indiana University, where he is the director of the Digital Science Center. He previously held positions at Caltech, Syracuse University, and Florida State University after being a postdoc at the Institute for Advanced Study at Princeton, Lawrence Berkeley Laboratory, and Peterhouse College Cambridge. He has supervised the Ph.D. of 73 students and published around 1500 papers (550 with at least ten citations) in physics and computing with a hindex of 82 and over 39000 citations. He received the High-Performance Parallel and Distributed Computing (HPDC) Achievement Award and the ACM - IEEE CS Ken Kennedy Award for Foundational contributions to parallel computing in 2019. He is a Fellow of APS (Physics) and ACM (Computing) and works on the interdisciplinary interface between computing and applications. He is involved in several projects to enhance the capabilities of Minority Serving Institutions. He has experience in online education and its use in MOOCs for areas like Data and Computational Science. He is a Fellow of APS (Physics) and ACM (Computing).

CV

Tony Hey began his career as a theoretical physicist with a doctorate in particle physics from the University of Oxford. After a career in physics that included research positions at Caltech, MIT and CERN, and a professorship at the University of Southampton, he became interested in parallel computing and moved into computer science. His research group pioneered distributed memory message-passing computers which are now the standard architecture for supercomputing systems. His group produced the first distributed memory ‘Genesis’ benchmark suite for message-passing parallel computers. He also initiated a US-European standardization process for the MPI standard, the now universally accepted message passing interface. He also wrote up Nobel Prize winner Richard Feynman’s ‘Lectures on Computation’ in which Feynman proposed the idea for a quantum computer.

Tony was both Head of Department and Dean of Engineering at Southampton before leaving to lead the U.K.’s ground-breaking eScience initiative in 2001. He recognized the importance of Big Data for science and wrote one of the first papers on the Data Deluge in 2003. In 2005 Tony joined Microsoft as a Vice-President and was responsible for Microsoft’s global university research engagements. He worked with Turing Award winner Jim Gray on applying computer science technologies to science and edited the much-cited ‘Fourth Paradigm: Data-Intensive Scientific Discovery’ in tribute to Jim. In 2014 he became a Senior Data Science Fellow at the eScience Institute, University of Washington before returning to the UK in 2015.

Tony is now Chief Data Scientist at the Rutherford Appleton Laboratory and founded the ‘Scientific Machine Learning’ group at the Lab. The group is working with the Alan Turing Institute in applying machine learning technologies to the ‘Big Scientific Data’ generated by the Diamond Synchrotron and the CryoEM facilities. In 2020, he chaired a US DOE subcommittee that proposed a new Office of Science initiative in ‘AI for Science’. He is a fellow of the Association for Computing Machinery, the American Association for the Advancement of Science, and the Royal Academy of Engineering. Tony is the author of several books including ‘The Computing Universe: A Journey through a Revolution’ published in 2015. In 2005, he was awarded a CBE by Prince Charles for his services to science.

CV

Jeyan Thiyagalingam received his PhD degree in Computer Science from Imperial College London, in 2005. Currently, he is the head of the Scientific Machine Learning (SciML) research group at the Science and Technologies Facilities Council, Rutherford Appleton Laboratory (STFC-RAL), UK. The remit of the group is AI for Science, particularly focussed on the UK’s largest experimental facilities around the Harwell Campus. Before joining STFC-RAL, he was an academic at the University of Liverpool, and previously held positions at MathWorks UK and at the University of Oxford. His research interests include machine learning, and machine learning and signal processing for science. He is a Fellow of the British Computer Society and also a senior member of the IEEE. He has served as a Associate Editor for the Concurrency and Computation: Practice and Experience Journal and for the Software X Journal. He has also been a TPC member in a number of conferences.

CV