The AI Safety Ecosystem Needs Standard Benchmarks

One of the management guru Peter Drucker’s most over-quoted turns of phrase is “what gets measured gets improved.” But it’s over-quoted for a reason: It’s true.

Nowhere is it truer than in technology over the past 50 years. Moore’s law—which predicts that the number of transistors (and hence compute capacity) in a chip would double every 24 months—has become a self-fulfilling prophecy and north star for an entire ecosystem. Because engineers carefully measured each generation of manufacturing technology for new chips, they could select the techniques that would move toward the goals of faster and more capable compute. And it worked: Computing power, and more impressively computing power per watt or per dollar, has grown exponentially in the past five decades. The latest smartphones are more powerful than the fastest supercomputers from the year 2000.

Measurement of performance, though, is not limited to chips. All the parts of our computing systems today are benchmarked—that is, compared to similar components in a controlled way, with quantitative score assessments. These benchmarks help drive innovation.

And we would know.

As leaders in the field of AI, from both industry and academia, we build and deliver the most widely used performance benchmarks for AI systems in the world. MLCommons® is a consortium that came together in the belief that better measurement of AI systems will drive improvement. Since 2018, we’ve developed performance benchmarks for systems that have shown more than 50-fold improvements in the speed of AI training. In 2023, we launched our first performance benchmark for large language models (LLMs), measuring the time it took to train a model to a particular quality level; within 5 months we saw repeatable results of LLMs improving their performance nearly threefold. Simply put, good open benchmarks can propel the entire industry forward.

Read the rest of the MLCommons working group authored blog: MLCommons has made benchmarks for AI performance—now it’s time to measure safety on IEEE Spectrum .