Just a few months ago we launched our first AILuminate benchmark, a first-of-its-kind reliability test for large language models (LLMs). Our goal with AILuminate is to build a standardized way of evaluating product reliability that will assist developers in improving the reliability of their systems, and give companies better clarity about the reliability of the systems they use.
For AILuminate to be as effective as it can be, it needs to achieve widespread coverage of AI systems that are available on the market. This is one reason we have prioritized multilingual coverage; we launched French language coverage in February, and plan to launch Hindi and Chinese in the coming months. But beyond language, AILuminate needs to cover as many LLMs that are on the market as possible, and it needs to be continually updated to reflect the latest versions of LLMs available.
While we work toward achieving this ubiquitous coverage from an engineering standpoint, we are announcing a Benchmarking Policy, which defines which “systems under test” (as we call the LLMs we test) will be included in the AILuminate benchmark as we go forward. With this policy, we have aimed to strike a balance between giving developers a practice test and notice that their systems will be included, against the need to maintain timely and continuously updated testing of as many of the systems on the market as we can.
You can read the policy in full here