OJIH Report: The Hidden Flaws in AI Benchmark Scores: A Call for Independent Evaluation

Meta's claim that Llama 3.1 rivals the best AI models, supported by benchmark scores, must be scrutinized due to the prevalence of self-assessment in the industry. The truth remains that developing new, challenging benchmarks is crucial to accurately measure the real-world capabilities and safety of AI models as they continue to evolve.

How can we trust the claims made by AI model-makers when they are the ones grading their own performance? When Meta announced its latest open-source large language model (LLM), Llama 3.1, on July 23, 2023, it claimed the model had "state-of-the-art capabilities" that could rival the best closed-source models like GPT-4o and Claude 3.5 Sonnet. Meta backed up these claims with a table showcasing the scores achieved by these models on various popular benchmarks like MMLU, GSM8K, and GPQA. For instance, Llama 3.1 scored 88.6% on the MMLU benchmark, while GPT-4o scored 88.7% and Claude 3.5 Sonnet scored 88.3%. But can we really trust these numbers?

Having accurate and reliable benchmarks for AI models is crucial, not just for the firms making them but for the entire AI community and consumers. Benchmarks "define and drive progress," providing a standard against which model-makers can measure their achievements and motivating them to improve, as noted by Percy Liang from Stanford University's Institute for Human-Centered Artificial Intelligence. Benchmarks help map the field's overall progress, demonstrate how AI systems compare with human capabilities in specific tasks, and guide users in choosing the appropriate model for their needs. Yet, as Dr. Clémentine Fourrier from Hugging Face warns, these benchmark scores should be taken with a pinch of salt. Often, model-makers may hype their products and company valuations using these scores, which may not always align with real-world performance.

One fundamental issue with benchmarks like MMLU (Massive Multi-task Language Understanding) is that they are simply too easy for today's models. Created in 2020, MMLU consists of 15,908 multiple-choice questions across 57 topics, including math, American history, science, and law. At the time, most language models scored little better than 25%, which is what one would get by guessing randomly. OpenAI's GPT-3 did best with a score of 43.9%. However, as models have improved, the best now score between 88% and 90%. This saturation makes it difficult to draw meaningful distinctions between models, akin to grading high-school students on middle-school tests.

To address this, more difficult benchmarks have been developed. For instance, MMLU-Pro has tougher questions with ten possible answers instead of four. GPQA (General-Purpose Question Answering) is like MMLU at a PhD level on selected science topics, with the best models scoring between 50% and 60%. Another benchmark, MuSR (Multi-step Soft Reasoning), tests reasoning ability using scenarios like murder mysteries. While a person might combine understanding motivation, language comprehension, and logical deduction to solve these, AI models struggle with such "soft reasoning" over multiple steps, often scoring no better than random.

MMLU also highlights other significant problems. One is the accuracy of the answers in these tests. A study conducted by Aryo Gema of the University of Edinburgh found that 57% of MMLU's virology questions and 26% of its logical-fallacy questions contained errors, some with no correct answer or more than one correct answer. This led to the creation of a new benchmark, MMLU-Redux, after cleaning up the errors.

Another issue is "contamination," where LLMs are trained on data from the internet that may include exact questions and answers from benchmarks like MMLU. This means models may inadvertently (or intentionally) cheat by having seen the tests in advance. Some model-makers might even train models on benchmark data to boost scores, which do not reflect the model's true ability. Private benchmarks with secret questions, like GPQA, can help mitigate this, but they restrict independent verification of scores.

Even small changes in how questions are posed to models can significantly affect scores. For example, in a multiple-choice test, asking a model to state the answer directly or to reply with the corresponding letter or number can yield different results, affecting reproducibility and comparability.

To ensure consistent testing, automated systems are now used to evaluate models against benchmarks. Dr. Liang's team at Stanford developed HELM (Holistic Evaluation of Language Models), generating leaderboards that show model performance across various benchmarks. Hugging Face's EleutherAI Harness generates leaderboards for open-source models. These standardized systems are more trustworthy than self-reported scores from model-makers.

As AI models gain new skills, new benchmarks are being developed. GAIA tests models on real-world problem-solving, with some answers kept secret to avoid contamination. NoCha (Novel Challenge), announced in June 2023, is a long-context benchmark with questions about recent novels, unlikely to be in training data. Other benchmarks assess models' abilities in biology or their tendency to hallucinate.

Creating new benchmarks is expensive, often requiring human experts to develop detailed questions and answers. An innovative approach is using LLMs themselves to generate new benchmarks. Dr. Liang's AutoBencher project does this by extracting questions and answers from source documents and identifying the hardest ones. Anthropic, the startup behind the Claude LLM, has begun funding new benchmark creation with a focus on AI safety. Logan Graham, an Anthropic researcher, noted the urgent need for benchmarks assessing AI models' safety capabilities, like developing cyber-attack tools or giving dangerous advice. On July 1, 2023, Anthropic invited proposals for new benchmarks, aiming to make them publicly available.

Historically, AI benchmarks were devised by academics. But as AI is commercialized and used in various fields, there's a growing need for reliable, specific benchmarks. Startups specializing in AI benchmarks are emerging, aiming to provide the tools necessary for assessing AI capabilities. With these advancements, the era of AI labs marking their own homework may soon end.

In plain terms, while benchmarks are vital for measuring AI progress and capabilities, consumers should remain skeptical of self-reported scores from model-makers. Are the benchmarks truly reflective of real-world performance? Are the scores free from contamination and biases? As the AI field evolves, independent, rigorous testing and new benchmarks will be crucial to ensuring trustworthy and accurate evaluations of AI models.

OJIH Report

Sunday, August 4, 2024

The Hidden Flaws in AI Benchmark Scores: A Call for Independent Evaluation

No comments:

Post a Comment

Flex, Inject, Regret: How Testosterone Turned American Men into Chemical Clowns

OJIH Report

Report Abuse