Meta's claim that Llama 3.1 rivals the best AI models, supported by benchmark scores, must be scrutinized due to the prevalence of self-assessment in the industry. The truth remains that developing new, challenging benchmarks is crucial to accurately measure the real-world capabilities and safety of AI models as they continue to evolve.
How can we trust the claims made by AI model-makers when they are the ones grading their own performance? When Meta announced its latest open-source large language model (LLM), Llama 3.1, on July 23, 2023, it claimed the model had "state-of-the-art capabilities" that could rival the best closed-source models like GPT-4o and Claude 3.5 Sonnet. Meta backed up these claims with a table showcasing the scores achieved by these models on various popular benchmarks like MMLU, GSM8K, and GPQA. For instance, Llama 3.1 scored 88.6% on the MMLU benchmark, while GPT-4o scored 88.7% and Claude 3.5 Sonnet scored 88.3%. But can we really trust these numbers?
Having
accurate and reliable benchmarks for AI models is crucial, not just for the
firms making them but for the entire AI community and consumers. Benchmarks
"define and drive progress," providing a standard against which
model-makers can measure their achievements and motivating them to improve, as
noted by Percy Liang from Stanford University's Institute for Human-Centered
Artificial Intelligence. Benchmarks help map the field's overall progress,
demonstrate how AI systems compare with human capabilities in specific tasks,
and guide users in choosing the appropriate model for their needs. Yet, as Dr.
Clémentine Fourrier from Hugging Face warns, these benchmark scores should be
taken with a pinch of salt. Often, model-makers may hype their products and company
valuations using these scores, which may not always align with real-world
performance.
One
fundamental issue with benchmarks like MMLU (Massive Multi-task Language
Understanding) is that they are simply too easy for today's models. Created in
2020, MMLU consists of 15,908 multiple-choice questions across 57 topics,
including math, American history, science, and law. At the time, most language
models scored little better than 25%, which is what one would get by guessing
randomly. OpenAI's GPT-3 did best with a score of 43.9%. However, as models
have improved, the best now score between 88% and 90%. This saturation makes it
difficult to draw meaningful distinctions between models, akin to grading
high-school students on middle-school tests.
To
address this, more difficult benchmarks have been developed. For instance,
MMLU-Pro has tougher questions with ten possible answers instead of four. GPQA
(General-Purpose Question Answering) is like MMLU at a PhD level on selected
science topics, with the best models scoring between 50% and 60%. Another
benchmark, MuSR (Multi-step Soft Reasoning), tests reasoning ability using
scenarios like murder mysteries. While a person might combine understanding
motivation, language comprehension, and logical deduction to solve these, AI
models struggle with such "soft reasoning" over multiple steps, often
scoring no better than random.
MMLU
also highlights other significant problems. One is the accuracy of the answers
in these tests. A study conducted by Aryo Gema of the University of Edinburgh
found that 57% of MMLU's virology questions and 26% of its logical-fallacy
questions contained errors, some with no correct answer or more than one
correct answer. This led to the creation of a new benchmark, MMLU-Redux, after
cleaning up the errors.
Another
issue is "contamination," where LLMs are trained on data from the
internet that may include exact questions and answers from benchmarks like
MMLU. This means models may inadvertently (or intentionally) cheat by having
seen the tests in advance. Some model-makers might even train models on
benchmark data to boost scores, which do not reflect the model's true ability.
Private benchmarks with secret questions, like GPQA, can help mitigate this,
but they restrict independent verification of scores.
Even
small changes in how questions are posed to models can significantly affect
scores. For example, in a multiple-choice test, asking a model to state the
answer directly or to reply with the corresponding letter or number can yield
different results, affecting reproducibility and comparability.
To
ensure consistent testing, automated systems are now used to evaluate models
against benchmarks. Dr. Liang's team at Stanford developed HELM (Holistic
Evaluation of Language Models), generating leaderboards that show model
performance across various benchmarks. Hugging Face's EleutherAI Harness
generates leaderboards for open-source models. These standardized systems are
more trustworthy than self-reported scores from model-makers.
As
AI models gain new skills, new benchmarks are being developed. GAIA tests
models on real-world problem-solving, with some answers kept secret to avoid
contamination. NoCha (Novel Challenge), announced in June 2023, is a
long-context benchmark with questions about recent novels, unlikely to be in
training data. Other benchmarks assess models' abilities in biology or their
tendency to hallucinate.
Creating
new benchmarks is expensive, often requiring human experts to develop detailed
questions and answers. An innovative approach is using LLMs themselves to
generate new benchmarks. Dr. Liang's AutoBencher project does this by
extracting questions and answers from source documents and identifying the
hardest ones. Anthropic, the startup behind the Claude LLM, has begun funding
new benchmark creation with a focus on AI safety. Logan Graham, an Anthropic
researcher, noted the urgent need for benchmarks assessing AI models' safety
capabilities, like developing cyber-attack tools or giving dangerous advice. On
July 1, 2023, Anthropic invited proposals for new benchmarks, aiming to make
them publicly available.
Historically,
AI benchmarks were devised by academics. But as AI is commercialized and used
in various fields, there's a growing need for reliable, specific benchmarks.
Startups specializing in AI benchmarks are emerging, aiming to provide the
tools necessary for assessing AI capabilities. With these advancements, the era
of AI labs marking their own homework may soon end.
In
plain terms, while benchmarks are vital for measuring AI progress and
capabilities, consumers should remain skeptical of self-reported scores from
model-makers. Are the benchmarks truly reflective of real-world performance?
Are the scores free from contamination and biases? As the AI field evolves,
independent, rigorous testing and new benchmarks will be crucial to ensuring
trustworthy and accurate evaluations of AI models.
No comments:
Post a Comment