Retro-Future Columnist
The more AI models are updated, the more dazzling the numbers become. Yet this dazzle often hides the tactile feel of actually using them. The habit of discussing performance based on single tests like MMLU makes progress easier to gauge, but weakens the practical contours such as naturalness of conversation, handling of long texts, tool integration, and safety.[3][6] The fact that benchmarks improve and the feeling that work has become easier are not found in the same place.
Stanford CRFM’s HELM system has verbalized this discomfort institutionally.[1][3] HELM promotes a multifaceted evaluation including not just accuracy, but calibration, robustness, fairness, toxicity, and efficiency, explicitly holding the position that models can’t be judged by a single score.[3][10] In other domains, image-focused HEIM showed that no model excels in every aspect.[3][5] The notion of an "ultimate" AI can never be confined to a single chart.
Yet companies still put numbers front and center. Public technical reports show GPT-4 demonstrates improvements in key capability benchmarks while separately noting limits and failures.[7][11] Anthropic’s Claude 4 annotates its performance reporting and even distinguishes whether extended reasoning was used.[2] Google’s Gemini also conveys the premise that benchmarks and actual use are not the same.[6] Companies compete on numbers not just to show off, but because they must sell comparability itself in a market without clear yardsticks.
This reveals a dynamic where research and sales share the same desk. Annual reports like the AI Index quietly record this ongoing competition, which is as much about technical progress as it is a means to explain performance to investors, developers, and procurement specialists.[6][8] For companies, benchmarks are instruments indicating model performance and markers that attract funding. That’s why scores keep updating, headlines get shorter, and comparison tables multiply.
However, questioning benchmarks does not mean stopping evaluations. Quite the opposite: evaluations that can’t explain what they measure struggle to hold up in practical judgment. Tasks like code generation, maintaining long contexts, handling in-house data, and safety boundaries cannot be fully seen through typical academic tests alone.[2][4][6] Anthropic’s Claude 4, by foregrounding safety and real-world evaluations, shifts focus from measuring a model’s cleverness to observing how it breaks.[2][4] Here lie the outlines of a new evaluation culture.
On the other hand, verifying which comparisons are truly fair remains difficult. Benchmarks with the same name may differ in preprocessing or settings across companies, and if training data contamination occurs, the numbers reflect echoes of memory rather than real skill.[9][10] Around Claude 4, too, published safety research has sparked discussions about benchmark contamination—how creating good tests for evaluation itself may produce new biases.[9] What’s needed here isn’t assertion but disclosure of reproducible evaluation conditions: what tools are used, how measurements are taken, and where external verification is possible.
This issue tightly connects with the habits of tech journalism. With every new model, headlines lean toward comparison, and score fluctuations easily become news. Yet users want not rankings but responses that don’t disrupt work flows and conversations that don’t become tiresome over long use. The feeling that "AI no longer feels like software. It feels like atmosphere." emerges not from a performance chart, but from the air inside daily workspaces.[5][6] Benchmarks cannot fully capture that atmosphere.
So why can’t companies stop the race? The answer is simple: numbers communicate easily in the market. They serve as a common language for researchers, persuasive material for sales, and evidence of growth curves for investors.[6][8] The more useful this convenience is, the more the values users truly feel recede into the background. Natural responses, low hallucination, perseverance on long tasks, accountability, the tangible sense of safety—all these bleed away into a single score. That’s why there shouldn’t be just one number to read going forward.
Beyond a model’s score, we want to know under what conditions it was measured, what failures lurk in appendices, and how much real-world evaluation is publicly available. Benchmarks can serve as lighthouses pointing toward AI’s future, but on foggy nights, their light can mislead about proximity.[1][3][6] What we should next study is not the rankings themselves, but the direction of evaluation design principles. [1,2,6,9][1][2][6][9]
References
References
Small numbered tags in the article body point to the sources below.
- AI21 Labs: Jurassic-2
- Introducing Claude 4 - Anthropic
- Holistic Evaluation of Language Models (HELM)
- Claude 4 and Anthropic's bet on code - by Nathan Lambert
- Holistic Evaluation of Language Models (HELM)
- [PDF] Technical Performance - Stanford HAI
- Peer review of GPT-4 technical report and systems card
- HELM Capabilities - Stanford CRFM
- The Claude 4 System Card is a Wild Read - by Charlie Guo
- HELM: Holistic Evaluation of Language Models - VerifyWise
- GPT-4 Release: Briefing on Model Improvements and Limitations
PICKUP ARTICLES
Pickup Articles
-
Generative AI & Foundation Models
In the Age of AI-Generated Text, Where Do the Boundaries of Quotation Lie?
This article organizes the U.S. copyright debates around generative AI’s training usage and output reproduction through the lens of fair use’s four factors, key lawsuits, the Copyr
-
Generative AI & Foundation Models
LLMs Seem Correct, But What Happens to That Subtle Feeling of Unease?
This article reframes decision support using LLMs not merely as a fight against hallucinations but through the lens of handling 'discomfort with assumptions.' Drawing on intuition
-
Generative AI & Foundation Models
When AI Reads, Copies, and Responds: The Fair Use Boundary Narrows
This article connects recent U.S. Copyright Office reports, the anticipated 2025 Thomson Reuters v. Ross Intelligence ruling, and the evolving litigation landscape around generativ