AI benchmarks are broken. Here’s what we need instead.

Foto: MIT Tech Review
Modern AI systems are achieving scores close to 90% on tests that were considered impossible for machines to pass just a few years ago; however, these impressive figures are increasingly detached from reality. Traditional benchmarks, such as MMLU, have fallen victim to "data contamination"—models are trained on test question sets available online, meaning that instead of solving problems, they simply recite memorized answers. Consequently, high marks in rankings do not translate into the actual utility of the tools in daily work. For end users, this means growing information chaos. Selecting the appropriate Large Language Model (LLM) based on raw statistics is becoming unreliable, as models optimized for rankings often fail at non-standard, creative tasks. Experts from MIT Technology Review point out that the industry urgently needs to transition to dynamic tests that evolve alongside technology and evaluate AI in "human-in-the-loop" scenarios. Instead of static spreadsheets, the future of artificial intelligence assessment will rely on the subjective perceptions of testers and the systems' ability to handle entirely new problems not present on the internet. Without this shift, benchmarks will remain merely a marketing shell that tells us nothing about the true intelligence of machines.
For decades, the foundation of artificial intelligence assessment has been a simple, almost seductive question: can a machine outperform a human? From historic victories in chess, through solving complex mathematical problems, to writing essays and generating code – the performance of AI models is measured almost exclusively in contrast to individual human capabilities. This anthropocentric testing model, while intuitive, is currently becoming the biggest brake on the development of technology that we do not understand as well as we think.
Current benchmarks are fundamentally flawed because they attempt to enclose the multidimensional nature of AI systems within the narrow framework of isolated problems. When OpenAI announces the results of the GPT-4 model in bar exams, and Google boasts of Gemini's success in MMLU (Massive Multitask Language Understanding) tests, we only receive a slice of reality. These numbers tell us nothing about how these systems will behave in a dynamic, unpredictable work environment where tasks do not have a single, correct answer recorded in a test key.
The illusion of the human measure in the world of algorithms
The traditional approach to AI evaluation is based on static comparison. If the Claude 3 Opus model from Anthropic scores higher on a coding test than the average programmer, the industry declares success. The problem is that machines do not think or work like humans. AI does not possess "common sense" or the contextual understanding of the world that allows a human to improvise when task conditions change. Benchmarks focus on the final result, completely ignoring the process of arriving at it and the stability of that solution.
Read also
Modern Large Language Models (LLMs) are masters of pattern recognition, which makes standard tests increasingly easy for them to "hack." A phenomenon known as data contamination occurs when questions from popular benchmarks end up in the models' training sets. As a result, the AI does not solve the problem through intelligence, but through memory – reproducing answers it has already seen. This makes high scores on leaderboards empty numbers that do not translate into real utility value in business or science.
We need a shift away from the "AI vs. Human" paradigm toward systemic assessment. Instead of asking if an AI can write a poem better than a philology student, we should examine how the tool integrates with human decision-making processes. The real challenge is not creating a machine that passes a medical exam, but one that, in the hands of a doctor, actually reduces the number of misdiagnoses in the chaos of a hospital environment. Current benchmarks completely overlook this aspect of interaction.
The end of the era of static leaderboards
To break the impasse, we must define new metrics that are resistant to simple data memorization. One direction is the introduction of dynamic tests, where tasks are generated procedurally or modified in real-time. If the GPT-4o model can solve a logic puzzle, let's see if it can handle it when we change one parameter that is irrelevant from a logical standpoint. It often turns out that a minor change in phrasing causes a total regression in performance – proof that the model does not "understand" the concept, but merely follows the statistical probability of words.
- Process evaluation: Analysis of not just the result, but the reasoning path (e.g., through Chain of Thought), which allows for the detection of so-called hallucinations at an early stage.
- Robustness tests: Checking how the model reacts to deliberate attempts to mislead it or to low-quality data.
- Operational metrics: Measuring resource consumption, latency, and the cost of obtaining a correct answer relative to its value.
- In-context learning capability: Testing how quickly a model adapts to new instructions without the need for retraining.
Another key element of the new era of evaluation must be the objective measurability of safety and ethics. Current Red Teaming tests are often subjective and depend on the creativity of the testers. We need automated yet nuanced tools that can assess a model's tendency to generate bias or dangerous instructions in a repeatable and scalable way. Without this, every new version of a Llama or Mistral model will be a deployment Russian roulette.
Isolated tasks are a dead end
Another mistake of the current methodology is the focus on atomic tasks. In the real world, work consists of a chain of activities. A programmer doesn't just write a function; they must understand the existing architecture, anticipate technical debt, and consult with the team. Benchmarks like HumanEval only test the first of these elements. We lack tools to evaluate Agentic AI – systems that operate autonomously over long periods, making multi-stage decisions to achieve a complex goal.
Modern analysis should emphasize "collaborative capacity." At Pixelift, we often observe that models with lower scores in raw mathematical benchmarks perform much better as creative assistants because their architecture favors better interpretation of user intent. This suggests that our current hierarchy of models is flawed because it is based on the wrong priorities. Intelligence is not just pure computing power; it is also flexibility and relevance in a social context.
It is also worth noting the problem of "overscaling" benchmarks. Models are becoming so large that testing them takes weeks and costs millions of dollars. This creates a barrier to entry for smaller players and academic centers, promoting only giants like Microsoft or Google. The democratization of AI requires benchmarks that are efficient – those that can reliably evaluate a Phi-3 Mini model with the same precision as giant computing clusters, without the need to burn gigawatt-hours of energy for validation alone.
A new paradigm: AI as a component, not a solo player
The future of AI system evaluation must lie in holistic testing. Instead of isolating a model in a sterile testing lab, we must start measuring "Human Augmentation." The real benchmark of tomorrow will be an indicator of the percentage increase in the efficiency of a design team using a given AI tool, while maintaining or improving final quality. This requires the involvement of sociologists, organizational psychologists, and domain experts, not just machine learning engineers.
We must stop treating AI as a digital equivalent of a human competing in a game show. AI is a new category of tool, closer to an operating system than an employee. Therefore, benchmarks should evolve toward performance tests known from the software industry, combined with deep semantic analysis. Only then can we stop getting excited about the next few percentage points in MMLU tables and start building systems that actually solve real problems instead of just pretending to.
My prediction is clear: within the next two years, there will be a total collapse of trust in current, public leaderboards. Companies will begin to create their own internal and hermetic "private gold sets," which will be the only reliable indicator of technology value. The era of "standardized tests" is ending before our eyes, giving way to an era of specific, contextual validation, where the most important metric will not be a "score better than a human," but "utility in a specific process."
More from Research
Related Articles

There are more AI health tools than ever—but how well do they work?
23h
The Pentagon’s culture war tactic against Anthropic has backfired
23h
Kris Jenner's image spreads in Chinese social media good luck trend
Mar 30



