AI12 min readTechCrunch AI

The PhD students who became the judges of the AI industry

P
Redakcja Pixelift5 views
Share
The PhD students who became the judges of the AI industry

Foto: TechCrunch AI

# Translation A few years ago, a group of PhD students from MIT and Stanford decided to change the way artificial intelligence models are evaluated. Today, their company Anthropic is among the most important players in the AI industry, and its employees are actually shaping safety standards across the entire sector. The story began with a simple observation: existing benchmarks for testing AI were insufficient and did not reflect real threats. These young scientists, instead of waiting for action from large corporations, took matters into their own hands. Their research into how large language models actually work and where they can fail became a point of reference for the entire industry. Today, when OpenAI, Google and Meta are developing increasingly powerful systems, it is precisely this kind of independent voice that is essential. These PhD students showed that you don't need billions of dollars to have a significant impact on the direction of technology development — deep understanding of the problem and determination are enough. Their work directly influences which models reach the market and what safety limitations they have.

At a moment when the artificial intelligence market is accelerating at a pace that seemed impossible just recently, a fundamental problem emerges: how to assess which model is the best? This is not about academic discussions or internal benchmarks of corporate laboratories. It's about something far more influential — a public ranking that actually determines millions of dollars in funding, product launch dates, and the direction of entire PR campaigns. And it turns out that this arbiter has become a project created by three PhD students from UC Berkeley, who just seven months ago were still unknown outside the university walls. This is a story about how a niche research tool transformed into the most important judge in the AI industry.

Arena, previously known as LM Arena, is a platform that allows users to compare responses from different language models to the same questions. It looks simple at first glance — you type a prompt, the system returns responses from two random models, you choose which is better. But in reality, this tool has become something like an Oscar for artificial intelligence, with the difference that statuettes are awarded not by an elite, but by millions of users from around the world. The ranking generated by Arena appears in the media, influences investor decisions, and model producers obsessively monitor their positions. When Claude 3.5 Sonnet from Anthropic advanced to first place, it was a headline-worthy event. When Grok X from Elon Musk appeared in the ranking, everyone waited with bated breath to see where it would land.

The story of how three PhD students changed the competitive landscape in AI tells us a lot about the times we live in. It shows how quickly power can shift in an industry, how decisions can be made by inexperienced but intelligent people, and how public consensus can become more influential than the internal metrics of technology giants. But it is also a story full of tensions, conflicts of interest, and questions about whether a single ranking should have such power over the future of an entire industry.

Three students who didn't know they were creating an industry standard

The genesis of Arena is typical for the technology startup ecosystem: a group of intelligent people, sometimes inexperienced in business, identifies a problem and creates an elegant solution. In this case, the problem was the fragmentation of AI model evaluation. When the boom in large language models began in 2023, every organization evaluated them differently. OpenAI had its own benchmarks, Google had its own, Anthropic had its own. Researchers published papers with their own metrics. The results were inconsistent, sometimes even contradictory. Nobody really knew which model was best, because each measurement gave a different answer.

Three PhD students from UC Berkeley — Lianmin Zheng, Ying Sheng, and Hao Zhang — decided to create something radically simpler: a crowdsourced platform where ordinary users could compare models. They weren't looking for scientific precision or advanced metrics. It was about what actually works in practice, what people think is better when they have to choose between two answers. This was a brilliant intuition: instead of asking what benchmarks say, ask people who actually use these models.

The platform was simple but effective. User-friendly interface, no unnecessary complications. You type a question, you get two answers, you choose the better one. The system collects thousands, then millions of such votes and generates a ranking based on that. The algorithm used for this ranking is the Bradley-Terry model, the same one used in chess tournaments or sports rankings. The mathematics were solid, but the approach was revolutionary: instead of creating a benchmark, they created a system where the benchmark creates itself through the aggregation of user preferences.

When the platform was launched, nobody expected it to achieve such scale and influence. The students counted on it being a useful tool for researchers, maybe for an academic paper. Instead, within a few months, Arena attracted millions of users. The ranking began to be cited in press articles. Companies began to obsessively monitor it. Investors began to check it before making financing decisions. What was supposed to be an academic project became de facto an industry standard.

How a niche project became the arbiter of millions of dollars

The breakthrough moment for Arena came when the results from the platform began to differ from official benchmarks published by model producers. When a model received a high rating on Arena but a low one in official tests, or vice versa, something interesting began: people started to trust Arena more. Why? Because Arena tests models in real-world scenarios, with questions that people actually ask, not an artificial test set created by the producer.

This trust quickly translated into power. When a startup sought funding, investors asked: "What's your position on Arena?" When a company planned to launch a new model, PR teams counted on it landing high in the ranking. When researchers published papers about new architectures, they compared their results with Arena. The ranking became what economists might call a price discovery mechanism — a mechanism through which the market discovers the true value of a product.

The most dramatic example of this power was the appearance of Claude 3.5 Sonnet from Anthropic at the top of the ranking. It was not the same series of models as previous versions of Claude. For many observers, this advancement to first place was a surprise, because previously OpenAI and its GPT-4 dominated. But when Anthropic began to promote this result, media picked up the story, investors paid attention, and for Anthropic itself it became a key argument in conversations with venture capital funds. The ranking from Arena became part of the success narrative. It was not just an ordinary benchmark — it was validation by the public.

At the same time, a problem emerged: if one ranking has such power, shouldn't it be more transparent, more regulated, more independent? The students from UC Berkeley found themselves in a situation that is hard to imagine — three people who a few months earlier were writing doctoral dissertations now actually deciding the hierarchy of the entire AI industry. It was power that nobody formally entrusted to them, that nobody elected them for, but that everyone respected.

Credibility crisis and conflicts of interest

Of course, such a concentration of influence could not pass without problems. Questions began to arise about bias in the ranking, about whether the system is really fair, or perhaps favors certain types of models or questions. Model producers who were unhappy with the results began to question the methodology. Why exactly these questions? Why exactly this user population? Doesn't the system favor models that are better at answering questions for Western, English-speaking audiences?

There was also the problem with the nature of crowdsourcing itself. When a ranking becomes important, people start to manipulate it. They can organize to vote for a specific model. They can create questions that favor certain models. They can even hire click farms to artificially boost results. Arena had to introduce protection systems against this type of manipulation, but it's always a cat-and-mouse game — when a security system appears, new ways to circumvent it emerge.

Another problem is conflicts of interest. The three founders of Arena are researchers from UC Berkeley, but at the same time they work on their own AI projects. Could their decisions about how the ranking works be biased? Could they favor certain research approaches that interest them? This might be paranoia, but in a world where one ranking decides millions of dollars, paranoia is justified.

Add to this one more problem: strategic participation. Model producers know that Arena exists. They can therefore optimize their models not for what is really good, but for what will perform well in Arena tests. This is the classic Goodhart's law problem — when a measure becomes a target, it stops being a good measure. If everyone optimizes for Arena, then the ranking stops telling us about the real capability of models, and starts telling us only about how well models are optimized for Arena.

Polish perspective: does this concern us?

For Polish creators, researchers, and technology companies, Arena might seem distant — something that happens in Silicon Valley. But in reality, it has a direct impact on the Polish AI ecosystem. First, every Polish startup working on AI models must now think about how it will perform on Arena. If you want to attract investors from the West, you need to show them that your model is competitive — and one way to prove this is a good position on Arena.

Second, Polish academia and researchers have access to the same platform as everyone else. If a Polish research group develops a new model or new architecture, they can test it on Arena and compare it with models from technology giants. This is democratization — nobody needs to invite you, nobody needs to validate you. Your work can be evaluated by the public on equal terms with the work of OpenAI or Anthropic.

Third, Arena shows us that the future of technology will not be determined only by large corporations. It can be shaped by a small group of intelligent people who create a tool that everyone wants to use. This is a lesson for the Polish startup ecosystem: sometimes you don't need to build a better product than the competition, you need to build a better system for evaluating competitors' products. This can be more influential and more profitable than the product itself. Arena earns through subscriptions and partnerships, but its real value is in the influence it has on the industry. This is a business model worth studying.

Mathematics behind the ranking: why the Bradley-Terry model is genius

To understand why Arena is so influential, it's worth delving into the mathematics behind how the ranking works. The three founders chose the Bradley-Terry model, a probabilistic method for comparing elements that comes from the 1950s and was originally used to rank athletes. The model assumes that when you compare two elements, the probability that one will be chosen over the other depends on their hidden "strength" or "ability".

It's an elegant solution to a problem that seems simple but isn't. If you have millions of votes from millions of users, how do you aggregate these votes into a coherent ranking? You can't just count votes, because that would be unfair — a model that is tested more often would receive more votes. The Bradley-Terry model solves this through statistical modeling: each vote is treated as a data point in a probabilistic model, and the ranking is the result of maximizing the likelihood of that model.

The result is a ranking that is not only fair but also has a built-in measure of uncertainty. Arena knows how confidently it can say that model A is better than model B. If the difference is small, the ranking shows it. This is science, but hidden behind a simple interface. For the user, it's just a ranking, but behind the scenes, advanced statistics are happening.

However, even this mathematics has its limitations. The Bradley-Terry model assumes that each vote is independent and that user preferences are transitive — if A is better than B, and B is better than C, then A should be better than C. In reality, people's preferences can be contradictory and context-dependent. A model might be better at writing code but worse at translation. The ranking cannot capture this — it must give one number for each model.

The future: is one ranking enough?

The question that increasingly appears in the industry is: should one platform have such power? Alternative rankings are already emerging. Hugging Face has its own leaderboards. Chatbot Arena has competing platforms. But none of them have the influence of Arena. This is dangerous — both for the industry and for Arena itself. If the ranking becomes too important, it can become a target for attacks, manipulation, or it might simply fail when everyone depends on it.

The three founders of Arena are aware of this. In recent months, they have tried to communicate more transparently about the methodology, about the limitations of the ranking, about what it measures and what it doesn't. But the question remains: can you be an arbiter of an industry and at the same time be a disinterested player in that industry? Should Arena be handed over to a neutral organization, such as IEEE or another scientific institution? Should it be decentralized so that no one has control over it?

For now, Arena remains what it is: a tool created by three intelligent people that accidentally became the most important benchmark in the AI industry. This is both its strength and its weakness. Strength, because it means that tools can be created by small teams and can have great impact. Weakness, because it means that one ranking, created by three people, decides billions of dollars in investment and the direction the entire industry is heading. The story of Arena is a story of power and responsibility that comes unexpectedly and sometimes overwhelms.

Lessons for the ecosystem: when a tool becomes power

If there is any universal lesson from the story of Arena, it is this: in a rapidly developing industry, the first person to create a tool to measure and compare will gain enormous power. This is not always intentional, but it is a natural consequence. When everyone needs an answer to the question "which model is best?" and you have the only credible answer, you have power.

This has implications for the future. If AI develops in the direction of specialized models — one for medicine, one for law, one for coding — then specialized rankings will be needed. Who will create them? Probably a small group of people who will have power over those industries. This can be good if these people are ethical and transparent. It can be bad if they have hidden agendas.

For Polish creators, researchers, and companies, this is also a lesson. Sometimes you don't need to create the best product. You need to create a tool that everyone will want to use to evaluate competitors' products. This can be more influential and more profitable than the product itself. Arena earns through subscriptions and partnerships, but its real value is in the influence it has on the industry. This is a business model worth studying.

Ultimately, the story of three PhD students from UC Berkeley who changed the landscape of AI evaluation is a story about how, in times of rapid change, a small group of intelligent people can have enormous impact. It is inspiring, but it should also concern us. When one platform has such power, questions about transparency, independence, and representativeness become key. Arena has shown us that a benchmark is not a neutral tool — it is a battlefield where the future of an industry is decided.

Source: TechCrunch AI
Share

Comments

Loading...