Arena, a platform that recently was merely an academic project, has become the arbiter in the war over who has the best AI model. Seven months — that's how long it took the startup to go from quiet existence in UC Berkeley laboratories to a position where its rankings decide millions of dollars in valuations, product launch schedules, and PR narratives of the biggest tech companies. It's fascinating because it shows how quickly new power structures can be established in an industry that itself has undergone fundamental change in just a few years.

But here's the problem, and it's serious. Arena is not the independent arbiter we might imagine. Its investors are exactly the same companies that it evaluates — OpenAI, Anthropic, Google and other AI giants. This is a situation that could be called a conflict of interest on an industrial scale. How can you trust a leaderboard that claims it can't be "hacked" if it's financed by players whose positions on that leaderboard directly affect their revenues, valuations, and strategic decisions?

The story of Arena is first and foremost a tale of how a small academic project became a tool of power in the AI ecosystem. At the same time, it's also a story about how the industry is trying to solve a fundamental problem: how to compare models that are increasingly advanced and increasingly difficult to evaluate objectively? The answer we've been given is troubling — it allowed a platform financed by those it evaluates to do so.

From academic project to industry arbiter

Arena started as a typical academic project — a team of researchers from UC Berkeley worked on a system for comparing language models. The idea was to create something that could work like Elo rating in chess, but for AI. Users could compare responses from different models to the same questions, and the system would automatically recalculate rankings based on the results of these comparisons. Elegant, simple, seemingly objective.

The problem emerged when the project began to attract attention. Tech companies noticed that this platform could be really important to them — that a model's position on the Arena leaderboard could determine whether investors would be interested in their model, whether media would write about them, whether startup teams would want to build on their infrastructure. The moment this was realized, Arena went from an academic curiosity to a strategic resource.

The startup quickly formalized and began raising funding. And here's the irony — its investors became exactly those companies that knew how important this platform would be. OpenAI, Anthropic, Google, as well as venture capital funds interested in the AI ecosystem — everyone had a stake in Arena. It was as if all the teams in a football league together bought the television that broadcasts the matches and decides the rankings.

The mechanics of the leaderboard, which seems neutral but isn't

The Arena ranking system does indeed look intelligently designed. Users compare responses from two models to the same question and choose which is better. The system converts this into something similar to the Elo algorithm — the more comparisons, the more stable a model's position on the leaderboard. Theoretically, if the system is really used by thousands of people, it becomes difficult to manipulate, because you would need to mobilize a huge number of people to distort the results.

But here's the first layer of the problem: who decides what questions are asked? If Arena allows companies or their employees to suggest questions that reach users, then we already have manipulation. A question formulated in a specific way can favor a specific model — for example, a question that requires a particular kind of creativity might be better for a model that happens to be good at that.

The second layer: who makes up Arena's user base? If the platform is used primarily by employees of tech companies, their friends, or people interested in AI, then their preferences can be systematically skewed. It may turn out that a particular model is preferred by this group, but less so by society at large. The leaderboard would then be a reflection of the preferences of a narrow group, not the actual quality.

The third layer, the worst: stimulating demand for your own product. If employees of a company that invested in Arena are encouraged to use the platform and compare models, they can (consciously or not) favor the model that their company sponsors. This doesn't have to be intentional — it can simply be a cognitive bias where people subconsciously prefer what is close to them.

Conflict of interest in capital letters

What's most interesting is that Arena itself claims it can't be "hacked" — that its system is so resistant to manipulation that it can be trusted. This is a statement that could be called naive, if not for the fact that it is strategically beneficial to the platform's investors. If everyone believes the leaderboard is reliable, then everyone will use it — and then its influence on the industry grows.

OpenAI has an interest in its GPT-4 or GPT-5 model occupying a high position on Arena. Not because the platform is unbiased — because a high position means more PR, more investor interest, more adoption. Similarly with Anthropic, Google, or any other player. Everyone has motivation for Arena to show their model in the best light. And everyone is an investor in the platform that determines that position.

This doesn't mean Arena deliberately manipulates results in favor of a specific company — though that would be possible. It means that the platform's funding structure creates systematic incentives to favor investors. Even if the algorithm is honest, even if there is no direct manipulation, the system as a whole can systematically bias toward companies that finance it.

Compare this to a situation that would be unacceptable in other industries. Imagine if all major car manufacturers together bought a car magazine that publishes car reviews. Everyone would be shareholders, everyone would have influence on editorial policy. Would anyone believe the reviews are unbiased? Of course not. Yet in AI this happens and everyone pretends it's normal.

How the leaderboard shapes industry reality

Arena's influence on the industry is real and measurable. A model's position on the leaderboard directly affects startup valuations, investor decisions, and company PR strategies. If your model ranks high on Arena, you can show that to potential investors, media, clients. If you rank low, it's a problem — you need to explain why the leaderboard is wrong, or work on improvement.

This creates a perverse incentive — companies are motivated to optimize their models not for what is actually useful to users, but for what looks good on Arena. If you know what questions are on the leaderboard, or can guess what questions will be there, you can tune the model to perform well on them. This is a form of gaming the metrics — exactly what Arena claims its system prevents.

A second effect: centralization of power in evaluation. Before Arena became dominant, there were many ways to evaluate models — academic benchmarks, informal user feedback, blog reviews, media comparisons. Each of these methods had its weaknesses, but together they created a more distributed field of evaluation. Now everything is concentrated on a single platform, financed by the same players. This is centralization that is not beneficial for the industry.

Polish implications and local ecosystems

In Poland this might seem distant — we don't have OpenAI or Anthropic, we don't invest billions in developing frontier LLMs. But it matters to us, because Polish startups and teams working on AI will have to compete in a space where the Arena leaderboard has significance. If a Polish company wants to build a model or an AI-based application, it will have to consider how its solution will perform on Arena.

Additionally, the Polish tech industry is interested in foreign investment. If investors use Arena as a source of information about which AI technologies are worth supporting, then indirectly Arena affects decisions about financing Polish projects. This is an indirect effect, but a real one.

There's also the question of transparency — Polish regulation, particularly in the context of discussions about the AI Act, should be interested in how technological rankings that impact the market are created. If a platform is financed by those it evaluates, it should be transparent about this and should be subject to regulatory oversight.

Alternatives that will never work

Theoretically, there could be an independent leaderboard, financed by entities neutral to the industry — for example by non-profit organizations, foundations, or even public institutions. Such a leaderboard would be more credible because it would have no financial interest in any model performing better or worse.

The problem is that such a leaderboard would be less useful to the industry. Tech companies would not be interested in supporting it because they couldn't control it. Media would be less interested because there would be no such PR potential. Users might be less interested because there wouldn't be the same "gamification" and competition that Arena offers.

In other words, the structure that would be more credible would be less influential. And the structure that is influential is financed by those it evaluates. This is a situation where the market naturally moves toward a solution that is beneficial for the industry but harmful to objectivity.

The future of leaderboards and the question of credibility

The question we should ask ourselves is not whether Arena is completely unbiased — it probably never will be — but whether its funding model is acceptable. Should we allow a platform that has such great influence on the industry to be financed by those it evaluates?

The answer the market gives is yes — everyone accepts this model because it is beneficial to the main players. But that doesn't mean it's the right answer. It could be required that Arena be more transparent about its finances, that it publishes detailed information about who finances it and how it affects its operations. It could be required that its board include representatives of disinterested parties — scientists from universities, employees of regulatory bodies, representatives of users.

But this won't happen because the industry has no motivation to do it. Arena works well for those who finance it, and that's enough. The rest — transparency, independence, credibility — these are values that are less important than the practical benefits of having a platform you can influence.

The story of Arena is above all a lesson in how quickly new power structures can be established in an industry, and how easy it is to accept solutions that are convenient for the elite, even if they are problematic for the general public. This is not specific to AI — it is a universal problem when a small group of entities has great power. But in AI, where everything changes so quickly, this is particularly important.

The leaderboard “you can’t game,” funded by the companies it ranks

Read also

From academic project to industry arbiter

The mechanics of the leaderboard, which seems neutral but isn't

Conflict of interest in capital letters

How the leaderboard shapes industry reality

Polish implications and local ecosystems

Alternatives that will never work

The future of leaderboards and the question of credibility

More from Startups

Commonwealth Fusion Systems leans on magnets for near-term revenue

Diverse teams start with diverse VCs

The reputation of troubled YC startup Delve has gotten even worse

Startup funding shatters all records in Q1

Related Articles

Peter Thiel’s big bet on solar-powered cow collars

Embattled startup Delve has ‘parted ways’ with Y Combinator

Anthropic says Claude Code subscribers will need to pay extra for OpenClaw usage

The Facebook insider building content moderation for the AI era

Comments