How to build a better AI benchmark

Limitations of traditional testing

If AI companies respond slowly to the growing failure of benchmarks, it is partly because the method of testing scores has been so effective for so long.

One of the biggest early successes of contemporary AI is the Imagenet challenge, which is the premise of contemporary benchmarks. The database, released in 2010, has over 3 million images for AI systems to be divided into 1,000 different categories as a public challenge to researchers.

Crucially, the test is completely agnostic about the method, and no matter how it works, any successful algorithm will quickly gain credibility. When an algorithm called Alexnet broke down in 2012, it was an unconventional form of GPU training at the time, it became one of the basic results of modern AI. Few people would guess in advance that Alexnet’s convolutional neural network would be the secret to unlocking image recognition, but after a good score, no one dared to argue. (Ilya Sutskever, one of Alexnet’s developers, will continue to Openai.)

A large part of what makes this challenge so effective is that there is little practical difference between Imagenet’s object classification challenge and the actual process that requires a computer to recognize images. Even with the controversy over the method, no one doubts that the model with the highest score will have an advantage when deployed in an actual image recognition system.

But in the following 12 years, AI researchers have applied the same method that is incomparable to an increasing number of general tasks. The SWE foundation is often used as a proxy for a wider range of coding capabilities, while other test style benchmarks are often reasoning. This broad range makes it difficult to implement rigorous measures of specific benchmarking measures, which in turn enables the responsible use of these findings.

Where things break down

As part of the Stanford University study, Anka Reuel, a doctoral student who has been focusing on benchmarking issues, has been convinced that the assessment of the problem is a result of a general push. “We have moved from a task-specific model to a general model,” Reuel said. “It’s no longer a task, it’s a bunch of tasks, so evaluation becomes more difficult.”

Like Jacobs of the University of Michigan, Reuel argues, “The main problem of benchmarks is effectiveness, even more important than actual implementation,” notes, “this is where a lot of things collapse.” For example, for tasks as complex as coding, it’s nearly impossible to incorporate all possible solutions into your problem set. As a result, it is difficult to measure whether the model is better because it is better at encoding or manipulating problem sets more efficiently. With the pressure from developers, shortcuts are hard to resist to score record points.

The hope for developers is that the success of many specific benchmarks will add up to be a generally competent model. However, the technology of proxy AI means that a single AI system can contain a complex different model, so it is difficult to evaluate whether improvements to a particular task will lead to generalization. “You can also turn more knobs and are also a well-known critic of hasty practice in the AI industry,” said Sayash Kapoor, a computer scientist in Princeton. “With the agents, they abandoned best practices for evaluation. ”

What's Hot

AT&T’s new locking option blocks thieves trying to access your account

Steam can now show you that the framework generation has changed your game

Hewlett Packard Enterprise $14B acquisition of Juniper, the judiciary clears after settlement

How to build a better AI benchmark

Unlock performance: Accelerate Pandas operation using Polars

CTGT’s AI platform is built to eliminate bias, hallucination in AI models

See blood clots before the strike

AI-controlled robot shows unstable driving, NHTSA problem Tesla

Estonia’s AI Leap brings chatbots to school

The competition between agents and controls enterprise AI

Smart Home Décor : Technology Offers a Slew of Options

Edifier W240TN Earbud Review: Fancy Specs Aren’t Everything

Review: Xiaomi’s New Mobile with Hi-fi and Home Cinema System

AT&T’s new locking option blocks thieves trying to access your account

Steam can now show you that the framework generation has changed your game

Hewlett Packard Enterprise $14B acquisition of Juniper, the judiciary clears after settlement

Unlock performance: Accelerate Pandas operation using Polars

Our Picks

AT&T’s new locking option blocks thieves trying to access your account

Steam can now show you that the framework generation has changed your game

Hewlett Packard Enterprise $14B acquisition of Juniper, the judiciary clears after settlement

Top Reviews

Smart Home Décor : Technology Offers a Slew of Options

Edifier W240TN Earbud Review: Fancy Specs Aren’t Everything

Review: Xiaomi’s New Mobile with Hi-fi and Home Cinema System

Subscribe to Updates

What's Hot

How to build a better AI benchmark

Limitations of traditional testing

Where things break down

Related Posts