Limitations of traditional testing
If AI companies respond slowly to the growing failure of benchmarks, it is partly because the method of testing scores has been so effective for so long.
One of the biggest early successes of contemporary AI is the Imagenet challenge, which is the premise of contemporary benchmarks. The database, released in 2010, has over 3 million images for AI systems to be divided into 1,000 different categories as a public challenge to researchers.
Crucially, the test is completely agnostic about the method, and no matter how it works, any successful algorithm will quickly gain credibility. When an algorithm called Alexnet broke down in 2012, it was an unconventional form of GPU training at the time, it became one of the basic results of modern AI. Few people would guess in advance that Alexnet’s convolutional neural network would be the secret to unlocking image recognition, but after a good score, no one dared to argue. (Ilya Sutskever, one of Alexnet’s developers, will continue to Openai.)
A large part of what makes this challenge so effective is that there is little practical difference between Imagenet’s object classification challenge and the actual process that requires a computer to recognize images. Even with the controversy over the method, no one doubts that the model with the highest score will have an advantage when deployed in an actual image recognition system.
But in the following 12 years, AI researchers have applied the same method that is incomparable to an increasing number of general tasks. The SWE foundation is often used as a proxy for a wider range of coding capabilities, while other test style benchmarks are often reasoning. This broad range makes it difficult to implement rigorous measures of specific benchmarking measures, which in turn enables the responsible use of these findings.
Where things break down
As part of the Stanford University study, Anka Reuel, a doctoral student who has been focusing on benchmarking issues, has been convinced that the assessment of the problem is a result of a general push. “We have moved from a task-specific model to a general model,” Reuel said. “It’s no longer a task, it’s a bunch of tasks, so evaluation becomes more difficult.”
Like Jacobs of the University of Michigan, Reuel argues, “The main problem of benchmarks is effectiveness, even more important than actual implementation,” notes, “this is where a lot of things collapse.” For example, for tasks as complex as coding, it’s nearly impossible to incorporate all possible solutions into your problem set. As a result, it is difficult to measure whether the model is better because it is better at encoding or manipulating problem sets more efficiently. With the pressure from developers, shortcuts are hard to resist to score record points.
The hope for developers is that the success of many specific benchmarks will add up to be a generally competent model. However, the technology of proxy AI means that a single AI system can contain a complex different model, so it is difficult to evaluate whether improvements to a particular task will lead to generalization. “You can also turn more knobs and are also a well-known critic of hasty practice in the AI industry,” said Sayash Kapoor, a computer scientist in Princeton. “With the agents, they abandoned best practices for evaluation. ”