Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    AT&T’s new locking option blocks thieves trying to access your account

    July 1, 2025

    Steam can now show you that the framework generation has changed your game

    July 1, 2025

    Hewlett Packard Enterprise $14B acquisition of Juniper, the judiciary clears after settlement

    June 30, 2025
    Facebook X (Twitter) Instagram
    NPP HUB
    • Home
    • Technology
    • Artificial Intelligence
    • Gadgets
    • Tech News
    Facebook X (Twitter) Instagram
    NPP HUB
    Home»Artificial Intelligence»How to build a better AI benchmark
    Artificial Intelligence

    How to build a better AI benchmark

    Daniel68By Daniel68May 8, 2025No Comments3 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email

    Limitations of traditional testing

    If AI companies respond slowly to the growing failure of benchmarks, it is partly because the method of testing scores has been so effective for so long.

    One of the biggest early successes of contemporary AI is the Imagenet challenge, which is the premise of contemporary benchmarks. The database, released in 2010, has over 3 million images for AI systems to be divided into 1,000 different categories as a public challenge to researchers.

    Crucially, the test is completely agnostic about the method, and no matter how it works, any successful algorithm will quickly gain credibility. When an algorithm called Alexnet broke down in 2012, it was an unconventional form of GPU training at the time, it became one of the basic results of modern AI. Few people would guess in advance that Alexnet’s convolutional neural network would be the secret to unlocking image recognition, but after a good score, no one dared to argue. (Ilya Sutskever, one of Alexnet’s developers, will continue to Openai.)

    A large part of what makes this challenge so effective is that there is little practical difference between Imagenet’s object classification challenge and the actual process that requires a computer to recognize images. Even with the controversy over the method, no one doubts that the model with the highest score will have an advantage when deployed in an actual image recognition system.

    But in the following 12 years, AI researchers have applied the same method that is incomparable to an increasing number of general tasks. The SWE foundation is often used as a proxy for a wider range of coding capabilities, while other test style benchmarks are often reasoning. This broad range makes it difficult to implement rigorous measures of specific benchmarking measures, which in turn enables the responsible use of these findings.

    Where things break down

    As part of the Stanford University study, Anka Reuel, a doctoral student who has been focusing on benchmarking issues, has been convinced that the assessment of the problem is a result of a general push. “We have moved from a task-specific model to a general model,” Reuel said. “It’s no longer a task, it’s a bunch of tasks, so evaluation becomes more difficult.”

    Like Jacobs of the University of Michigan, Reuel argues, “The main problem of benchmarks is effectiveness, even more important than actual implementation,” notes, “this is where a lot of things collapse.” For example, for tasks as complex as coding, it’s nearly impossible to incorporate all possible solutions into your problem set. As a result, it is difficult to measure whether the model is better because it is better at encoding or manipulating problem sets more efficiently. With the pressure from developers, shortcuts are hard to resist to score record points.

    The hope for developers is that the success of many specific benchmarks will add up to be a generally competent model. However, the technology of proxy AI means that a single AI system can contain a complex different model, so it is difficult to evaluate whether improvements to a particular task will lead to generalization. “You can also turn more knobs and are also a well-known critic of hasty practice in the AI ​​industry,” said Sayash Kapoor, a computer scientist in Princeton. “With the agents, they abandoned best practices for evaluation. ”

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Daniel68
    • Website

    Related Posts

    Unlock performance: Accelerate Pandas operation using Polars

    June 30, 2025

    CTGT’s AI platform is built to eliminate bias, hallucination in AI models

    June 29, 2025

    See blood clots before the strike

    June 27, 2025

    AI-controlled robot shows unstable driving, NHTSA problem Tesla

    June 26, 2025

    Estonia’s AI Leap brings chatbots to school

    June 25, 2025

    The competition between agents and controls enterprise AI

    June 24, 2025
    Leave A Reply Cancel Reply

    Top Reviews
    8.9
    Blog

    Smart Home Décor : Technology Offers a Slew of Options

    By Daniel68
    8.9
    Blog

    Edifier W240TN Earbud Review: Fancy Specs Aren’t Everything

    By Daniel68
    8.9
    Blog

    Review: Xiaomi’s New Mobile with Hi-fi and Home Cinema System

    By Daniel68
    mmm
    Editors Picks

    AT&T’s new locking option blocks thieves trying to access your account

    July 1, 2025

    Steam can now show you that the framework generation has changed your game

    July 1, 2025

    Hewlett Packard Enterprise $14B acquisition of Juniper, the judiciary clears after settlement

    June 30, 2025

    Unlock performance: Accelerate Pandas operation using Polars

    June 30, 2025
    Legal Pages
    • About Us
    • Disclaimer
    • DMCA Notice
    • Privacy Policy
    Our Picks

    AT&T’s new locking option blocks thieves trying to access your account

    July 1, 2025

    Steam can now show you that the framework generation has changed your game

    July 1, 2025

    Hewlett Packard Enterprise $14B acquisition of Juniper, the judiciary clears after settlement

    June 30, 2025
    Top Reviews
    8.9

    Smart Home Décor : Technology Offers a Slew of Options

    January 15, 2021
    8.9

    Edifier W240TN Earbud Review: Fancy Specs Aren’t Everything

    January 15, 2021
    8.9

    Review: Xiaomi’s New Mobile with Hi-fi and Home Cinema System

    January 15, 2021

    Type above and press Enter to search. Press Esc to cancel.