Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Steam can now show you that the framework generation has changed your game

    July 1, 2025

    Hewlett Packard Enterprise $14B acquisition of Juniper, the judiciary clears after settlement

    June 30, 2025

    Unlock performance: Accelerate Pandas operation using Polars

    June 30, 2025
    Facebook X (Twitter) Instagram
    NPP HUB
    • Home
    • Technology
    • Artificial Intelligence
    • Gadgets
    • Tech News
    Facebook X (Twitter) Instagram
    NPP HUB
    Home»Artificial Intelligence»A Chinese company has just launched a changing AI benchmark.
    Artificial Intelligence

    A Chinese company has just launched a changing AI benchmark.

    Daniel68By Daniel68June 23, 2025No Comments3 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email

    After Chatgpt’s successful breakthrough, the development of the Hongshan benchmark began in 2022 as an internal tool to evaluate which model is worth investing in. Since then, led by partner Gong Yuan, the team has steadily expanded the system, bringing outside researchers and professionals to help improve it. As the project grew, they decided to release it to the public.

    Xbench solved the problem with two different systems. A benchmark similar to traditional: an academic test that measures the model’s ability toward various topics. Another technical interview that is more like a work that evaluates the actual economic value the model may offer.

    Xbench’s method of evaluating raw intelligence currently includes two components: Xbench-Scienceqa and Xbench-Deepresearch. ScienceQA is not fundamental to existing graduate-level benchmarks such as GPQA and SuperGPQA. It includes issues spanning the fields from biochemistry to orbital mechanics, which were drafted by graduate students and double-checked by professors. Scores not only reward the correct answer, but also lead to chains of reasoning.

    In contrast, DeepResearch focuses on the ability of models to browse Chinese networks. Ten subject experts have created 100 questions in music, history, finance and literature that cannot be just Googled but require a lot of research to answer. Scores favor the breadth of sources, consistency of facts, and the willingness of the model to accept when there is insufficient data. One question in the public collection is “How many Chinese cities are there in the three provinces of Northwest Province bordering foreign countries?” (If you’re wondering, there are only 12 and only 33% of the models are correct.)

    The researchers said on the company’s website that they want to add more dimensions to their testing, such as the creativity of the model in problem solving, the degree of collaboration when working with other models, and its reliability.

    The team has been committed to updating test questions quarterly and maintaining semi-public, semi-private datasets.

    To evaluate the real-world intelligence of the model, the team worked with experts to develop tasks based on real-world workflows, initially in recruitment and marketing. For example, a task requires a model to acquire five qualified battery engineer candidates and make rationales for each draft. Another requires it to match the appropriate short video creators from over 800 influencers pools.

    The site also teases upcoming categories including finance, law, accounting and design. The problem sets for these categories are not open source yet.

    Chatgpt-O3 is again ranked No. 1 in the current two professional categories. For recruitment, Confused Search and Claude 3.5 sonnets won second and third place respectively. For marketing, Claude, Grok and Gemini all performed well.

    “The benchmarks are really hard to include things that are difficult to quantify,” Zihan Zheng said. “But Xbench represents a promising start.”

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Daniel68
    • Website

    Related Posts

    Unlock performance: Accelerate Pandas operation using Polars

    June 30, 2025

    CTGT’s AI platform is built to eliminate bias, hallucination in AI models

    June 29, 2025

    See blood clots before the strike

    June 27, 2025

    AI-controlled robot shows unstable driving, NHTSA problem Tesla

    June 26, 2025

    Estonia’s AI Leap brings chatbots to school

    June 25, 2025

    The competition between agents and controls enterprise AI

    June 24, 2025
    Leave A Reply Cancel Reply

    Top Reviews
    8.9
    Blog

    Review: Xiaomi’s New Mobile with Hi-fi and Home Cinema System

    By Daniel68
    8.9
    Blog

    Smart Home Décor : Technology Offers a Slew of Options

    By Daniel68
    8.9
    Blog

    Edifier W240TN Earbud Review: Fancy Specs Aren’t Everything

    By Daniel68
    mmm
    Editors Picks

    Steam can now show you that the framework generation has changed your game

    July 1, 2025

    Hewlett Packard Enterprise $14B acquisition of Juniper, the judiciary clears after settlement

    June 30, 2025

    Unlock performance: Accelerate Pandas operation using Polars

    June 30, 2025

    Anker recalls five more electric banks to achieve fire risk

    June 30, 2025
    Legal Pages
    • About Us
    • Disclaimer
    • DMCA Notice
    • Privacy Policy
    Our Picks

    Steam can now show you that the framework generation has changed your game

    July 1, 2025

    Hewlett Packard Enterprise $14B acquisition of Juniper, the judiciary clears after settlement

    June 30, 2025

    Unlock performance: Accelerate Pandas operation using Polars

    June 30, 2025
    Top Reviews
    8.9

    Review: Xiaomi’s New Mobile with Hi-fi and Home Cinema System

    January 15, 2021
    8.9

    Smart Home Décor : Technology Offers a Slew of Options

    January 15, 2021
    8.9

    Edifier W240TN Earbud Review: Fancy Specs Aren’t Everything

    January 15, 2021

    Type above and press Enter to search. Press Esc to cancel.