A Chinese company has just launched a changing AI benchmark.

After Chatgpt’s successful breakthrough, the development of the Hongshan benchmark began in 2022 as an internal tool to evaluate which model is worth investing in. Since then, led by partner Gong Yuan, the team has steadily expanded the system, bringing outside researchers and professionals to help improve it. As the project grew, they decided to release it to the public.

Xbench solved the problem with two different systems. A benchmark similar to traditional: an academic test that measures the model’s ability toward various topics. Another technical interview that is more like a work that evaluates the actual economic value the model may offer.

Xbench’s method of evaluating raw intelligence currently includes two components: Xbench-Scienceqa and Xbench-Deepresearch. ScienceQA is not fundamental to existing graduate-level benchmarks such as GPQA and SuperGPQA. It includes issues spanning the fields from biochemistry to orbital mechanics, which were drafted by graduate students and double-checked by professors. Scores not only reward the correct answer, but also lead to chains of reasoning.

In contrast, DeepResearch focuses on the ability of models to browse Chinese networks. Ten subject experts have created 100 questions in music, history, finance and literature that cannot be just Googled but require a lot of research to answer. Scores favor the breadth of sources, consistency of facts, and the willingness of the model to accept when there is insufficient data. One question in the public collection is “How many Chinese cities are there in the three provinces of Northwest Province bordering foreign countries?” (If you’re wondering, there are only 12 and only 33% of the models are correct.)

The researchers said on the company’s website that they want to add more dimensions to their testing, such as the creativity of the model in problem solving, the degree of collaboration when working with other models, and its reliability.

The team has been committed to updating test questions quarterly and maintaining semi-public, semi-private datasets.

To evaluate the real-world intelligence of the model, the team worked with experts to develop tasks based on real-world workflows, initially in recruitment and marketing. For example, a task requires a model to acquire five qualified battery engineer candidates and make rationales for each draft. Another requires it to match the appropriate short video creators from over 800 influencers pools.

The site also teases upcoming categories including finance, law, accounting and design. The problem sets for these categories are not open source yet.

Chatgpt-O3 is again ranked No. 1 in the current two professional categories. For recruitment, Confused Search and Claude 3.5 sonnets won second and third place respectively. For marketing, Claude, Grok and Gemini all performed well.

“The benchmarks are really hard to include things that are difficult to quantify,” Zihan Zheng said. “But Xbench represents a promising start.”

What's Hot

Steam can now show you that the framework generation has changed your game

Hewlett Packard Enterprise $14B acquisition of Juniper, the judiciary clears after settlement

Unlock performance: Accelerate Pandas operation using Polars

A Chinese company has just launched a changing AI benchmark.

Unlock performance: Accelerate Pandas operation using Polars

CTGT’s AI platform is built to eliminate bias, hallucination in AI models

See blood clots before the strike

AI-controlled robot shows unstable driving, NHTSA problem Tesla

Estonia’s AI Leap brings chatbots to school

The competition between agents and controls enterprise AI

Review: Xiaomi’s New Mobile with Hi-fi and Home Cinema System

Smart Home Décor : Technology Offers a Slew of Options

Edifier W240TN Earbud Review: Fancy Specs Aren’t Everything

Steam can now show you that the framework generation has changed your game

Hewlett Packard Enterprise $14B acquisition of Juniper, the judiciary clears after settlement

Unlock performance: Accelerate Pandas operation using Polars

Anker recalls five more electric banks to achieve fire risk

Our Picks

Steam can now show you that the framework generation has changed your game

Hewlett Packard Enterprise $14B acquisition of Juniper, the judiciary clears after settlement

Unlock performance: Accelerate Pandas operation using Polars

Top Reviews

Review: Xiaomi’s New Mobile with Hi-fi and Home Cinema System

Smart Home Décor : Technology Offers a Slew of Options

Edifier W240TN Earbud Review: Fancy Specs Aren’t Everything

Subscribe to Updates

What's Hot

A Chinese company has just launched a changing AI benchmark.

Related Posts