After Chatgpt’s successful breakthrough, the development of the Hongshan benchmark began in 2022 as an internal tool to evaluate which model is worth investing in. Since then, led by partner Gong Yuan, the team has steadily expanded the system, bringing outside researchers and professionals to help improve it. As the project grew, they decided to release it to the public.
Xbench solved the problem with two different systems. A benchmark similar to traditional: an academic test that measures the model’s ability toward various topics. Another technical interview that is more like a work that evaluates the actual economic value the model may offer.
Xbench’s method of evaluating raw intelligence currently includes two components: Xbench-Scienceqa and Xbench-Deepresearch. ScienceQA is not fundamental to existing graduate-level benchmarks such as GPQA and SuperGPQA. It includes issues spanning the fields from biochemistry to orbital mechanics, which were drafted by graduate students and double-checked by professors. Scores not only reward the correct answer, but also lead to chains of reasoning.
In contrast, DeepResearch focuses on the ability of models to browse Chinese networks. Ten subject experts have created 100 questions in music, history, finance and literature that cannot be just Googled but require a lot of research to answer. Scores favor the breadth of sources, consistency of facts, and the willingness of the model to accept when there is insufficient data. One question in the public collection is “How many Chinese cities are there in the three provinces of Northwest Province bordering foreign countries?” (If you’re wondering, there are only 12 and only 33% of the models are correct.)
The researchers said on the company’s website that they want to add more dimensions to their testing, such as the creativity of the model in problem solving, the degree of collaboration when working with other models, and its reliability.
The team has been committed to updating test questions quarterly and maintaining semi-public, semi-private datasets.
To evaluate the real-world intelligence of the model, the team worked with experts to develop tasks based on real-world workflows, initially in recruitment and marketing. For example, a task requires a model to acquire five qualified battery engineer candidates and make rationales for each draft. Another requires it to match the appropriate short video creators from over 800 influencers pools.
The site also teases upcoming categories including finance, law, accounting and design. The problem sets for these categories are not open source yet.
Chatgpt-O3 is again ranked No. 1 in the current two professional categories. For recruitment, Confused Search and Claude 3.5 sonnets won second and third place respectively. For marketing, Claude, Grok and Gemini all performed well.
“The benchmarks are really hard to include things that are difficult to quantify,” Zihan Zheng said. “But Xbench represents a promising start.”