Connect with us

Tech News

Did xAI lie about Grok 3’s benchmarks?

Published

on

The xAI Grok AI logo

The debate surrounding AI benchmarks and their reporting by AI labs is becoming more public. This week, an OpenAI employee accused Elon Musk’s AI company, xAI, of sharing misleading benchmark results for its latest AI model, Grok 3. xAI co-founder Igor Babushkin defended the company’s actions, leading to differing opinions on the matter.

xAI posted a graph on their blog showcasing Grok 3’s performance on AIME 2025, a set of challenging math questions. While some experts question the validity of AIME as an AI benchmark, it is commonly used to assess a model’s math capabilities. The graph displayed Grok 3 variants outperforming OpenAI’s best model, o3-mini-high, on AIME 2025. However, OpenAI employees noted that the graph did not include o3-mini-high’s score at “cons@64,” which could alter the comparison.

A closer look reveals that Grok 3 Reasoning Beta and Grok 3 mini Reasoning scored below o3-mini-high on AIME 2025 at “@1.” Despite this, xAI promotes Grok 3 as the “world’s smartest AI.” The debate escalated, with Babushkin accusing OpenAI of similar misleading practices in the past.

A neutral party created a more “accurate” graph displaying various models’ performances at cons@64, shedding light on the nuances of the benchmark comparisons. AI researcher Nathan Lambert emphasized the importance of considering the computational and monetary costs associated with achieving the best scores, highlighting the limitations and strengths of AI models that benchmarks often fail to convey.

See also  Elon Musk Says He Has Sold X to His A.I. Start-Up xAI

Trending