DeepMind's Gemini Ultra 2 achieves gold-medal performance on FrontierMath
Google DeepMind published results showing Gemini Ultra 2 solving 83% of problems in the FrontierMath benchmark, a set of competition-level mathematics problems previously unsolved by AI. Expert mathematicians typically score in the same range.
Google DeepMind today reported that Gemini Ultra 2 has achieved 83% on the FrontierMath benchmark, a private evaluation set built by a panel of research mathematicians and previously considered out-of-reach for any model.
What FrontierMath Tests
FrontierMath problems are designed to require novel mathematical reasoning rather than pattern recall. Each problem has a single, unambiguous numeric answer and is held in escrow to prevent training-set leakage.
Methodology
DeepMind ran Gemini Ultra 2 in an agentic configuration with access to a Python sandbox and a formal proof verifier. The model was given a budget of 4 hours of wall-clock time and an average of 600,000 reasoning tokens per problem.
Implications
An 83% score places Gemini Ultra 2 at the level of an IMO gold medalist on this benchmark, and within a few points of the human-expert ceiling. The team notes this is the first model to clear the 70% threshold researchers had set as a marker for genuine mathematical reasoning capability.