How artificial intelligence is tackling mathematical problem-solving

Home » How artificial intelligence is tackling mathematical problem-solving

Nature of IMO
- Prestigious, annual, global mathematical problem-solving competition for high school students.
- Consists of 6 original problems over two consecutive days, each with a 3-hour limit per session (total 9 hours).
- Problems test creativity, logical reasoning, and problem-solving skills rather than advanced formal mathematics.
- Problems are new and unique — never published before in literature or online.
Medal Criteria
- Gold: Score typically equivalent to solving ~5/6 problems correctly.
- Silver/Bronze: Lower score thresholds.
- Grading is strict — a single logical or calculation error invalidates the solution.

Relevance : GS 3(Science and Technology)

OpenAI’s Announcement
- Used a general-purpose reasoning model, not specialized or trained for IMO.
- Achieved Gold medal-level performance under the same time limits as humans.
- Solutions graded by former IMO medalists hired by OpenAI (led to some disputes over grading accuracy).
- Announcement made before the competition concluded, which some felt overshadowed human participants.
Google DeepMind’s Attempt
- Used Gemini Deep Think (advanced reasoning model).
- Participated officially with IMO organisers’ permission.
- Scored 35/42 points — a confirmed Gold medal score.
- Solutions praised by IMO graders for clarity, precision, and ease of understanding.

Initial Challenges (ChatGPT launch phase)
- Frequent hallucinations (fabricated facts).
- Basic arithmetic mistakes and flawed reasoning.
- Incapable of reliably solving even moderate-level math problems.
First Major Improvement – Agents
- Models given ability to:
  - Search the web for accurate info.
  - Use Python interpreters to perform calculations and verify reasoning.
- Result: Dramatic increase in accuracy on moderately hard problems.
Second Breakthrough – Reasoning Models
- Examples: OpenAI o3, Gemini-2.5-pro.
- Operate like internal monologue models:
  - Consider multiple approaches before deciding.
  - Revisit and refine intermediate reasoning.
  - Restart if necessary.
  - Aim for a logically consistent final answer.
Proof Verification Systems
- Integration with formal proof checkers like the Lean prover.
- Used to formally verify mathematical proofs for correctness.
- Example: AlphaProof (Google DeepMind, 2024) — Silver medal equivalent (but took 2 days).
Reinforcement Learning with Synthetic Data
- Models generate and test vast quantities of synthetic problems.
- Similar to how AI mastered chess by self-play starting only from rules.

Research and innovation acceleration:
- AI can assist in generating approaches, identifying related problems, and verifying solutions at high speed.
- Formal proof integration can prevent errors in complex, long-term projects.
Shift in intellectual benchmarks:
- Human-only benchmarks like IMO may no longer remain exclusive to humans.
- Potential need for redefining measures of human achievement.
From problem-solving to sustained research:
- Short-term creativity ≠ long-term research reliability.
- Research requires sustained error-free progress over months or years — AI integration with proof systems is a step toward this.

Timing of announcements:
- Premature disclosures risk overshadowing human achievements.
Fairness in evaluation:
- Company-appointed graders create conflict-of-interest perceptions.
- Need for independent verification standards for AI competition results.
Motivational impact:
- Risk of diminishing incentive for human participation if AI dominance becomes the norm.
Originality concerns:
- AI combines known ideas but its capacity for truly novel insights remains debated.