Demis Hassabis didn’t mince words. Responding on X, the Google DeepMind CEO summed up the situation in three blunt words: “This is embarrassing.” He was reacting to a celebratory post by Sébastien Bubeck of OpenAI, who claimed that GPT-5 had helped mathematicians solve 10 long-standing open problems. “Science acceleration via AI has officially begun,” Bubeck proclaimed. The moment quickly became a case study in what’s going wrong with AI discourse.
The excitement centered on so-called Erdős problems—mathematical puzzles left behind by the prolific Paul Erdős. A public database tracks more than a thousand of them, noting which are solved and which are not. Bubeck interpreted GPT-5’s findings as genuine breakthroughs. But the database’s curator pushed back, pointing out that “unsolved” often just means “no solution listed yet,” not that no solution exists anywhere. With millions of math papers in circulation, gaps are inevitable—and GPT-5 is exceptionally good at finding obscure prior work.
That’s exactly what happened. GPT-5 hadn’t produced new proofs; it had rediscovered existing ones that hadn’t yet been cataloged. An impressive feat of literature search, yes—but not the revolutionary leap it was advertised as.
Two lessons stand out. First, major scientific claims shouldn’t be rushed onto social media without careful validation. Second, the genuinely impressive part—the model’s ability to surface forgotten or overlooked research—was buried under hype. That capability alone could be transformative for mathematicians, who spend enormous time navigating prior results.
The pattern isn’t isolated. Similar overstatements followed claims that GPT-5 had cracked a previously unsolved math problem, triggering breathless comparisons to historic AI milestones. Experts quickly noted that the problem itself was relatively elementary, hardly a paradigm shift. As one researcher put it, there’s a persistent tendency to oversell incremental progress.
More grounded evaluations paint a mixed picture. Recent studies examining LLMs in medicine and law show promise in narrow tasks, like diagnosis or summarization, but serious weaknesses in treatment recommendations and legal reasoning. The evidence, researchers concluded, falls well short of what bold claims suggest.
Still, social media rewards spectacle. AI announcements land first on X, amplified by fear of missing out and public sparring among high-profile figures. The result is a feedback loop where hype outruns scrutiny. Bubeck’s post was embarrassing only because the error was caught. Many others aren’t—and until incentives change, exaggerated claims will keep circulating. As one observer dryly noted, on these platforms, huge claims simply work too well.



