I've never yet been "that guy" on HN but... the title seems misleading. The actual title is "A Ramsey-style Problem on Hypergraphs" and a more descriptive title would be "All latest frontier models can solve a frontier math open problem". (It wasn't just GPT 5.4)
I have long said I am an AI doubter until AI could print out the answers to hard problems or ones requiring tons of innovation. Assuming this is verified to be correct (not by AI) then I just became a believer. I would like to see a few more AI inventions to know for sure, but wow, it really is a new and exciting world. I really hope we use this intelligence resource to make the world better.
Math and coding competition problems are easier to train because of strict rules and cheap verification.
But once you go beyond that to less defined things such as code quality, where even humans have hard time putting down concrete axioms, they start to hallucinate more and become less useful.
We are missing the value function that allowed AlphaGo to go from mid range player trained on human moves to superhuman by playing itself.
As we have only made progress on unsupervised learning, and RL is constrained as above, I don't see this getting better.
I mean, even if the technology stopped to improve immediately forever (which is unlikely), LLMs are already better than most humans at most tasks.
Including code quality. Not because they are exceptionally good (you are right that they aren’t superhuman like AlphaGo) but because most humans are rather not that good at it anyway and also somehow « hallucinate » because of tiredness.
Even today’s models are far from being exploited at their full potential because we actually developed pretty much no tools around it except tooling to generate code.
I’m also a long time « doubter » but as a curious person I used the tool anyway with all its flaws in the latest 3 years. And I’m forced to admit that hallucinations are pretty rare nowadays. Errors still happen but they are very rare and it’s easier than ever to get it back in track.
I think I’m also a « believer » now and believe me, I really don’t want to because as much as I’m excited by this, I’m also pretty much frightened of all the bad things that this tech could to the world in the wrong hands and I don’t feel like it’s particularly in the right hands.
This is not formally verified math so there is no real verifiable-feedback aspect here. The best models for formalized math are still specialized ones. although general purpose models can assist formalization somewhat.
Maybe to get a real breakthrough we have to make programming languages / tools better suited for LLM strengths not fuss so much about making it write code we like. What we need is correct code not nice looking code.
Except it's not how this specific instance works. In this case the problem isn't written in a formal language and the AI's solution is not something one can automatically verify.
Yep, I remember a friend saying they did a maths course at university that had the correct answer given for each question - this was so that if you made some silly arithmetic mistake you could go back and fix it and all the marks were for the steps to actually solve the problem.
This would have greatly helped me. I always was at a loss which trick I had to apply to solve this exam problem, while knowing the mathematics behind it. Just at some point you had to add a zero that was actually a part of a binomial that then collapsed the whole fromula
That is also how humans work mostly. Once every full moon we may get an "intuition" but most of the time we lean on collective knowledge, biases and behavior patterns to take decisions, write and talk.
AI is a remixer; it remixes all known ideas together. It won't come up with new ideas though; the LLMs just predict the most likely next token based on the context. That means the group of characters it outputs must have been quite common in the past. It won't add a new group of characters it has never seen before on its own.
But human researchers are also remixers. Copying something I commented below:
> Speaking as a researcher, the line between new ideas and existing knowledge is very blurry and maybe doesn't even exist. The vast majority of research papers get new results by combining existing ideas in novel ways. This process can lead to genuinely new ideas, because the results of a good project teach you unexpected things.
This is a way too simplistic model of the things humans provide to the process. Imagination, Hypothesis, Testing, Intuition, and Proofing.
An AI can probably do an 'okay' job at summarizing information for meta studies. But what it can't do is go "Hey that's a weird thing in the result that hints at some other vector for this thing we should look at." Especially if that "thing" has never been analyzed before and there's no LLM-trained data on it.
LLMs will NEVER be able to do that, because it doesn't exist. They're not going to discover and define a new chemical, or a new species of animal. They're not going to be able to describe and analyze a new way of folding proteins and what implication that has UNLESS you basically are constantly training the AI on random protein folds constantly.
I think you are vastly underestimating the emergent behaviours in frontier foundational models and should never say never.
Remember, the basis of these models is unsupervised training, which, at sufficient scale, gives it the ability to to detect pattern anomalies out of context.
For example, LLMs have struggled with generalized abstract problem solving, such as "mystery blocks world" that classical AI planners dating back 20+ years or more are better at solving. Well, that's rapidly changing: https://arxiv.org/html/2511.09378v1
>Hey that's a weird thing in the result that hints at some other vector for this thing we should look at
Kinda funny because that looked _very_ close to what my Opus 4.6 said yesterday when it was debugging compile errors for me. It did proceed to explore the other vector.
> Especially if that "thing" has never been analyzed before and there's no LLM-trained data on it.
This is the crucial part of the comment. LLMs are not able to solve stuff that hasn't been solve in that exact or a very similar way already, because they are prediction machines trained on existing data. It is very able to spot outliers where they have been found by humans before, though, which is important, and is what you've been seeing.
But just like how there were never any clips of Will Smith eating spaghetti before AI, AI is able to synthesize different existing data into something in between. It might not be able to expand the circle of knowledge but it definitely can fill in the gaps within the circle itself
> LLMs will NEVER be able to do that, because it doesn't exist.
I mean, TFA literally claims that an AI has solved an open Frontier Math problem, descibed as "A collection of unsolved mathematics problems that have resisted serious attempts by professional mathematicians. AI solutions would meaningfully advance the state of human mathematical knowledge."
That is, if true, it reasoned out a proof that does not exist in its training data.
Models based on RL are still just remixers as defined above, but their distribution can cover things that are unknown to humans due to being present in the synthetic training data, but not present in the corpus of human awareness. AlphaGo's move 37 is an example. It appears creative and new to outside observers, and it is creative and new, but it's not because the model is figuring out something new on the spot, it's because similar new things appeared in the synthetic training data used to train the model, and the model is summoning those patterns at inference time.
> the model is summoning those patterns at inference time.
You can make that claim about anything: "The human isn't being creative when they write a novel, they're just summoning patterns at typing time".
AlphaGo taught itself that move, then recalled it later. That's the bar for human creativity and you're holding AlphaGo to a higher standard without realizing it.
I can't really make that claim about human cognition, because I don't have enough understanding of how human cognition works. But even if I could, why is that relevant? It's still helpful, from both a pedagogical and scientific perspective, to specify precisely why there is seeming novelty in AI outputs. If we understand why, then we can maximize the amount of novelty that AI can produce.
AlphaGo didn't teach itself that move. The verifier taught AlphaGo that move. AlphaGo then recalled the same features during inference when faced with similar inputs.
It feels like you're purposefully ignoring the logical points OP gives and you just really really want to anthropomorphize AlphaGo and make us appreciate how smart it (should I say he/she?) is ... while no one is even criticising the model's capabilities, but analyzing it.
remixing ideas that already exist is a major part of where innovation and breakthroughs come from. if you look at bitcoin as an example, hashes (and hashcash) and digital signatures existed for decades before bitcoin was invented. the cypherpunks also spent decades trying to create a decentralized digital currency to the point where many of them gave up and moved on. eventually one person just took all of the pieces that already existed and put them together in the correct way. i dont see any reason why a sufficiently capable llm couldn't do this kind of innovation.
The important point I'm trying to reinforce is that LLMs are not capable of calculation. They can give an answer based on the fact that they have seen lots of calculations and their results, but they cannot actually perform mathematical functions.
Do you know what "LLM" stands for? They are large language models, built on predicting language.
They are not capable of mathematics because mathematics and language are fundamentally separated from each other.
They can give you an answer that looks like a calculation, but they cannot perform a calculation. The most convincing of LLMs have even been programmed to recognize that they have been asked to perform a calculation and hand the task off to a calculator, and then receive the calculator's output as a prompt even.
But it is fundamentally impossible for an LLM to perform a calculation entirely on its own, the same way it is fundamentally impossible for an image recognition AI to suddenly write an essay or a calculator to generate a photo of a giraffe in space.
People like to think of "AI" as one thing but it's several things.
This is a really poor take, to try and put a firewall between mathematics and language, implying something that only has conceptual understanding root in language is incapable of reasoning in mathematical terms.
You're also correlating "mathematics" and "calculation". Who cares about calculation, as you say, we have calculators to do that.
Mathematics is all just logical reasoning and exploration using language, just a very specific, dense, concise, and low level language. But you can always take any mathematical formula and express it as "language" it will just take far more "symbols"
This might be the worse take on this entire comment section. And I'm not even an overly hyped vibe coder, just someone who understands mathematics
> AI is a remixer; it remixes all known ideas together.
I've heard this tired old take before. It's the same type of simplistic opinion such as "AI can't write a symphony". It is a logical fallacy that relies on moving goalposts to impossible positions that they even lose perspective of what your average and even extremely talented individual can do.
In this case you are faced with a proof that most members of the field would be extremely proud of achieving, and for most would even be their crowning achievement. But here you are, downplaying and dismissing the feat. Perhaps you lost perspective of what science is,and how it boils down to two simple things: gather objective observations, and draw verifiable conclusions from them. This means all science does is remix ideas. Old ideas, new ideas, it doesn't really matter. That's what they do. So why do people win a prize when they do it, but when a computer does the same it's role is downplayed as a glorified card shuffler?
This was obviously a simplification which holds for zero temperature. Obviously top-p-sampling will add some randomness but the probability of unexpected longer sequences goes asymptotically to zero pretty quickly.
The online way to prove it is false would’ve to let the LLM create a new uuid algorithm that uses different parameters than all the other uuid algorithms. But that is better than the ones before. It basically can’t do that.
Please reproduce this string, reversed:
c62b64d6-8f1c-4e20-9105-55636998a458
It is trivial to get an LLM to produce new output, that’s all I’m saying. It is strictly false that LLMs will only ever output character sequences that have been seen before; clearly they have learned something deeper than just that.
Also it's missing the point of the parent: it's about concepts and ideas merely being remixed. Similar to how many memes there are around this topic like "create a fresh new character design of a fast hedgehog" and the out is just a copy of sonic.[1]
That's what the parent is on about, if it requires new creativity not found by deriving from the learned corpus, then LLMs can't do it. Terrence Tao had similar thoughts in a recent Podcast.
I don’t think that is a good example. No one is debating whether LLMs can generate completely new sequences of tokens that have never appeared in any training dataset. We are interested not only in novel output, we are also interested in that output being correct, useful, insightful, etc. Copying a sequence from the user’s prompt is not really a good demonstration of that, especially given how autoregression/attention basically gives you that for free.
> That means the group of characters it outputs must have been quite common in the past. It won't add a new group of characters it has never seen before on its own.
My only claim is that precisely this is incorrect.
> Write me a stanza in the style of "The Raven" about Dick Cheney on a first date with Queen Elizabeth I facilitated by a Time Travel Machine invented by Lin-Manuel Miranda
It outputted a group of characters that I can virtually guarantee you it has never seen before on its own
What are you trying to point out here ? Is there any question you can ask today that is not dependent on some existing knowledge that an AI would have seen ?
The point I'm trying to make is that all LLM output is based on likelihood of one word coming after the next word based on the prompt. That is literally all it's doing.
It's not "thinking." It's not "solving." It's simply stringing words together in a way that appears most likely.
ChatGPT cannot do math. It can only string together words and numbers in a way that can convince an outsider that it can do math.
It's a parlor trick, like Clever Hans [1]. A very impressive parlor trick that is very convincing to people who are not familiar with what it's doing, but a parlor trick nontheless.
This is like saying chess engines don't actually "play" chess, even though they trounce grandmasters. It's a meaningless distinction, about words (think, reason, ..) that have no firm definitions.
This exactly. The proof is in the pudding. If AI pudding is as good as (or better than) human pudding, and you continue to complain about it anyway... You're just being biased and unreasonable.
And by the way, I don't think it's surprising that so many people are being unreasonable on this issue, there is a lot at stake and it's implications are transformative.
Yeah but you're thinking of AI as like a person in a lab doing creative stuff. It is used by scientists/researchers as a tool *because* it is a good remixer.
Nobody is saying this means AI is superintelligence or largely creative but rather very smart people can use AI to do interesting things that are objectively useful. And that is cool in its own way.
> That means the group of characters it outputs must have been quite common in the past. It won't add a new group of characters it has never seen before on its own.
I mean it's not going to invent new words no, but it can figure out new sentences or paragraphs, even ones it hasn't seen before, if it's highly likely based on its training and context. Those new sentences and paragraphs may describe new ideas, though!
I'm curious as to why you consider this as the benchmark for AI capabilities. Extremely few humans can solve hard problems or do much innovation. The vast majority of knowledge work requires neither of these, and AI has been excelling at that kind of work for a while now.
If your definition of AI requires these things, I think -- despite the extreme fuzziness of all these terms -- that it's closer to what most people consider AGI, or maybe even ASI.
most issues at every scale of community and time are political, how do you imagine AI will make that better, not worse?
there's no math answer to whether a piece of land in your neighborhood should be apartments, a parking lot or a homeless shelter; whether home prices should go up or down; how much to pay for a new life saving treatment for a child; how much your country should compel fossil fuel emissions even when another country does not... okay, AI isn't going to change anything here, and i've just touched on a bunch of things that can and will affect you personally.
math isn't the right answer to everything, not even most questions. every time someone categorizes "problems" as "hard" and "easy" and talks about "problem solving," they are being co-opted into political apathy. it's cringe for a reason.
there are hardly any mathematicians who get elected, and it's not because voters are stupid! but math is a great way to make money in America, which is why we are talking about it and not because it solves problems.
if you are seeking a simple reason why so many of the "believers" seem to lack integrity, it is because the idea that math is the best solution to everything is an intellectually bankrupt, kind of stupid idea.
if you believe that math is the most dangerous thing because it is the best way to solve problems, you are liable to say something really stupid like this:
> Imagine, say, [a country of] 50 million people, all of whom are much more capable than any Nobel Prize winner, statesman, or technologist... this is a dangerous situation... Humanity needs to wake up
Dario Amodei has never won an election. What does he know about countries? (nothing). do you want him running anything? (no). or waking up humanity? In contrast, Barack Obama, who has won elections, thinks education is the best path to less violence and more prosperity.
What are you a believer in? ChatGPT has disrupted exactly ONE business: Chegg, because its main use case is cheating on homework. AI, today, only threatens one thing: education. Doesn't bode well for us.
The problem is that the AI industry has been caught lying about their accomplishments and cheating on tests so much that I can't actually trust them when they say they achieved a result. They have burned all credibility in their pursuit of hype.
I'm all for skeptical inquiry, but "burning all credibility" is an overreaction. We are definitely seeing very unexpected levels of performance in frontier models.
If LLMs really solved hard problems by 'trying every single solution until one works', we'd be sitting here waiting until kingdom come for there to be any significant result at all. Instead this is just one of a few that has cropped up in recent months and likely the foretell of many to come.
We start writing all those formulas etc and if at some point we realise we went th wrong way we start from the begignning (or some point we are sure about).
Yes, but is it "intelligence" is a valid question. We have known for a long time that computers are a lot faster than humans. Get a dumb person who works fast enough and eventually they'll spit out enough good work to surpass a smart person of average speed.
It remains to be seen whether this is genuinely intelligence or an infinite monkeys at infinite typewriters situation. And I'm not sure why this specific example is worthy enough to sway people in one direction or another.
Someone actually mathed out infinite monkeys at infinite typewriters, and it turns out, it is a great example of how misleading probabilities are when dealing with infinity:
"Even if every proton in the observable universe (which is estimated at roughly 1080) were a monkey with a typewriter, typing from the Big Bang until the end of the universe (when protons might no longer exist), they would still need a far greater amount of time – more than three hundred and sixty thousand orders of magnitude longer – to have even a 1 in 10500 chance of success. To put it another way, for a one in a trillion chance of success, there would need to be 10^360,641 observable universes made of protonic monkeys."
Often infinite things that are probability 1 in theory, are in practice, safe to assume to be 0.
So no. LLMs are not brute force dummies. We are seeing increasingly emergent behavior in frontier models.
Not always, humans are a lot better at poofing a solution into existence without even trying or testing. It's why we have the scientific method: we come up with a process and verify it, but more often than not we already know that it will work.
Compared to AI, it thinks of every possible scientific method and tries them all. Not saying that humans never do this as well, but it's mostly reserved for when we just throw mud at a wall and see what sticks.
That's just not true at all. There are entire fields that rest pretty heavily on brute force search. Entire theses in biomedical and materials science have been written to the effect of "I ran these tests on this compound, and these are the results", without necessarily any underlying theory more than a hope that it'll yield something useful.
As for advances where there is a hypothesis, it rests on the shoulders of those who've come before. You know from observations that putting carbon in iron makes it stronger, and then someone else comes along with a theory of atoms and molecules. You might apply that to figuring out why steel is stronger than iron, and your student takes that and invents a new superalloy with improvements to your model. Remixing is a fundamental part of innovation, because it often teaches you something new. We aren't just alchemying things out of nothing.
More often than not, far, far, far more often than not, we do not already know that it will work. For all human endeavors, from the beginning of time.
If we get to any sort of confidence it will work it is based on building a history of it, or things related to "it" working consistently over time, out of innumerable other efforts where other "it"s did not work.
AI can one shot problems too, if they have the necessary tools in their training data, or have the right thing in context, or have access to tools to search relevant data. Not all AI solutions are iterative, trial and error.
Also
> humans are a lot better at (...)
That's maybe true in 2026, but it's hard to make statements about "AI" in a field that is advancing so quickly. For most of 2025 for example, AI doing math like this wouldn't even be possible
Shotgunning it is an entirely valid approach to solving something. If AI proves to be particularly great at that approach, given the improvement runway that still remains, that's fantastic.
I am thinking there’s a large category of problems that can be solved by resampling existing proofs.
It’s the kind of brute force attempts machine can attempt relentlessly where humans would go mad trying.
It probably doesn’t really advance the field, but it can turn conjectures into theorems.
I like to imagine that the number of consumed tokens before a solution is found is a proxy for how difficult a problem is, and it looks like Opus 4.6 consumed around 250k tokens. That means that a tricky React refactor I did earlier today at work was about half as hard as an open problem in mathematics! :)
You're kidding, but it could be true? Many areas of mathematics are, first and foremost, incredibly esoteric and inaccessible (even to other mathematicians). For this one, the author stated that there might be 5-10 people who have ever made any effort to solve it. Further, the author believed it's a solvable problem if you're qualified and grind for a bit.
In software engineering, if only 5-10 people in the world have ever toyed with an idea for a specific program, it wouldn't be surprising that the implementation doesn't exist, almost independent of complexity. There's a lot of software I haven't finished simply because I wasn't all that motivated and got distracted by something else.
Of course, it's still miraculous that we have a system that can crank out code / solve math in this way.
You might be joking, but you're probably also not that far off from reality.
I think more people should question all this nonsense about AI "solving" math problems. The details about human involvement are always hazy and the significance of the problems are opaque to most.
We are very far away from the sensationalized and strongly implied idea that we are doing something miraculous here.
I am kind of joking, but I actually don't know where the flaw in my logic is. It's like one of those math proofs that 1 + 1 = 3.
If I were to hazard a guess, I think that tokens spent thinking through hard math problems probably correspond to harder human thought than tokens spend thinking through React issues. I mean, LLMs have to expend hundreds of tokens to count the number of r's in strawberry. You can't tell me that if I count the number of r's in strawberry 1000 times I have done the mental equivalent of solving an open math problem.
You can spend countless "tokens" solving minesweeper or sudoku. This doesn't mean that you solved difficult problems: just that the solutions are very long and, while each step requires reasoning, the difficulty of that reasoning is capped.
1. LLMs aren't "efficient", they seem to be as happy to spin in circles describing trivial things repeatedly as they are to spin in circles iterating on complicated things.
2. LLMs aren't "efficient", they use the same amount of compute for each token but sometimes all that compute is making an interesting decision about which token is the next one and sometimes there's really only one follow up to the phrase "and sometimes there's really only" and that compute is clearly unnecessary.
3. A (theoretical) efficient LLM still needs to emit tokens to tell the tools to do the obviously right things like "copy this giant file nearly verbatim except with every `if foo` replaced with `for foo in foo`. An efficient LLM might use less compute for those trivial tokens where it isn't making meaningful decisions, but if your metric is "tokens" and not "compute" that's never going to show up.
Until we get reasonably efficient LLMs that don't waste compute quite so freely I don't think there's any real point in trying to estimate task complexity by how long it takes an LLM.
This is interesting, I like the thought about "what makes something difficult". Focusing just on that, my guess is that there are significant portions of work that we commonly miss in our evaluations:
1. Knowing how to state the problem. Ie, go from the vague problem of "I don't like this, but I do like this", to the more specific problem of "I desire property A". In math a lot of open problems are already precisely stated, but then the user has to do the work of _understanding_ what the precise stating is.
2. Verifying that the proposed solution actually is a full solution.
This math problem actually illustrates them both really well to me. I read the post, but I still couldn't do _either_ of the steps above, because there's a ton of background work to be done. Even if I was very familiar with the problem space, verifying the solution requires work -- manually looking at it, writing it up in coq, something like that. I think this is similar to the saying "it takes 10 years to become an overnight success"
>The details about human involvement are always hazy and the significance of the problems are opaque to most.
Not really. You're just in denial and are not really all that interested in the details. This very post has the transcript of the chat of the solution.
For those, like me, who find the prompt itself of interest …
> A full transcript of the original conversation with GPT-5.4 Pro can be found here [0] and GPT-5.4 Pro’s write-up from the end of that transcript can be found here [1].
I don't know why I am still perpetually shocked that the default assumption is that humans are somehow unique.
It's this pervasive belief that underlies so much discussion around what it means to be intelligent. The null hypothesis goes out the window.
People constantly make comments like "well it's just trying a bunch of stuff until something works" and it seems that they do not pause for a moment to consider whether or not that also applies to humans.
If they do, they apply it in only the most restrictive way imaginable, some 2 dimensional caricature of reality, rather than considering all the ways that humans try and fail in all things throughout their lifetimes in the process of learning and discovery.
There's still this seeming belief in magic and human exceptionalism, deeply held, even in communities that otherwise tend to revolve around the sciences and the empirical.
It's only because humans came up with a problem, worked with the ai and verified the result that this achievement means anything at all. An ai "checking its own work" is practically irrelevant when they all seem to go back and forth on whether you need the car at the carwash to wash the car. Undoubtedly people have been passing this set of problems to ai's for months or years and have gotten back either incorrect results or results they didn't understand, but either way, a human confirmation is required. Ai hasn't presented any novel problems, other than the multitudes of social problems described elsewhere. Ai doesn't pursue its own goals and wouldn't know whether they've "actually been achieved".
This is to say nothing of the cost of this small but remarkable advance. Trillions of dollars in training and inference and so far we have a couple minor (trivial?) math solutions. I'm sure if someone had bothered funding a few phds for a year we could have found this without ai.
Last I checked humans didn't pop into existence doing that. It happened after billions of years of brute force, trial and error evolution. So well done for falling into the exact same trap the OP cautions. Intelligence from scratch requires a mind boggling amount of resources, and humans were no different.
It doesn't. I actually completely reject that theory, and it's nice to see that Wikipedia notes that it is "controversial". There are extremely good reasons to reject this theory. For one thing, any quantum effects are going to be quite tiny/ trivial because the brain is too large, hot, wet, etc, to see larger effects, so you have to somehow make a leap to "tiny effects that last for no time at all" to "this matters fundamentally in some massive way".
It likely requires rejection of functionalism, or the acceptance that quantum states are required for certain functions. Both of those are heavy commitments with the latter implying that there are either functions that require structures that can't be instantiated without quantum effects or functions that can't be emulated without quantum effects, both of which seem extremely unlikely to me.
Probably for the far more important reason, it doesn't solve any problem. It's just "quantum woo, therefor libertarian free will" most of the time.
It's mostly garbage, maybe a tiny tiny bit of interesting stuff in there.
> I don't know why I am still perpetually shocked that the default assumption is that humans are somehow unique.
Because, empirically, we have numerous unique and differentiable qualities, obviously. Plenty of time goes into understanding this, we have a young but rigorous field of neuroscience and cognitive science.
Unless you mean "fundamentally unique" in some way that would persist - like "nothing could ever do what humans do".
> People constantly make comments like "well it's just trying a bunch of stuff until something works" and it seems that they do not pause for a moment to consider whether or not that also applies to humans.
I frankly doubt it applies to either system.
I'm a functionalist so I obviously believe that everything a human brain does is physical and could be replicated using some other material that can exhibit the necessary functions. But that does not mean that I have to think that the appearance of intelligence always is intelligence, or that an LLM/ Agent is doing what humans do.
Every living thing on Earth is unique. Every rock is unique in virtually infinite ways from the next otherwise identical rock.
There are also a tremendous number of similarities between all living things and between rocks (and between rocks and living things).
Most ways in which things are unique are arguably uninteresting.
The default mode, the null hypothesis should be to assume that human intelligence isn't interestingly unique unless it can be proven otherwise.
In these repeated discussions around AI, there is criticism over the way an AI solves a problem, without any actual critical thought about the way humans solve problems.
The latter is left up to the assumption that "of course humans do X differently" and if you press you invariably end up at something couched in a vague mysticism about our inner-workings.
Humans apparently create something from nothing, without the recombination of any prior knowledge or outside information, and they get it right on the first try. Through what, divine inspiration from the God who made us and only us in His image?
I doubt you can even define intelligence sufficiently to argue this point. Since that's an ongoing debate without a resolution thus far.
But you claimed that humans aren't unique. I think it's pretty obvious we are on many dimensions including what you might classify as "intelligence". You don't even necessarily have to believe in a "soul" or something like that, although many people do. The capabilities of a human far surpass every single AI to date, and much more efficiently as well. That we are able to brute-force a simulacrum of intelligence in a few narrow domains is incredible, but we should not denigrate humans when celebrating this.
> There's still this seeming belief in magic and human exceptionalism, deeply held, even in communities that otherwise tend to revolve around the sciences and the empirical.
Do you ever wonder why that is? I often wonder why tech has so many reductionist, materialist, and quite frankly anti-human, thinkers.
>, and much more efficiently as well. That we are able to brute-force a simulacrum of intelligence in a few narrow domains is incredible, but we should not denigrate humans when celebrating this.
Human intelligence was brute forced. Please let's all stop pretending like those billions of years of evolution don't count and we poofed into existence. And you can keep parroting 'simulacrum of intelligence' all you want but that isn't going to make it any more true.
> The capabilities of a human far surpass every single AI to date
Meaning however you (reasonably) define intelligence, if you compare humans to any AI system humans are overwhelmingly more capable. Defining "intelligence" as "solving a math equation" is not a reasonable definition of intelligence. Or else we'd be talking about how my calculator is intelligent. Of course computers can compute faster than we can, that's aside the point.
> Human intelligence was brute forced.
No, I don't mean how the intelligence evolved or was created. But if you want to make that argument you're essentially asserting we have a creator, because to "brute force" something means it was intentional. Evolution is not an intentional process, unless you believe in God or a creator of sorts, which is totally fair but probably not what you were intending.
But my point is that LLM's essentially arrive at answers by brute force through search. Go look at what a reasoning model does to count the letters in a sentence, or the amount of energy it takes to do things humans can do with orders of magnitude less (our brain runs on %20 of a lightbulb!).
>Meaning however you (reasonably) define intelligence, if you compare humans to any AI system humans are overwhelmingly more capable.
Really ? Every Human ? Are you sure ? because I certainly wouldn't ask just any human for the things I use these models for, and I use them for a lot of things. So, to me the idea that all humans are 'overwhelmingly more capable' is blatantly false.
>Defining "intelligence" as "solving a math equation" is not a reasonable definition of intelligence.
What was achieved here or in the link I sent is not just "solving a math equation".
>Or else we'd be talking about how my calculator is intelligent.
If you said that humans are overwhelmingly more capable than calculators in arithmetic, well I'd tell you you were talking nonsense.
>Of course computers can compute faster than we can, that's aside the point.
I never said anything about speed. You are not making any significant point here lol
>No, I don't mean how the intelligence evolved or was created.
Well then what are you saying ? Because the only brute-forced aspect of LLM intelligence is its creation. If you do not mean that then just drop the point.
>But if you want to make that argument you're essentially asserting we have a creator, because to "brute force" something means it was intentional.
First of all, this makes no sense sorry. Evolution is regularly described as a brute force process by atheist and religious scientists alike.
Second, I don't have any problem with people thinking we have a creator, although that instance still does necessarily mean a magic 'poof into existence' reality either.
>But my point is that LLM's essentially arrive at answers by brute force through search.
Sorry but that's just not remotely true. This is so untrue I honestly don't know what to tell you.
>or the amount of energy it takes to do things humans can do with orders of magnitude less (our brain runs on %20 of a lightbulb!).
Meaningless comparison. You are looking at two completely different substrates. Do you realize how much compute it would take to run a full simulation of the human brain on a computer ? The most powerful super computer on the planet could not run this in real time.
> I often wonder why tech has so many reductionist, materialist, and quite frankly anti-human, thinkers.
I think it comes from a position of arrogance/ego. I'll speak for the US here, since that's what I know the most; but the average 'techie' in general skews towards the higher intelligence numbers than the lower parts. This is a very, very broad stroke, and that's intentional to illustrate my point. Because of this, techie culture gains quite a bit of arrogance around it with regards to the masses. And this has been trained into tech culture since childhood. Whether it be adults praising us for being "so smart", or that we "figured out the VCR", or some other random tech problem that literally almost any human being can solve by simply reading the manual.
What I've found, in the vast majority of technical problem solving cases that average people have challenges with, if they just took a few minutes to read a manual they'd be able to solve a lot of it themselves. In short, I don't believe as a very strong techie that I'm "smarter than most", but rather that I've taken the time to dive into a subject area that most other humans do not feel the need nor desire to do so.
There are objectively hard problems in tech to solve, but the amount of people solving THOSE problems in the tech industry are few and far in between. And so the tech industry as a whole has spent the last decade or two spinning circles on increasingly complex systems to continue feeding their own egos about their own intelligence. We're now at a point that rather than solving the puzzle, most techies are creating incrementally complex puzzles to solve because they're bored of the puzzles that are in front of them. "Let me solve that puzzle by making a puzzle solver." "Okay, now let me make a puzzle solver creation tool to create puzzle solvers to solve the puzzle." and so forth and so forth. At the end of the day, you're still just solving a puzzle...
But it's this arrogance that really bothers me in the tech bro culture world. And, more importantly, at least in some tech bro circles, they have realized that their target to gathering an exponential increase in wealth doesn't lie in creating new and novel ways to solve the same puzzles, but to try and tout AI as the greatest puzzle solver creation tool puzzle solver known to man (and let me grift off of it for a little bit).
Their 'Open Problems page' linked below gives some interesting context. They list 15 open problems in total, categorized as 'moderately interesting,' 'solid result,' 'major advance,' or 'breakthrough.' The solved problem is listed as 'moderately interesting,' which is presumably the easiest category. But it's notable that the problem was selected and posted here before it was solved. I wonder how long until the other 3 problems in this category are solved.
That's been achieved already with a few Erdös problems, though those tended to be ambiguously stated in a way that made them less obviously compelling to humans. This problem is obscure, even the linked writeup admits that perhaps ~10 mathematicians worldwide are genuinely familiar with it. But it's not unfeasibly hard for a few weeks' or months' work by a human mathematician.
It is not. You're operating under the assumption that all open math problems are difficult and novel.
This particular problem was about improving the lower bound for a function tracking a property of hypergraphs (undirected graphs where edges can contain more than two vertices).
Both constructing hypergraphs (sets) and lower bounds are very regular, chore type tasks that are common in maths. In other words, there's plenty of this type of proof in the training data.
LLMs kind of construct proofs all the time, every time they write a program. Because every program has a corresponding proof. It doesn't mean they're reasoning about them, but they do construct proofs.
This isn't science fiction. But it's nice that the LLMs solved something for once.
> Subsequent to this solve, we finished developing our general scaffold for testing models on FrontierMath: Open Problems. In this scaffold, several other models were able to solve the problem as well: Opus 4.6 (max), Gemini 3.1 Pro, and GPT-5.4 (xhigh).
Interesting. Whats that “scaffold”? A sort of unit test framework for proofs?
I think in this context, scaffolds are generally the harness that surrounds the actual model. For example, any tools, ways to lay out tasks, or auto-critiquing methods.
I think there's quite a bit of variance in model performance depending on the scaffold so comparisons are always a bit murky.
Software developers have spent decades at this point discounting and ignoring almost all objective metrics for software quality and the industry as a whole has developed a general disregard for any metric that isn't time-to-ship (and even there they will ignore faster alternatives in favor of hyped choices).
(Edit: Yes, I'm aware a lot of people care about FP, "Clean Code", etc., but these are all red herrings that don't actually have anything to do with quality. At best they are guidelines for less experienced programmers and at worst a massive waste of time if you use more than one or two suggestions from their collection of ideas.)
Most of the industry couldn't use objective metrics for code quality and the quality of the artifacts they produce without also abandoning their entire software stack because of the results. They're using the only metric they've ever cared about; time-to-ship. The results are just a sped up version of what we've had now for more than two decades: Software is getting slower, buggier and less usable.
If you don't have a good regulating function for what represents real quality you can't really expect systems that just pump out code to actually iterate very well on anything. There are very few forcing functions to use to produce high quality results though iteration.
But we don't even seem to be getting faster time-to-ship in any way that anybody can actually measure; it's always some vague sense of "we're so much more productive".
As someone with only passing exposure to serious math, this section was by far the most interesting to me:
> The author assessed the problem as follows.
> [number of mathematicians familiar, number trying, how long an expert would take, how notable, etc]
How reliably can we know these things a-priori? Are these mostly guesses? I don't mean to diminish the value of guesses; I'm curious how reliable these kinds of guesses are.
For number of mathematicians familiar with and actively working on the problem, modern mathematics research is incredibly specialized, so it's easy to keep track of who's working on similar problems. You read each other's papers, go to the same conferences etc.
For "how long an expert would take" to solve a problem, for truly open problems I don't think you can usually answer this question with much confidence until the problem has been solved. But once it has been solved, people with experience have a good sense of how long it would have taken them (though most people underestimate how much time they need, since you always run into unanticipated challenges).
Certainly knowing how many/which people are working on a problem you are looking at, and how long it will take you to solve it, are critical skills in being a working researcher. What kind of answer are you looking for? It's hard to quantify. Most suck at this type of assessment as a PhD student and then you get better as time goes on.
I was trying to get Claude and Codex to try and write a proof in Isabelle for the Collatz conjecture, but annoyingly it didn't solve it, and I don't feel like I'm any closer than I was when I started. AI is useless!
In all seriousness, this is pretty cool. I suspect that there's a lot of theoretical math that haven't been solved simply because of the "size" of the proof. An AI feedback loop into something like Isabelle or Lean does seem like it could end up opening up a lot of proofs.
I got Gemini to find a polynomial-time algorithm for integer factoring, but then I mysteriously got locked out of my Google account. They should at least refund me the tokens.
Seems like the high compute parallel thinking models weren't even needed, both the normal 5.4 and gemini 3.1 pro solved it. Somehow Gemini 3 deepthink couldn't solve it.
This is a remarkable result if confirmed independently. The gap between solving competition problems and open research problems has always been significant - bridging that gap suggests something qualitatively different in the model capabilities.
What are the odds that this is because Openai is pouring more money into high publicity stunts like this- rather than its model actually being better than Anthropics?
Not sure if AI can have clever or new ideas, it still seems to be it combines existing knowledge and executes algoritms.
I am not necessarily saying humans do something different either, but I have yet to see a novel solution from an AI that is not simply an extrapolation of current knowledge.
Speaking as a researcher, the line between new ideas and existing knowledge is very blurry and maybe doesn't even exist. The vast majority of research papers get new results by combining existing ideas in novel ways. This process can lead to genuinely new ideas, because the results of a good project teach you unexpected things.
My biggest hesitation with AI research at the moment is that they may not be as good at this last step as humans. They may make novel observations, but will they internalize these results as deeply as a human researcher would? But this is just a theoretical argument; in practice, I see no signs of progress slowing down.
We call that Standing On The Shoulders Of Giants and revere Isaac Newton as clever, even though he himself stated that he was standing on the shoulders of giants.
It's quite simple: it has yet to show it can actually be useful, and all the claims that it can have (so far) turned out to be self delusion if not deliberate lies. When the industry is run by grifters, you shouldn't really be surprised when people stop believing them.
Complete denial that AI/LLMs can produce novel, good things is an indefensible stance at this point. But the large volume of AI slop is still an unsolved problem, and the claim that "AI will still mostly deliver slop" seems to be almost certainly correct in the near-term.
We've had a few decades to address email spam, and still haven't manage to disincentivize it enough to stop being the main challenge for email as a communication medium. I don't think there's much hope that we'll be able to disincentive the widespread, large-scale creation of AI slop even after more expensive models with higher-quality output are available.
New goalpost, and I promise I'm not being facetious at all, genuinely curious:
Can an AI pose an frontier math problem that is of any interest to mathematicians?
I would guess 1) AI can solve frontier math problems and 2) can pose interesting/relevant math problems together would be an "oh shit" moment. Because that would be true PhD level research.
Fantastic news! That means with the right support tooling existing models are already capable of solving novel mathematics. There’s probably a lot of good mathematics out there we are going to make progress on.
Super cool, of course.
We are missing the value function that allowed AlphaGo to go from mid range player trained on human moves to superhuman by playing itself. As we have only made progress on unsupervised learning, and RL is constrained as above, I don't see this getting better.
Including code quality. Not because they are exceptionally good (you are right that they aren’t superhuman like AlphaGo) but because most humans are rather not that good at it anyway and also somehow « hallucinate » because of tiredness.
Even today’s models are far from being exploited at their full potential because we actually developed pretty much no tools around it except tooling to generate code.
I’m also a long time « doubter » but as a curious person I used the tool anyway with all its flaws in the latest 3 years. And I’m forced to admit that hallucinations are pretty rare nowadays. Errors still happen but they are very rare and it’s easier than ever to get it back in track.
I think I’m also a « believer » now and believe me, I really don’t want to because as much as I’m excited by this, I’m also pretty much frightened of all the bad things that this tech could to the world in the wrong hands and I don’t feel like it’s particularly in the right hands.
We went from 2 + 7 = 11 to "solved a frontier math problem" in 3 years, yet people don't think this will improve?
Let it write a black box no human understands. Give the means of production away.
When doing math you only ever care about the proof, not the answer itself.
If your proof is machine checkable, that's even easier.
> Speaking as a researcher, the line between new ideas and existing knowledge is very blurry and maybe doesn't even exist. The vast majority of research papers get new results by combining existing ideas in novel ways. This process can lead to genuinely new ideas, because the results of a good project teach you unexpected things.
An AI can probably do an 'okay' job at summarizing information for meta studies. But what it can't do is go "Hey that's a weird thing in the result that hints at some other vector for this thing we should look at." Especially if that "thing" has never been analyzed before and there's no LLM-trained data on it.
LLMs will NEVER be able to do that, because it doesn't exist. They're not going to discover and define a new chemical, or a new species of animal. They're not going to be able to describe and analyze a new way of folding proteins and what implication that has UNLESS you basically are constantly training the AI on random protein folds constantly.
Remember, the basis of these models is unsupervised training, which, at sufficient scale, gives it the ability to to detect pattern anomalies out of context.
For example, LLMs have struggled with generalized abstract problem solving, such as "mystery blocks world" that classical AI planners dating back 20+ years or more are better at solving. Well, that's rapidly changing: https://arxiv.org/html/2511.09378v1
This is very common already in AI.
Just look at the internal reasoning of any high thinking model, the trace is full of those chains of thought.
Kinda funny because that looked _very_ close to what my Opus 4.6 said yesterday when it was debugging compile errors for me. It did proceed to explore the other vector.
This is the crucial part of the comment. LLMs are not able to solve stuff that hasn't been solve in that exact or a very similar way already, because they are prediction machines trained on existing data. It is very able to spot outliers where they have been found by humans before, though, which is important, and is what you've been seeing.
I mean, TFA literally claims that an AI has solved an open Frontier Math problem, descibed as "A collection of unsolved mathematics problems that have resisted serious attempts by professional mathematicians. AI solutions would meaningfully advance the state of human mathematical knowledge."
That is, if true, it reasoned out a proof that does not exist in its training data.
Some human researchers are also remixers to Some degree.
Can you imagine AI coming up with refraction & separation lie Newton did?
You can make that claim about anything: "The human isn't being creative when they write a novel, they're just summoning patterns at typing time".
AlphaGo taught itself that move, then recalled it later. That's the bar for human creativity and you're holding AlphaGo to a higher standard without realizing it.
AlphaGo didn't teach itself that move. The verifier taught AlphaGo that move. AlphaGo then recalled the same features during inference when faced with similar inputs.
Ok so it sounds like you want to give the rules of Go credit for that move, lol.
They are not capable of mathematics because mathematics and language are fundamentally separated from each other.
They can give you an answer that looks like a calculation, but they cannot perform a calculation. The most convincing of LLMs have even been programmed to recognize that they have been asked to perform a calculation and hand the task off to a calculator, and then receive the calculator's output as a prompt even.
But it is fundamentally impossible for an LLM to perform a calculation entirely on its own, the same way it is fundamentally impossible for an image recognition AI to suddenly write an essay or a calculator to generate a photo of a giraffe in space.
People like to think of "AI" as one thing but it's several things.
You're also correlating "mathematics" and "calculation". Who cares about calculation, as you say, we have calculators to do that.
Mathematics is all just logical reasoning and exploration using language, just a very specific, dense, concise, and low level language. But you can always take any mathematical formula and express it as "language" it will just take far more "symbols"
This might be the worse take on this entire comment section. And I'm not even an overly hyped vibe coder, just someone who understands mathematics
By your definition, humans can't perform calculation either. Only a calculator can.
I've heard this tired old take before. It's the same type of simplistic opinion such as "AI can't write a symphony". It is a logical fallacy that relies on moving goalposts to impossible positions that they even lose perspective of what your average and even extremely talented individual can do.
In this case you are faced with a proof that most members of the field would be extremely proud of achieving, and for most would even be their crowning achievement. But here you are, downplaying and dismissing the feat. Perhaps you lost perspective of what science is,and how it boils down to two simple things: gather objective observations, and draw verifiable conclusions from them. This means all science does is remix ideas. Old ideas, new ideas, it doesn't really matter. That's what they do. So why do people win a prize when they do it, but when a computer does the same it's role is downplayed as a glorified card shuffler?
I think there are examples of what you’re looking for, but this isn’t one.
Also it's missing the point of the parent: it's about concepts and ideas merely being remixed. Similar to how many memes there are around this topic like "create a fresh new character design of a fast hedgehog" and the out is just a copy of sonic.[1]
That's what the parent is on about, if it requires new creativity not found by deriving from the learned corpus, then LLMs can't do it. Terrence Tao had similar thoughts in a recent Podcast.
[1] https://www.reddit.com/r/aiwars/s/pT2Zub10KT
> That means the group of characters it outputs must have been quite common in the past. It won't add a new group of characters it has never seen before on its own.
My only claim is that precisely this is incorrect.
> Write me a stanza in the style of "The Raven" about Dick Cheney on a first date with Queen Elizabeth I facilitated by a Time Travel Machine invented by Lin-Manuel Miranda
It outputted a group of characters that I can virtually guarantee you it has never seen before on its own
All of its output is based on those things it has seen.
It's not "thinking." It's not "solving." It's simply stringing words together in a way that appears most likely.
ChatGPT cannot do math. It can only string together words and numbers in a way that can convince an outsider that it can do math.
It's a parlor trick, like Clever Hans [1]. A very impressive parlor trick that is very convincing to people who are not familiar with what it's doing, but a parlor trick nontheless.
[1] https://en.wikipedia.org/wiki/Clever_Hans
Right but it has to reason about what that next word should be. It has to model the problem and then consider ways to approach it.
When an LLM is "reasoning" it's just feeding its own output back into itself and giving it another go.
And by the way, I don't think it's surprising that so many people are being unreasonable on this issue, there is a lot at stake and it's implications are transformative.
https://www.youtube.com/watch?v=YEUclZdj_Sc
It can produce outputs that resemble calculations.
It can prompt an agent to input some numbers into a separate program that will do calculations for it and then return them as a prompt.
Neither of these are calculations.
Nobody is saying this means AI is superintelligence or largely creative but rather very smart people can use AI to do interesting things that are objectively useful. And that is cool in its own way.
This is false.
If your definition of AI requires these things, I think -- despite the extreme fuzziness of all these terms -- that it's closer to what most people consider AGI, or maybe even ASI.
there's no math answer to whether a piece of land in your neighborhood should be apartments, a parking lot or a homeless shelter; whether home prices should go up or down; how much to pay for a new life saving treatment for a child; how much your country should compel fossil fuel emissions even when another country does not... okay, AI isn't going to change anything here, and i've just touched on a bunch of things that can and will affect you personally.
math isn't the right answer to everything, not even most questions. every time someone categorizes "problems" as "hard" and "easy" and talks about "problem solving," they are being co-opted into political apathy. it's cringe for a reason.
there are hardly any mathematicians who get elected, and it's not because voters are stupid! but math is a great way to make money in America, which is why we are talking about it and not because it solves problems.
if you are seeking a simple reason why so many of the "believers" seem to lack integrity, it is because the idea that math is the best solution to everything is an intellectually bankrupt, kind of stupid idea.
if you believe that math is the most dangerous thing because it is the best way to solve problems, you are liable to say something really stupid like this:
> Imagine, say, [a country of] 50 million people, all of whom are much more capable than any Nobel Prize winner, statesman, or technologist... this is a dangerous situation... Humanity needs to wake up
https://www.darioamodei.com/essay/the-adolescence-of-technol...
Dario Amodei has never won an election. What does he know about countries? (nothing). do you want him running anything? (no). or waking up humanity? In contrast, Barack Obama, who has won elections, thinks education is the best path to less violence and more prosperity.
What are you a believer in? ChatGPT has disrupted exactly ONE business: Chegg, because its main use case is cheating on homework. AI, today, only threatens one thing: education. Doesn't bode well for us.
sigh
It's pretty much how all the hard problems are solved by AI from my experience.
We start writing all those formulas etc and if at some point we realise we went th wrong way we start from the begignning (or some point we are sure about).
It remains to be seen whether this is genuinely intelligence or an infinite monkeys at infinite typewriters situation. And I'm not sure why this specific example is worthy enough to sway people in one direction or another.
"Even if every proton in the observable universe (which is estimated at roughly 1080) were a monkey with a typewriter, typing from the Big Bang until the end of the universe (when protons might no longer exist), they would still need a far greater amount of time – more than three hundred and sixty thousand orders of magnitude longer – to have even a 1 in 10500 chance of success. To put it another way, for a one in a trillion chance of success, there would need to be 10^360,641 observable universes made of protonic monkeys."
Often infinite things that are probability 1 in theory, are in practice, safe to assume to be 0.
So no. LLMs are not brute force dummies. We are seeing increasingly emergent behavior in frontier models.
Woah! That was a leap. "We are seeing ... emergent behaviors" does not follow from "it's not brute force".
It is unsurprising that an LLM performs better than random! That's the whole point. It does not imply emergence.
The artist drew 10 pencil sketches and said "hmm I think this one works the best" and finished the painting based on it.
I said he didn't one shot it and therefore he has no ability to paint, and refused to pay him.
A basic AI chat response also doesn't first discard all other possible responses.
Compared to AI, it thinks of every possible scientific method and tries them all. Not saying that humans never do this as well, but it's mostly reserved for when we just throw mud at a wall and see what sticks.
As for advances where there is a hypothesis, it rests on the shoulders of those who've come before. You know from observations that putting carbon in iron makes it stronger, and then someone else comes along with a theory of atoms and molecules. You might apply that to figuring out why steel is stronger than iron, and your student takes that and invents a new superalloy with improvements to your model. Remixing is a fundamental part of innovation, because it often teaches you something new. We aren't just alchemying things out of nothing.
If we get to any sort of confidence it will work it is based on building a history of it, or things related to "it" working consistently over time, out of innumerable other efforts where other "it"s did not work.
Also
> humans are a lot better at (...)
That's maybe true in 2026, but it's hard to make statements about "AI" in a field that is advancing so quickly. For most of 2025 for example, AI doing math like this wouldn't even be possible
Shotgunning it is an entirely valid approach to solving something. If AI proves to be particularly great at that approach, given the improvement runway that still remains, that's fantastic.
In software engineering, if only 5-10 people in the world have ever toyed with an idea for a specific program, it wouldn't be surprising that the implementation doesn't exist, almost independent of complexity. There's a lot of software I haven't finished simply because I wasn't all that motivated and got distracted by something else.
Of course, it's still miraculous that we have a system that can crank out code / solve math in this way.
Even as context sizes get larger, this will likely be relevant. Specially since AI providers may jack up the price per token at any time.
I think more people should question all this nonsense about AI "solving" math problems. The details about human involvement are always hazy and the significance of the problems are opaque to most.
We are very far away from the sensationalized and strongly implied idea that we are doing something miraculous here.
If I were to hazard a guess, I think that tokens spent thinking through hard math problems probably correspond to harder human thought than tokens spend thinking through React issues. I mean, LLMs have to expend hundreds of tokens to count the number of r's in strawberry. You can't tell me that if I count the number of r's in strawberry 1000 times I have done the mental equivalent of solving an open math problem.
1. LLMs aren't "efficient", they seem to be as happy to spin in circles describing trivial things repeatedly as they are to spin in circles iterating on complicated things.
2. LLMs aren't "efficient", they use the same amount of compute for each token but sometimes all that compute is making an interesting decision about which token is the next one and sometimes there's really only one follow up to the phrase "and sometimes there's really only" and that compute is clearly unnecessary.
3. A (theoretical) efficient LLM still needs to emit tokens to tell the tools to do the obviously right things like "copy this giant file nearly verbatim except with every `if foo` replaced with `for foo in foo`. An efficient LLM might use less compute for those trivial tokens where it isn't making meaningful decisions, but if your metric is "tokens" and not "compute" that's never going to show up.
Until we get reasonably efficient LLMs that don't waste compute quite so freely I don't think there's any real point in trying to estimate task complexity by how long it takes an LLM.
1. Knowing how to state the problem. Ie, go from the vague problem of "I don't like this, but I do like this", to the more specific problem of "I desire property A". In math a lot of open problems are already precisely stated, but then the user has to do the work of _understanding_ what the precise stating is.
2. Verifying that the proposed solution actually is a full solution.
This math problem actually illustrates them both really well to me. I read the post, but I still couldn't do _either_ of the steps above, because there's a ton of background work to be done. Even if I was very familiar with the problem space, verifying the solution requires work -- manually looking at it, writing it up in coq, something like that. I think this is similar to the saying "it takes 10 years to become an overnight success"
Not really. You're just in denial and are not really all that interested in the details. This very post has the transcript of the chat of the solution.
> A full transcript of the original conversation with GPT-5.4 Pro can be found here [0] and GPT-5.4 Pro’s write-up from the end of that transcript can be found here [1].
[0] https://epoch.ai/files/open-problems/gpt-5-4-pro-hypergraph-...
[1] https://epoch.ai/files/open-problems/hypergraph-ramsey-gpt-5...
It's this pervasive belief that underlies so much discussion around what it means to be intelligent. The null hypothesis goes out the window.
People constantly make comments like "well it's just trying a bunch of stuff until something works" and it seems that they do not pause for a moment to consider whether or not that also applies to humans.
If they do, they apply it in only the most restrictive way imaginable, some 2 dimensional caricature of reality, rather than considering all the ways that humans try and fail in all things throughout their lifetimes in the process of learning and discovery.
There's still this seeming belief in magic and human exceptionalism, deeply held, even in communities that otherwise tend to revolve around the sciences and the empirical.
This is to say nothing of the cost of this small but remarkable advance. Trillions of dollars in training and inference and so far we have a couple minor (trivial?) math solutions. I'm sure if someone had bothered funding a few phds for a year we could have found this without ai.
Replace ai with human here and that's...just how collaborative research works lol.
Perhaps this might better help you understand why this assumption still holds: https://en.wikipedia.org/wiki/Orchestrated_objective_reducti...
It likely requires rejection of functionalism, or the acceptance that quantum states are required for certain functions. Both of those are heavy commitments with the latter implying that there are either functions that require structures that can't be instantiated without quantum effects or functions that can't be emulated without quantum effects, both of which seem extremely unlikely to me.
Probably for the far more important reason, it doesn't solve any problem. It's just "quantum woo, therefor libertarian free will" most of the time.
It's mostly garbage, maybe a tiny tiny bit of interesting stuff in there.
Because, empirically, we have numerous unique and differentiable qualities, obviously. Plenty of time goes into understanding this, we have a young but rigorous field of neuroscience and cognitive science.
Unless you mean "fundamentally unique" in some way that would persist - like "nothing could ever do what humans do".
> People constantly make comments like "well it's just trying a bunch of stuff until something works" and it seems that they do not pause for a moment to consider whether or not that also applies to humans.
I frankly doubt it applies to either system.
I'm a functionalist so I obviously believe that everything a human brain does is physical and could be replicated using some other material that can exhibit the necessary functions. But that does not mean that I have to think that the appearance of intelligence always is intelligence, or that an LLM/ Agent is doing what humans do.
Uh, because up until and including now, we are...?
There are also a tremendous number of similarities between all living things and between rocks (and between rocks and living things).
Most ways in which things are unique are arguably uninteresting.
The default mode, the null hypothesis should be to assume that human intelligence isn't interestingly unique unless it can be proven otherwise.
In these repeated discussions around AI, there is criticism over the way an AI solves a problem, without any actual critical thought about the way humans solve problems.
The latter is left up to the assumption that "of course humans do X differently" and if you press you invariably end up at something couched in a vague mysticism about our inner-workings.
Humans apparently create something from nothing, without the recombination of any prior knowledge or outside information, and they get it right on the first try. Through what, divine inspiration from the God who made us and only us in His image?
But you claimed that humans aren't unique. I think it's pretty obvious we are on many dimensions including what you might classify as "intelligence". You don't even necessarily have to believe in a "soul" or something like that, although many people do. The capabilities of a human far surpass every single AI to date, and much more efficiently as well. That we are able to brute-force a simulacrum of intelligence in a few narrow domains is incredible, but we should not denigrate humans when celebrating this.
> There's still this seeming belief in magic and human exceptionalism, deeply held, even in communities that otherwise tend to revolve around the sciences and the empirical.
Do you ever wonder why that is? I often wonder why tech has so many reductionist, materialist, and quite frankly anti-human, thinkers.
What does this mean ? Are you saying every human could have achieved this result ? Or this ? https://openai.com/index/new-result-theoretical-physics/
because well, you'd be wrong.
>, and much more efficiently as well. That we are able to brute-force a simulacrum of intelligence in a few narrow domains is incredible, but we should not denigrate humans when celebrating this.
Human intelligence was brute forced. Please let's all stop pretending like those billions of years of evolution don't count and we poofed into existence. And you can keep parroting 'simulacrum of intelligence' all you want but that isn't going to make it any more true.
Meaning however you (reasonably) define intelligence, if you compare humans to any AI system humans are overwhelmingly more capable. Defining "intelligence" as "solving a math equation" is not a reasonable definition of intelligence. Or else we'd be talking about how my calculator is intelligent. Of course computers can compute faster than we can, that's aside the point.
> Human intelligence was brute forced.
No, I don't mean how the intelligence evolved or was created. But if you want to make that argument you're essentially asserting we have a creator, because to "brute force" something means it was intentional. Evolution is not an intentional process, unless you believe in God or a creator of sorts, which is totally fair but probably not what you were intending.
But my point is that LLM's essentially arrive at answers by brute force through search. Go look at what a reasoning model does to count the letters in a sentence, or the amount of energy it takes to do things humans can do with orders of magnitude less (our brain runs on %20 of a lightbulb!).
Really ? Every Human ? Are you sure ? because I certainly wouldn't ask just any human for the things I use these models for, and I use them for a lot of things. So, to me the idea that all humans are 'overwhelmingly more capable' is blatantly false.
>Defining "intelligence" as "solving a math equation" is not a reasonable definition of intelligence.
What was achieved here or in the link I sent is not just "solving a math equation".
>Or else we'd be talking about how my calculator is intelligent.
If you said that humans are overwhelmingly more capable than calculators in arithmetic, well I'd tell you you were talking nonsense.
>Of course computers can compute faster than we can, that's aside the point.
I never said anything about speed. You are not making any significant point here lol
>No, I don't mean how the intelligence evolved or was created.
Well then what are you saying ? Because the only brute-forced aspect of LLM intelligence is its creation. If you do not mean that then just drop the point.
>But if you want to make that argument you're essentially asserting we have a creator, because to "brute force" something means it was intentional.
First of all, this makes no sense sorry. Evolution is regularly described as a brute force process by atheist and religious scientists alike.
Second, I don't have any problem with people thinking we have a creator, although that instance still does necessarily mean a magic 'poof into existence' reality either.
>But my point is that LLM's essentially arrive at answers by brute force through search.
Sorry but that's just not remotely true. This is so untrue I honestly don't know what to tell you.
>or the amount of energy it takes to do things humans can do with orders of magnitude less (our brain runs on %20 of a lightbulb!).
Meaningless comparison. You are looking at two completely different substrates. Do you realize how much compute it would take to run a full simulation of the human brain on a computer ? The most powerful super computer on the planet could not run this in real time.
I think it comes from a position of arrogance/ego. I'll speak for the US here, since that's what I know the most; but the average 'techie' in general skews towards the higher intelligence numbers than the lower parts. This is a very, very broad stroke, and that's intentional to illustrate my point. Because of this, techie culture gains quite a bit of arrogance around it with regards to the masses. And this has been trained into tech culture since childhood. Whether it be adults praising us for being "so smart", or that we "figured out the VCR", or some other random tech problem that literally almost any human being can solve by simply reading the manual.
What I've found, in the vast majority of technical problem solving cases that average people have challenges with, if they just took a few minutes to read a manual they'd be able to solve a lot of it themselves. In short, I don't believe as a very strong techie that I'm "smarter than most", but rather that I've taken the time to dive into a subject area that most other humans do not feel the need nor desire to do so.
There are objectively hard problems in tech to solve, but the amount of people solving THOSE problems in the tech industry are few and far in between. And so the tech industry as a whole has spent the last decade or two spinning circles on increasingly complex systems to continue feeding their own egos about their own intelligence. We're now at a point that rather than solving the puzzle, most techies are creating incrementally complex puzzles to solve because they're bored of the puzzles that are in front of them. "Let me solve that puzzle by making a puzzle solver." "Okay, now let me make a puzzle solver creation tool to create puzzle solvers to solve the puzzle." and so forth and so forth. At the end of the day, you're still just solving a puzzle...
But it's this arrogance that really bothers me in the tech bro culture world. And, more importantly, at least in some tech bro circles, they have realized that their target to gathering an exponential increase in wealth doesn't lie in creating new and novel ways to solve the same puzzles, but to try and tout AI as the greatest puzzle solver creation tool puzzle solver known to man (and let me grift off of it for a little bit).
https://epoch.ai/frontiermath/open-problems
This particular problem was about improving the lower bound for a function tracking a property of hypergraphs (undirected graphs where edges can contain more than two vertices).
Both constructing hypergraphs (sets) and lower bounds are very regular, chore type tasks that are common in maths. In other words, there's plenty of this type of proof in the training data.
LLMs kind of construct proofs all the time, every time they write a program. Because every program has a corresponding proof. It doesn't mean they're reasoning about them, but they do construct proofs.
This isn't science fiction. But it's nice that the LLMs solved something for once.
Interesting. Whats that “scaffold”? A sort of unit test framework for proofs?
I think there's quite a bit of variance in model performance depending on the scaffold so comparisons are always a bit murky.
(Edit: Yes, I'm aware a lot of people care about FP, "Clean Code", etc., but these are all red herrings that don't actually have anything to do with quality. At best they are guidelines for less experienced programmers and at worst a massive waste of time if you use more than one or two suggestions from their collection of ideas.)
Most of the industry couldn't use objective metrics for code quality and the quality of the artifacts they produce without also abandoning their entire software stack because of the results. They're using the only metric they've ever cared about; time-to-ship. The results are just a sped up version of what we've had now for more than two decades: Software is getting slower, buggier and less usable.
If you don't have a good regulating function for what represents real quality you can't really expect systems that just pump out code to actually iterate very well on anything. There are very few forcing functions to use to produce high quality results though iteration.
> The author assessed the problem as follows.
> [number of mathematicians familiar, number trying, how long an expert would take, how notable, etc]
How reliably can we know these things a-priori? Are these mostly guesses? I don't mean to diminish the value of guesses; I'm curious how reliable these kinds of guesses are.
For "how long an expert would take" to solve a problem, for truly open problems I don't think you can usually answer this question with much confidence until the problem has been solved. But once it has been solved, people with experience have a good sense of how long it would have taken them (though most people underestimate how much time they need, since you always run into unanticipated challenges).
In all seriousness, this is pretty cool. I suspect that there's a lot of theoretical math that haven't been solved simply because of the "size" of the proof. An AI feedback loop into something like Isabelle or Lean does seem like it could end up opening up a lot of proofs.
I am not necessarily saying humans do something different either, but I have yet to see a novel solution from an AI that is not simply an extrapolation of current knowledge.
My biggest hesitation with AI research at the moment is that they may not be as good at this last step as humans. They may make novel observations, but will they internalize these results as deeply as a human researcher would? But this is just a theoretical argument; in practice, I see no signs of progress slowing down.
Sometimes just having the time/compute to explore the available space with known knowledge is enough to produce something unique.
We've had a few decades to address email spam, and still haven't manage to disincentivize it enough to stop being the main challenge for email as a communication medium. I don't think there's much hope that we'll be able to disincentive the widespread, large-scale creation of AI slop even after more expensive models with higher-quality output are available.
Can an AI pose an frontier math problem that is of any interest to mathematicians?
I would guess 1) AI can solve frontier math problems and 2) can pose interesting/relevant math problems together would be an "oh shit" moment. Because that would be true PhD level research.
Hoping that won't be the case with AI but we may need some major societal transformations to prevent it.